Problem
When --fit on (default), llama_params_fit estimates GPU memory usage to decide how many layers to
offload. However, it only accounts for:
- LLM main model weights
- KV cache
- Compute buffers
When a multimodal projector (mmproj) is present — either auto-detected via -hf or manually specified
via -mm/--mmproj — its GPU memory is loaded after the fit estimation runs, consuming the free
memory margin and causing OOM (ggml_cuda_pool_alloc / CUDA out of memory).
Steps to reproduce
# With any multimodal model that auto-detects mmproj via -hf,
# on a GPU where fit needs to reduce n_gpu_layers or n_ctx:
llama-server -hf Qwen/Qwen3.5-35B-A3B-gguf -c 65536
# Or with manual mmproj:
llama-server -m model.gguf -mm mmproj.gguf -c 65536
The OOM is probabilistic — it depends on GPU memory fragmentation and how close the fit result is to the
VRAM limit.
Workaround
Increase margin manually:
llama-server -hf Qwen/Qwen3.5-35B-A3B-gguf -c 65536 --fit-margin 2048
Or disable mmproj GPU offload:
llama-server -hf Qwen/Qwen3.5-35B-A3B-gguf -c 65536 --no-mmproj-offload
Possible Root cause
The initialization order is:
common_init_result constructor runs (common/common.cpp:1046)
- Inside it,
llama_params_fit() estimates GPU memory and decides n_gpu_layers
(common/common.cpp:1053)
- LLM model is loaded to GPU based on the fitted parameters (
common/common.cpp:1061)
- Constructor returns
- Later, mmproj is loaded to GPU via
mtmd_init_from_file():
- In server:
server-context.cpp:693
- In mtmd-cli:
mtmd-cli.cpp:120 → init_vision_context()
llama_params_fit_impl (src/llama.cpp:159) has no knowledge of mmproj — it doesn't take any
mmproj-related parameters. Its margins parameter (default 1024 MiB per device) is the only buffer, and
mmproj can easily exceed it.
Problem
When
--fit on(default),llama_params_fitestimates GPU memory usage to decide how many layers tooffload. However, it only accounts for:
When a multimodal projector (mmproj) is present — either auto-detected via
-hfor manually specifiedvia
-mm/--mmproj— its GPU memory is loaded after the fit estimation runs, consuming the freememory margin and causing OOM (
ggml_cuda_pool_alloc/ CUDA out of memory).Steps to reproduce
The OOM is probabilistic — it depends on GPU memory fragmentation and how close the fit result is to the
VRAM limit.
Workaround
Increase margin manually:
llama-server -hf Qwen/Qwen3.5-35B-A3B-gguf -c 65536 --fit-margin 2048
Or disable mmproj GPU offload:
llama-server -hf Qwen/Qwen3.5-35B-A3B-gguf -c 65536 --no-mmproj-offload
Possible Root cause
The initialization order is:
common_init_resultconstructor runs (common/common.cpp:1046)llama_params_fit()estimates GPU memory and decidesn_gpu_layers(
common/common.cpp:1053)common/common.cpp:1061)mtmd_init_from_file():server-context.cpp:693mtmd-cli.cpp:120→init_vision_context()llama_params_fit_impl(src/llama.cpp:159) has no knowledge of mmproj — it doesn't take anymmproj-related parameters. Its
marginsparameter (default 1024 MiB per device) is the only buffer, andmmproj can easily exceed it.