Eval bug: auto fit estimation does not account for mmproj GPU memory, causing OOM with multimodal models

 ### Problem

  When `--fit on` (default), `llama_params_fit` estimates GPU memory usage to decide how many layers to
  offload. However, it only accounts for:

  - LLM main model weights
  - KV cache
  - Compute buffers

  When a multimodal projector (mmproj) is present — either auto-detected via `-hf` or manually specified
  via `-mm`/`--mmproj` — its GPU memory is loaded **after** the fit estimation runs, consuming the free
  memory margin and causing OOM (`ggml_cuda_pool_alloc` / CUDA out of memory).

  ### Steps to reproduce

  ```bash
  # With any multimodal model that auto-detects mmproj via -hf,
  # on a GPU where fit needs to reduce n_gpu_layers or n_ctx:
  llama-server -hf Qwen/Qwen3.5-35B-A3B-gguf -c 65536

  # Or with manual mmproj:
  llama-server -m model.gguf -mm mmproj.gguf -c 65536
 ```

  The OOM is probabilistic — it depends on GPU memory fragmentation and how close the fit result is to the
   VRAM limit.

  ### Workaround

  # Increase margin manually:
  llama-server -hf Qwen/Qwen3.5-35B-A3B-gguf -c 65536 --fit-margin 2048

  # Or disable mmproj GPU offload:
  llama-server -hf Qwen/Qwen3.5-35B-A3B-gguf -c 65536 --no-mmproj-offload

 ### Possible Root cause

  # The initialization order is:

  1. `common_init_result` constructor runs (`common/common.cpp:1046`)
  2. Inside it, `llama_params_fit()` estimates GPU memory and decides `n_gpu_layers`
  (`common/common.cpp:1053`)
  3. LLM model is loaded to GPU based on the fitted parameters (`common/common.cpp:1061`)
  4. Constructor returns
  5. **Later**, mmproj is loaded to GPU via `mtmd_init_from_file()`:
     - In server: `server-context.cpp:693`
     - In mtmd-cli: `mtmd-cli.cpp:120` → `init_vision_context()`

  `llama_params_fit_impl` (`src/llama.cpp:159`) has no knowledge of mmproj — it doesn't take any
  mmproj-related parameters. Its `margins` parameter (default 1024 MiB per device) is the only buffer, and
   mmproj can easily exceed it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: auto fit estimation does not account for mmproj GPU memory, causing OOM with multimodal models #19980

Problem

Steps to reproduce

Workaround

Increase margin manually:

Or disable mmproj GPU offload:

Possible Root cause

The initialization order is:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: auto fit estimation does not account for mmproj GPU memory, causing OOM with multimodal models #19980

Description

Problem

Steps to reproduce

Workaround

Increase margin manually:

Or disable mmproj GPU offload:

Possible Root cause

The initialization order is:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions