Summary
The native model router co-loads two models that pin to the same CUDA device and OOMs at load time, because the admit/fit decision reasons about total VRAM across devices instead of per-device free memory. Surfaced as intermittent HTTP 500s from the titan-llm deployment (2 × RTX 3090) during the 2026-06-04 heierchat incident.
Symptom
titan-llm runs the router with --models-max 2 --models-preset llama-router.ini. Every large preset pins to GPU0 via tensor-split = 100,0 (to leave GPU1 for a co-located ComfyUI). When two big CUDA0-pinned models are requested close together (e.g. devstral-24b ~13.5 GB already resident, then gemma-4-31b-iq4 needing ~15.6 GB weights), the second load fails:
[36701] CUDA0 : NVIDIA GeForce RTX 3090 (24124 MiB, 10552 MiB free)
[36701] W common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 999, abort
[36701] E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15598.97 MiB on device 0: cudaMalloc failed: out of memory
The router then routes to the failed/closed instance → client sees proxy error: Could not establish connection (HTTP 500) / 502 / 000. Intermittent — only when a second CUDA0-pinned model is already resident; if GPU0 is free the same model loads fine (~23 GB on a 24 GB card, by design).
nvidia-smi at load time confirmed GPU0 had ~13.5 GB occupied (devstral) and GPU1 was nearly empty (389 MiB) — usable capacity the router won't touch because of the pin.
Root cause
--models-max 2 keeps two models resident, and the admit decision treats VRAM as one 48 GB pool. It does not account for tensor-split / --device pinning both models onto the same physical GPU0 (24 GB). So it co-admits 14 GB + 16 GB believing 48 GB suffices; they collide on GPU0.
Note: docker/unified-llm/entrypoint.sh deliberately uses --models-max 2 (not 1) to dodge a load-race (max=1 force-kills a model still in LOAD state via unload_lru after stop_timeout), justified as "without changing memory headroom (each load still respects the per-model GPU footprint)." That last clause is the bug — max=2 does overcommit a shared-device pin.
Proposed fix (per @ht-llama.cpp-dev's read)
Land in two places:
- Admit decision —
common_fit_params.
- Eviction policy —
tools/server/server-models.cpp (pick_any_resident + the LRU walk).
The router already parses which device(s) a model pins to (tensor-split). Add a compute_free_per_device(active_models) helper that subtracts loaded-model footprints from per-device cudaMemGetInfo. The admit check becomes: for each device the candidate targets, candidate.bytes <= free_after_lru_evict[device] — and LRU-evict from that device, not globally. This makes --models-max 2 safe for same-device pins (today it's the OOM trigger) and lets the router correctly use GPU1 when a second model would fit there.
Environment
- ht-llama.cpp
b0daec55b (origin/ht); deployed image unified-llm:dflash-65f46f0f8.
- titan: 2 × RTX 3090 (24 GB each). Router flags:
--models-max 2 --parallel 1 --cont-batching.
- Diagnosed jointly by snoop-kube (cluster) + ht-llama.cpp-dev.
Related: companion issue on context-checkpoint host-RAM cap.
Summary
The native model router co-loads two models that pin to the same CUDA device and OOMs at load time, because the admit/fit decision reasons about total VRAM across devices instead of per-device free memory. Surfaced as intermittent HTTP 500s from the titan-llm deployment (2 × RTX 3090) during the 2026-06-04 heierchat incident.
Symptom
titan-llmruns the router with--models-max 2 --models-preset llama-router.ini. Every large preset pins to GPU0 viatensor-split = 100,0(to leave GPU1 for a co-located ComfyUI). When two big CUDA0-pinned models are requested close together (e.g.devstral-24b~13.5 GB already resident, thengemma-4-31b-iq4needing ~15.6 GB weights), the second load fails:The router then routes to the failed/closed instance → client sees
proxy error: Could not establish connection(HTTP 500) / 502 / 000. Intermittent — only when a second CUDA0-pinned model is already resident; if GPU0 is free the same model loads fine (~23 GB on a 24 GB card, by design).nvidia-smiat load time confirmed GPU0 had ~13.5 GB occupied (devstral) and GPU1 was nearly empty (389 MiB) — usable capacity the router won't touch because of the pin.Root cause
--models-max 2keeps two models resident, and the admit decision treats VRAM as one 48 GB pool. It does not account fortensor-split/--devicepinning both models onto the same physical GPU0 (24 GB). So it co-admits 14 GB + 16 GB believing 48 GB suffices; they collide on GPU0.Note:
docker/unified-llm/entrypoint.shdeliberately uses--models-max 2(not 1) to dodge a load-race (max=1 force-kills a model still in LOAD state viaunload_lruafterstop_timeout), justified as "without changing memory headroom (each load still respects the per-model GPU footprint)." That last clause is the bug — max=2 does overcommit a shared-device pin.Proposed fix (per @ht-llama.cpp-dev's read)
Land in two places:
common_fit_params.tools/server/server-models.cpp(pick_any_resident+ the LRU walk).The router already parses which device(s) a model pins to (tensor-split). Add a
compute_free_per_device(active_models)helper that subtracts loaded-model footprints from per-devicecudaMemGetInfo. The admit check becomes: for each device the candidate targets,candidate.bytes <= free_after_lru_evict[device]— and LRU-evict from that device, not globally. This makes--models-max 2safe for same-device pins (today it's the OOM trigger) and lets the router correctly use GPU1 when a second model would fit there.Environment
b0daec55b(origin/ht); deployed imageunified-llm:dflash-65f46f0f8.--models-max 2 --parallel 1 --cont-batching.Related: companion issue on context-checkpoint host-RAM cap.