Skip to content

Router co-loads same-device-pinned models and OOMs: --models-max fit decision ignores per-device VRAM #66

@marksverdhei

Description

@marksverdhei

Summary

The native model router co-loads two models that pin to the same CUDA device and OOMs at load time, because the admit/fit decision reasons about total VRAM across devices instead of per-device free memory. Surfaced as intermittent HTTP 500s from the titan-llm deployment (2 × RTX 3090) during the 2026-06-04 heierchat incident.

Symptom

titan-llm runs the router with --models-max 2 --models-preset llama-router.ini. Every large preset pins to GPU0 via tensor-split = 100,0 (to leave GPU1 for a co-located ComfyUI). When two big CUDA0-pinned models are requested close together (e.g. devstral-24b ~13.5 GB already resident, then gemma-4-31b-iq4 needing ~15.6 GB weights), the second load fails:

[36701] CUDA0 : NVIDIA GeForce RTX 3090 (24124 MiB, 10552 MiB free)
[36701] W common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 999, abort
[36701] E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15598.97 MiB on device 0: cudaMalloc failed: out of memory

The router then routes to the failed/closed instance → client sees proxy error: Could not establish connection (HTTP 500) / 502 / 000. Intermittent — only when a second CUDA0-pinned model is already resident; if GPU0 is free the same model loads fine (~23 GB on a 24 GB card, by design).

nvidia-smi at load time confirmed GPU0 had ~13.5 GB occupied (devstral) and GPU1 was nearly empty (389 MiB) — usable capacity the router won't touch because of the pin.

Root cause

--models-max 2 keeps two models resident, and the admit decision treats VRAM as one 48 GB pool. It does not account for tensor-split / --device pinning both models onto the same physical GPU0 (24 GB). So it co-admits 14 GB + 16 GB believing 48 GB suffices; they collide on GPU0.

Note: docker/unified-llm/entrypoint.sh deliberately uses --models-max 2 (not 1) to dodge a load-race (max=1 force-kills a model still in LOAD state via unload_lru after stop_timeout), justified as "without changing memory headroom (each load still respects the per-model GPU footprint)." That last clause is the bug — max=2 does overcommit a shared-device pin.

Proposed fix (per @ht-llama.cpp-dev's read)

Land in two places:

  • Admit decisioncommon_fit_params.
  • Eviction policytools/server/server-models.cpp (pick_any_resident + the LRU walk).

The router already parses which device(s) a model pins to (tensor-split). Add a compute_free_per_device(active_models) helper that subtracts loaded-model footprints from per-device cudaMemGetInfo. The admit check becomes: for each device the candidate targets, candidate.bytes <= free_after_lru_evict[device] — and LRU-evict from that device, not globally. This makes --models-max 2 safe for same-device pins (today it's the OOM trigger) and lets the router correctly use GPU1 when a second model would fit there.

Environment

  • ht-llama.cpp b0daec55b (origin/ht); deployed image unified-llm:dflash-65f46f0f8.
  • titan: 2 × RTX 3090 (24 GB each). Router flags: --models-max 2 --parallel 1 --cont-batching.
  • Diagnosed jointly by snoop-kube (cluster) + ht-llama.cpp-dev.

Related: companion issue on context-checkpoint host-RAM cap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions