Router co-loads same-device-pinned models and OOMs: --models-max fit decision ignores per-device VRAM

## Summary

The native model router co-loads two models that pin to the **same** CUDA device and OOMs at load time, because the admit/fit decision reasons about **total** VRAM across devices instead of **per-device** free memory. Surfaced as intermittent HTTP 500s from the titan-llm deployment (2 × RTX 3090) during the 2026-06-04 heierchat incident.

## Symptom

`titan-llm` runs the router with `--models-max 2 --models-preset llama-router.ini`. Every large preset pins to GPU0 via `tensor-split = 100,0` (to leave GPU1 for a co-located ComfyUI). When two big CUDA0-pinned models are requested close together (e.g. `devstral-24b` ~13.5 GB already resident, then `gemma-4-31b-iq4` needing ~15.6 GB weights), the second load fails:

```
[36701] CUDA0 : NVIDIA GeForce RTX 3090 (24124 MiB, 10552 MiB free)
[36701] W common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 999, abort
[36701] E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15598.97 MiB on device 0: cudaMalloc failed: out of memory
```

The router then routes to the failed/closed instance → client sees `proxy error: Could not establish connection` (HTTP 500) / 502 / 000. **Intermittent** — only when a second CUDA0-pinned model is already resident; if GPU0 is free the same model loads fine (~23 GB on a 24 GB card, by design).

`nvidia-smi` at load time confirmed GPU0 had ~13.5 GB occupied (devstral) and GPU1 was nearly empty (389 MiB) — usable capacity the router won't touch because of the pin.

## Root cause

`--models-max 2` keeps two models resident, and the admit decision treats VRAM as one 48 GB pool. It does not account for `tensor-split` / `--device` pinning both models onto the same physical GPU0 (24 GB). So it co-admits 14 GB + 16 GB believing 48 GB suffices; they collide on GPU0.

Note: `docker/unified-llm/entrypoint.sh` deliberately uses `--models-max 2` (not 1) to dodge a load-race (max=1 force-kills a model still in LOAD state via `unload_lru` after `stop_timeout`), justified as *"without changing memory headroom (each load still respects the per-model GPU footprint)."* **That last clause is the bug** — max=2 *does* overcommit a shared-device pin.

## Proposed fix (per @ht-llama.cpp-dev's read)

Land in two places:
- **Admit decision** — `common_fit_params`.
- **Eviction policy** — `tools/server/server-models.cpp` (`pick_any_resident` + the LRU walk).

The router already parses which device(s) a model pins to (tensor-split). Add a `compute_free_per_device(active_models)` helper that subtracts loaded-model footprints from per-device `cudaMemGetInfo`. The admit check becomes: *for each device the candidate targets, `candidate.bytes <= free_after_lru_evict[device]`* — and LRU-evict from **that** device, not globally. This makes `--models-max 2` safe for same-device pins (today it's the OOM trigger) and lets the router correctly use GPU1 when a second model would fit there.

## Environment
- ht-llama.cpp `b0daec55b` (origin/ht); deployed image `unified-llm:dflash-65f46f0f8`.
- titan: 2 × RTX 3090 (24 GB each). Router flags: `--models-max 2 --parallel 1 --cont-batching`.
- Diagnosed jointly by snoop-kube (cluster) + ht-llama.cpp-dev.

Related: companion issue on context-checkpoint host-RAM cap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Router co-loads same-device-pinned models and OOMs: --models-max fit decision ignores per-device VRAM #66

Summary

Symptom

Root cause

Proposed fix (per @ht-llama.cpp-dev's read)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Router co-loads same-device-pinned models and OOMs: --models-max fit decision ignores per-device VRAM #66

Description

Summary

Symptom

Root cause

Proposed fix (per @ht-llama.cpp-dev's read)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions