Summary
The server accumulates context checkpoints in host RAM with only a fixed count cap (32), no byte budget. Under long contexts this grows to ~20 GB and triggers a kernel SystemOOM that kills llama-server. Observed on titan during the 2026-06-04 heierchat incident.
Symptom
A research-augmented client (heierchat) drives long contexts on gemma-4-31b-iq4. The server logs checkpoints accumulating:
[41329] slot create_check: id 0 | task 9827 | created context checkpoint 25 of 32 (... size = 637.519 MiB)
[41329] slot create_check: id 0 | task 9827 | created context checkpoint 26 of 32 (... size = 637.519 MiB)
32 × ~637 MiB ≈ 20 GB of host RAM in checkpoints alone. titan has 46 GB RAM + only 3 GB swap. The kernel OOM-kills the server:
oom-kill: ... task=llama-server,pid=1026069
Out of memory: Killed process 1026069 (llama-server) total-vm:155205132kB, anon-rss:37099988kB, ...
After the kill, the router routes to the dead port and returns instant 500s until restarted (separate router-recovery concern; see PR #64 discussion).
Root cause
Checkpoints are kept in host RAM and capped by count (32), not by bytes or available host memory. KV itself is on-GPU (q8_0 + flash-attn), weights are mmap'd (file-backed); the host anon-rss blowup is the checkpoint store + CUDA host buffers. A box with less RAM than 32 × per-checkpoint-bytes will OOM regardless of GPU headroom.
Proposed fix (per @ht-llama.cpp-dev's read)
Localised to tools/server/server-context.cpp. server_prompt_checkpoint already tracks per-checkpoint bytes (source of the 637.519 MiB log). Replace the fixed 32-count cap with a max-bytes budget, default e.g. min(8 GB, host_free * 0.3), evicting oldest checkpoints when the budget is exceeded. Probe available host memory cheaply via /proc/meminfo MemAvailable.
This is independent of the per-GPU router-fit refactor and can land on its own.
Environment
- ht-llama.cpp
b0daec55b (origin/ht); titan 46 GB RAM, 3 GB swap, 2 × RTX 3090.
- Model:
gemma-4-31b-iq4 (IQ4_XS), ctx-size 57344, cache-type-k/v q8_0, flash-attn on.
- Diagnosed jointly by snoop-kube (cluster) + ht-llama.cpp-dev.
Related: companion issue on per-GPU-aware router fit / --models-max overcommit.
Summary
The server accumulates context checkpoints in host RAM with only a fixed count cap (32), no byte budget. Under long contexts this grows to ~20 GB and triggers a kernel SystemOOM that kills
llama-server. Observed on titan during the 2026-06-04 heierchat incident.Symptom
A research-augmented client (heierchat) drives long contexts on
gemma-4-31b-iq4. The server logs checkpoints accumulating:32 × ~637 MiB ≈ 20 GB of host RAM in checkpoints alone. titan has 46 GB RAM + only 3 GB swap. The kernel OOM-kills the server:
After the kill, the router routes to the dead port and returns instant 500s until restarted (separate router-recovery concern; see PR #64 discussion).
Root cause
Checkpoints are kept in host RAM and capped by count (32), not by bytes or available host memory. KV itself is on-GPU (q8_0 + flash-attn), weights are mmap'd (file-backed); the host anon-rss blowup is the checkpoint store + CUDA host buffers. A box with less RAM than
32 × per-checkpoint-byteswill OOM regardless of GPU headroom.Proposed fix (per @ht-llama.cpp-dev's read)
Localised to
tools/server/server-context.cpp.server_prompt_checkpointalready tracks per-checkpoint bytes (source of the637.519 MiBlog). Replace the fixed 32-count cap with a max-bytes budget, default e.g.min(8 GB, host_free * 0.3), evicting oldest checkpoints when the budget is exceeded. Probe available host memory cheaply via/proc/meminfoMemAvailable.This is independent of the per-GPU router-fit refactor and can land on its own.
Environment
b0daec55b(origin/ht); titan 46 GB RAM, 3 GB swap, 2 × RTX 3090.gemma-4-31b-iq4(IQ4_XS),ctx-size 57344,cache-type-k/v q8_0,flash-attn on.Related: companion issue on per-GPU-aware router fit /
--models-maxovercommit.