Skip to content

Context checkpoints capped by count not bytes -> host-RAM OOM under long contexts #67

@marksverdhei

Description

@marksverdhei

Summary

The server accumulates context checkpoints in host RAM with only a fixed count cap (32), no byte budget. Under long contexts this grows to ~20 GB and triggers a kernel SystemOOM that kills llama-server. Observed on titan during the 2026-06-04 heierchat incident.

Symptom

A research-augmented client (heierchat) drives long contexts on gemma-4-31b-iq4. The server logs checkpoints accumulating:

[41329] slot create_check: id 0 | task 9827 | created context checkpoint 25 of 32 (... size = 637.519 MiB)
[41329] slot create_check: id 0 | task 9827 | created context checkpoint 26 of 32 (... size = 637.519 MiB)

32 × ~637 MiB ≈ 20 GB of host RAM in checkpoints alone. titan has 46 GB RAM + only 3 GB swap. The kernel OOM-kills the server:

oom-kill: ... task=llama-server,pid=1026069
Out of memory: Killed process 1026069 (llama-server) total-vm:155205132kB, anon-rss:37099988kB, ...

After the kill, the router routes to the dead port and returns instant 500s until restarted (separate router-recovery concern; see PR #64 discussion).

Root cause

Checkpoints are kept in host RAM and capped by count (32), not by bytes or available host memory. KV itself is on-GPU (q8_0 + flash-attn), weights are mmap'd (file-backed); the host anon-rss blowup is the checkpoint store + CUDA host buffers. A box with less RAM than 32 × per-checkpoint-bytes will OOM regardless of GPU headroom.

Proposed fix (per @ht-llama.cpp-dev's read)

Localised to tools/server/server-context.cpp. server_prompt_checkpoint already tracks per-checkpoint bytes (source of the 637.519 MiB log). Replace the fixed 32-count cap with a max-bytes budget, default e.g. min(8 GB, host_free * 0.3), evicting oldest checkpoints when the budget is exceeded. Probe available host memory cheaply via /proc/meminfo MemAvailable.

This is independent of the per-GPU router-fit refactor and can land on its own.

Environment

  • ht-llama.cpp b0daec55b (origin/ht); titan 46 GB RAM, 3 GB swap, 2 × RTX 3090.
  • Model: gemma-4-31b-iq4 (IQ4_XS), ctx-size 57344, cache-type-k/v q8_0, flash-attn on.
  • Diagnosed jointly by snoop-kube (cluster) + ht-llama.cpp-dev.

Related: companion issue on per-GPU-aware router fit / --models-max overcommit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions