Context checkpoints capped by count not bytes -> host-RAM OOM under long contexts

## Summary

The server accumulates context checkpoints in **host RAM** with only a fixed **count** cap (32), no byte budget. Under long contexts this grows to ~20 GB and triggers a kernel SystemOOM that kills `llama-server`. Observed on titan during the 2026-06-04 heierchat incident.

## Symptom

A research-augmented client (heierchat) drives long contexts on `gemma-4-31b-iq4`. The server logs checkpoints accumulating:

```
[41329] slot create_check: id 0 | task 9827 | created context checkpoint 25 of 32 (... size = 637.519 MiB)
[41329] slot create_check: id 0 | task 9827 | created context checkpoint 26 of 32 (... size = 637.519 MiB)
```

32 × ~637 MiB ≈ **20 GB of host RAM** in checkpoints alone. titan has 46 GB RAM + only 3 GB swap. The kernel OOM-kills the server:

```
oom-kill: ... task=llama-server,pid=1026069
Out of memory: Killed process 1026069 (llama-server) total-vm:155205132kB, anon-rss:37099988kB, ...
```

After the kill, the router routes to the dead port and returns instant 500s until restarted (separate router-recovery concern; see PR #64 discussion).

## Root cause

Checkpoints are kept in host RAM and capped by **count** (32), not by **bytes** or available host memory. KV itself is on-GPU (q8_0 + flash-attn), weights are mmap'd (file-backed); the host anon-rss blowup is the checkpoint store + CUDA host buffers. A box with less RAM than `32 × per-checkpoint-bytes` will OOM regardless of GPU headroom.

## Proposed fix (per @ht-llama.cpp-dev's read)

Localised to `tools/server/server-context.cpp`. `server_prompt_checkpoint` already tracks per-checkpoint bytes (source of the `637.519 MiB` log). Replace the fixed 32-count cap with a **max-bytes budget**, default e.g. `min(8 GB, host_free * 0.3)`, evicting oldest checkpoints when the budget is exceeded. Probe available host memory cheaply via `/proc/meminfo` `MemAvailable`.

This is independent of the per-GPU router-fit refactor and can land on its own.

## Environment
- ht-llama.cpp `b0daec55b` (origin/ht); titan 46 GB RAM, 3 GB swap, 2 × RTX 3090.
- Model: `gemma-4-31b-iq4` (IQ4_XS), `ctx-size 57344`, `cache-type-k/v q8_0`, `flash-attn on`.
- Diagnosed jointly by snoop-kube (cluster) + ht-llama.cpp-dev.

Related: companion issue on per-GPU-aware router fit / `--models-max` overcommit.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context checkpoints capped by count not bytes -> host-RAM OOM under long contexts #67

Summary

Symptom

Root cause

Proposed fix (per @ht-llama.cpp-dev's read)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Context checkpoints capped by count not bytes -> host-RAM OOM under long contexts #67

Description

Summary

Symptom

Root cause

Proposed fix (per @ht-llama.cpp-dev's read)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions