fix(server): per-slot byte cap on context checkpoints (closes #67) by marksverdhei · Pull Request #68 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-04T18:09:37Z

Closes #67.

Why

The existing checkpoint cap is count-only (--ctx-checkpoints, default 32), so a single slot can accumulate ~20 GB of host-RAM checkpoints under heierchat's long contexts. On titan (46 GB RAM / 3 GB swap) this drives llama-server to 37 GB anon-rss → SystemOOM → 500s.

What this adds

A per-slot byte budget alongside the existing count cap:

CLI: --ctx-checkpoints-max-mib N (env LLAMA_ARG_CTX_CHECKPOINTS_MAX_MIB)
Default: 4096 MiB per slot
Disable / legacy behavior: --ctx-checkpoints-max-mib 0 (count-only)
Eviction: FIFO until both caps are satisfied. Whichever cap bites first is reported via reason=count|bytes in the warning log.

+ auto over_count = [&]() { return slot.prompt.checkpoints.size() >= (size_t) params_base.n_ctx_checkpoints; };
+ auto over_bytes = [&]() { return byte_cap > 0 && !empty && total_bytes() >= byte_cap; };
- while (slot.prompt.checkpoints.size() >= (size_t) params_base.n_ctx_checkpoints) {
+ while (over_count() || over_bytes()) {

The success log also now reports slot total = X MiB / Y MiB cap so the current footprint is visible per checkpoint-create.

Worst-case total bound

n_slots * --ctx-checkpoints-max-mib MiB. With heierchat's typical --parallel 1 (DFlash gate) that's 4 GiB; with --parallel 4 it's 16 GiB — both well under titan's 46 GiB and far below the 20 GiB single-slot accumulation we saw in the OOM logs.

Follow-up (NOT in this PR)

Snoop-kube and I discussed an adaptive cap based on /proc/meminfo MemAvailable * 0.3 for unattended runs. Left for a separate change once the byte-cap mechanism has bake time in production.

Verified

✅ cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON && cmake --build build --target llama-server succeeds.
✅ --help renders the new flag with the documented default.
No behavior change when --ctx-checkpoints-max-mib 0 (count-only legacy path).

The existing checkpoint cap is count-only (`--ctx-checkpoints`, default 32), which lets a single slot accumulate ~20 GB of host-RAM checkpoints under heierchat's long contexts and drives titan into SystemOOM (37 GB anon-rss on the 46 GB / 3 GB-swap node). Adds a per-slot byte budget: * `--ctx-checkpoints-max-mib N` (env `LLAMA_ARG_CTX_CHECKPOINTS_MAX_MIB`), default 4096 MiB / slot, 0 = disabled (count-only legacy behavior). * Eviction in `create_checkpoint` now FIFO-evicts until BOTH caps satisfy. Whichever cap bites first is reported via `reason=count|bytes` in the warning log so it's diagnosable from titan logs. * The success log now also reports `slot total = X MiB / Y MiB cap` so the current footprint is visible per checkpoint create. A 4 GiB-per-slot default bounds total host-RAM checkpoint use at `n_slots * 4 GiB`. With heierchat's typical `--parallel 1` (DFlash gate) that's 4 GiB worst-case; with `--parallel 4` it's 16 GiB — both well under titan's 46 GiB. Follow-up (snoop-kube discussed): a dynamic cap based on `/proc/meminfo MemAvailable * 0.3` would adapt better than the fixed default — left for a separate change once the byte-cap mechanism is in production.

Adds tools/server/tests/unit/test_ctx_checkpoints_bytes_cap.py with four scenarios for --ctx-checkpoints-max-mib: * default-args: server starts; if checkpoints are created the "slot total = X MiB / Y MiB cap" footprint marker appears in the create_checkpoint log line. * --ctx-checkpoints-max-mib 0: byte cap disabled, server starts fine, request succeeds (count-only legacy behavior). * negative value: arg parser rejects, ServerProcess.start() raises. * tiny byte cap + multi-turn chat: when eviction fires, the log reports reason=bytes. Skipped if tinyllama2 doesn't accumulate any checkpoints in the short conversation (rather than flaking). ServerProcess gains three knobs for the existing/new flags: n_ctx_checkpoints, checkpoint_min_step, ctx_checkpoints_max_mib — all default to None (use server defaults) and only emit a CLI flag when set, so existing tests are unaffected.

Bug-hunt finding during the PR #68 review territory: the eviction loop unconditionally calls slot.prompt.checkpoints.front() / .erase(begin()) based only on size >= n_ctx_checkpoints. When n_ctx_checkpoints is 0 and the list is empty (the user-likely "disable checkpoints" intent), both calls hit empty-container UB. The arg parser accepts 0 without complaint and silently wraps negative ints via the size_t cast to SIZE_MAX (which is also a no-op cap). Rather than tighten the arg parser and risk breaking unknown callers, treat n_ctx_checkpoints <= 0 as "checkpoints disabled" at the call boundary — a sensible interpretation that's also what the negative-wrap was de facto delivering. Adds test_ctx_checkpoints_zero_disables_creation to the existing test_ctx_checkpoints_bytes_cap.py: drives the server with --ctx-checkpoints 0 and asserts no "created context checkpoint" line ever fires while requests still succeed.

marksverdhei · 2026-06-05T14:10:40Z

Folded in an adjacent bug-hunt finding: when --ctx-checkpoints 0 is passed, the eviction loop hits UB on .front()/.erase(begin()) of an empty list. New commit 109a81b71 short-circuits create_checkpoint at n_ctx_checkpoints <= 0 (with test). Same fix would apply on ht regardless of the byte-cap work; rolling it into this PR avoids a conflicting patch later.

This was referenced Jun 4, 2026

Router co-loads same-device-pinned models and OOMs: --models-max fit decision ignores per-device VRAM #66

Open

Hivemind Maintenance Tasks Epoch 2 #79

Closed

marksverdhei added 2 commits June 5, 2026 15:58

This was referenced Jun 5, 2026

Hivemind Maintenance Tasks Epoch 3 #81

Closed

Hivemind Maintenance Tasks Epoch 4 #86

Closed

Hivemind Maintenance Tasks Epoch 5 #91

Closed

marksverdhei merged commit b0c7723 into ht Jun 12, 2026
6 of 12 checks passed

marksverdhei deleted the fix/server-checkpoint-bytes-cap branch June 12, 2026 18:35

marksverdhei mentioned this pull request Jun 12, 2026

docs(readme): complete HT Fork Changes inventory with per-change justifications #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): per-slot byte cap on context checkpoints (closes #67)#68

fix(server): per-slot byte cap on context checkpoints (closes #67)#68
marksverdhei merged 3 commits into
htfrom
fix/server-checkpoint-bytes-cap

marksverdhei commented Jun 4, 2026

Uh oh!

marksverdhei commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jun 4, 2026

Why

What this adds

Worst-case total bound

Follow-up (NOT in this PR)

Verified

Uh oh!

marksverdhei commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant