fix(server): per-slot byte cap on context checkpoints (closes #67)#68
Merged
Conversation
The existing checkpoint cap is count-only (`--ctx-checkpoints`, default 32), which lets a single slot accumulate ~20 GB of host-RAM checkpoints under heierchat's long contexts and drives titan into SystemOOM (37 GB anon-rss on the 46 GB / 3 GB-swap node). Adds a per-slot byte budget: * `--ctx-checkpoints-max-mib N` (env `LLAMA_ARG_CTX_CHECKPOINTS_MAX_MIB`), default 4096 MiB / slot, 0 = disabled (count-only legacy behavior). * Eviction in `create_checkpoint` now FIFO-evicts until BOTH caps satisfy. Whichever cap bites first is reported via `reason=count|bytes` in the warning log so it's diagnosable from titan logs. * The success log now also reports `slot total = X MiB / Y MiB cap` so the current footprint is visible per checkpoint create. A 4 GiB-per-slot default bounds total host-RAM checkpoint use at `n_slots * 4 GiB`. With heierchat's typical `--parallel 1` (DFlash gate) that's 4 GiB worst-case; with `--parallel 4` it's 16 GiB — both well under titan's 46 GiB. Follow-up (snoop-kube discussed): a dynamic cap based on `/proc/meminfo MemAvailable * 0.3` would adapt better than the fixed default — left for a separate change once the byte-cap mechanism is in production.
This was referenced Jun 4, 2026
added 2 commits
June 5, 2026 15:58
Adds tools/server/tests/unit/test_ctx_checkpoints_bytes_cap.py with four scenarios for --ctx-checkpoints-max-mib: * default-args: server starts; if checkpoints are created the "slot total = X MiB / Y MiB cap" footprint marker appears in the create_checkpoint log line. * --ctx-checkpoints-max-mib 0: byte cap disabled, server starts fine, request succeeds (count-only legacy behavior). * negative value: arg parser rejects, ServerProcess.start() raises. * tiny byte cap + multi-turn chat: when eviction fires, the log reports reason=bytes. Skipped if tinyllama2 doesn't accumulate any checkpoints in the short conversation (rather than flaking). ServerProcess gains three knobs for the existing/new flags: n_ctx_checkpoints, checkpoint_min_step, ctx_checkpoints_max_mib — all default to None (use server defaults) and only emit a CLI flag when set, so existing tests are unaffected.
Bug-hunt finding during the PR #68 review territory: the eviction loop unconditionally calls slot.prompt.checkpoints.front() / .erase(begin()) based only on size >= n_ctx_checkpoints. When n_ctx_checkpoints is 0 and the list is empty (the user-likely "disable checkpoints" intent), both calls hit empty-container UB. The arg parser accepts 0 without complaint and silently wraps negative ints via the size_t cast to SIZE_MAX (which is also a no-op cap). Rather than tighten the arg parser and risk breaking unknown callers, treat n_ctx_checkpoints <= 0 as "checkpoints disabled" at the call boundary — a sensible interpretation that's also what the negative-wrap was de facto delivering. Adds test_ctx_checkpoints_zero_disables_creation to the existing test_ctx_checkpoints_bytes_cap.py: drives the server with --ctx-checkpoints 0 and asserts no "created context checkpoint" line ever fires while requests still succeed.
Author
|
Folded in an adjacent bug-hunt finding: when |
This was referenced Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #67.
Why
The existing checkpoint cap is count-only (
--ctx-checkpoints, default 32), so a single slot can accumulate ~20 GB of host-RAM checkpoints under heierchat's long contexts. On titan (46 GB RAM / 3 GB swap) this drivesllama-serverto 37 GB anon-rss → SystemOOM → 500s.What this adds
A per-slot byte budget alongside the existing count cap:
--ctx-checkpoints-max-mib N(envLLAMA_ARG_CTX_CHECKPOINTS_MAX_MIB)--ctx-checkpoints-max-mib 0(count-only)reason=count|bytesin the warning log.The success log also now reports
slot total = X MiB / Y MiB capso the current footprint is visible per checkpoint-create.Worst-case total bound
n_slots * --ctx-checkpoints-max-mibMiB. With heierchat's typical--parallel 1(DFlash gate) that's 4 GiB; with--parallel 4it's 16 GiB — both well under titan's 46 GiB and far below the 20 GiB single-slot accumulation we saw in the OOM logs.Follow-up (NOT in this PR)
Snoop-kube and I discussed an adaptive cap based on
/proc/meminfo MemAvailable * 0.3for unattended runs. Left for a separate change once the byte-cap mechanism has bake time in production.Verified
cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON && cmake --build build --target llama-serversucceeds.--helprenders the new flag with the documented default.--ctx-checkpoints-max-mib 0(count-only legacy path).