Skip to content

fix(server): per-slot byte cap on context checkpoints (closes #67)#68

Merged
marksverdhei merged 3 commits into
htfrom
fix/server-checkpoint-bytes-cap
Jun 12, 2026
Merged

fix(server): per-slot byte cap on context checkpoints (closes #67)#68
marksverdhei merged 3 commits into
htfrom
fix/server-checkpoint-bytes-cap

Conversation

@marksverdhei

Copy link
Copy Markdown

Closes #67.

Why

The existing checkpoint cap is count-only (--ctx-checkpoints, default 32), so a single slot can accumulate ~20 GB of host-RAM checkpoints under heierchat's long contexts. On titan (46 GB RAM / 3 GB swap) this drives llama-server to 37 GB anon-rss → SystemOOM → 500s.

What this adds

A per-slot byte budget alongside the existing count cap:

  • CLI: --ctx-checkpoints-max-mib N (env LLAMA_ARG_CTX_CHECKPOINTS_MAX_MIB)
  • Default: 4096 MiB per slot
  • Disable / legacy behavior: --ctx-checkpoints-max-mib 0 (count-only)
  • Eviction: FIFO until both caps are satisfied. Whichever cap bites first is reported via reason=count|bytes in the warning log.
+ auto over_count = [&]() { return slot.prompt.checkpoints.size() >= (size_t) params_base.n_ctx_checkpoints; };
+ auto over_bytes = [&]() { return byte_cap > 0 && !empty && total_bytes() >= byte_cap; };
- while (slot.prompt.checkpoints.size() >= (size_t) params_base.n_ctx_checkpoints) {
+ while (over_count() || over_bytes()) {

The success log also now reports slot total = X MiB / Y MiB cap so the current footprint is visible per checkpoint-create.

Worst-case total bound

n_slots * --ctx-checkpoints-max-mib MiB. With heierchat's typical --parallel 1 (DFlash gate) that's 4 GiB; with --parallel 4 it's 16 GiB — both well under titan's 46 GiB and far below the 20 GiB single-slot accumulation we saw in the OOM logs.

Follow-up (NOT in this PR)

Snoop-kube and I discussed an adaptive cap based on /proc/meminfo MemAvailable * 0.3 for unattended runs. Left for a separate change once the byte-cap mechanism has bake time in production.

Verified

  • cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON && cmake --build build --target llama-server succeeds.
  • --help renders the new flag with the documented default.
  • No behavior change when --ctx-checkpoints-max-mib 0 (count-only legacy path).

The existing checkpoint cap is count-only (`--ctx-checkpoints`, default 32),
which lets a single slot accumulate ~20 GB of host-RAM checkpoints under
heierchat's long contexts and drives titan into SystemOOM (37 GB anon-rss
on the 46 GB / 3 GB-swap node).

Adds a per-slot byte budget:

* `--ctx-checkpoints-max-mib N` (env `LLAMA_ARG_CTX_CHECKPOINTS_MAX_MIB`),
  default 4096 MiB / slot, 0 = disabled (count-only legacy behavior).
* Eviction in `create_checkpoint` now FIFO-evicts until BOTH caps satisfy.
  Whichever cap bites first is reported via `reason=count|bytes` in the
  warning log so it's diagnosable from titan logs.
* The success log now also reports `slot total = X MiB / Y MiB cap` so the
  current footprint is visible per checkpoint create.

A 4 GiB-per-slot default bounds total host-RAM checkpoint use at
`n_slots * 4 GiB`. With heierchat's typical `--parallel 1` (DFlash gate)
that's 4 GiB worst-case; with `--parallel 4` it's 16 GiB — both well
under titan's 46 GiB.

Follow-up (snoop-kube discussed): a dynamic cap based on
`/proc/meminfo MemAvailable * 0.3` would adapt better than the fixed
default — left for a separate change once the byte-cap mechanism is
in production.
marksverdhei added 2 commits June 5, 2026 15:58
Adds tools/server/tests/unit/test_ctx_checkpoints_bytes_cap.py with
four scenarios for --ctx-checkpoints-max-mib:

* default-args: server starts; if checkpoints are created the
  "slot total = X MiB / Y MiB cap" footprint marker appears in the
  create_checkpoint log line.
* --ctx-checkpoints-max-mib 0: byte cap disabled, server starts fine,
  request succeeds (count-only legacy behavior).
* negative value: arg parser rejects, ServerProcess.start() raises.
* tiny byte cap + multi-turn chat: when eviction fires, the log
  reports reason=bytes. Skipped if tinyllama2 doesn't accumulate
  any checkpoints in the short conversation (rather than flaking).

ServerProcess gains three knobs for the existing/new flags:
n_ctx_checkpoints, checkpoint_min_step, ctx_checkpoints_max_mib —
all default to None (use server defaults) and only emit a CLI flag
when set, so existing tests are unaffected.
Bug-hunt finding during the PR #68 review territory: the eviction loop
unconditionally calls slot.prompt.checkpoints.front() / .erase(begin())
based only on size >= n_ctx_checkpoints. When n_ctx_checkpoints is 0 and
the list is empty (the user-likely "disable checkpoints" intent), both
calls hit empty-container UB.

The arg parser accepts 0 without complaint and silently wraps negative
ints via the size_t cast to SIZE_MAX (which is also a no-op cap). Rather
than tighten the arg parser and risk breaking unknown callers, treat
n_ctx_checkpoints <= 0 as "checkpoints disabled" at the call boundary —
a sensible interpretation that's also what the negative-wrap was de facto
delivering.

Adds test_ctx_checkpoints_zero_disables_creation to the existing
test_ctx_checkpoints_bytes_cap.py: drives the server with
--ctx-checkpoints 0 and asserts no "created context checkpoint" line
ever fires while requests still succeed.
@marksverdhei

Copy link
Copy Markdown
Author

Folded in an adjacent bug-hunt finding: when --ctx-checkpoints 0 is passed, the eviction loop hits UB on .front()/.erase(begin()) of an empty list. New commit 109a81b71 short-circuits create_checkpoint at n_ctx_checkpoints <= 0 (with test). Same fix would apply on ht regardless of the byte-cap work; rolling it into this PR avoids a conflicting patch later.

@marksverdhei marksverdhei merged commit b0c7723 into ht Jun 12, 2026
6 of 12 checks passed
@marksverdhei marksverdhei deleted the fix/server-checkpoint-bytes-cap branch June 12, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Context checkpoints capped by count not bytes -> host-RAM OOM under long contexts

1 participant