Skip to content

fix: bound compactFullSweep so a compaction sweep cannot hang the turn#712

Merged
jalehman merged 3 commits into
Martian-Engineering:mainfrom
100yenadmin:fix/bound-compact-full-sweep
May 19, 2026
Merged

fix: bound compactFullSweep so a compaction sweep cannot hang the turn#712
jalehman merged 3 commits into
Martian-Engineering:mainfrom
100yenadmin:fix/bound-compact-full-sweep

Conversation

@100yenadmin

@100yenadmin 100yenadmin commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Problem

On OpenClaw 2026.5.18 + lossless-claw 0.10.0, a context compaction can leave the
agent unresponsive for tens of minutes. Three compounding LCM-side causes:

  1. compactFullSweep was unbounded. The Phase 1 leaf-pass loop in
    src/compaction.ts (while (true)) had no max-iteration cap and no
    wall-clock deadline — one leaf pass per raw-message chunk. Large
    conversations produce many passes (observed: 16 passes on a 308K-token
    conversation).
  2. The sweep ran synchronously on the turn-critical path. With
    proactiveThresholdCompactionMode: "deferred", recorded compaction debt is
    drained inside the next turn's assemble(), which runs compactFullSweep
    before returning the assembled prompt. Each leaf pass also does a synchronous
    node:sqlite scan that blocks the Node event loop.
  3. compactUntilUnder was unbounded across rounds. Bounding one sweep is
    not enough: compactUntilUnder (src/compaction.ts) wraps a
    for (round = 1; round <= maxRounds; round++) loop around sweeps, calling
    compact()compactFullSweep once per round with maxRounds = 10. Each
    compactFullSweep invocation re-initializes its own sweepStartedAt /
    sweepDeadlineAt, so the per-sweep sweepDeadlineMs resets every round.
    Worst case is maxRounds × sweepDeadlineMs ≈ 10 × 120s ≈ 20 minutes. This is
    the codex automatic compaction path (force:false, compactionTarget:"budget") — the common one — so the per-sweep bound alone
    still leaves a real ~20-minute stall.

When the summarizer is slow or rate-limited, each pass burns its full
summaryTimeoutMs (default 60s, often configured 180s) before falling back;
16 passes × that = 15–48 minutes of a hung turn.

Fix

  • Bound the sweep. compactFullSweep now enforces both a hard per-pass
    iteration cap (maxSweepIterations, default 12) and a wall-clock deadline
    (sweepDeadlineMs, default 120000). The counter and deadline are shared
    across Phase 1 (leaf) and Phase 2 (condensed), so the total sweep stays
    bounded. On hitting either limit the sweep stops before starting another
    pass and returns the consistent partial result, logging a clear
    compactFullSweep stopped at … warning. Remaining context pressure is
    picked up by the next sweep.
  • Bound the whole compactUntilUnder operation. compactUntilUnder now
    computes one operation-wide wall-clock deadline at its start and (a) threads
    it into every round's compactFullSweep via a new optional
    operationDeadlineAt input, so a sweep stops at whichever is sooner — its
    own sweepDeadlineMs or the operation deadline — and (b) checks it in the
    for round loop, bailing before the next round with the consistent partial
    result and a compactUntilUnder stopped at … warning. The total budget is a
    separate knob, compactUntilUnderDeadlineMs (default 300000), not a
    reuse of sweepDeadlineMs: a single legitimate round's sweep can use most of
    sweepDeadlineMs, so reusing it as the operation-wide budget would let the
    first round alone exhaust it and break multi-round compaction. 5 minutes
    leaves room for a few full-deadline sweeps while capping the worst case well
    below 20 minutes.
  • Yield the event loop. The sweep now awaits a macrotask
    (setImmediate) between the synchronous node:sqlite scans of consecutive
    passes, so a long sweep cannot freeze the gateway for its whole duration.
  • Time-box the inline drain. The assemble() deferred-debt drain reaches
    compactFullSweep through compact(), so the deadline above also bounds
    what assemble() can do inline — the turn is no longer held hostage by a
    full sweep.

All three limits are configurable via plugin config (maxSweepIterations,
sweepDeadlineMs, compactUntilUnderDeadlineMs) or the
LCM_MAX_SWEEP_ITERATIONS / LCM_SWEEP_DEADLINE_MS /
LCM_COMPACT_UNTIL_UNDER_DEADLINE_MS environment variables, following the
existing config conventions.

This is complementary to the interceptCompaction handoff work (#665) and the
threshold hard floor (#619): it bounds the sweep itself, orthogonal to those.

The host-side half of this fix — giving the OpenClaw host a safety timeout
so it does not await a slow plugin-owned compaction forever — is in
openclaw/openclaw#84083. The two PRs together close the stall: this one
keeps an individual sweep bounded; that one bounds the host's wait on the
plugin regardless.

Related: #584 (withTimeout doesn't abort the underlying summarizer call). This
PR does not change withTimeout — bounding the sweep already caps total time
even when a single pass burns its full timeout. Aborting the underlying call is
a separate, larger change.

Tests

  • New compactFullSweep bounds suite in test/lcm-integration.test.ts: a sweep
    that would exceed the iteration cap stops cleanly with a consistent partial
    result; raising the cap genuinely runs more passes (probe-verified: 14 passes
    uncapped vs 2 capped — directly reproduces the 16-pass bug); a sweep that
    exceeds the wall-clock deadline stops cleanly; a bounded sweep returns within
    a small multiple of the deadline.
  • New compactUntilUnder bounds suite in test/lcm-integration.test.ts: a
    multi-round compactUntilUnder that would otherwise run maxRounds × sweepDeadlineMs stops at the operation deadline instead (total wall-clock
    bounded to a small multiple of compactUntilUnderDeadlineMs, not
    maxRounds × sweepDeadlineMs); it returns a consistent partial result on the
    deadline; and a generous deadline does not cut a legitimate fast multi-round
    run short.
  • test/config.test.ts: default, plugin-config, and env-override coverage for
    all three new settings.
  • test/circuit-breaker.test.ts: the cooldown test now fakes only Date
    (its cooldown is Date.now()-based) so the new in-sweep setImmediate yield
    still runs under that test.
  • npm run build (the CI build gate) passes; the local vitest suite passes for
    all touched and adjacent files; tsc --noEmit introduces zero new errors
    versus the prior branch state.

Fixes #711

Takeover hardening update

During takeover review, an additional edge case was found and fixed in eb53bcf: if the sweep deadline expired while selecting a leaf chunk or condensation candidate, compactFullSweep could still start one more summarizer pass after the budget was gone.

The branch now rechecks the sweep budget after selection and before starting either the leaf or condensed summarizer pass. Two regression tests cover deadline expiry during selection for both paths.

Validation run from /Volumes/LEXAR/repos/lossless-claw-fix-sweep:

  • ./node_modules/.bin/vitest run test/lcm-integration.test.ts test/config.test.ts test/circuit-breaker.test.ts --maxWorkers=1 -> 160/160 passing
  • npm run build -> passing
  • GitHub checks on the updated head -> passing

A threshold full sweep ran an unbounded leaf-pass loop (one pass per raw
chunk) synchronously on the turn-critical path — drained inside the next
turn's assemble() under deferred compaction. With a slow or rate-limited
summarizer each pass burns its full summaryTimeoutMs, so a large
conversation (observed: 16 passes on 308K tokens) left the agent
unresponsive for tens of minutes.

Bound compactFullSweep with both a hard per-pass iteration cap
(maxSweepIterations, default 12) and a wall-clock deadline
(sweepDeadlineMs, default 120000), shared across Phase 1 and Phase 2 so
the total sweep is bounded. On hitting either limit the sweep stops
before starting another pass and returns the consistent partial result,
logging a clear warning. Remaining context pressure is picked up by the
next sweep. The deadline also time-boxes the inline assemble()
deferred-debt drain.

Yield the Node event loop (setImmediate) between the synchronous
node:sqlite scans of consecutive passes so a long sweep cannot freeze
the gateway for its whole duration.

Both limits are configurable via plugin config or the
LCM_MAX_SWEEP_ITERATIONS / LCM_SWEEP_DEADLINE_MS env vars.
@100yenadmin

Copy link
Copy Markdown
Collaborator Author

@jalehman

Eva added 2 commits May 19, 2026 20:51
PR Martian-Engineering#712 bounded a single compactFullSweep with a per-pass iteration cap
and a wall-clock deadline (sweepDeadlineMs), but compactUntilUnder wraps
a `for round = 1..maxRounds` loop around sweeps. Each compactFullSweep
invocation re-initializes its own sweepDeadlineAt, so Martian-Engineering#712's deadline
resets every round. Worst case for the codex automatic compaction path
(force:false, compactionTarget:"budget") is maxRounds × sweepDeadlineMs
≈ 10 × 120s ≈ 20 minutes — a real stall Martian-Engineering#712 does not catch.

Bound the whole compactUntilUnder operation with its own wall-clock
budget. _compactUntilUnderImpl computes one operationDeadlineAt at the
start and (a) threads it into every round's compactFullSweep via a new
optional operationDeadlineAt input, so a sweep stops at whichever is
sooner — its own sweepDeadlineMs or the operation deadline — and (b)
checks it in the `for round` loop, bailing before the next round with
the consistent partial result and a `compactUntilUnder stopped at …`
warning.

The total budget is a separate knob, compactUntilUnderDeadlineMs
(default 300000), not a reuse of sweepDeadlineMs: a single legitimate
round's sweep can use most of sweepDeadlineMs, so reusing it as the
operation-wide budget would let the first round alone exhaust it and
break multi-round compaction. 5 minutes leaves room for a few
full-deadline sweeps while capping the worst case well below 20
minutes. Configurable via plugin config or LCM_COMPACT_UNTIL_UNDER_DEADLINE_MS.
@jalehman jalehman self-assigned this May 19, 2026
@jalehman jalehman merged commit 67b7f51 into Martian-Engineering:main May 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compaction stall: LCM's compactFullSweep and compactUntilUnder are unbounded — a compaction can hang the agent for ~20 minutes

2 participants