Compaction Timeout With Late Success: Transcript Compacted, Counter Stale
Summary
A manual compaction on session agent:christina:feishu:group:oc_03f1133e89a8d5ee60128a2c3ebca80a reported:
Compaction failed: Compaction timed out
but the same session later showed a sharply reduced context footprint, and the session transcript contains a real persisted compaction entry for that attempt.
This means the compaction did eventually complete, but the session store counter used by /status was not updated to reflect it.
Symptom
/status before manual compaction showed Context: 245k/272k (90%) · Compactions: 1
- OpenClaw later emitted
Compaction failed: Compaction timed out
- The session transcript later persisted a second
compaction entry for that session
/status after that showed Context: 85k/272k (31%) · Compactions: 1
Observed result:
- transcript state says compaction happened
- context usage behavior says compaction happened
/status compaction counter still says it did not
Evidence
Session transcript
File:
/Users/jojo/.openclaw/agents/christina/sessions/22efdbf9-aac9-44ad-bb64-2d91b20305aa.jsonl
Relevant entries:
- line 489: status before timeout
- line 490:
Compaction failed: Compaction timed out
- line 491: actual persisted
compaction entry from the same attempt
- line 516: later status showing much smaller context but unchanged compaction count
Session store
File:
/Users/jojo/.openclaw/agents/christina/sessions/sessions.json
The persisted compactionCount for this session remained 1 even though the transcript contains two compaction entries total.
Root Cause Hypothesis
This looks like a race between:
- the runner's compaction wait timeout
- the actual asynchronous completion of compaction
- the end-of-run writeback into
sessions.json
Current behavior appears to be:
- the runner waits up to 60s for compaction retry bookkeeping
- if that wait times out, the run proceeds
- end-of-run session-store update uses
result.meta.agentMeta.compactionCount
- if that meta value is still
0 at writeback time, sessions.json does not increment compactionCount
- if compaction finishes later and persists into transcript, the counter is not reconciled afterward
Code References
- wait timeout path:
/usr/local/lib/node_modules/openclaw/dist/plugin-sdk/reply-C0BWJKME.js
- around
waitForCompactionRetryWithAggregateTimeout(...)
- end-of-run store update:
/usr/local/lib/node_modules/openclaw/dist/plugin-sdk/reply-C0BWJKME.js
updateSessionStoreAfterAgentRun(...)
- current counter write logic:
- increments only from
result.meta.agentMeta.compactionCount
Why This Is A Bug
Yes, this should be treated as a bug.
The problem is not the 60s timeout itself. The timeout behavior is reasonable.
The bug is that:
- the user-visible failure message implies compaction did not complete
- the transcript later proves it did complete
- the
/status counter remains stale
That is a state-consistency bug between transcript persistence and session-store reporting.
Non-Goals
- Do not increase the 60s timeout
- Do not block the active conversation longer
- Do not make the runner poll aggressively or hold extra heavy state in memory
Recommended Fix Direction
Preferred fix: post-compaction reconciliation write
When a compaction eventually persists successfully, perform a lightweight session-store reconciliation step that updates compactionCount independently of the original run's already-finished writeback.
Concretely:
- after a compaction entry is durably appended to transcript
- issue a tiny follow-up store update for that
sessionKey
- set:
compactionCount = max(existing compactionCount, transcript compaction count)
- optionally also refresh a lightweight
updatedAt
Advantages:
- no timeout increase
- no need to keep waiting in the runner
- no expensive transcript rescans on every
/status
- directly fixes the stale counter at the point where truth becomes known
Acceptable alternative: lazy reconcile on /status
When /status loads a session, if there is evidence of a recent compaction-timeout mismatch, reconcile compactionCount from transcript before rendering.
This is less attractive because:
- it pushes repair into read path
- it can add latency to
/status
- it leaves stale state around until someone explicitly checks status
Another acceptable alternative: append an explicit async completion event
If compaction completes after timeout, emit a small internal completion event and let a background handler update the store.
This is also viable, but more moving parts than the direct reconciliation write.
Recommended Minimal Implementation
- Keep the current 60s timeout unchanged.
- Keep the current "proceed after timeout" behavior unchanged.
- After transcript compaction persistence succeeds, call a dedicated helper like:
reconcileSessionStoreAfterCompaction(sessionKey, sessionFile)
- That helper should:
- read current session-store entry
- determine authoritative compaction count cheaply
- update only if transcript-derived count is greater than stored count
Suggested data source for reconciliation
Best option:
- increment store from the same success path that appends the
compaction entry
Fallback option:
- count transcript
type:"compaction" entries only when a late-success path is detected
Avoid:
- full transcript scans on every request
Severity
Moderate.
It does not appear to corrupt transcript state, but it makes /status misleading and can cause operators to draw the wrong conclusion about whether compaction actually happened.
Notes
- This issue is compatible with keeping the current timeout policy.
- The bug is not "compaction timed out"; the bug is "late compaction success is not reconciled into session-store counters."
Latest Main Investigation
Investigated against fresh origin/main worktree:
- repo:
/Users/jojo/XinWorld/projects/openclaw-main-investigation
- fetched commit:
0ece3834f
Relevant write paths
/status counter for command-agent sessions is updated from run metadata, not transcript truth:
src/commands/agent/session-store.ts
updateSessionStoreAfterAgentRun(...)
- uses:
const compactionsThisRun = Math.max(0, result.meta.agentMeta?.compactionCount ?? 0)
- only increments
next.compactionCount when compactionsThisRun > 0
Embedded runner produces that meta counter from attempt-local subscription state:
-
src/agents/pi-embedded-runner/run/attempt.ts
-
returns compactionCount: getCompactionCount()
-
src/agents/pi-embedded-runner/run.ts
-
accumulates attempt.compactionCount into autoCompactionCount
-
writes agentMeta.compactionCount: autoCompactionCount > 0 ? autoCompactionCount : undefined
Attempt-local counting depends on the compaction end event being seen before the attempt returns:
src/agents/pi-embedded-subscribe.handlers.compaction.ts
handleAutoCompactionEnd(...) calls ctx.incrementCompactionCount?.() only on successful compaction end
Timeout behavior
The runner still uses a hard aggregate wait timeout:
src/agents/pi-embedded-runner/run/attempt.ts
COMPACTION_RETRY_AGGREGATE_TIMEOUT_MS = 60_000
When that wait times out:
timedOutDuringCompaction = true
- the attempt proceeds using a snapshot selection path
- later cleanup unsubscribes the subscription
This means the counter is only reliable if compaction completion is observed before the attempt returns and unsubscribes.
Existing persistence helper
There is already a lightweight direct store-update helper suitable for reconciliation:
src/auto-reply/reply/session-updates.ts
incrementCompactionCount(...)
That helper:
- updates only the target session entry
- can also refresh
totalTokens using tokensAfter
- does not require changing timeout behavior
Auto-reply note
Auto-reply paths already do a post-run counter write when autoCompactionCompleted is true:
src/auto-reply/reply/agent-runner.ts
src/auto-reply/reply/followup-runner.ts
That means the same class of bug can happen there too if compaction completion lands after the run outcome is finalized.
Concrete Repair Plan
Preferred fix
Add a post-compaction success reconciliation hook on the event path that already knows compaction actually completed, instead of relying exclusively on the enclosing run's final metadata.
Minimal design
- Keep the existing 60s timeout unchanged.
- Keep the current "continue after timeout" behavior unchanged.
- On successful compaction end, perform a tiny best-effort session-store update immediately.
- Make that update monotonic so duplicate signals cannot overcount.
Proposed implementation shape
Add a helper, for example:
reconcileCompactionCountAfterSuccess({ sessionKey, agentId, config, observedCompactionCount, sessionId? })
Suggested behavior:
- resolve the correct
sessions.json path from config via resolveStorePath(...)
- load the current session store entry
- set
compactionCount = max(existing compactionCount, observedCompactionCount)
- optionally update
updatedAt
- optionally update
totalTokens when a trustworthy post-compaction token estimate is available
Best hook point
Primary hook point:
src/agents/pi-embedded-subscribe.handlers.compaction.ts
- inside
handleAutoCompactionEnd(...)
Reason:
- this is the first place where success is actually known
- it already distinguishes
hasResult and wasAborted
- it runs independently of whether the outer run later times out or finishes
Required plumbing
handleAutoCompactionEnd(...) currently has sessionKey, sessionId, agentId, and config via subscribe params, but not direct session-store access.
To support reconciliation cleanly:
- extend subscribe context/helper plumbing to resolve store path from
config + agentId
- call into a small shared session-store helper
- keep failures best-effort and log-only
Why max(...) instead of +1
Using +1 in the late-success path risks double increments when:
- the normal run-finalization path already counted the compaction
- retries or duplicated end signals occur
Using:
compactionCount = max(existing, observed)
keeps the repair idempotent and safe.
Optional stronger variant
If attempt-local observed count is not trusted for all late-success cases, add a targeted transcript reconciliation helper only for the timeout-late-success branch:
- count transcript
type:"compaction" entries for that one session
- write
max(existing, transcriptCount)
This should remain fallback-only, not the default hot path.
Proposed Tests
-
Add a unit/integration test where:
- compaction wait hits the 60s aggregate timeout
- compaction end event arrives before final teardown
- transcript success is simulated
sessions.json.compactionCount still converges to the correct value
-
Add an idempotency test showing:
- normal run-finalization increments once
- late reconciliation does not increment a second time
-
Add an auto-reply regression test for:
autoCompactionCompleted false at run-finalization time
- later compaction success still repairs store count
Compaction Timeout With Late Success: Transcript Compacted, Counter Stale
Summary
A manual compaction on session
agent:christina:feishu:group:oc_03f1133e89a8d5ee60128a2c3ebca80areported:but the same session later showed a sharply reduced context footprint, and the session transcript contains a real persisted
compactionentry for that attempt.This means the compaction did eventually complete, but the session store counter used by
/statuswas not updated to reflect it.Symptom
/statusbefore manual compaction showedContext: 245k/272k (90%) · Compactions: 1Compaction failed: Compaction timed outcompactionentry for that session/statusafter that showedContext: 85k/272k (31%) · Compactions: 1Observed result:
/statuscompaction counter still says it did notEvidence
Session transcript
File:
/Users/jojo/.openclaw/agents/christina/sessions/22efdbf9-aac9-44ad-bb64-2d91b20305aa.jsonlRelevant entries:
Compaction failed: Compaction timed outcompactionentry from the same attemptSession store
File:
/Users/jojo/.openclaw/agents/christina/sessions/sessions.jsonThe persisted
compactionCountfor this session remained1even though the transcript contains twocompactionentries total.Root Cause Hypothesis
This looks like a race between:
sessions.jsonCurrent behavior appears to be:
result.meta.agentMeta.compactionCount0at writeback time,sessions.jsondoes not incrementcompactionCountCode References
/usr/local/lib/node_modules/openclaw/dist/plugin-sdk/reply-C0BWJKME.jswaitForCompactionRetryWithAggregateTimeout(...)/usr/local/lib/node_modules/openclaw/dist/plugin-sdk/reply-C0BWJKME.jsupdateSessionStoreAfterAgentRun(...)result.meta.agentMeta.compactionCountWhy This Is A Bug
Yes, this should be treated as a bug.
The problem is not the 60s timeout itself. The timeout behavior is reasonable.
The bug is that:
/statuscounter remains staleThat is a state-consistency bug between transcript persistence and session-store reporting.
Non-Goals
Recommended Fix Direction
Preferred fix: post-compaction reconciliation write
When a compaction eventually persists successfully, perform a lightweight session-store reconciliation step that updates
compactionCountindependently of the original run's already-finished writeback.Concretely:
sessionKeycompactionCount = max(existing compactionCount, transcript compaction count)updatedAtAdvantages:
/statusAcceptable alternative: lazy reconcile on
/statusWhen
/statusloads a session, if there is evidence of a recent compaction-timeout mismatch, reconcilecompactionCountfrom transcript before rendering.This is less attractive because:
/statusAnother acceptable alternative: append an explicit async completion event
If compaction completes after timeout, emit a small internal completion event and let a background handler update the store.
This is also viable, but more moving parts than the direct reconciliation write.
Recommended Minimal Implementation
Suggested data source for reconciliation
Best option:
compactionentryFallback option:
type:"compaction"entries only when a late-success path is detectedAvoid:
Severity
Moderate.
It does not appear to corrupt transcript state, but it makes
/statusmisleading and can cause operators to draw the wrong conclusion about whether compaction actually happened.Notes
Latest Main Investigation
Investigated against fresh
origin/mainworktree:/Users/jojo/XinWorld/projects/openclaw-main-investigation0ece3834fRelevant write paths
/statuscounter for command-agent sessions is updated from run metadata, not transcript truth:src/commands/agent/session-store.tsupdateSessionStoreAfterAgentRun(...)const compactionsThisRun = Math.max(0, result.meta.agentMeta?.compactionCount ?? 0)next.compactionCountwhencompactionsThisRun > 0Embedded runner produces that meta counter from attempt-local subscription state:
src/agents/pi-embedded-runner/run/attempt.tsreturns
compactionCount: getCompactionCount()src/agents/pi-embedded-runner/run.tsaccumulates
attempt.compactionCountintoautoCompactionCountwrites
agentMeta.compactionCount: autoCompactionCount > 0 ? autoCompactionCount : undefinedAttempt-local counting depends on the compaction end event being seen before the attempt returns:
src/agents/pi-embedded-subscribe.handlers.compaction.tshandleAutoCompactionEnd(...)callsctx.incrementCompactionCount?.()only on successful compaction endTimeout behavior
The runner still uses a hard aggregate wait timeout:
src/agents/pi-embedded-runner/run/attempt.tsCOMPACTION_RETRY_AGGREGATE_TIMEOUT_MS = 60_000When that wait times out:
timedOutDuringCompaction = trueThis means the counter is only reliable if compaction completion is observed before the attempt returns and unsubscribes.
Existing persistence helper
There is already a lightweight direct store-update helper suitable for reconciliation:
src/auto-reply/reply/session-updates.tsincrementCompactionCount(...)That helper:
totalTokensusingtokensAfterAuto-reply note
Auto-reply paths already do a post-run counter write when
autoCompactionCompletedis true:src/auto-reply/reply/agent-runner.tssrc/auto-reply/reply/followup-runner.tsThat means the same class of bug can happen there too if compaction completion lands after the run outcome is finalized.
Concrete Repair Plan
Preferred fix
Add a post-compaction success reconciliation hook on the event path that already knows compaction actually completed, instead of relying exclusively on the enclosing run's final metadata.
Minimal design
Proposed implementation shape
Add a helper, for example:
reconcileCompactionCountAfterSuccess({ sessionKey, agentId, config, observedCompactionCount, sessionId? })Suggested behavior:
sessions.jsonpath from config viaresolveStorePath(...)compactionCount = max(existing compactionCount, observedCompactionCount)updatedAttotalTokenswhen a trustworthy post-compaction token estimate is availableBest hook point
Primary hook point:
src/agents/pi-embedded-subscribe.handlers.compaction.tshandleAutoCompactionEnd(...)Reason:
hasResultandwasAbortedRequired plumbing
handleAutoCompactionEnd(...)currently hassessionKey,sessionId,agentId, andconfigvia subscribe params, but not direct session-store access.To support reconciliation cleanly:
config + agentIdWhy
max(...)instead of+1Using
+1in the late-success path risks double increments when:Using:
compactionCount = max(existing, observed)keeps the repair idempotent and safe.
Optional stronger variant
If attempt-local observed count is not trusted for all late-success cases, add a targeted transcript reconciliation helper only for the timeout-late-success branch:
type:"compaction"entries for that one sessionmax(existing, transcriptCount)This should remain fallback-only, not the default hot path.
Proposed Tests
Add a unit/integration test where:
sessions.json.compactionCountstill converges to the correct valueAdd an idempotency test showing:
Add an auto-reply regression test for:
autoCompactionCompletedfalse at run-finalization time