Skip to content

Compaction late success leaves /status compactionCount stale after timeout #45492

@jackal092927

Description

@jackal092927

Compaction Timeout With Late Success: Transcript Compacted, Counter Stale

Summary

A manual compaction on session agent:christina:feishu:group:oc_03f1133e89a8d5ee60128a2c3ebca80a reported:

Compaction failed: Compaction timed out

but the same session later showed a sharply reduced context footprint, and the session transcript contains a real persisted compaction entry for that attempt.

This means the compaction did eventually complete, but the session store counter used by /status was not updated to reflect it.

Symptom

  • /status before manual compaction showed Context: 245k/272k (90%) · Compactions: 1
  • OpenClaw later emitted Compaction failed: Compaction timed out
  • The session transcript later persisted a second compaction entry for that session
  • /status after that showed Context: 85k/272k (31%) · Compactions: 1

Observed result:

  • transcript state says compaction happened
  • context usage behavior says compaction happened
  • /status compaction counter still says it did not

Evidence

Session transcript

File:

  • /Users/jojo/.openclaw/agents/christina/sessions/22efdbf9-aac9-44ad-bb64-2d91b20305aa.jsonl

Relevant entries:

  • line 489: status before timeout
  • line 490: Compaction failed: Compaction timed out
  • line 491: actual persisted compaction entry from the same attempt
  • line 516: later status showing much smaller context but unchanged compaction count

Session store

File:

  • /Users/jojo/.openclaw/agents/christina/sessions/sessions.json

The persisted compactionCount for this session remained 1 even though the transcript contains two compaction entries total.

Root Cause Hypothesis

This looks like a race between:

  1. the runner's compaction wait timeout
  2. the actual asynchronous completion of compaction
  3. the end-of-run writeback into sessions.json

Current behavior appears to be:

  • the runner waits up to 60s for compaction retry bookkeeping
  • if that wait times out, the run proceeds
  • end-of-run session-store update uses result.meta.agentMeta.compactionCount
  • if that meta value is still 0 at writeback time, sessions.json does not increment compactionCount
  • if compaction finishes later and persists into transcript, the counter is not reconciled afterward

Code References

  • wait timeout path:
    • /usr/local/lib/node_modules/openclaw/dist/plugin-sdk/reply-C0BWJKME.js
    • around waitForCompactionRetryWithAggregateTimeout(...)
  • end-of-run store update:
    • /usr/local/lib/node_modules/openclaw/dist/plugin-sdk/reply-C0BWJKME.js
    • updateSessionStoreAfterAgentRun(...)
  • current counter write logic:
    • increments only from result.meta.agentMeta.compactionCount

Why This Is A Bug

Yes, this should be treated as a bug.

The problem is not the 60s timeout itself. The timeout behavior is reasonable.

The bug is that:

  • the user-visible failure message implies compaction did not complete
  • the transcript later proves it did complete
  • the /status counter remains stale

That is a state-consistency bug between transcript persistence and session-store reporting.

Non-Goals

  • Do not increase the 60s timeout
  • Do not block the active conversation longer
  • Do not make the runner poll aggressively or hold extra heavy state in memory

Recommended Fix Direction

Preferred fix: post-compaction reconciliation write

When a compaction eventually persists successfully, perform a lightweight session-store reconciliation step that updates compactionCount independently of the original run's already-finished writeback.

Concretely:

  • after a compaction entry is durably appended to transcript
  • issue a tiny follow-up store update for that sessionKey
  • set:
    • compactionCount = max(existing compactionCount, transcript compaction count)
  • optionally also refresh a lightweight updatedAt

Advantages:

  • no timeout increase
  • no need to keep waiting in the runner
  • no expensive transcript rescans on every /status
  • directly fixes the stale counter at the point where truth becomes known

Acceptable alternative: lazy reconcile on /status

When /status loads a session, if there is evidence of a recent compaction-timeout mismatch, reconcile compactionCount from transcript before rendering.

This is less attractive because:

  • it pushes repair into read path
  • it can add latency to /status
  • it leaves stale state around until someone explicitly checks status

Another acceptable alternative: append an explicit async completion event

If compaction completes after timeout, emit a small internal completion event and let a background handler update the store.

This is also viable, but more moving parts than the direct reconciliation write.

Recommended Minimal Implementation

  1. Keep the current 60s timeout unchanged.
  2. Keep the current "proceed after timeout" behavior unchanged.
  3. After transcript compaction persistence succeeds, call a dedicated helper like:
reconcileSessionStoreAfterCompaction(sessionKey, sessionFile)
  1. That helper should:
  • read current session-store entry
  • determine authoritative compaction count cheaply
  • update only if transcript-derived count is greater than stored count

Suggested data source for reconciliation

Best option:

  • increment store from the same success path that appends the compaction entry

Fallback option:

  • count transcript type:"compaction" entries only when a late-success path is detected

Avoid:

  • full transcript scans on every request

Severity

Moderate.

It does not appear to corrupt transcript state, but it makes /status misleading and can cause operators to draw the wrong conclusion about whether compaction actually happened.

Notes

  • This issue is compatible with keeping the current timeout policy.
  • The bug is not "compaction timed out"; the bug is "late compaction success is not reconciled into session-store counters."

Latest Main Investigation

Investigated against fresh origin/main worktree:

  • repo: /Users/jojo/XinWorld/projects/openclaw-main-investigation
  • fetched commit: 0ece3834f

Relevant write paths

/status counter for command-agent sessions is updated from run metadata, not transcript truth:

  • src/commands/agent/session-store.ts
  • updateSessionStoreAfterAgentRun(...)
  • uses:
    • const compactionsThisRun = Math.max(0, result.meta.agentMeta?.compactionCount ?? 0)
    • only increments next.compactionCount when compactionsThisRun > 0

Embedded runner produces that meta counter from attempt-local subscription state:

  • src/agents/pi-embedded-runner/run/attempt.ts

  • returns compactionCount: getCompactionCount()

  • src/agents/pi-embedded-runner/run.ts

  • accumulates attempt.compactionCount into autoCompactionCount

  • writes agentMeta.compactionCount: autoCompactionCount > 0 ? autoCompactionCount : undefined

Attempt-local counting depends on the compaction end event being seen before the attempt returns:

  • src/agents/pi-embedded-subscribe.handlers.compaction.ts
  • handleAutoCompactionEnd(...) calls ctx.incrementCompactionCount?.() only on successful compaction end

Timeout behavior

The runner still uses a hard aggregate wait timeout:

  • src/agents/pi-embedded-runner/run/attempt.ts
  • COMPACTION_RETRY_AGGREGATE_TIMEOUT_MS = 60_000

When that wait times out:

  • timedOutDuringCompaction = true
  • the attempt proceeds using a snapshot selection path
  • later cleanup unsubscribes the subscription

This means the counter is only reliable if compaction completion is observed before the attempt returns and unsubscribes.

Existing persistence helper

There is already a lightweight direct store-update helper suitable for reconciliation:

  • src/auto-reply/reply/session-updates.ts
  • incrementCompactionCount(...)

That helper:

  • updates only the target session entry
  • can also refresh totalTokens using tokensAfter
  • does not require changing timeout behavior

Auto-reply note

Auto-reply paths already do a post-run counter write when autoCompactionCompleted is true:

  • src/auto-reply/reply/agent-runner.ts
  • src/auto-reply/reply/followup-runner.ts

That means the same class of bug can happen there too if compaction completion lands after the run outcome is finalized.

Concrete Repair Plan

Preferred fix

Add a post-compaction success reconciliation hook on the event path that already knows compaction actually completed, instead of relying exclusively on the enclosing run's final metadata.

Minimal design

  1. Keep the existing 60s timeout unchanged.
  2. Keep the current "continue after timeout" behavior unchanged.
  3. On successful compaction end, perform a tiny best-effort session-store update immediately.
  4. Make that update monotonic so duplicate signals cannot overcount.

Proposed implementation shape

Add a helper, for example:

reconcileCompactionCountAfterSuccess({ sessionKey, agentId, config, observedCompactionCount, sessionId? })

Suggested behavior:

  • resolve the correct sessions.json path from config via resolveStorePath(...)
  • load the current session store entry
  • set compactionCount = max(existing compactionCount, observedCompactionCount)
  • optionally update updatedAt
  • optionally update totalTokens when a trustworthy post-compaction token estimate is available

Best hook point

Primary hook point:

  • src/agents/pi-embedded-subscribe.handlers.compaction.ts
  • inside handleAutoCompactionEnd(...)

Reason:

  • this is the first place where success is actually known
  • it already distinguishes hasResult and wasAborted
  • it runs independently of whether the outer run later times out or finishes

Required plumbing

handleAutoCompactionEnd(...) currently has sessionKey, sessionId, agentId, and config via subscribe params, but not direct session-store access.

To support reconciliation cleanly:

  • extend subscribe context/helper plumbing to resolve store path from config + agentId
  • call into a small shared session-store helper
  • keep failures best-effort and log-only

Why max(...) instead of +1

Using +1 in the late-success path risks double increments when:

  • the normal run-finalization path already counted the compaction
  • retries or duplicated end signals occur

Using:

  • compactionCount = max(existing, observed)

keeps the repair idempotent and safe.

Optional stronger variant

If attempt-local observed count is not trusted for all late-success cases, add a targeted transcript reconciliation helper only for the timeout-late-success branch:

  • count transcript type:"compaction" entries for that one session
  • write max(existing, transcriptCount)

This should remain fallback-only, not the default hot path.

Proposed Tests

  1. Add a unit/integration test where:

    • compaction wait hits the 60s aggregate timeout
    • compaction end event arrives before final teardown
    • transcript success is simulated
    • sessions.json.compactionCount still converges to the correct value
  2. Add an idempotency test showing:

    • normal run-finalization increments once
    • late reconciliation does not increment a second time
  3. Add an auto-reply regression test for:

    • autoCompactionCompleted false at run-finalization time
    • later compaction success still repairs store count

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions