Compaction late success leaves /status compactionCount stale after timeout

# Compaction Timeout With Late Success: Transcript Compacted, Counter Stale

## Summary

A manual compaction on session `agent:christina:feishu:group:oc_03f1133e89a8d5ee60128a2c3ebca80a` reported:

```text
Compaction failed: Compaction timed out
```

but the same session later showed a sharply reduced context footprint, and the session transcript contains a real persisted `compaction` entry for that attempt.

This means the compaction did eventually complete, but the session store counter used by `/status` was not updated to reflect it.

## Symptom

- `/status` before manual compaction showed `Context: 245k/272k (90%) · Compactions: 1`
- OpenClaw later emitted `Compaction failed: Compaction timed out`
- The session transcript later persisted a second `compaction` entry for that session
- `/status` after that showed `Context: 85k/272k (31%) · Compactions: 1`

Observed result:

- transcript state says compaction happened
- context usage behavior says compaction happened
- `/status` compaction counter still says it did not

## Evidence

### Session transcript

File:

- `/Users/jojo/.openclaw/agents/christina/sessions/22efdbf9-aac9-44ad-bb64-2d91b20305aa.jsonl`

Relevant entries:

- line 489: status before timeout
- line 490: `Compaction failed: Compaction timed out`
- line 491: actual persisted `compaction` entry from the same attempt
- line 516: later status showing much smaller context but unchanged compaction count

### Session store

File:

- `/Users/jojo/.openclaw/agents/christina/sessions/sessions.json`

The persisted `compactionCount` for this session remained `1` even though the transcript contains two `compaction` entries total.

## Root Cause Hypothesis

This looks like a race between:

1. the runner's compaction wait timeout
2. the actual asynchronous completion of compaction
3. the end-of-run writeback into `sessions.json`

Current behavior appears to be:

- the runner waits up to 60s for compaction retry bookkeeping
- if that wait times out, the run proceeds
- end-of-run session-store update uses `result.meta.agentMeta.compactionCount`
- if that meta value is still `0` at writeback time, `sessions.json` does not increment `compactionCount`
- if compaction finishes later and persists into transcript, the counter is not reconciled afterward

## Code References

- wait timeout path:
  - `/usr/local/lib/node_modules/openclaw/dist/plugin-sdk/reply-C0BWJKME.js`
  - around `waitForCompactionRetryWithAggregateTimeout(...)`
- end-of-run store update:
  - `/usr/local/lib/node_modules/openclaw/dist/plugin-sdk/reply-C0BWJKME.js`
  - `updateSessionStoreAfterAgentRun(...)`
- current counter write logic:
  - increments only from `result.meta.agentMeta.compactionCount`

## Why This Is A Bug

Yes, this should be treated as a bug.

The problem is not the 60s timeout itself. The timeout behavior is reasonable.

The bug is that:

- the user-visible failure message implies compaction did not complete
- the transcript later proves it did complete
- the `/status` counter remains stale

That is a state-consistency bug between transcript persistence and session-store reporting.

## Non-Goals

- Do not increase the 60s timeout
- Do not block the active conversation longer
- Do not make the runner poll aggressively or hold extra heavy state in memory

## Recommended Fix Direction

### Preferred fix: post-compaction reconciliation write

When a compaction eventually persists successfully, perform a lightweight session-store reconciliation step that updates `compactionCount` independently of the original run's already-finished writeback.

Concretely:

- after a compaction entry is durably appended to transcript
- issue a tiny follow-up store update for that `sessionKey`
- set:
  - `compactionCount = max(existing compactionCount, transcript compaction count)`
- optionally also refresh a lightweight `updatedAt`

Advantages:

- no timeout increase
- no need to keep waiting in the runner
- no expensive transcript rescans on every `/status`
- directly fixes the stale counter at the point where truth becomes known

### Acceptable alternative: lazy reconcile on `/status`

When `/status` loads a session, if there is evidence of a recent compaction-timeout mismatch, reconcile `compactionCount` from transcript before rendering.

This is less attractive because:

- it pushes repair into read path
- it can add latency to `/status`
- it leaves stale state around until someone explicitly checks status

### Another acceptable alternative: append an explicit async completion event

If compaction completes after timeout, emit a small internal completion event and let a background handler update the store.

This is also viable, but more moving parts than the direct reconciliation write.

## Recommended Minimal Implementation

1. Keep the current 60s timeout unchanged.
2. Keep the current "proceed after timeout" behavior unchanged.
3. After transcript compaction persistence succeeds, call a dedicated helper like:

```text
reconcileSessionStoreAfterCompaction(sessionKey, sessionFile)
```

4. That helper should:

- read current session-store entry
- determine authoritative compaction count cheaply
- update only if transcript-derived count is greater than stored count

## Suggested data source for reconciliation

Best option:

- increment store from the same success path that appends the `compaction` entry

Fallback option:

- count transcript `type:"compaction"` entries only when a late-success path is detected

Avoid:

- full transcript scans on every request

## Severity

Moderate.

It does not appear to corrupt transcript state, but it makes `/status` misleading and can cause operators to draw the wrong conclusion about whether compaction actually happened.

## Notes

- This issue is compatible with keeping the current timeout policy.
- The bug is not "compaction timed out"; the bug is "late compaction success is not reconciled into session-store counters."

## Latest Main Investigation

Investigated against fresh `origin/main` worktree:

- repo: `/Users/jojo/XinWorld/projects/openclaw-main-investigation`
- fetched commit: `0ece3834f`

### Relevant write paths

`/status` counter for command-agent sessions is updated from run metadata, not transcript truth:

- `src/commands/agent/session-store.ts`
- `updateSessionStoreAfterAgentRun(...)`
- uses:
  - `const compactionsThisRun = Math.max(0, result.meta.agentMeta?.compactionCount ?? 0)`
  - only increments `next.compactionCount` when `compactionsThisRun > 0`

Embedded runner produces that meta counter from attempt-local subscription state:

- `src/agents/pi-embedded-runner/run/attempt.ts`
- returns `compactionCount: getCompactionCount()`

- `src/agents/pi-embedded-runner/run.ts`
- accumulates `attempt.compactionCount` into `autoCompactionCount`
- writes `agentMeta.compactionCount: autoCompactionCount > 0 ? autoCompactionCount : undefined`

Attempt-local counting depends on the compaction end event being seen before the attempt returns:

- `src/agents/pi-embedded-subscribe.handlers.compaction.ts`
- `handleAutoCompactionEnd(...)` calls `ctx.incrementCompactionCount?.()` only on successful compaction end

### Timeout behavior

The runner still uses a hard aggregate wait timeout:

- `src/agents/pi-embedded-runner/run/attempt.ts`
- `COMPACTION_RETRY_AGGREGATE_TIMEOUT_MS = 60_000`

When that wait times out:

- `timedOutDuringCompaction = true`
- the attempt proceeds using a snapshot selection path
- later cleanup unsubscribes the subscription

This means the counter is only reliable if compaction completion is observed before the attempt returns and unsubscribes.

### Existing persistence helper

There is already a lightweight direct store-update helper suitable for reconciliation:

- `src/auto-reply/reply/session-updates.ts`
- `incrementCompactionCount(...)`

That helper:

- updates only the target session entry
- can also refresh `totalTokens` using `tokensAfter`
- does not require changing timeout behavior

### Auto-reply note

Auto-reply paths already do a post-run counter write when `autoCompactionCompleted` is true:

- `src/auto-reply/reply/agent-runner.ts`
- `src/auto-reply/reply/followup-runner.ts`

That means the same class of bug can happen there too if compaction completion lands after the run outcome is finalized.

## Concrete Repair Plan

### Preferred fix

Add a post-compaction success reconciliation hook on the event path that already knows compaction actually completed, instead of relying exclusively on the enclosing run's final metadata.

### Minimal design

1. Keep the existing 60s timeout unchanged.
2. Keep the current "continue after timeout" behavior unchanged.
3. On successful compaction end, perform a tiny best-effort session-store update immediately.
4. Make that update monotonic so duplicate signals cannot overcount.

### Proposed implementation shape

Add a helper, for example:

`reconcileCompactionCountAfterSuccess({ sessionKey, agentId, config, observedCompactionCount, sessionId? })`

Suggested behavior:

- resolve the correct `sessions.json` path from config via `resolveStorePath(...)`
- load the current session store entry
- set `compactionCount = max(existing compactionCount, observedCompactionCount)`
- optionally update `updatedAt`
- optionally update `totalTokens` when a trustworthy post-compaction token estimate is available

### Best hook point

Primary hook point:

- `src/agents/pi-embedded-subscribe.handlers.compaction.ts`
- inside `handleAutoCompactionEnd(...)`

Reason:

- this is the first place where success is actually known
- it already distinguishes `hasResult` and `wasAborted`
- it runs independently of whether the outer run later times out or finishes

### Required plumbing

`handleAutoCompactionEnd(...)` currently has `sessionKey`, `sessionId`, `agentId`, and `config` via subscribe params, but not direct session-store access.

To support reconciliation cleanly:

- extend subscribe context/helper plumbing to resolve store path from `config + agentId`
- call into a small shared session-store helper
- keep failures best-effort and log-only

### Why `max(...)` instead of `+1`

Using `+1` in the late-success path risks double increments when:

- the normal run-finalization path already counted the compaction
- retries or duplicated end signals occur

Using:

- `compactionCount = max(existing, observed)`

keeps the repair idempotent and safe.

### Optional stronger variant

If attempt-local observed count is not trusted for all late-success cases, add a targeted transcript reconciliation helper only for the timeout-late-success branch:

- count transcript `type:"compaction"` entries for that one session
- write `max(existing, transcriptCount)`

This should remain fallback-only, not the default hot path.

## Proposed Tests

1. Add a unit/integration test where:
   - compaction wait hits the 60s aggregate timeout
   - compaction end event arrives before final teardown
   - transcript success is simulated
   - `sessions.json.compactionCount` still converges to the correct value

2. Add an idempotency test showing:
   - normal run-finalization increments once
   - late reconciliation does not increment a second time

3. Add an auto-reply regression test for:
   - `autoCompactionCompleted` false at run-finalization time
   - later compaction success still repairs store count


Uh oh!

Compaction late success leaves /status compactionCount stale after timeout #45492

Description

Compaction Timeout With Late Success: Transcript Compacted, Counter Stale

Summary

Symptom

Evidence

Session transcript

Session store

Root Cause Hypothesis

Code References

Why This Is A Bug

Non-Goals

Recommended Fix Direction

Preferred fix: post-compaction reconciliation write

Acceptable alternative: lazy reconcile on /status

Another acceptable alternative: append an explicit async completion event

Recommended Minimal Implementation

Suggested data source for reconciliation

Severity

Notes

Latest Main Investigation

Relevant write paths

Timeout behavior

Existing persistence helper

Auto-reply note

Concrete Repair Plan

Preferred fix

Minimal design

Proposed implementation shape

Best hook point

Required plumbing

Why max(...) instead of +1

Optional stronger variant

Proposed Tests

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Acceptable alternative: lazy reconcile on `/status`

Why `max(...)` instead of `+1`