Skip to content

fix(compaction): bound stale transcript usage#81916

Open
jbetala7 wants to merge 3 commits into
openclaw:mainfrom
jbetala7:fix/81178-bound-stale-transcript-usage
Open

fix(compaction): bound stale transcript usage#81916
jbetala7 wants to merge 3 commits into
openclaw:mainfrom
jbetala7:fix/81178-bound-stale-transcript-usage

Conversation

@jbetala7

@jbetala7 jbetala7 commented May 14, 2026

Copy link
Copy Markdown
Contributor

Fixes #81178

Summary

  • bound stale raw transcript prompt-usage snapshots against the recent active replay estimate
  • keep post-usage tail pressure additive for fresh usage records so interrupted tool output after the latest usage record remains conservative
  • when a post-usage compaction marker proves the latest usage snapshot is stale, count only the post-marker tail bytes instead of stale pre-compaction tail bytes
  • drop marker-proven stale output pressure even when the stale prompt usage is moderate and does not trip the prompt-disparity heuristic
  • add preflight compaction regressions for both giant stale usage and moderate stale prompt usage with large stale output/pre-marker tail

Real behavior proof

Behavior addressed: stale pre-compaction transcript usage, stale pre-marker tail bytes, and stale output tokens should not force another preflight compaction when a post-usage compaction marker proves that the active post-compaction replay is small and no explicit transcript-byte policy is exceeded.

Real environment tested: local OpenClaw source runtime from PR head 74ba2e34ec on macOS, using a real temporary JSONL session transcript and the actual runPreflightCompactionIfNeeded path. The compact dependency was instrumented only to count whether the runtime attempted compaction.

Exact steps or command run after this patch: from the PR checkout at 74ba2e34ec, ran a node --import tsx source-runtime proof that wrote a transcript with stale assistant usage input=9000, stale assistant usage output=80000, a 450000-byte stale pre-marker tool result, a compaction marker, and a small post-compaction replay; then it invoked runPreflightCompactionIfNeeded with a 100000-token context window.

Evidence after fix: copied terminal output from the direct source-runtime proof:

{
  "head": "74ba2e34ec",
  "behavior": "post-usage compaction marker drops stale output and pre-marker tail pressure",
  "staleUsageInputTokens": 9000,
  "staleOutputTokens": 80000,
  "preMarkerTailBytes": 450000,
  "contextWindowTokens": 100000,
  "compactCalls": 0,
  "phaseCalls": [],
  "returnedOriginalEntry": true
}

Observed result after fix: the source-runtime preflight call returned the original session entry, never entered the preflight_compacting phase, and made zero compaction calls despite stale output plus stale pre-marker tail pressure that would otherwise exceed the configured 100k context window.

What was not tested: I did not run a live Discord/Pi provider session because the failure is in the local preflight estimator and the source-runtime proof exercises that estimator path directly.

Verification

  • node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts (25 passed)
  • node_modules/.bin/oxfmt --check --threads=1 src/auto-reply/reply/agent-runner-memory.ts src/auto-reply/reply/agent-runner-memory.test.ts
  • git diff --check
  • git diff --cached --check
  • git merge-tree --write-tree HEAD origin/main
  • node --import tsx -e ... source-runtime proof copied above

Re-verified at exact branch head 74ba2e34ec on 2026-06-03 (the earlier proof referenced b90c31584d, the head before the Dependency Guard rebase; the patch is unchanged — two commits on top of current main ed4c4afc0f).

Before fix (origin/main estimator, same standalone source-runtime harness and transcript): runPreflightCompactionIfNeeded throws Preflight compaction required but failed: not_compacted — i.e. main forces preflight compaction for the stale-after-compaction transcript (the bug).

After fix (PR head 74ba2e34ec), both regression scenarios via the real source-runtime path (node --import tsx, compaction action instrumented only to count attempts):

{ "head": "74ba2e34ec", "behavior": "post-usage compaction marker drops stale output and pre-marker tail pressure",
  "staleUsageInputTokens": 9000, "staleOutputTokens": 80000, "preMarkerTailBytes": 450000,
  "contextWindowTokens": 100000, "compactCalls": 0, "phaseCalls": [], "returnedOriginalEntry": true }
{ "head": "74ba2e34ec", "behavior": "giant stale pre-compaction usage dropped after compaction marker",
  "staleUsageInputTokens": 240000, "staleOutputTokens": 120000, "preMarkerTailBytes": 450000,
  "contextWindowTokens": 100000, "compactCalls": 0, "phaseCalls": [], "returnedOriginalEntry": true }
PROOF_RESULT: PASS

Focused suite at this head: node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts → 42 passed.


Update 2026-06-03 — addressed ClawSweeper P1 (head 26d1293708)

ClawSweeper re-review (proof judged diamond lobster) raised a P1 patch-quality
risk: transcriptLineHasPostUsageCompactionMarker also matched the post-compaction
refresh phrases ([Post-compaction context refresh], Session was just compacted.)
in free message text, so ordinary user/tool content echoing them could masquerade
as a marker and wrongly drop stale-usage pressure.

Fix: detect only structured compaction records (type/payload.type ===
"compaction" | "session.compacted" — the records the runtime actually persists
via transcript-file-state.ts / session-manager.ts). Dropped the free-text
fallback and the now-unused collectTranscriptText helper (net -? prod LOC).
The refresh phrases come from post-compaction-context.ts and are prompt-injected
context, not persisted transcript markers, so structured detection fully covers the
real signal.

Real behavior proof at head 26d1293708 (node --import tsx, real
runPreflightCompactionIfNeeded, compaction instrumented to count attempts):

# structured {type:"compaction"} markers still suppress compaction (feature preserved)
moderate stale (in=9000 out=80000, 450k tail): compactCalls=0, returnedOriginalEntry=true
giant   stale (in=240000 out=120000, 450k tail): compactCalls=0, returnedOriginalEntry=true
PROOF_RESULT: PASS

# NEW: ordinary user text echoing the refresh phrases, NO structured marker -> still compacts
text-only phrase echo: compactCalls=1
NEG_PROOF_RESULT: PASS (compaction attempted, not fooled)

Focused suite: node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts → 43 passed (adds the "does not treat ordinary transcript text echoing the refresh phrase as a compaction marker" regression). oxfmt --check clean.

@clawsweeper

clawsweeper Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 3, 2026, 3:28 AM ET / 07:28 UTC.

Summary
The PR changes preflight compaction transcript estimation so stale pre-compaction usage, output, and pre-marker tail pressure are discounted only when a post-usage structured compaction marker proves the snapshot is stale.

PR surface: Source +103, Tests +210. Total +313 across 2 files.

Reproducibility: yes. source inspection plus the PR's before/after source-runtime proof give a high-confidence reproduction path: current main counts stale pre-compaction usage and tail pressure, while the PR's structured-marker scenario returns without compaction.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Risk before merge

  • [P1] This changes whether preflight compaction runs for sessions with stale transcript usage after compaction; a bad marker or tail estimate could under-compact or over-compact session state even if unit tests pass.

Maintainer options:

  1. Accept With Focused Compaction Proof (recommended)
    Maintainers can accept the bounded session-state risk if the focused source-runtime proof and required checks remain green at the reviewed head.
  2. Ask For Broader Runtime Replay
    If maintainers want stronger confidence, request one packaged or live long-session replay showing a post-compaction turn does not immediately compact again.

Next step before merge

  • No automated repair is needed because the latest head addresses the prior finding and I found no new blocking patch defect.

Security
Cleared: The diff only changes local transcript estimation logic and focused tests; I found no new dependency, secret, workflow, package, or supply-chain surface.

Review details

Best possible solution:

Keep the structured-marker estimator shape and land it only after normal required checks confirm the focused compaction regression suite stays green.

Do we have a high-confidence way to reproduce the issue?

Yes, source inspection plus the PR's before/after source-runtime proof give a high-confidence reproduction path: current main counts stale pre-compaction usage and tail pressure, while the PR's structured-marker scenario returns without compaction.

Is this the best way to solve the issue?

Yes, this is the best narrow fix for the reported path: it adjusts the estimator at the transcript snapshot boundary and avoids the unsafe global usage cap by requiring a structured post-usage compaction marker.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 85e5d486df11.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes copied terminal output from a real source-runtime invocation of runPreflightCompactionIfNeeded at the current head, including the new negative phrase-echo proof.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes copied terminal output from a real source-runtime invocation of runPreflightCompactionIfNeeded at the current head, including the new negative phrase-echo proof.
  • remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
  • remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P2: This is a normal-priority regression fix for repeated preflight compactions with limited blast radius around session compaction accounting.
  • merge-risk: 🚨 session-state: The diff changes token-pressure accounting that decides whether an existing session compacts before the next run.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes copied terminal output from a real source-runtime invocation of runPreflightCompactionIfNeeded at the current head, including the new negative phrase-echo proof.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes copied terminal output from a real source-runtime invocation of runPreflightCompactionIfNeeded at the current head, including the new negative phrase-echo proof.
Evidence reviewed

PR surface:

Source +103, Tests +210. Total +313 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 105 2 +103
Tests 1 210 0 +210
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 315 2 +313

What I checked:

Likely related people:

  • steipete: Recent history shows repeated maintenance of preflight compaction, transcript tail scan performance, and oversized transcript guard behavior in agent-runner-memory.ts. (role: recent area contributor; confidence: high; commits: 29af4add2a8e, 39bc43cb6068, b005f01c1304; files: src/auto-reply/reply/agent-runner-memory.ts, src/auto-reply/reply/memory-flush.ts)
  • ArthurNie: Authored the recent current-main change that made preflight compaction a hard gate before oversized agent turns, which is directly adjacent to this PR's estimator path. (role: recent adjacent owner; confidence: medium; commits: 9d54285b0d4a; files: src/auto-reply/reply/agent-runner-memory.ts, src/auto-reply/reply/agent-runner-memory.test.ts)
  • jared596: File history identifies the earlier change that triggered preflight compaction from transcript estimates when usage is stale, the behavior this PR refines. (role: introduced related behavior; confidence: medium; commits: c6d8318d07f5; files: src/auto-reply/reply/agent-runner-memory.ts, src/auto-reply/reply/memory-flush.ts)
  • giodl73-repo: Authored local transcript usage estimation in the gateway transcript reader, which is adjacent to the bounded recent replay estimate used by this PR. (role: adjacent transcript-estimation contributor; confidence: medium; commits: 2c59ea8a2e76; files: src/gateway/session-utils.fs.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 14, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 14, 2026
@stielemans

This comment was marked as low quality.

@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. proof: supplied External PR includes structured after-fix real behavior proof. and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 16, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 16, 2026
@clawsweeper clawsweeper Bot added P2 Normal backlog priority with limited blast radius. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. labels May 17, 2026
@jbetala7 jbetala7 force-pushed the fix/81178-bound-stale-transcript-usage branch from e7c39ee to 3fa8b5d Compare May 17, 2026 16:12
@openclaw-barnacle openclaw-barnacle Bot removed the cli CLI command changes label May 17, 2026
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 28, 2026
@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg: 🎁 locked until real behavior proof passes.

Details
  • No creature or rarity is rolled until proof passes.
  • Eggs are collectible flavor only; they do not affect labels, ratings, merge decisions, or automation.

@jbetala7 jbetala7 force-pushed the fix/81178-bound-stale-transcript-usage branch from 896b5c8 to 460f423 Compare May 31, 2026 07:28
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 31, 2026
@jbetala7 jbetala7 force-pushed the fix/81178-bound-stale-transcript-usage branch from 460f423 to 74ba2e3 Compare June 1, 2026 08:10
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. labels Jun 1, 2026
@jbetala7

jbetala7 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

The Real behavior proof and auto-response checks were from the 06-01 Dependency Guard rebase run (auto-response shows A task was canceled), and the prior proof referenced the pre-rebase head b90c31584d. I re-ran the real source-runtime proof at the exact current head 74ba2e34ec and updated the PR body.

Real behavior proof (real runPreflightCompactionIfNeeded source path via node --import tsx; only the terminal compaction action is instrumented to count attempts):

  • Before (origin/main estimator): same harness + transcript -> runPreflightCompactionIfNeeded throws Preflight compaction required but failed: not_compacted (main forces preflight compaction on the stale-after-compaction transcript — the bug).
  • After (head 74ba2e34ec): both scenarios return the original session entry, never enter preflight_compacting, compactCalls: 0:
    • moderate stale usage input=9000 output=80000, 450000-byte pre-marker tail, 100k window -> compactCalls: 0, returnedOriginalEntry: true
    • giant stale usage input=240000 output=120000, same tail/window -> compactCalls: 0, returnedOriginalEntry: true

Focused suite at this head: node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts -> 42 passed.

@clawsweeper

clawsweeper Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 3, 2026
ClawSweeper re-review flagged a P1: transcriptLineHasPostUsageCompactionMarker
also matched the post-compaction refresh phrases ("[Post-compaction context
refresh]", "Session was just compacted.") in free message text. Those phrases
are prompt-injected context, not persisted markers, so ordinary user/tool
content echoing them could masquerade as a compaction marker and wrongly drop
stale-usage pressure.

Detect only structured compaction records (type "compaction"/"session.compacted",
the records the runtime actually writes via transcript-file-state / session-manager)
and drop the free-text fallback plus the now-unused collectTranscriptText helper.
Add a regression test proving ordinary transcript text echoing the refresh phrase
(with no structured marker) keeps preflight compaction conservative.
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 3, 2026
@jbetala7

jbetala7 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the P1 from the last re-review at head 26d1293708.

transcriptLineHasPostUsageCompactionMarker now keys only on structured compaction records (type/payload.type === compaction | session.compacted). I removed the free-text phrase fallback (and the now-unused collectTranscriptText helper) because [Post-compaction context refresh] / Session was just compacted. originate from post-compaction-context.ts as prompt-injected context, not persisted transcript markers — so ordinary user/tool content echoing them can no longer masquerade as a marker.

Proof (real runPreflightCompactionIfNeeded, compaction counted):

  • structured {type:"compaction"} markers still suppress compaction: moderate + giant stale cases both compactCalls=0, returns original entry (feature preserved).
  • new regression — ordinary user text echoing both phrases with no structured record: compactCalls=1 (still compacts, not fooled).

Focused suite now 43 passed (adds the phrase-echo regression ClawSweeper requested). @clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: repeated early preflight compactions after compaction due to stale transcript usage

2 participants