fix(compaction): bound stale transcript usage#81916
Conversation
|
Codex review: needs maintainer review before merge. Reviewed June 3, 2026, 3:28 AM ET / 07:28 UTC. Summary PR surface: Source +103, Tests +210. Total +313 across 2 files. Reproducibility: yes. source inspection plus the PR's before/after source-runtime proof give a high-confidence reproduction path: current main counts stale pre-compaction usage and tail pressure, while the PR's structured-marker scenario returns without compaction. Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Keep the structured-marker estimator shape and land it only after normal required checks confirm the focused compaction regression suite stays green. Do we have a high-confidence way to reproduce the issue? Yes, source inspection plus the PR's before/after source-runtime proof give a high-confidence reproduction path: current main counts stale pre-compaction usage and tail pressure, while the PR's structured-marker scenario returns without compaction. Is this the best way to solve the issue? Yes, this is the best narrow fix for the reported path: it adjusts the estimator at the transcript snapshot boundary and avoids the unsafe global usage cap by requiring a structured post-usage compaction marker. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 85e5d486df11. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +103, Tests +210. Total +313 across 2 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
This comment was marked as low quality.
This comment was marked as low quality.
e7c39ee to
3fa8b5d
Compare
|
ClawSweeper PR egg: 🎁 locked until real behavior proof passes. Details
|
896b5c8 to
460f423
Compare
460f423 to
74ba2e3
Compare
|
@clawsweeper re-review The Real behavior proof (real
Focused suite at this head: |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
ClawSweeper re-review flagged a P1: transcriptLineHasPostUsageCompactionMarker
also matched the post-compaction refresh phrases ("[Post-compaction context
refresh]", "Session was just compacted.") in free message text. Those phrases
are prompt-injected context, not persisted markers, so ordinary user/tool
content echoing them could masquerade as a compaction marker and wrongly drop
stale-usage pressure.
Detect only structured compaction records (type "compaction"/"session.compacted",
the records the runtime actually writes via transcript-file-state / session-manager)
and drop the free-text fallback plus the now-unused collectTranscriptText helper.
Add a regression test proving ordinary transcript text echoing the refresh phrase
(with no structured marker) keeps preflight compaction conservative.
|
Addressed the P1 from the last re-review at head
Proof (real
Focused suite now 43 passed (adds the phrase-echo regression ClawSweeper requested). @clawsweeper re-review |
|
🦞👀 Command router queued. I will update this comment with the next step. |
Fixes #81178
Summary
Real behavior proof
Behavior addressed: stale pre-compaction transcript usage, stale pre-marker tail bytes, and stale output tokens should not force another preflight compaction when a post-usage compaction marker proves that the active post-compaction replay is small and no explicit transcript-byte policy is exceeded.
Real environment tested: local OpenClaw source runtime from PR head
74ba2e34econ macOS, using a real temporary JSONL session transcript and the actualrunPreflightCompactionIfNeededpath. The compact dependency was instrumented only to count whether the runtime attempted compaction.Exact steps or command run after this patch: from the PR checkout at
74ba2e34ec, ran anode --import tsxsource-runtime proof that wrote a transcript with stale assistant usageinput=9000, stale assistant usageoutput=80000, a 450000-byte stale pre-marker tool result, a compaction marker, and a small post-compaction replay; then it invokedrunPreflightCompactionIfNeededwith a 100000-token context window.Evidence after fix: copied terminal output from the direct source-runtime proof:
{ "head": "74ba2e34ec", "behavior": "post-usage compaction marker drops stale output and pre-marker tail pressure", "staleUsageInputTokens": 9000, "staleOutputTokens": 80000, "preMarkerTailBytes": 450000, "contextWindowTokens": 100000, "compactCalls": 0, "phaseCalls": [], "returnedOriginalEntry": true }Observed result after fix: the source-runtime preflight call returned the original session entry, never entered the
preflight_compactingphase, and made zero compaction calls despite stale output plus stale pre-marker tail pressure that would otherwise exceed the configured 100k context window.What was not tested: I did not run a live Discord/Pi provider session because the failure is in the local preflight estimator and the source-runtime proof exercises that estimator path directly.
Verification
node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts(25 passed)node_modules/.bin/oxfmt --check --threads=1 src/auto-reply/reply/agent-runner-memory.ts src/auto-reply/reply/agent-runner-memory.test.tsgit diff --checkgit diff --cached --checkgit merge-tree --write-tree HEAD origin/mainnode --import tsx -e ...source-runtime proof copied aboveRe-verified at exact branch head
74ba2e34econ 2026-06-03 (the earlier proof referencedb90c31584d, the head before the Dependency Guard rebase; the patch is unchanged — two commits on top of currentmained4c4afc0f).Before fix (origin/main estimator, same standalone source-runtime harness and transcript):
runPreflightCompactionIfNeededthrowsPreflight compaction required but failed: not_compacted— i.e. main forces preflight compaction for the stale-after-compaction transcript (the bug).After fix (PR head
74ba2e34ec), both regression scenarios via the real source-runtime path (node --import tsx, compaction action instrumented only to count attempts):{ "head": "74ba2e34ec", "behavior": "post-usage compaction marker drops stale output and pre-marker tail pressure", "staleUsageInputTokens": 9000, "staleOutputTokens": 80000, "preMarkerTailBytes": 450000, "contextWindowTokens": 100000, "compactCalls": 0, "phaseCalls": [], "returnedOriginalEntry": true } { "head": "74ba2e34ec", "behavior": "giant stale pre-compaction usage dropped after compaction marker", "staleUsageInputTokens": 240000, "staleOutputTokens": 120000, "preMarkerTailBytes": 450000, "contextWindowTokens": 100000, "compactCalls": 0, "phaseCalls": [], "returnedOriginalEntry": true } PROOF_RESULT: PASSFocused suite at this head:
node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts→ 42 passed.Update 2026-06-03 — addressed ClawSweeper P1 (head
26d1293708)ClawSweeper re-review (proof judged
diamond lobster) raised a P1 patch-qualityrisk:
transcriptLineHasPostUsageCompactionMarkeralso matched the post-compactionrefresh phrases (
[Post-compaction context refresh],Session was just compacted.)in free message text, so ordinary user/tool content echoing them could masquerade
as a marker and wrongly drop stale-usage pressure.
Fix: detect only structured compaction records (
type/payload.type==="compaction"|"session.compacted"— the records the runtime actually persistsvia
transcript-file-state.ts/session-manager.ts). Dropped the free-textfallback and the now-unused
collectTranscriptTexthelper (net -? prod LOC).The refresh phrases come from
post-compaction-context.tsand are prompt-injectedcontext, not persisted transcript markers, so structured detection fully covers the
real signal.
Real behavior proof at head
26d1293708(node --import tsx, realrunPreflightCompactionIfNeeded, compaction instrumented to count attempts):Focused suite:
node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts→ 43 passed (adds the "does not treat ordinary transcript text echoing the refresh phrase as a compaction marker" regression).oxfmt --checkclean.