fix(compaction): bound stale transcript usage by jbetala7 · Pull Request #81916 · openclaw/openclaw

jbetala7 · 2026-05-14T20:39:55Z

Summary

bound stale raw transcript prompt-usage snapshots against the recent active replay estimate
keep post-usage tail pressure additive for fresh usage records so interrupted tool output after the latest usage record remains conservative
when a post-usage compaction marker proves the latest usage snapshot is stale, count only the post-marker tail bytes instead of stale pre-compaction tail bytes
drop marker-proven stale output pressure even when the stale prompt usage is moderate and does not trip the prompt-disparity heuristic
add preflight compaction regressions for both giant stale usage and moderate stale prompt usage with large stale output/pre-marker tail

Real behavior proof

Behavior addressed: stale pre-compaction transcript usage, stale pre-marker tail bytes, and stale output tokens should not force another preflight compaction when a post-usage compaction marker proves that the active post-compaction replay is small and no explicit transcript-byte policy is exceeded.

Real environment tested: local OpenClaw source runtime from PR head 74ba2e34ec on macOS, using a real temporary JSONL session transcript and the actual runPreflightCompactionIfNeeded path. The compact dependency was instrumented only to count whether the runtime attempted compaction.

Exact steps or command run after this patch: from the PR checkout at 74ba2e34ec, ran a node --import tsx source-runtime proof that wrote a transcript with stale assistant usage input=9000, stale assistant usage output=80000, a 450000-byte stale pre-marker tool result, a compaction marker, and a small post-compaction replay; then it invoked runPreflightCompactionIfNeeded with a 100000-token context window.

Evidence after fix: copied terminal output from the direct source-runtime proof:

{
  "head": "74ba2e34ec",
  "behavior": "post-usage compaction marker drops stale output and pre-marker tail pressure",
  "staleUsageInputTokens": 9000,
  "staleOutputTokens": 80000,
  "preMarkerTailBytes": 450000,
  "contextWindowTokens": 100000,
  "compactCalls": 0,
  "phaseCalls": [],
  "returnedOriginalEntry": true
}

Observed result after fix: the source-runtime preflight call returned the original session entry, never entered the preflight_compacting phase, and made zero compaction calls despite stale output plus stale pre-marker tail pressure that would otherwise exceed the configured 100k context window.

What was not tested: I did not run a live Discord/Pi provider session because the failure is in the local preflight estimator and the source-runtime proof exercises that estimator path directly.

Verification

node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts (25 passed)
node_modules/.bin/oxfmt --check --threads=1 src/auto-reply/reply/agent-runner-memory.ts src/auto-reply/reply/agent-runner-memory.test.ts
git diff --check
git diff --cached --check
git merge-tree --write-tree HEAD origin/main
node --import tsx -e ... source-runtime proof copied above

Re-verified at exact branch head 74ba2e34ec on 2026-06-03 (the earlier proof referenced b90c31584d, the head before the Dependency Guard rebase; the patch is unchanged — two commits on top of current main ed4c4afc0f).

Before fix (origin/main estimator, same standalone source-runtime harness and transcript): runPreflightCompactionIfNeeded throws Preflight compaction required but failed: not_compacted — i.e. main forces preflight compaction for the stale-after-compaction transcript (the bug).

After fix (PR head 74ba2e34ec), both regression scenarios via the real source-runtime path (node --import tsx, compaction action instrumented only to count attempts):

{ "head": "74ba2e34ec", "behavior": "post-usage compaction marker drops stale output and pre-marker tail pressure",
  "staleUsageInputTokens": 9000, "staleOutputTokens": 80000, "preMarkerTailBytes": 450000,
  "contextWindowTokens": 100000, "compactCalls": 0, "phaseCalls": [], "returnedOriginalEntry": true }
{ "head": "74ba2e34ec", "behavior": "giant stale pre-compaction usage dropped after compaction marker",
  "staleUsageInputTokens": 240000, "staleOutputTokens": 120000, "preMarkerTailBytes": 450000,
  "contextWindowTokens": 100000, "compactCalls": 0, "phaseCalls": [], "returnedOriginalEntry": true }
PROOF_RESULT: PASS

Focused suite at this head: node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts → 42 passed.

Update 2026-06-03 — addressed ClawSweeper P1 (head `26d1293708`)

ClawSweeper re-review (proof judged diamond lobster) raised a P1 patch-quality
risk: transcriptLineHasPostUsageCompactionMarker also matched the post-compaction
refresh phrases ([Post-compaction context refresh], Session was just compacted.)
in free message text, so ordinary user/tool content echoing them could masquerade
as a marker and wrongly drop stale-usage pressure.

Fix: detect only structured compaction records (type/payload.type ===
"compaction" | "session.compacted" — the records the runtime actually persists
via transcript-file-state.ts / session-manager.ts). Dropped the free-text
fallback and the now-unused collectTranscriptText helper (net -? prod LOC).
The refresh phrases come from post-compaction-context.ts and are prompt-injected
context, not persisted transcript markers, so structured detection fully covers the
real signal.

Real behavior proof at head 26d1293708 (node --import tsx, real
runPreflightCompactionIfNeeded, compaction instrumented to count attempts):

# structured {type:"compaction"} markers still suppress compaction (feature preserved)
moderate stale (in=9000 out=80000, 450k tail): compactCalls=0, returnedOriginalEntry=true
giant   stale (in=240000 out=120000, 450k tail): compactCalls=0, returnedOriginalEntry=true
PROOF_RESULT: PASS

# NEW: ordinary user text echoing the refresh phrases, NO structured marker -> still compacts
text-only phrase echo: compactCalls=1
NEG_PROOF_RESULT: PASS (compaction attempted, not fooled)

Focused suite: node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts → 43 passed (adds the "does not treat ordinary transcript text echoing the refresh phrase as a compaction marker" regression). oxfmt --check clean.

clawsweeper · 2026-05-14T20:43:31Z

Codex review: needs maintainer review before merge. Reviewed June 3, 2026, 3:28 AM ET / 07:28 UTC.

Summary
The PR changes preflight compaction transcript estimation so stale pre-compaction usage, output, and pre-marker tail pressure are discounted only when a post-usage structured compaction marker proves the snapshot is stale.

PR surface: Source +103, Tests +210. Total +313 across 2 files.

Reproducibility: yes. source inspection plus the PR's before/after source-runtime proof give a high-confidence reproduction path: current main counts stale pre-compaction usage and tail pressure, while the PR's structured-marker scenario returns without compaction.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

none.

Risk before merge

[P1] This changes whether preflight compaction runs for sessions with stale transcript usage after compaction; a bad marker or tail estimate could under-compact or over-compact session state even if unit tests pass.

Maintainer options:

Accept With Focused Compaction Proof (recommended)
Maintainers can accept the bounded session-state risk if the focused source-runtime proof and required checks remain green at the reviewed head.
Ask For Broader Runtime Replay
If maintainers want stronger confidence, request one packaged or live long-session replay showing a post-compaction turn does not immediately compact again.

Next step before merge

No automated repair is needed because the latest head addresses the prior finding and I found no new blocking patch defect.

Security
Cleared: The diff only changes local transcript estimation logic and focused tests; I found no new dependency, secret, workflow, package, or supply-chain surface.

Review details

Best possible solution:

Keep the structured-marker estimator shape and land it only after normal required checks confirm the focused compaction regression suite stays green.

Do we have a high-confidence way to reproduce the issue?

Yes, source inspection plus the PR's before/after source-runtime proof give a high-confidence reproduction path: current main counts stale pre-compaction usage and tail pressure, while the PR's structured-marker scenario returns without compaction.

Is this the best way to solve the issue?

Yes, this is the best narrow fix for the reported path: it adjusts the estimator at the transcript snapshot boundary and avoids the unsafe global usage cap by requiring a structured post-usage compaction marker.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 85e5d486df11.

Label changes

Label changes:

add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes copied terminal output from a real source-runtime invocation of runPreflightCompactionIfNeeded at the current head, including the new negative phrase-echo proof.
add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes copied terminal output from a real source-runtime invocation of runPreflightCompactionIfNeeded at the current head, including the new negative phrase-echo proof.
remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

P2: This is a normal-priority regression fix for repeated preflight compactions with limited blast radius around session compaction accounting.
merge-risk: 🚨 session-state: The diff changes token-pressure accounting that decides whether an existing session compacts before the next run.
rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes copied terminal output from a real source-runtime invocation of runPreflightCompactionIfNeeded at the current head, including the new negative phrase-echo proof.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes copied terminal output from a real source-runtime invocation of runPreflightCompactionIfNeeded at the current head, including the new negative phrase-echo proof.

Evidence reviewed

PR surface:

Source +103, Tests +210. Total +313 across 2 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	1	105	2	+103
Tests	1	210	0	+210
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	2	315	2	+313

What I checked:

Root policy read: Read the full root repository policy and applied the PR review guidance for session-state and compaction-sensitive changes. (AGENTS.md:24, 85e5d486df11)
Current main behavior: Current main combines transcript prompt usage with all post-usage tail bytes and output tokens, so stale pre-compaction transcript records can still drive preflight token pressure. (src/auto-reply/reply/agent-runner-memory.ts:650, 85e5d486df11)
PR implementation: The PR bounds prompt usage only when a structured post-usage compaction marker is present, counts only post-marker tail bytes, and drops stale output pressure in that marker-proven path. (src/auto-reply/reply/agent-runner-memory.ts:731, 26d129370858)
Trusted marker contract: The runtime persists compaction as structured type: "compaction" entries, while the refresh phrase comes from prompt-injected context rather than a persisted marker. (src/agents/embedded-agent-runner/transcript-file-state.ts:554, 85e5d486df11)
Regression coverage: The PR adds focused tests for giant stale usage, moderate stale prompt usage with stale output/tail pressure, and phrase-echo text without a structured marker still triggering compaction. (src/auto-reply/reply/agent-runner-memory.test.ts:1527, 26d129370858)
Runtime caller path: The preflight compaction path is invoked before memory flush and normal embedded-agent execution in both direct reply and follow-up runner flows. (src/auto-reply/reply/agent-runner.ts:1483, 85e5d486df11)

Likely related people:

steipete: Recent history shows repeated maintenance of preflight compaction, transcript tail scan performance, and oversized transcript guard behavior in agent-runner-memory.ts. (role: recent area contributor; confidence: high; commits: 29af4add2a8e, 39bc43cb6068, b005f01c1304; files: src/auto-reply/reply/agent-runner-memory.ts, src/auto-reply/reply/memory-flush.ts)
ArthurNie: Authored the recent current-main change that made preflight compaction a hard gate before oversized agent turns, which is directly adjacent to this PR's estimator path. (role: recent adjacent owner; confidence: medium; commits: 9d54285b0d4a; files: src/auto-reply/reply/agent-runner-memory.ts, src/auto-reply/reply/agent-runner-memory.test.ts)
jared596: File history identifies the earlier change that triggered preflight compaction from transcript estimates when usage is stale, the behavior this PR refines. (role: introduced related behavior; confidence: medium; commits: c6d8318d07f5; files: src/auto-reply/reply/agent-runner-memory.ts, src/auto-reply/reply/memory-flush.ts)
giodl73-repo: Authored local transcript usage estimation in the gateway transcript reader, which is adjacent to the bounded recent replay estimate used by this PR. (role: adjacent transcript-estimation contributor; confidence: medium; commits: 2c59ea8a2e76; files: src/gateway/session-utils.fs.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

clawsweeper · 2026-05-28T07:01:44Z

ClawSweeper PR egg: 🎁 locked until real behavior proof passes.

Details

No creature or rarity is rolled until proof passes.
Eggs are collectible flavor only; they do not affect labels, ratings, merge decisions, or automation.

jbetala7 · 2026-06-03T06:49:31Z

@clawsweeper re-review

The Real behavior proof and auto-response checks were from the 06-01 Dependency Guard rebase run (auto-response shows A task was canceled), and the prior proof referenced the pre-rebase head b90c31584d. I re-ran the real source-runtime proof at the exact current head 74ba2e34ec and updated the PR body.

Real behavior proof (real runPreflightCompactionIfNeeded source path via node --import tsx; only the terminal compaction action is instrumented to count attempts):

Before (origin/main estimator): same harness + transcript -> runPreflightCompactionIfNeeded throws Preflight compaction required but failed: not_compacted (main forces preflight compaction on the stale-after-compaction transcript — the bug).
After (head 74ba2e34ec): both scenarios return the original session entry, never enter preflight_compacting, compactCalls: 0:
- moderate stale usage input=9000 output=80000, 450000-byte pre-marker tail, 100k window -> compactCalls: 0, returnedOriginalEntry: true
- giant stale usage input=240000 output=120000, same tail/window -> compactCalls: 0, returnedOriginalEntry: true

Focused suite at this head: node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-memory.test.ts -> 42 passed.

clawsweeper · 2026-06-03T06:49:34Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26868589918
Updated: 2026-06-03T06:59:34.038Z

ClawSweeper re-review flagged a P1: transcriptLineHasPostUsageCompactionMarker also matched the post-compaction refresh phrases ("[Post-compaction context refresh]", "Session was just compacted.") in free message text. Those phrases are prompt-injected context, not persisted markers, so ordinary user/tool content echoing them could masquerade as a compaction marker and wrongly drop stale-usage pressure. Detect only structured compaction records (type "compaction"/"session.compacted", the records the runtime actually writes via transcript-file-state / session-manager) and drop the free-text fallback plus the now-unused collectTranscriptText helper. Add a regression test proving ordinary transcript text echoing the refresh phrase (with no structured marker) keeps preflight compaction conservative.

jbetala7 · 2026-06-03T07:22:13Z

Addressed the P1 from the last re-review at head 26d1293708.

transcriptLineHasPostUsageCompactionMarker now keys only on structured compaction records (type/payload.type === compaction | session.compacted). I removed the free-text phrase fallback (and the now-unused collectTranscriptText helper) because [Post-compaction context refresh] / Session was just compacted. originate from post-compaction-context.ts as prompt-injected context, not persisted transcript markers — so ordinary user/tool content echoing them can no longer masquerade as a marker.

Proof (real runPreflightCompactionIfNeeded, compaction counted):

structured {type:"compaction"} markers still suppress compaction: moderate + giant stale cases both compactCalls=0, returns original entry (feature preserved).
new regression — ordinary user text echoing both phrases with no structured record: compactCalls=1 (still compacts, not fooled).

Focused suite now 43 passed (adds the phrase-echo regression ClawSweeper requested). @clawsweeper re-review

clawsweeper · 2026-06-03T07:22:16Z

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

jbetala7 mentioned this pull request May 14, 2026

[Bug]: repeated early preflight compactions after compaction due to stale transcript usage #81178

Open

openclaw-barnacle Bot added size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 14, 2026

openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 14, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 14, 2026

This comment was marked as low quality.

Sign in to view

openclaw-barnacle Bot added size: M and removed size: S labels May 14, 2026

clawsweeper Bot mentioned this pull request May 15, 2026

fix(memory): prevent stale preflight compaction loops and assert OpenAI model list IDs #81231

Closed

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 16, 2026

openclaw-barnacle Bot added cli CLI command changes scripts Repository scripts commands Command implementations agents Agent runtime and tooling extensions: kimi-coding extensions: codex extensions: deepinfra extensions: openrouter extensions: xai labels May 17, 2026

clawsweeper Bot added P2 Normal backlog priority with limited blast radius. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. labels May 17, 2026

jbetala7 force-pushed the fix/81178-bound-stale-transcript-usage branch from e7c39ee to 3fa8b5d Compare May 17, 2026 16:12

openclaw-barnacle Bot removed the cli CLI command changes label May 17, 2026

jbetala7 force-pushed the fix/81178-bound-stale-transcript-usage branch from 896b5c8 to 460f423 Compare May 31, 2026 07:28

clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 31, 2026

jbetala7 added 2 commits June 1, 2026 13:39

fix(compaction): bound stale transcript usage

e3d85e3

fix(compaction): drop marker-stale tail pressure

74ba2e3

jbetala7 force-pushed the fix/81178-bound-stale-transcript-usage branch from 460f423 to 74ba2e3 Compare June 1, 2026 08:10

clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. labels Jun 1, 2026

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 3, 2026

yetval mentioned this pull request Jun 8, 2026

fix(reply): project preflight compaction gate by next-input size on fresh tokens #91488

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(compaction): bound stale transcript usage#81916

fix(compaction): bound stale transcript usage#81916
jbetala7 wants to merge 3 commits into
openclaw:mainfrom
jbetala7:fix/81178-bound-stale-transcript-usage

jbetala7 commented May 14, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 14, 2026 •

edited

Loading

Uh oh!

This comment was marked as low quality.

clawsweeper Bot commented May 28, 2026

Uh oh!

jbetala7 commented Jun 3, 2026

Uh oh!

clawsweeper Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

jbetala7 commented Jun 3, 2026

Uh oh!

clawsweeper Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jbetala7 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Real behavior proof

Verification

Update 2026-06-03 — addressed ClawSweeper P1 (head 26d1293708)

Uh oh!

clawsweeper Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as low quality.

clawsweeper Bot commented May 28, 2026

Uh oh!

jbetala7 commented Jun 3, 2026

Uh oh!

clawsweeper Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbetala7 commented Jun 3, 2026

Uh oh!

clawsweeper Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jbetala7 commented May 14, 2026 •

edited

Loading

Update 2026-06-03 — addressed ClawSweeper P1 (head `26d1293708`)

clawsweeper Bot commented May 14, 2026 •

edited

Loading

clawsweeper Bot commented Jun 3, 2026 •

edited

Loading