Bound aggregate persisted tool results#87639
Conversation
86adec9 to
5d1e1d8
Compare
|
Codex review: needs real behavior proof before merge. Reviewed May 28, 2026, 4:09 PM ET / 20:09 UTC. Summary PR surface: Source +73, Tests +137. Total +210 across 4 files. Reproducibility: yes. for source-level reproduction: current main only caps individual toolResult messages, and the linked report describes repeated sub-cap persisted tool results causing short later prompts to exceed context. I did not run a live Telegram/Windows reproduction in this read-only review. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Proof guidance:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land after current-head proof shows the streamFn path bounds provider-visible history while preserving canonical session state, and maintainers accept the 4x older-first aggregate policy. Do we have a high-confidence way to reproduce the issue? Yes for source-level reproduction: current main only caps individual toolResult messages, and the linked report describes repeated sub-cap persisted tool results causing short later prompts to exceed context. I did not run a live Telegram/Windows reproduction in this read-only review. Is this the best way to solve the issue? Unclear until the merge policy is accepted: the final implementation is narrower than rewriting persisted messages and has focused tests, but current-head real behavior proof should be refreshed after the final streamFn-transform change. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 59205bd63c6a. Label changesLabel justifications:
Evidence reviewedPR surface: Source +73, Tests +137. Total +210 across 4 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg: 🔥 warming; proof passed, review follow-up or readiness checks remain. Hatch with Rules and detailsHatchability:
About:
|
5d1e1d8 to
8930882
Compare
|
Added a tracked bug report with the redacted real 2026.5.26 failure details: The concrete user-visible symptom was repeated Telegram replies like: This happened even after short follow-up messages such as I am intentionally not attaching the raw screenshot/session JSONL because it includes private Telegram/session details, but the issue includes the redacted failure text, environment, and persisted-message shape. |
8930882 to
985c40f
Compare
985c40f to
b47071d
Compare
b47071d to
765ed27
Compare
2b6bf3e to
6b5f4f6
Compare
6b5f4f6 to
6d7603d
Compare
6d7603d to
1b428fe
Compare
1b428fe to
433df50
Compare
433df50 to
f87b233
Compare
f87b233 to
f04fd94
Compare
|
Maintainer rewrite pushed. Behavior addressed: bound aggregate tool-result history at the prompt boundary, without rewriting the just-appended persisted session entry. Real environment tested: local macOS focused Vitest plus Blacksmith Testbox Linux changed-lane proof. Exact steps or command run after this patch:
Evidence after fix: final head Observed result after fix: provider-visible prompt history is aggregate-bounded, older aggregate tool results are reduced before newer ones, persisted transcript/session history remains canonical, and context-engine What was not tested: live provider/Telegram reproduction was not run; this was covered with focused unit/integration tests plus changed-lane CI. |
|
Rebased after Behavior addressed: unchanged from the proof above. Real environment tested: same Testbox proof above plus local focused rebase sanity. Exact steps or command run after this patch:
Evidence after fix: final rebased head Observed result after fix: no behavior change from the already-reviewed/proved diff; rebase only moved the base to latest What was not tested: repeated full Testbox after the clean rebase; avoided per moving-main policy after one green full run plus focused rebase sanity. |
Keep aggregate tool-result truncation on the prompt history boundary instead of rewriting the just-appended persisted branch entry. Co-authored-by: luyifan <al3060388206@gmail.com>
|
Merged as f49a3e4. Verification run on final PR head 4f54861:
|
Summary
toolResulttext kept on the active session branchWhy
A long-running tool can append many
toolResultmessages where each result is under the per-message cap. The persisted session then replays the aggregate output on later turns, so even a short follow-up can exceed the model context window.Real behavior proof
Behavior or issue addressed: User-reported OpenClaw 2026.5.26 failure with concrete runtime/error evidence: on a real Telegram/Windows setup, a long-running
processtask persisted many individually sub-captoolResultchunks. Later one-character Telegram replies such as1failed withYour input exceeds the context window of this modelbecause the direct session history was already poisoned by aggregate tool output.Real environment tested: Windows/Telegram OpenClaw 2026.5.26 bot session (
agent:main:telegram:default:direct:8589344021) plus this branch in a local OpenClaw checkout on macOS.Exact steps or command run after this patch: Reproduced the same shape with repeated
process-named tool results against the realSessionManager/installSessionToolResultGuardpath in this branch, then inspected the active branch aggregate after each append. Also verified the original real session backup contained manytoolName: processtoolResultentries around ~30k chars each before reset.Evidence after fix: Copied terminal output from the patched guard path showing aggregate persisted tool output stays bounded instead of accumulating unbounded across repeated process results:
Observed result after fix: Repeated sub-cap tool outputs no longer grow the active branch's persisted
toolResulttext without bound; once aggregate output exceeds the configured aggregate budget, older persisted branch entries are rewritten with truncation notices before the next user turn can replay the full accumulated output.What was not tested: I did not hotpatch the user's production Windows
distruntime with this unmerged PR; the user's preference was to avoid further Windows hotfixes unless official packages are unusable.Test plan
./node_modules/.bin/vitest run --config test/vitest/vitest.unit-fast.config.ts src/agents/session-tool-result-guard.test.ts --reporter=dot./node_modules/.bin/vitest run --config test/vitest/vitest.agents-core.config.ts src/agents/session-tool-result-guard.transcript-events.test.ts --reporter=dotpnpm exec oxfmt --check --threads=1 src/agents/session-tool-result-guard.ts src/agents/session-tool-result-guard.test.ts src/agents/session-tool-result-guard.transcript-events.test.ts src/agents/embedded-agent-runner/tool-result-truncation.tspnpm exec oxlint src/agents/session-tool-result-guard.ts src/agents/session-tool-result-guard.test.ts src/agents/session-tool-result-guard.transcript-events.test.ts src/agents/embedded-agent-runner/tool-result-truncation.ts src/plugin-sdk/approval-reaction-runtime.ts src/plugins/discovery.tspnpm run test:extensions:package-boundary:compileCI unblock note: after rebasing onto current
origin/main, the extension/package boundary prep exposed two existing type-check issues (approval-reaction-runtimecommand-action narrowing and nullable package-manifest stat). This branch includes minimal fixes for those so the PR can exercise the full CI path.Note:
pnpm check:changedreached the core tsgo typecheck stage but was killed by the local runner before completion.