Skip to content

Bound aggregate persisted tool results#87639

Merged
steipete merged 1 commit into
openclaw:mainfrom
ooiuuii:fix/bound-session-tool-output
May 28, 2026
Merged

Bound aggregate persisted tool results#87639
steipete merged 1 commit into
openclaw:mainfrom
ooiuuii:fix/bound-session-tool-output

Conversation

@ooiuuii

@ooiuuii ooiuuii commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

  • bound the aggregate persisted toolResult text kept on the active session branch
  • preserve the existing per-result cap while adding a larger aggregate cap for repeated sub-cap outputs
  • cover long-running tools that emit many individually acceptable results but can poison later prompts

Why

A long-running tool can append many toolResult messages where each result is under the per-message cap. The persisted session then replays the aggregate output on later turns, so even a short follow-up can exceed the model context window.

Real behavior proof

  • Behavior or issue addressed: User-reported OpenClaw 2026.5.26 failure with concrete runtime/error evidence: on a real Telegram/Windows setup, a long-running process task persisted many individually sub-cap toolResult chunks. Later one-character Telegram replies such as 1 failed with Your input exceeds the context window of this model because the direct session history was already poisoned by aggregate tool output.

  • Real environment tested: Windows/Telegram OpenClaw 2026.5.26 bot session (agent:main:telegram:default:direct:8589344021) plus this branch in a local OpenClaw checkout on macOS.

  • Exact steps or command run after this patch: Reproduced the same shape with repeated process-named tool results against the real SessionManager/installSessionToolResultGuard path in this branch, then inspected the active branch aggregate after each append. Also verified the original real session backup contained many toolName: process toolResult entries around ~30k chars each before reset.

  • Evidence after fix: Copied terminal output from the patched guard path showing aggregate persisted tool output stays bounded instead of accumulating unbounded across repeated process results:

    scenario: 8 process toolResult messages, each ~850 chars, maxToolResultChars=1000
    previous aggregate shape: 8 * ~853 chars = ~6824 chars persisted on the active branch
    patched active branch result:
    {
      "toolResultMessages": 8,
      "aggregateToolResultChars": 3998,
      "aggregateBudgetChars": 4000,
      "hasTruncationNotice": true,
      "activeBranchMessages": 16
    }
    real-session failure evidence before fix/reset from the 2026.5.26 report: Telegram replies such as "1" returned "Your input exceeds the context window of this model"; inspection of the persisted direct session showed repeated large `toolName: process` / `toolResult` chunks, many around ~30k chars, accumulated before the session was reset.
    
  • Observed result after fix: Repeated sub-cap tool outputs no longer grow the active branch's persisted toolResult text without bound; once aggregate output exceeds the configured aggregate budget, older persisted branch entries are rewritten with truncation notices before the next user turn can replay the full accumulated output.

  • What was not tested: I did not hotpatch the user's production Windows dist runtime with this unmerged PR; the user's preference was to avoid further Windows hotfixes unless official packages are unusable.

Test plan

  • ./node_modules/.bin/vitest run --config test/vitest/vitest.unit-fast.config.ts src/agents/session-tool-result-guard.test.ts --reporter=dot
  • ./node_modules/.bin/vitest run --config test/vitest/vitest.agents-core.config.ts src/agents/session-tool-result-guard.transcript-events.test.ts --reporter=dot
  • pnpm exec oxfmt --check --threads=1 src/agents/session-tool-result-guard.ts src/agents/session-tool-result-guard.test.ts src/agents/session-tool-result-guard.transcript-events.test.ts src/agents/embedded-agent-runner/tool-result-truncation.ts
  • pnpm exec oxlint src/agents/session-tool-result-guard.ts src/agents/session-tool-result-guard.test.ts src/agents/session-tool-result-guard.transcript-events.test.ts src/agents/embedded-agent-runner/tool-result-truncation.ts src/plugin-sdk/approval-reaction-runtime.ts src/plugins/discovery.ts
  • pnpm run test:extensions:package-boundary:compile

CI unblock note: after rebasing onto current origin/main, the extension/package boundary prep exposed two existing type-check issues (approval-reaction-runtime command-action narrowing and nullable package-manifest stat). This branch includes minimal fixes for those so the PR can exercise the full CI path.

Note: pnpm check:changed reached the core tsgo typecheck stage but was killed by the local runner before completion.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: L triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 28, 2026
@ooiuuii ooiuuii force-pushed the fix/bound-session-tool-output branch from 86adec9 to 5d1e1d8 Compare May 28, 2026 12:52
@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 28, 2026, 4:09 PM ET / 20:09 UTC.

Summary
The PR adds aggregate tool-result truncation to embedded agent prompt history/provider submission and tests that canonical session messages remain unrewritten.

PR surface: Source +73, Tests +137. Total +210 across 4 files.

Reproducibility: yes. for source-level reproduction: current main only caps individual toolResult messages, and the linked report describes repeated sub-cap persisted tool results causing short later prompts to exceed context. I did not run a live Telegram/Windows reproduction in this read-only review.

Review metrics: 1 noteworthy metric.

  • Prompt-History Budget: 1 new aggregate cap: 4x resolved per-result toolResult cap. This changes the default provider-visible session replay budget, so maintainers should notice it before merge.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Result: blocked until stronger real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add current-head real behavior proof for 4f54861, with private session details redacted.
  • Have a maintainer explicitly accept the 4x aggregate budget and older-first provider-history truncation policy.

Proof guidance:

  • [P1] Needs stronger real behavior proof before merge: Earlier terminal/Testbox proof exists, but the current head changed the prompt-submission mechanism after that proof; refreshed current-head proof is needed before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P1] Provider-visible session replay will replace older accumulated toolResult text with truncation notices under a new 4x aggregate budget, which can change model behavior for sessions that relied on older tool output.
  • [P1] The recorded Testbox/terminal proof predates the final streamFn-transform implementation at 4f54861.

Maintainer options:

  1. Refresh Current-Head Proof (recommended)
    Add terminal, Testbox, or redacted runtime proof for final head 4f54861 showing provider-visible aggregate truncation and unchanged canonical session history.
  2. Accept Prior Proof Explicitly
    A maintainer may decide the earlier Testbox run plus focused tests are enough despite the final implementation rewrite, but that should be an explicit merge decision.
  3. Pause For Budget Policy
    If the 4x aggregate budget or older-first reduction is not the desired default, pause this branch and choose the permanent session-history policy before merge.

Next step before merge

  • [P1] Manual review is needed because no narrow code repair is identified; merge depends on current-head proof and maintainer acceptance of the session-history policy.

Security
Cleared: The diff is limited to agent runtime prompt-history code and focused tests, with no CI, dependency, secrets, lockfile, or supply-chain surface changed.

Review details

Best possible solution:

Land after current-head proof shows the streamFn path bounds provider-visible history while preserving canonical session state, and maintainers accept the 4x older-first aggregate policy.

Do we have a high-confidence way to reproduce the issue?

Yes for source-level reproduction: current main only caps individual toolResult messages, and the linked report describes repeated sub-cap persisted tool results causing short later prompts to exceed context. I did not run a live Telegram/Windows reproduction in this read-only review.

Is this the best way to solve the issue?

Unclear until the merge policy is accepted: the final implementation is narrower than rewriting persisted messages and has focused tests, but current-head real behavior proof should be refreshed after the final streamFn-transform change.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 59205bd63c6a.

Label changes

Label justifications:

  • P2: This fixes a real session-history context overflow with limited surface, but it is not an emergency runtime outage.
  • merge-risk: 🚨 session-state: The PR changes provider-visible session history by truncating older aggregate toolResult text while leaving persisted history canonical.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: Earlier terminal/Testbox proof exists, but the current head changed the prompt-submission mechanism after that proof; refreshed current-head proof is needed before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

PR surface:

Source +73, Tests +137. Total +210 across 4 files.

View PR surface stats
Area Files Added Removed Net
Source 2 99 26 +73
Tests 2 141 4 +137
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 4 240 30 +210

What I checked:

  • Root and scoped review policy read: Root AGENTS.md and the scoped agent/embedded-runner AGENTS.md files were read; the root policy treats auth/session state and persisted context behavior as compatibility-sensitive, and the runner policy asks for focused helper coverage plus nearby runner/context-engine proof. (AGENTS.md:1, 59205bd63c6a)
  • Current main only has a per-result prompt guard: On current main, truncateOversizedToolResultsInMessages accepts only a per-result max and returns each sub-cap toolResult unchanged, so repeated sub-cap tool results can still accumulate in aggregate. (src/agents/embedded-agent-runner/tool-result-truncation.ts:335, 59205bd63c6a)
  • PR head adds aggregate prompt-history truncation: The PR head resolves the per-result cap, applies a 4x aggregate cap before prompt pressure checks, and wraps provider stream messages so provider-visible history is aggregate-bounded without rewriting canonical session state. (src/agents/embedded-agent-runner/run/attempt.ts:3443, 4f54861333fb)
  • Focused tests cover the intended session-state boundary: The new runner/context-engine test verifies the provider-submitted messages are capped while the prompt handler and afterTurn/canonical messages remain above the aggregate cap and do not contain truncation notices. (src/agents/embedded-agent-runner/run/attempt.spawn-workspace.context-engine.test.ts:2406, 4f54861333fb)
  • Current-head proof gap: The last proof comment covers earlier heads, while the diff from 64c761d to final head replaces the temporary session-message swap with a streamFn message transform; current-head proof should be refreshed for that final mechanism. (src/agents/embedded-agent-runner/run/attempt.ts:3924, 4f54861333fb)

Likely related people:

  • steipete: Authored the final PR head and maintainer rewrite/proof comments for the prompt-history truncation path under review. (role: recent area contributor; confidence: high; commits: 4f54861333fb; files: src/agents/embedded-agent-runner/run/attempt.ts, src/agents/embedded-agent-runner/tool-result-truncation.ts, src/agents/embedded-agent-runner/run/attempt.spawn-workspace.context-engine.test.ts)
  • alkor2000: Available local blame/log history for the existing current-main per-result truncation and nearby embedded-runner lines points to commit b3db1db. (role: current-main baseline contributor; confidence: low; commits: b3db1dba85fe; files: src/agents/embedded-agent-runner/run/attempt.ts, src/agents/embedded-agent-runner/tool-result-truncation.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 28, 2026
@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg: 🔥 warming; proof passed, review follow-up or readiness checks remain. Hatch with @clawsweeper hatch when eligible.

Rules and details

Hatchability:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

About:

  • Eggs appear after real-behavior proof passes. They are collectible flavor only.
  • Review momentum changes the shell state: follow-up work warms it, re-review makes it wobble, and a clean final review lets it hatch.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 28, 2026
@ooiuuii ooiuuii force-pushed the fix/bound-session-tool-output branch from 5d1e1d8 to 8930882 Compare May 28, 2026 13:01
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 28, 2026
@ooiuuii

ooiuuii commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

Added a tracked bug report with the redacted real 2026.5.26 failure details:

The concrete user-visible symptom was repeated Telegram replies like:

Your input exceeds the context window of this model. Please adjust your input and try again.

This happened even after short follow-up messages such as 1, which confirmed the current user input was not the large payload. Inspection of the affected persisted direct session showed repeated large toolName: process / toolResult chunks accumulated on the active history branch.

I am intentionally not attaching the raw screenshot/session JSONL because it includes private Telegram/session details, but the issue includes the redacted failure text, environment, and persisted-message shape.

@ooiuuii ooiuuii force-pushed the fix/bound-session-tool-output branch from 8930882 to 985c40f Compare May 28, 2026 13:43
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@ooiuuii ooiuuii force-pushed the fix/bound-session-tool-output branch from 985c40f to b47071d Compare May 28, 2026 13:51
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@ooiuuii ooiuuii force-pushed the fix/bound-session-tool-output branch from b47071d to 765ed27 Compare May 28, 2026 14:27
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@steipete steipete force-pushed the fix/bound-session-tool-output branch from 2b6bf3e to 6b5f4f6 Compare May 28, 2026 18:14
@openclaw-barnacle openclaw-barnacle Bot added size: M and removed size: S proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 28, 2026
@steipete steipete force-pushed the fix/bound-session-tool-output branch from 6b5f4f6 to 6d7603d Compare May 28, 2026 18:16
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 28, 2026
@steipete steipete force-pushed the fix/bound-session-tool-output branch from 6d7603d to 1b428fe Compare May 28, 2026 18:39
@steipete steipete force-pushed the fix/bound-session-tool-output branch from 1b428fe to 433df50 Compare May 28, 2026 18:42
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@steipete steipete force-pushed the fix/bound-session-tool-output branch from 433df50 to f87b233 Compare May 28, 2026 18:56
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@steipete steipete force-pushed the fix/bound-session-tool-output branch from f87b233 to f04fd94 Compare May 28, 2026 19:04
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@steipete

Copy link
Copy Markdown
Contributor

Maintainer rewrite pushed.

Behavior addressed: bound aggregate tool-result history at the prompt boundary, without rewriting the just-appended persisted session entry.

Real environment tested: local macOS focused Vitest plus Blacksmith Testbox Linux changed-lane proof.

Exact steps or command run after this patch:

  • pnpm test src/agents/embedded-agent-runner/tool-result-truncation.test.ts src/agents/embedded-agent-runner/run/attempt.spawn-workspace.context-engine.test.ts -- --reporter=verbose
  • .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main
  • node scripts/crabbox-wrapper.mjs run -provider blacksmith-testbox -blacksmith-workflow .github/workflows/ci-check-testbox.yml -blacksmith-job check -label pr-87639-check-changed-final -shell -- "pnpm check:changed"

Evidence after fix: final head 9c6b76d0b889010bea6e7b865caba283d6151f91; focused Vitest passed 91 tests; autoreview clean; Testbox tbx_01ksr0a12gaks2x5f0waexrnsf passed with exit 0; GitHub PR checks are green.

Observed result after fix: provider-visible prompt history is aggregate-bounded, older aggregate tool results are reduced before newer ones, persisted transcript/session history remains canonical, and context-engine afterTurn does not ingest prompt-only truncation notices.

What was not tested: live provider/Telegram reproduction was not run; this was covered with focused unit/integration tests plus changed-lane CI.

@steipete

Copy link
Copy Markdown
Contributor

Rebased after main moved during merge.

Behavior addressed: unchanged from the proof above.

Real environment tested: same Testbox proof above plus local focused rebase sanity.

Exact steps or command run after this patch:

  • git fetch origin main && git rebase origin/main
  • pnpm test src/agents/embedded-agent-runner/tool-result-truncation.test.ts src/agents/embedded-agent-runner/run/attempt.spawn-workspace.context-engine.test.ts -- --reporter=verbose

Evidence after fix: final rebased head 64c761db08eb19e107d908e9fcc6d0e5b45eb4d3; same four-file diff; focused Vitest passed 91 tests after rebase; prior final Testbox tbx_01ksr0a12gaks2x5f0waexrnsf passed pnpm check:changed before the clean rebase.

Observed result after fix: no behavior change from the already-reviewed/proved diff; rebase only moved the base to latest origin/main.

What was not tested: repeated full Testbox after the clean rebase; avoided per moving-main policy after one green full run plus focused rebase sanity.

Keep aggregate tool-result truncation on the prompt history boundary instead of rewriting the just-appended persisted branch entry.

Co-authored-by: luyifan <al3060388206@gmail.com>
@steipete

Copy link
Copy Markdown
Contributor

Merged as f49a3e4.

Verification run on final PR head 4f54861:

  • pnpm test src/agents/embedded-agent-runner/tool-result-truncation.test.ts src/agents/embedded-agent-runner/run/attempt.spawn-workspace.context-engine.test.ts -- --reporter=verbose - 91 tests passed.
  • node scripts/run-tsgo.mjs -p test/tsconfig/tsconfig.core.test.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core-test.tsbuildinfo - passed.
  • pnpm lint --threads=8 - passed.
  • Autoreview command: /Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode branch --base origin/main - clean after the accepted slash-command/provider-boundary finding was fixed.
  • GitHub CI rerun 26599074695 - passed.
  • GitHub CodeQL 26599074742, CodeQL Critical Quality 26599074700, OpenGrep PR Diff 26599075126, Workflow Sanity 26599075130 - passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. size: M status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants