fix: move compaction planning off the event loop#88132
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 29, 2026, 4:46 PM ET / 20:46 UTC. Summary PR surface: Source +662, Tests +125, Other +2. Total +789 across 11 files. Reproducibility: yes. for the source-level failure shape: current main and the latest release run token estimation, splitting, and pruning synchronously in compaction, matching the linked timer-delay report. This review did not run a live Telegram reproduction. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Mantis proof suggestion Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land after maintainer review accepts the worker failure semantics and package proof, or add packaged/live compaction proof first if production-path confidence is required. Do we have a high-confidence way to reproduce the issue? Yes for the source-level failure shape: current main and the latest release run token estimation, splitting, and pruning synchronously in compaction, matching the linked timer-delay report. This review did not run a live Telegram reproduction. Is this the best way to solve the issue? Yes, moving deterministic planning off the main event loop is the narrow maintainable fix for the reported starvation. The remaining question is proof depth, not a different implementation direction. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 61e7b042b633. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +662, Tests +125, Other +2. Total +789 across 11 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
d153113 to
1d566e3
Compare
1d566e3 to
15320f2
Compare
|
Pre-merge verification for this PR:
Known CI gaps: latest |
Summary
Fixes #86358.
This replaces the yield-only mitigation with a real worker-backed compaction planning path:
src/agents/compaction-planning.ts: token estimates, safe transcript projection, token-share splitting, max-token chunking, oversized fallback planning, adaptive chunk ratio, and history pruning.src/agents/compaction-planning.worker.tsas the worker entry andsrc/agents/compaction-planning-worker.tsas the main-thread runner.generateSummary/provider path.workerDatastructured clone sotoolResult.detailsand runtime-context custom messages do not cross the thread boundary or get cloned into worker memory.>=64messages). Small histories stay synchronous to avoid worker startup overhead and to preserve cheap test/mocked behavior.summarizeInStages: history pruning, adaptive chunk-ratio calculation, stage splitting, summary chunking, and oversized-message fallback planning.tsdown, the tsdown config test, release packaging requirements, and release-check fixtures.The important architecture point: this does not put sessions, auth, plugin providers, model calls, or hooks in a worker. It moves the deterministic CPU planning that was starving the Node event loop while leaving runtime side effects on the main thread.
Verification
OPENCLAW_VITEST_MAX_WORKERS=1 OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/tmp/openclaw-vitest-cache-86358-worker2 /usr/bin/time -l pnpm test src/agents/compaction-planning-worker.test.ts -- --reporter=verboseOPENCLAW_VITEST_MAX_WORKERS=1 OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/tmp/openclaw-vitest-cache-86358-sanitize /usr/bin/time -l pnpm test src/agents/compaction.token-sanitize.test.ts src/agents/compaction-planning-worker.test.ts src/agents/compaction-partial-summary.test.ts src/agents/compaction.identifier-preservation.test.ts src/agents/compaction.summarize-fallback.test.ts src/agents/compaction.tool-result-details.test.ts src/agents/agent-hooks/compaction-safeguard.test.ts test/release-check.test.ts -- --reporter=verboseOPENCLAW_VITEST_MAX_WORKERS=1 OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/tmp/openclaw-vitest-cache-86358-compaction3 /usr/bin/time -l pnpm test src/agents/compaction.test.ts src/agents/compaction.token-sanitize.test.ts src/agents/compaction-planning-worker.test.ts src/infra/tsdown-config.test.ts -- --reporter=verboseOPENCLAW_VITEST_MAX_WORKERS=1 OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/tmp/openclaw-vitest-cache-86358-release /usr/bin/time -l pnpm test test/release-check.test.ts -- --reporter=verbose/usr/bin/time -l pnpm build/Users/steipete/Projects/agent-skills/skills/autoreview/scripts/autoreview --mode localautoreview clean: no accepted/actionable findings reported.git diff --checkOPENCLAW_TESTBOX=0 /usr/bin/time -l pnpm check:changedsrc/config/sessions.cache.test.tsat lines 743, 772, 791, and 796. This patch does not touch that file. The same failure was observed before this final patch while validating the issue.Attempted remote changed-gate proof:
tbx_01kstpjpm958rksk5f1mjw7f95, GitHub Actions run https://github.com/openclaw/openclaw/actions/runs/26660367101pnpm installbefore reachingcheck:changed, so I cancelled it and ran the local changed gate above.Real behavior proof
Behavior addressed: Large compaction planning no longer monopolizes the main Node event loop before summarization requests. Telegram/MCP timers should get main-thread turns while CPU-heavy compaction planning runs in a worker.
Real environment tested: Local Node/OpenClaw source checkout plus actual Node worker-thread execution in the worker regression test. No live Telegram account or production MCP backend was used.
Exact steps or command run after this patch: See the verification commands above, especially
src/agents/compaction-planning-worker.test.tsandpnpm build.Evidence after fix: The worker regression schedules a zero-delay timer and a large stage-split planning job at the same time; the timer resolves first while the worker continues planning, proving the planning CPU is not blocking the main event loop in that path.
Observed result after fix: Worker planning returns the same summary chunk/stage planning shape, missing worker runtime is classified as unavailable for fallback, sanitized worker inputs omit
toolResult.detailsand runtime-context custom messages, and existing compaction fallback/identifier/safeguard tests continue to pass.What was not tested: Live Telegram timeout behavior against a real bot session was not exercised in this PR; the proof is focused worker-thread/event-loop behavior plus existing compaction unit coverage.