fix(diagnostics): recover orphaned session activity and yield event loop during lock contention#87028
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 26, 2026, 9:45 PM ET / 01:45 UTC. Summary PR surface: Source +35, Tests +299, Docs +1. Total +335 across 9 files. Reproducibility: yes. at source level: current main keeps idle queued ownerless model/tool activity outside recoverable Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Have a maintainer review the recovery semantics and latest-head guard checks, then land this narrow fix if they accept the session-state and availability tradeoff as the right short-term repair for the related production failure. Do we have a high-confidence way to reproduce the issue? Yes, at source level: current main keeps idle queued ownerless model/tool activity outside recoverable Is this the best way to solve the issue? Yes, conditionally: the patch is a narrow repair in the existing diagnostics and session-lock paths, with expected-state gates and no new config surface. Because it changes recovery semantics in a session hot path, maintainer acceptance and green focused checks remain the best merge gate. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against b74984dd5069. Label changesLabel justifications:
Evidence reviewedPR surface: Source +35, Tests +299, Docs +1. Total +335 across 9 files. View PR surface stats
Acceptance criteria:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Frosted Crabkin Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
…oop during lock contention Idle sessions with queued work and stale orphaned activity (model_call or tool_call without an active embedded owner) were classified as non-recoverable stalled active work. This left the queue wedged until the underlying timeout unwound (up to 2 hours), blocking message delivery. Additionally, session write lock callbacks that run synchronous process inspection (readProcessArgsSync) now yield to the event loop between retry attempts, preventing a single lock contention storm from starving all other sessions. Related openclaw#84903 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ecfb51d to
1666849
Compare
|
Verification before merge: Behavior addressed: idle queued sessions with stale ownerless model/tool activity now classify as recoverable session.stuck, and session write-lock stale-owner checks yield before synchronous process inspection. Real environment tested: local macOS checkout plus Blacksmith Testbox. Exact steps or command run after this patch:
Evidence after fix:
Observed result after fix: diagnostics and recovery tests cover orphaned model_call/tool_call activity on idle queued sessions; lock tests still pass with the event-loop yield around stale process inspection. What was not tested: no full 47-agent live replay, no WSL2/Feishu live repro, and no fresh Telegram production replay beyond the author-provided real macOS Telegram gateway proof. |
Summary
session.stuckinstead of non-recoverablestalled_agent_runshouldReclaim/shouldRemoveStaleLockcallbacks before synchronous process inspection, preventing lock contention retry storms from starving other sessionsContext
Related #84903.
This targets two production failure shapes:
Orphaned activity blocking queue: When an embedded owner dies but
model_callortool_callactivity remains in the diagnostic tracker, the session sits idle with queued work that never drains. The classifier sawactiveWorkKindand returnedstalled_agent_run(non-recoverable) instead of routing to idle-state recovery. This PR threadsstateinto the classifier so idle+queued+stale+ownerless activity is classified assession.stuckwithrecoveryEligible: true.Lock contention event loop starvation:
shouldReclaimandshouldRemoveStaleLockcallbacks callreadProcessArgsSyncon every retry attempt. Under high lock contention, repeated synchronous process inspection blocks the event loop entirely. This PR addsawait yieldEventLoop()(setImmediate) before the sync inspection so other sessions can make progress between retry attempts.Safety
expectedState: "idle"and does not setallowActiveAbort, so true active embedded/model work remains protected by existing gatesTests
Ran locally on macOS, Node 24.16.0:
node scripts/run-vitest.mjs run --config test/vitest/vitest.logging.config.ts— 282 passednode scripts/run-vitest.mjs run src/agents/session-write-lock.test.ts— 78 passednode scripts/run-vitest.mjs run src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts— 68 passednode scripts/run-vitest.mjs run src/commands/doctor-session-locks.test.ts— 4 passednode scripts/run-vitest.mjs run --config test/vitest/vitest.process.config.ts— 111 passedpnpm build— cleanNew test coverage added:
Real behavior proof
Behavior addressed: Gateway event loop starvation and session queue deadlock caused by orphaned diagnostic activity blocking idle-state recovery, combined with synchronous lock contention callbacks starving the event loop. Targets the #84903 production shape where one stalled session blocks all others.
Real environment tested: OpenClaw 2026.5.26 (e8f584e) on macOS Darwin 25.5.0, Node v24.16.0, Telegram direct channel, gateway running as LaunchAgent.
Exact steps or command run after this patch:
diagnostic-session-attention.ts,diagnostic.ts,session-write-lock.tspnpm build— cleanpnpm openclaw gateway restartpnpm openclaw message send --channel telegram --target <redacted> --message "test"— delivered as message ID 430Evidence after fix:
Observed result after fix: Gateway event loop stays at 0ms max delay. Telegram messages deliver in ~1s. CPU idle at 0.3%. No session queue deadlock observed over the monitoring period.
What was not tested: Full multi-agent (47+ agents) production replay with concurrent lock contention was not reproduced. WSL2/Linux and Feishu channel paths were not tested. The broader lock isolation architecture (#84903 long-term fix) is not claimed resolved by this PR.
🤖 AI-assisted (Claude Code). Human-tested on real OpenClaw gateway with Telegram channel.