Skip to content

fix(diagnostics): recover orphaned session activity and yield event loop during lock contention#87028

Merged
steipete merged 2 commits into
openclaw:mainfrom
samuelsoaress:fix/session-isolation-event-loop-yield
May 27, 2026
Merged

fix(diagnostics): recover orphaned session activity and yield event loop during lock contention#87028
steipete merged 2 commits into
openclaw:mainfrom
samuelsoaress:fix/session-isolation-event-loop-yield

Conversation

@samuelsoaress

Copy link
Copy Markdown
Contributor

Summary

  • Classify idle sessions with queued work and stale orphaned activity (model_call or tool_call) as recoverable session.stuck instead of non-recoverable stalled_agent_run
  • Yield to the event loop in session write lock shouldReclaim/shouldRemoveStaleLock callbacks before synchronous process inspection, preventing lock contention retry storms from starving other sessions
  • Add stress test coverage for multiple simultaneous stalled sessions, orphaned tool_call recovery, and idle-state recovery without active abort

Context

Related #84903.

This targets two production failure shapes:

  1. Orphaned activity blocking queue: When an embedded owner dies but model_call or tool_call activity remains in the diagnostic tracker, the session sits idle with queued work that never drains. The classifier saw activeWorkKind and returned stalled_agent_run (non-recoverable) instead of routing to idle-state recovery. This PR threads state into the classifier so idle+queued+stale+ownerless activity is classified as session.stuck with recoveryEligible: true.

  2. Lock contention event loop starvation: shouldReclaim and shouldRemoveStaleLock callbacks call readProcessArgsSync on every retry attempt. Under high lock contention, repeated synchronous process inspection blocks the event loop entirely. This PR adds await yieldEventLoop() (setImmediate) before the sync inspection so other sessions can make progress between retry attempts.

Safety

  • Orphaned activity recovery uses expectedState: "idle" and does not set allowActiveAbort, so true active embedded/model work remains protected by existing gates
  • The event loop yield is a no-op under normal (no contention) conditions — it only matters when the lock manager is retrying

Tests

Ran locally on macOS, Node 24.16.0:

  • node scripts/run-vitest.mjs run --config test/vitest/vitest.logging.config.ts282 passed
  • node scripts/run-vitest.mjs run src/agents/session-write-lock.test.ts78 passed
  • node scripts/run-vitest.mjs run src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts68 passed
  • node scripts/run-vitest.mjs run src/commands/doctor-session-locks.test.ts4 passed
  • node scripts/run-vitest.mjs run --config test/vitest/vitest.process.config.ts111 passed
  • pnpm build — clean

New test coverage added:

  • Classifier: idle queued stale model_call without embedded owner → recoverable
  • Classifier: idle queued stale tool_call without embedded owner → recoverable
  • Classifier: processing session with orphaned activity → still non-recoverable (safety gate)
  • Heartbeat: orphaned model_call triggers idle-state recovery without allowActiveAbort
  • Heartbeat: orphaned tool_call triggers idle-state recovery without allowActiveAbort
  • Heartbeat: two independent stalled sessions recover independently (cross-session isolation)
  • Recovery runtime: idle queued work with orphaned model_call resets lane without abort
  • Recovery runtime: idle queued work with orphaned tool_call resets lane without abort

Real behavior proof

Behavior addressed: Gateway event loop starvation and session queue deadlock caused by orphaned diagnostic activity blocking idle-state recovery, combined with synchronous lock contention callbacks starving the event loop. Targets the #84903 production shape where one stalled session blocks all others.

Real environment tested: OpenClaw 2026.5.26 (e8f584e) on macOS Darwin 25.5.0, Node v24.16.0, Telegram direct channel, gateway running as LaunchAgent.

Exact steps or command run after this patch:

  1. Applied patch to diagnostic-session-attention.ts, diagnostic.ts, session-write-lock.ts
  2. pnpm build — clean
  3. pnpm openclaw gateway restart
  4. pnpm openclaw message send --channel telegram --target <redacted> --message "test" — delivered as message ID 430
  5. Monitored gateway logs and process metrics for 15+ minutes

Evidence after fix:

Before (gateway frozen, pre-fix):
  eventLoopMax=12650.0ms
  provider auth pre-warmed in 114079ms
  sessions.list: 17467ms
  agents.list: 3636ms
  CPU: 100.1%  RSS: 931MB

After (gateway healthy, post-fix):
  eventLoopMax=0.0ms
  provider auth pre-warmed in 596ms
  message.action: 1073ms
  chat.history: 146ms
  CPU: 0.3%  RSS: 532MB

Observed result after fix: Gateway event loop stays at 0ms max delay. Telegram messages deliver in ~1s. CPU idle at 0.3%. No session queue deadlock observed over the monitoring period.

What was not tested: Full multi-agent (47+ agents) production replay with concurrent lock contention was not reproduced. WSL2/Linux and Feishu channel paths were not tested. The broader lock isolation architecture (#84903 long-term fix) is not claimed resolved by this PR.


🤖 AI-assisted (Claude Code). Human-tested on real OpenClaw gateway with Telegram channel.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M proof: supplied External PR includes structured after-fix real behavior proof. labels May 26, 2026
@clawsweeper

clawsweeper Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 26, 2026, 9:45 PM ET / 01:45 UTC.

Summary
The branch makes idle queued sessions with stale ownerless model/tool activity recover as session.stuck, adds an event-loop yield before synchronous lock-owner inspection during contention callbacks, and updates related diagnostics docs/tests.

PR surface: Source +35, Tests +299, Docs +1. Total +335 across 9 files.

Reproducibility: yes. at source level: current main keeps idle queued ownerless model/tool activity outside recoverable session.stuck, and the PR adds focused tests that exercise the previously wedged shape. I did not run a live reproduction in this read-only review.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Maintainer should confirm latest-head focused checks for diagnostics and session write locks before merge.
  • Maintainer should explicitly accept the short-term recovery behavior versus waiting for the broader linked isolation architecture.

Risk before merge

  • The PR changes when stale ownerless model/tool activity can clear an idle queued session lane, so a bad classification would directly affect session state and queued message delivery.
  • The lock change intentionally adds an async yield inside contention callbacks; that is the right direction for event-loop starvation, but it changes lock retry timing in a gateway availability hot path.
  • The proof is strong for one macOS Telegram gateway but the PR body says the full multi-agent production replay, WSL2/Linux, and Feishu paths were not reproduced.

Maintainer options:

  1. Land after maintainer acceptance (recommended)
    A maintainer can accept the bounded recovery behavior and land once the relevant diagnostic/session-lock checks are green at the current head.
  2. Ask for broader runtime proof
    Maintainers can require an additional Linux/WSL or multi-session stress proof before merge if they want evidence closer to the original Feishu/WSL production shape.
  3. Pause for the broader isolation plan
    If maintainers decide this changes recovery policy too much for a tactical patch, pause this PR and keep the broader session-isolation work in the linked production issue as canonical.

Next step before merge
No narrow automated repair is needed; the remaining action is maintainer review of a high-impact session recovery and availability change.

Security
Cleared: The diff does not touch dependencies, workflows, lockfiles, permissions, credential handling, or package execution surfaces; no concrete security or supply-chain concern was found.

Review details

Best possible solution:

Have a maintainer review the recovery semantics and latest-head guard checks, then land this narrow fix if they accept the session-state and availability tradeoff as the right short-term repair for the related production failure.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: current main keeps idle queued ownerless model/tool activity outside recoverable session.stuck, and the PR adds focused tests that exercise the previously wedged shape. I did not run a live reproduction in this read-only review.

Is this the best way to solve the issue?

Yes, conditionally: the patch is a narrow repair in the existing diagnostics and session-lock paths, with expected-state gates and no new config surface. Because it changes recovery semantics in a session hot path, maintainer acceptance and green focused checks remain the best merge gate.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against b74984dd5069.

Label changes

Label justifications:

  • P1: The PR targets an active gateway event-loop/session queue failure that can block real message processing across sessions.
  • merge-risk: 🚨 session-state: The diff changes diagnostic recovery decisions and can clear idle queued session state when stale ownerless activity is observed.
  • merge-risk: 🚨 message-delivery: The affected recovery path controls whether queued inbound work drains or remains blocked, so mistakes can suppress or mis-handle messages.
  • merge-risk: 🚨 availability: The lock-contention change is aimed at event-loop starvation but sits in a gateway hot path where retry timing can affect process availability.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (logs): The PR body includes after-fix real gateway proof with a macOS LaunchAgent, Telegram send, redacted target, copied runtime metrics, and observed healthy delivery/CPU/event-loop behavior.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix real gateway proof with a macOS LaunchAgent, Telegram send, redacted target, copied runtime metrics, and observed healthy delivery/CPU/event-loop behavior.
Evidence reviewed

PR surface:

Source +35, Tests +299, Docs +1. Total +335 across 9 files.

View PR surface stats
Area Files Added Removed Net
Source 3 40 5 +35
Tests 3 300 1 +299
Docs 3 7 6 +1
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 9 347 12 +335

Acceptance criteria:

  • node scripts/run-vitest.mjs run --config test/vitest/vitest.logging.config.ts
  • node scripts/run-vitest.mjs run src/agents/session-write-lock.test.ts
  • node scripts/run-vitest.mjs run src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
  • node scripts/run-vitest.mjs run src/commands/doctor-session-locks.test.ts
  • node scripts/run-vitest.mjs run --config test/vitest/vitest.process.config.ts

What I checked:

  • Repository policy read and applied: Root AGENTS.md and the relevant scoped docs/agents guides were read; their guidance makes session state, message delivery, gateway availability, docs wording, and broad runtime recovery behavior maintainer-sensitive review surfaces. (AGENTS.md:1, b74984dd5069)
  • Current main does not already contain the orphaned-activity recovery: On current main, classifySessionAttention has no state parameter and stale active model/tool activity still classifies as non-recoverable stalled active work rather than recoverable session.stuck. (src/logging/diagnostic-session-attention.ts:27, b74984dd5069)
  • Current main only wakes idle queued recovery for embedded ownership: The current heartbeat gate is named isIdleQueuedEmbeddedRunStall and requires an embedded owner, so ownerless stale model_call or tool_call activity is outside the existing idle queued recovery path. (src/logging/diagnostic.ts:542, b74984dd5069)
  • Current main lock callbacks do synchronous inspection without yielding first: The current shouldReclaim and shouldRemoveStaleLock callbacks call inspectLockPayloadForSession directly, which can invoke synchronous process-argument inspection under contention. (src/agents/session-write-lock.ts:770, b74984dd5069)
  • PR adds the recoverable orphaned-activity classifier branch: At the PR head, the classifier accepts session state and returns recoverable session.stuck for idle queued stale activity without an active embedded owner. (src/logging/diagnostic-session-attention.ts:27, 166684921612)
  • PR threads the recovery decision through heartbeat: The PR expands the idle queued recovery predicate to cover orphaned model/tool activity and passes the current state/generation into recovery, preserving the existing expected-state gate. (src/logging/diagnostic.ts:542, 166684921612)

Likely related people:

  • Shakker: Local current-main blame for the central diagnostic classifier, idle queued recovery predicate, and session-lock contention callbacks points to commit 209eadc; the checkout is shallow/grafted, so this is a routing signal rather than full authorship history. (role: recent area contributor; confidence: medium; commits: 209eadcd2d3f; files: src/logging/diagnostic.ts, src/logging/diagnostic-session-attention.ts, src/agents/session-write-lock.ts)
  • steipete: The PR timeline shows maintainer assignment and a force-push, and the current head includes Peter Steinberger's docs clarification commit on the same diagnostics behavior. (role: recent reviewer/adjacent owner; confidence: medium; commits: 166684921612; files: docs/concepts/agent-loop.md, docs/concepts/queue.md, docs/gateway/opentelemetry.md)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels May 26, 2026
@clawsweeper

clawsweeper Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Frosted Crabkin

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: sparkles near resolved comments.
Image traits: location status garden; accessory tiny test log scroll; palette plum, gold, and soft gray; mood patient; pose stepping out of a freshly hatched shell; shell paper lantern shell; lighting calm overcast light; background little resolved-comment flags.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Frosted Crabkin in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@steipete steipete self-assigned this May 27, 2026
samuelsoaress and others added 2 commits May 27, 2026 02:32
…oop during lock contention

Idle sessions with queued work and stale orphaned activity (model_call or
tool_call without an active embedded owner) were classified as non-recoverable
stalled active work. This left the queue wedged until the underlying timeout
unwound (up to 2 hours), blocking message delivery.

Additionally, session write lock callbacks that run synchronous process
inspection (readProcessArgsSync) now yield to the event loop between retry
attempts, preventing a single lock contention storm from starving all other
sessions.

Related openclaw#84903

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@steipete steipete force-pushed the fix/session-isolation-event-loop-yield branch from ecfb51d to 1666849 Compare May 27, 2026 01:39
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation gateway Gateway runtime labels May 27, 2026

Copy link
Copy Markdown
Contributor

Verification before merge:

Behavior addressed: idle queued sessions with stale ownerless model/tool activity now classify as recoverable session.stuck, and session write-lock stale-owner checks yield before synchronous process inspection.

Real environment tested: local macOS checkout plus Blacksmith Testbox.

Exact steps or command run after this patch:

  • git diff --check
  • node scripts/run-vitest.mjs src/logging/diagnostic-session-attention.test.ts src/logging/diagnostic.test.ts src/logging/diagnostic-stuck-session-recovery.runtime.test.ts src/agents/session-write-lock.test.ts
  • pnpm docs:list
  • pnpm changed:lanes --json
  • pnpm check:changed
  • .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main

Evidence after fix:

  • Focused Vitest: 5 files, 166 tests passed.
  • Blacksmith Testbox check:changed: tbx_01kskh6vxd54dd585r9e7c2mjc, GitHub Actions run 26485453993, exit 0.
  • Autoreview: no accepted/actionable findings.
  • PR CI on 1666849: CI 26485547434 success; CodeQL 26485547452 success; CodeQL Critical Quality 26485547453 success; Workflow Sanity 26485547455 success; OpenGrep PR Diff 26485547456 success.

Observed result after fix: diagnostics and recovery tests cover orphaned model_call/tool_call activity on idle queued sessions; lock tests still pass with the event-loop yield around stale process inspection.

What was not tested: no full 47-agent live replay, no WSL2/Feishu live repro, and no fresh Telegram production replay beyond the author-provided real macOS Telegram gateway proof.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation gateway Gateway runtime merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants