Skip to content

A single stalled agent session blocks the entire Gateway event loop (isolation failure) #84903

@Sylaaaaas

Description

@Sylaaaaas

Description

A single agent's stalled session (model call hung due to lock contention) blocks the entire Gateway event loop, causing all other sessions to stop processing messages. This is a session isolation failure — one agent's hang should not affect the availability of other agents or the Gateway itself.

Environment

  • OpenClaw version: 2026.5.20-beta.1 (also observed in 2026.5.19-beta.2)
  • OS: Linux (WSL2) 6.6.114.1-microsoft-standard-WSL2 x64
  • Node.js: v22.22.0
  • Channel: Feishu (飞书) group chat
  • Model: zai/glm-5-turbo

Steps to Reproduce

  1. One agent (agent-architect) spawns a subagent that enters a model call
  2. The model call hangs (likely due to session write lock contention / retry storm)
  3. The session is marked as stalled after ~6 minutes
  4. Gateway event loop reaches 100% utilization
  5. All other sessions stop processing inbound messages
  6. Other sessions' dispatches are aborted silently

Observed Behavior

Event Loop Degradation

16:44:02 liveness warning: eventLoopDelayP99Ms=2548 eventLoopUtilization=1 cpuCoreRatio=1.358
16:46:18 liveness warning: eventLoopDelayP99Ms=2514 eventLoopUtilization=1 cpuCoreRatio=1.291
16:48:37 liveness warning: eventLoopDelayP99Ms=2933 eventLoopUtilization=1 cpuCoreRatio=1.341
16:50:56 liveness warning: eventLoopDelayP99Ms=1019 eventLoopUtilization=1 cpuCoreRatio=1.288
16:53:19 liveness warning: eventLoopDelayP99Ms=19562 eventLoopMaxMs=19562 eventLoopUtilization=1

Stalled Session

stalled session: sessionId=f9274af7 sessionKey=agent:agent-architect:sub:r1-15-skill-audit
  state=processing
  age=1016s (17 minutes!)
  reason=active_work_without_progress
  activeWorkKind=model_call

Impact on Other Sessions

  • Other agent sessions in the same Gateway could not process new messages
  • Dispatched messages returned queuedFinal=false, replies=0 (silent drop)
  • User messages were aborted before processing
  • Gateway RSS memory grew from ~1.8GB to 4.4GB during the stall

Recovery

Gateway's stuck session recovery eventually aborted the embedded run after ~17 minutes:

stuck session recovery: action=abort_embedded_run aborted=true

But event loop remained blocked even after recovery — required full Gateway restart.

Root Cause Analysis

The stalled session's lock contention triggers a retry storm (session write lock -> fail -> retry -> fail). While the actual API calls are async, the lock acquisition/release + retry logic generates synchronous overhead on the event loop, eventually blocking it entirely.

Related issues:

Expected Behavior

A single agent session stall should NOT:

  1. Block the event loop (other sessions should continue normally)
  2. Cause other sessions to silently drop messages
  3. Require a full Gateway restart to recover

Suggested Fix

  1. Per-session timeout budgets: Abort a session's embedded run after N seconds of no progress, independent of the model call timeout
  2. Better async isolation: Ensure lock contention retries don't block the main event loop (use setImmediate/yield between retries)
  3. Circuit breaker: After repeated lock failures, skip the write instead of retrying
  4. Per-session resource limits: Cap CPU and memory usage per session so one runaway session can't starve others

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions