Skip to content

[Bug]: Gateway liveness reports severe event-loop stalls under subagent load #82936

@galiniliev

Description

@galiniliev

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Gateway diagnostics reported repeated multi-second event-loop stalls while several agent/subagent runs were active.

Steps to reproduce

  1. Start the gateway with diagnostics enabled.
  2. Run a workload with several active agent/subagent sessions.
  3. Observe gateway liveness warnings reporting multi-second event-loop delay while active work is present.

Expected behavior

The gateway diagnostic event path should not monopolize the main event loop during bursts from concurrent agent/subagent activity.

Actual behavior

The captured gateway log contains repeated liveness warnings with event-loop delay measured in seconds, including samples with active agent/subagent work and queued work.

OpenClaw version

NOT_ENOUGH_INFO

Operating system

NOT_ENOUGH_INFO

Install method

pnpm dev

Model

NOT_ENOUGH_INFO

Provider / routing chain

NOT_ENOUGH_INFO

Additional provider/model setup details

NOT_ENOUGH_INFO

Logs, screenshots, and evidence

Trace/proof:
- gateway-dev.log:10274
  "liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=30s eventLoopDelayP99Ms=6123.7 eventLoopDelayMaxMs=7709.1 eventLoopUtilization=0.975 cpuCoreRatio=0.983 active=5 waiting=0 queued=1 ..."
- gateway-dev.log:11542
  "liveness warning: reasons=event_loop_delay,event_loop_utilization interval=329s eventLoopDelayP99Ms=12750.7 eventLoopDelayMaxMs=12750.7 eventLoopUtilization=1 ..."
- gateway-dev.log:4190
  "liveness warning: reasons=event_loop_delay,cpu interval=30s eventLoopDelayP99Ms=1780.5 eventLoopDelayMaxMs=2128.6 eventLoopUtilization=0.892 cpuCoreRatio=0.951 active=6 ..."

Impact and severity

Affected: Gateway users running concurrent agent/subagent workloads with diagnostics enabled.
Severity: High, because seconds-long event-loop stalls can delay polling, streaming, queue handling, and cleanup.
Frequency: 154 captured liveness lines matched eventLoopDelayP99Ms= in the observed log.
Consequence: Gateway responsiveness degrades while active agent/subagent work is running.

Additional information

The implicated source path is the diagnostic liveness and diagnostic event dispatch path. High-frequency diagnostic events were deferred, but the async queue drained the whole backlog in a single setImmediate turn, which can starve other gateway work during concurrent agent/subagent bursts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.bugSomething isn't workingclawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.maintainerMaintainer-authored PR

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions