Skip to content

Event-loop saturation and ACP session leak on 2026.4.27 (9d8de70) #74345

@solomonneas

Description

@solomonneas

Gateway event-loop pegs and ACP session lifecycle leaks on 2026.4.27 post-tag drift (commit 9d8de70)

Summary

Between a8b64b7d52 (good) and 9d8de70c20 (bad) — both shipped under the 2026.4.27 tag, ~509 commits of post-tag drift — the gateway becomes unable to close ACP sessions during task-registry-maintenance runs. Sessions accumulate unbounded, the event loop is held for 480-490 seconds at a time by long-running synchronous work, every embedded model run surfaces decision=surface_error reason=timeout, Telegram polling stalls (getUpdates stuck for 700+s), and Discord disconnects with gateway was not ready after 15000ms. The bot becomes effectively unreachable across all transports.

A simple systemctl restart openclaw-gateway does not clear it — a fresh process reproduces the leak within seconds at idle (no user activity required).

Rolling back the worktree to a8b64b7d52 and rebuilding fully resolves the issue. Both versions report OpenClaw 2026.4.27 (<short-hash>).

Reproduction

# Bad commit — symptom appears immediately on a fresh process at idle
cd ~/openclaw
git checkout 9d8de70c20
rm -rf dist && pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway
journalctl --user -u openclaw-gateway -f   # watch for the symptoms below

Symptom fingerprint

Five log signals appear together within ~30 seconds of a fresh gateway start, with no user activity:

  1. ACP session-close maintenance failures looping

    [tasks/task-registry-maintenance] Failed to close orphaned parent-owned ACP session during task maintenance
    [tasks/task-registry-maintenance] Failed to close terminal ACP session during task maintenance
    

    ~10 per minute, observed ~2,278 in 24 hours.

  2. Session-write-lock holds far past max

    [session-write-lock] releasing lock held for 489034ms (max=15000ms): /home/<user>/.openclaw/agents/claude/sessions/sessions.json.lock
    [session-write-lock] releasing lock held for 76908ms  (max=15000ms): /home/<user>/.openclaw/agents/main/sessions/sessions.json.lock
    

    29-489 seconds repeated; max-allowed is 15s.

  3. Event loop pegged

    [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=487s
        eventLoopDelayP99Ms=309.1 eventLoopDelayMaxMs=480767.9 eventLoopUtilization=0.993 cpuCoreRatio=1.002
        active=0 waiting=0 queued=0
    

    eventLoopDelayMaxMs consistently 480000+ ms (≈8 min) per ~500s window. active=0 waiting=0 queued=0 rules out backed-up agent work — something held synchronously.

  4. Embedded model runs all timeout

    [agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.5
    [agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4
    

    Every model call times out, both gpt-5.5 and the gpt-5.4 fallback. OpenAI status page is green, access_token validates fine — this is local event-loop saturation, not a provider issue.

  5. Transport flapping

    [telegram] Polling stall detected (active getUpdates stuck for 701.09s); forcing restart.
    [discord] gateway was not ready after 15000ms; restarting gateway
    

    Both inbound transports flap because their heartbeat timers fire on a thread the event loop can't service.

Process-level: gateway node process at 30-77 % CPU with no user activity; Tasks (cgroup) climbs unbounded — observed 1138 before manual intervention. On a8b64b7d52, idle is 12 % CPU and ~85 tasks steady.

Verified workaround

cd ~/openclaw
git reset --hard a8b64b7d523170ffdcabb538e601c6a871d8a7a7
rm -rf dist
pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway

After ~90 seconds, all five symptoms disappear (verified by 0 maintenance failures / 0 liveness warnings / 0 long lock holds in a 90s observation window).

Likely culprits

git log --oneline a8b64b7..9d8de70 is 509 commits. Highest-suspicion candidates based on the fingerprint (session-write-lock holds + ACP session lifecycle + gateway transport):

  • 023d3371a5 refactor(gateway): classify gateway transport failures
  • 2b811fe6d9 fix(memory): make qmd gateway startup lazy
  • afc4f06ca3 fix(memory): isolate qmd boot refresh
  • Any change to task-registry / ACP session close paths

A bisect across that range should land it quickly given how immediately the symptom reproduces.

Environment

  • OpenClaw 2026.4.27 (both commits report this version)
  • Node 22.22.2, pnpm 10.33.0
  • Linux x86_64, systemd-managed user service
  • Channels enabled: Telegram, Discord, Signal
  • ACP plugin (@zed-industries/codex-acp) and claude-agent-acp wrappers active

Additional logs / artifacts

I have ~24 hours of journal output covering the broken build, plus a side-by-side comparison against the fresh post-rollback gateway. Happy to attach a redacted excerpt or run any specific diagnostic if it would help bisecting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions