Gateway event-loop pegs and ACP session lifecycle leaks on 2026.4.27 post-tag drift (commit 9d8de70)
Summary
Between a8b64b7d52 (good) and 9d8de70c20 (bad) — both shipped under the 2026.4.27 tag, ~509 commits of post-tag drift — the gateway becomes unable to close ACP sessions during task-registry-maintenance runs. Sessions accumulate unbounded, the event loop is held for 480-490 seconds at a time by long-running synchronous work, every embedded model run surfaces decision=surface_error reason=timeout, Telegram polling stalls (getUpdates stuck for 700+s), and Discord disconnects with gateway was not ready after 15000ms. The bot becomes effectively unreachable across all transports.
A simple systemctl restart openclaw-gateway does not clear it — a fresh process reproduces the leak within seconds at idle (no user activity required).
Rolling back the worktree to a8b64b7d52 and rebuilding fully resolves the issue. Both versions report OpenClaw 2026.4.27 (<short-hash>).
Reproduction
# Bad commit — symptom appears immediately on a fresh process at idle
cd ~/openclaw
git checkout 9d8de70c20
rm -rf dist && pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway
journalctl --user -u openclaw-gateway -f # watch for the symptoms below
Symptom fingerprint
Five log signals appear together within ~30 seconds of a fresh gateway start, with no user activity:
-
ACP session-close maintenance failures looping
[tasks/task-registry-maintenance] Failed to close orphaned parent-owned ACP session during task maintenance
[tasks/task-registry-maintenance] Failed to close terminal ACP session during task maintenance
~10 per minute, observed ~2,278 in 24 hours.
-
Session-write-lock holds far past max
[session-write-lock] releasing lock held for 489034ms (max=15000ms): /home/<user>/.openclaw/agents/claude/sessions/sessions.json.lock
[session-write-lock] releasing lock held for 76908ms (max=15000ms): /home/<user>/.openclaw/agents/main/sessions/sessions.json.lock
29-489 seconds repeated; max-allowed is 15s.
-
Event loop pegged
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=487s
eventLoopDelayP99Ms=309.1 eventLoopDelayMaxMs=480767.9 eventLoopUtilization=0.993 cpuCoreRatio=1.002
active=0 waiting=0 queued=0
eventLoopDelayMaxMs consistently 480000+ ms (≈8 min) per ~500s window. active=0 waiting=0 queued=0 rules out backed-up agent work — something held synchronously.
-
Embedded model runs all timeout
[agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.5
[agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4
Every model call times out, both gpt-5.5 and the gpt-5.4 fallback. OpenAI status page is green, access_token validates fine — this is local event-loop saturation, not a provider issue.
-
Transport flapping
[telegram] Polling stall detected (active getUpdates stuck for 701.09s); forcing restart.
[discord] gateway was not ready after 15000ms; restarting gateway
Both inbound transports flap because their heartbeat timers fire on a thread the event loop can't service.
Process-level: gateway node process at 30-77 % CPU with no user activity; Tasks (cgroup) climbs unbounded — observed 1138 before manual intervention. On a8b64b7d52, idle is 12 % CPU and ~85 tasks steady.
Verified workaround
cd ~/openclaw
git reset --hard a8b64b7d523170ffdcabb538e601c6a871d8a7a7
rm -rf dist
pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway
After ~90 seconds, all five symptoms disappear (verified by 0 maintenance failures / 0 liveness warnings / 0 long lock holds in a 90s observation window).
Likely culprits
git log --oneline a8b64b7..9d8de70 is 509 commits. Highest-suspicion candidates based on the fingerprint (session-write-lock holds + ACP session lifecycle + gateway transport):
023d3371a5 refactor(gateway): classify gateway transport failures
2b811fe6d9 fix(memory): make qmd gateway startup lazy
afc4f06ca3 fix(memory): isolate qmd boot refresh
- Any change to
task-registry / ACP session close paths
A bisect across that range should land it quickly given how immediately the symptom reproduces.
Environment
- OpenClaw
2026.4.27 (both commits report this version)
- Node 22.22.2, pnpm 10.33.0
- Linux x86_64, systemd-managed user service
- Channels enabled: Telegram, Discord, Signal
- ACP plugin (
@zed-industries/codex-acp) and claude-agent-acp wrappers active
Additional logs / artifacts
I have ~24 hours of journal output covering the broken build, plus a side-by-side comparison against the fresh post-rollback gateway. Happy to attach a redacted excerpt or run any specific diagnostic if it would help bisecting.
Gateway event-loop pegs and ACP session lifecycle leaks on
2026.4.27post-tag drift (commit9d8de70)Summary
Between
a8b64b7d52(good) and9d8de70c20(bad) — both shipped under the2026.4.27tag, ~509 commits of post-tag drift — the gateway becomes unable to close ACP sessions duringtask-registry-maintenanceruns. Sessions accumulate unbounded, the event loop is held for 480-490 seconds at a time by long-running synchronous work, every embedded model run surfacesdecision=surface_error reason=timeout, Telegram polling stalls (getUpdates stuck for 700+s), and Discord disconnects withgateway was not ready after 15000ms. The bot becomes effectively unreachable across all transports.A simple
systemctl restart openclaw-gatewaydoes not clear it — a fresh process reproduces the leak within seconds at idle (no user activity required).Rolling back the worktree to
a8b64b7d52and rebuilding fully resolves the issue. Both versions reportOpenClaw 2026.4.27 (<short-hash>).Reproduction
Symptom fingerprint
Five log signals appear together within ~30 seconds of a fresh gateway start, with no user activity:
ACP session-close maintenance failures looping
~10 per minute, observed ~2,278 in 24 hours.
Session-write-lock holds far past max
29-489 seconds repeated; max-allowed is 15s.
Event loop pegged
eventLoopDelayMaxMsconsistently 480000+ ms (≈8 min) per ~500s window.active=0 waiting=0 queued=0rules out backed-up agent work — something held synchronously.Embedded model runs all timeout
Every model call times out, both
gpt-5.5and thegpt-5.4fallback. OpenAI status page is green,access_tokenvalidates fine — this is local event-loop saturation, not a provider issue.Transport flapping
Both inbound transports flap because their heartbeat timers fire on a thread the event loop can't service.
Process-level: gateway
nodeprocess at 30-77 % CPU with no user activity;Tasks(cgroup) climbs unbounded — observed 1138 before manual intervention. Ona8b64b7d52, idle is 12 % CPU and ~85 tasks steady.Verified workaround
After ~90 seconds, all five symptoms disappear (verified by 0 maintenance failures / 0 liveness warnings / 0 long lock holds in a 90s observation window).
Likely culprits
git log --oneline a8b64b7..9d8de70is 509 commits. Highest-suspicion candidates based on the fingerprint (session-write-lock holds + ACP session lifecycle + gateway transport):023d3371a5 refactor(gateway): classify gateway transport failures2b811fe6d9 fix(memory): make qmd gateway startup lazyafc4f06ca3 fix(memory): isolate qmd boot refreshtask-registry/ ACP session close pathsA bisect across that range should land it quickly given how immediately the symptom reproduces.
Environment
2026.4.27(both commits report this version)@zed-industries/codex-acp) andclaude-agent-acpwrappers activeAdditional logs / artifacts
I have ~24 hours of journal output covering the broken build, plus a side-by-side comparison against the fresh post-rollback gateway. Happy to attach a redacted excerpt or run any specific diagnostic if it would help bisecting.