Event-loop saturation and ACP session leak on 2026.4.27 (9d8de70)

# Gateway event-loop pegs and ACP session lifecycle leaks on `2026.4.27` post-tag drift (commit `9d8de70`)

## Summary

Between `a8b64b7d52` (good) and `9d8de70c20` (bad) — both shipped under the `2026.4.27` tag, ~509 commits of post-tag drift — the gateway becomes unable to close ACP sessions during `task-registry-maintenance` runs. Sessions accumulate unbounded, the event loop is held for **480-490 seconds at a time** by long-running synchronous work, every embedded model run surfaces `decision=surface_error reason=timeout`, Telegram polling stalls (`getUpdates stuck for 700+s`), and Discord disconnects with `gateway was not ready after 15000ms`. The bot becomes effectively unreachable across all transports.

A simple `systemctl restart openclaw-gateway` does **not** clear it — a fresh process reproduces the leak within seconds at idle (no user activity required).

Rolling back the worktree to `a8b64b7d52` and rebuilding fully resolves the issue. Both versions report `OpenClaw 2026.4.27 (<short-hash>)`.

## Reproduction

```bash
# Bad commit — symptom appears immediately on a fresh process at idle
cd ~/openclaw
git checkout 9d8de70c20
rm -rf dist && pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway
journalctl --user -u openclaw-gateway -f   # watch for the symptoms below
```

## Symptom fingerprint

Five log signals appear together within ~30 seconds of a fresh gateway start, with no user activity:

1. **ACP session-close maintenance failures looping**
   ```
   [tasks/task-registry-maintenance] Failed to close orphaned parent-owned ACP session during task maintenance
   [tasks/task-registry-maintenance] Failed to close terminal ACP session during task maintenance
   ```
   ~10 per minute, observed ~2,278 in 24 hours.

2. **Session-write-lock holds far past max**
   ```
   [session-write-lock] releasing lock held for 489034ms (max=15000ms): /home/<user>/.openclaw/agents/claude/sessions/sessions.json.lock
   [session-write-lock] releasing lock held for 76908ms  (max=15000ms): /home/<user>/.openclaw/agents/main/sessions/sessions.json.lock
   ```
   29-489 seconds repeated; max-allowed is 15s.

3. **Event loop pegged**
   ```
   [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=487s
       eventLoopDelayP99Ms=309.1 eventLoopDelayMaxMs=480767.9 eventLoopUtilization=0.993 cpuCoreRatio=1.002
       active=0 waiting=0 queued=0
   ```
   `eventLoopDelayMaxMs` consistently 480000+ ms (≈8 min) per ~500s window. `active=0 waiting=0 queued=0` rules out backed-up agent work — something held synchronously.

4. **Embedded model runs all timeout**
   ```
   [agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.5
   [agent/embedded] embedded run failover decision: runId=… stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4
   ```
   Every model call times out, both `gpt-5.5` and the `gpt-5.4` fallback. OpenAI status page is green, `access_token` validates fine — this is local event-loop saturation, not a provider issue.

5. **Transport flapping**
   ```
   [telegram] Polling stall detected (active getUpdates stuck for 701.09s); forcing restart.
   [discord] gateway was not ready after 15000ms; restarting gateway
   ```
   Both inbound transports flap because their heartbeat timers fire on a thread the event loop can't service.

Process-level: gateway `node` process at 30-77 % CPU with **no user activity**; `Tasks` (cgroup) climbs unbounded — observed 1138 before manual intervention. On `a8b64b7d52`, idle is 12 % CPU and ~85 tasks steady.

## Verified workaround

```bash
cd ~/openclaw
git reset --hard a8b64b7d523170ffdcabb538e601c6a871d8a7a7
rm -rf dist
pnpm install && pnpm build && pnpm ui:build
systemctl --user restart openclaw-gateway
```

After ~90 seconds, all five symptoms disappear (verified by 0 maintenance failures / 0 liveness warnings / 0 long lock holds in a 90s observation window).

## Likely culprits

`git log --oneline a8b64b7..9d8de70` is 509 commits. Highest-suspicion candidates based on the fingerprint (session-write-lock holds + ACP session lifecycle + gateway transport):

- `023d3371a5 refactor(gateway): classify gateway transport failures`
- `2b811fe6d9 fix(memory): make qmd gateway startup lazy`
- `afc4f06ca3 fix(memory): isolate qmd boot refresh`
- Any change to `task-registry` / ACP session close paths

A bisect across that range should land it quickly given how immediately the symptom reproduces.

## Environment

- OpenClaw `2026.4.27` (both commits report this version)
- Node 22.22.2, pnpm 10.33.0
- Linux x86_64, systemd-managed user service
- Channels enabled: Telegram, Discord, Signal
- ACP plugin (`@zed-industries/codex-acp`) and `claude-agent-acp` wrappers active

## Additional logs / artifacts

I have ~24 hours of journal output covering the broken build, plus a side-by-side comparison against the fresh post-rollback gateway. Happy to attach a redacted excerpt or run any specific diagnostic if it would help bisecting.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Event-loop saturation and ACP session leak on 2026.4.27 (9d8de70) #74345

Gateway event-loop pegs and ACP session lifecycle leaks on `2026.4.27` post-tag drift (commit `9d8de70`)

Summary

Reproduction

Symptom fingerprint

Verified workaround

Likely culprits

Environment

Additional logs / artifacts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Event-loop saturation and ACP session leak on 2026.4.27 (9d8de70) #74345

Description

Gateway event-loop pegs and ACP session lifecycle leaks on 2026.4.27 post-tag drift (commit 9d8de70)

Summary

Reproduction

Symptom fingerprint

Verified workaround

Likely culprits

Environment

Additional logs / artifacts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Gateway event-loop pegs and ACP session lifecycle leaks on `2026.4.27` post-tag drift (commit `9d8de70`)