Upstream Bug 1 — pi-agent-core lifecycle race: "Agent listener invoked outside active run"
Repo: github.com/openclaw/openclaw
Suggested labels: bug, gateway, pi-agent-core, stability
OpenClaw version: 2026.4.5
Node: 22.22.1
OS: macOS (Apple Silicon)
Title
Gateway: unhandled promise rejection from pi-agent-core Agent.processEvents after run abort/timeout corrupts in-memory run state
Summary
When a model request times out or is aborted mid-stream, the embedded @mariozechner/pi-agent-core library can fire listener callbacks after the run has already been closed. The check inside Agent.processEvents (agent.js:388) raises Error: Agent listener invoked outside active run, which surfaces as an unhandled promise rejection in the gateway. This corrupts the gateway's in-memory run state and cascades failures down any remaining fallback chain — even models that would otherwise succeed.
We hit this repeatedly during a 24-hour cascade incident on 2026-04-06/07; the gateway only fully recovered after a restart.
Environment
openclaw@2026.4.5
- Embedded dep:
@mariozechner/pi-agent-core (located at /opt/homebrew/lib/node_modules/openclaw/node_modules/@mariozechner/pi-agent-core/dist/agent.js)
- macOS 14, Node 22.22.1
- LaunchAgent-managed gateway daemon
Reproduction
- Configure an agent with a multi-tier fallback chain (e.g.
anthropic/claude-opus-4-6 → openai-codex/gpt-5.4 → zai/glm-5.1).
- Run a heavy-context turn that triggers compaction (~92% prompt usage works reliably).
- Force the primary model to fail or stall (we hit this naturally with a 60s idle timeout under load).
- Observe the gateway log: a fallback handoff fires, and shortly after — sometimes after the next model has already started — the late listener event from the aborted primary run fires.
- Result: unhandled rejection bubbles up, in-memory run state is corrupted, and subsequent models in the same chain fail spuriously.
We saw this 5+ times in a 24-hour window, always immediately after timeout/abort sequences during compaction or handoff transitions.
Stack trace (from ~/.openclaw/logs/gateway.err.log)
[2026-04-07T07:31:36.765-05:00] [openclaw] Unhandled promise rejection: Error: Agent listener invoked outside active run
at Agent.processEvents (file:///opt/homebrew/lib/node_modules/openclaw/node_modules/@mariozechner/pi-agent-core/dist/agent.js:388:19)
Identical stack appeared at:
- 2026-04-06 14:36 CT
- 2026-04-06 15:20 CT
- 2026-04-06 15:43 CT
- 2026-04-06 15:55 CT
- 2026-04-06 16:06 CT
- 2026-04-07 07:31 CT
…etc.
Expected behavior
Late listener events fired after a run is closed should be dropped silently (or logged at debug level) — they should not throw an unhandled rejection that crashes/corrupts the outer gateway run state.
Actual behavior
- Unhandled rejection bubbles up to the gateway process
- Subsequent fallback attempts in the same chain fail (we believe due to corrupted in-memory run state — the symptom is "all models failed" even when the next chain entry would normally be healthy)
- Only a full gateway restart clears it
Cascade evidence
After the lifecycle race fires once, we routinely see:
2026-04-07T08:04:38.138-05:00 Embedded agent failed before reply: All models failed (2):
openai-codex/gpt-5.4: LLM error api_error: Internal server error (timeout)
zai/glm-5.1: LLM error api_error: Internal server error (timeout)
…even when the underlying providers are independently healthy when curled directly. Gateway restart immediately restores the chain.
Suggested fix
In pi-agent-core's Agent.processEvents():
- Add a guard at the top of the function: if the run is already closed, return early instead of throwing.
- Or, attach an
'error' handler in the gateway that drops these specific errors at the run-controller level.
A defense-in-depth option in OpenClaw itself: register a process.on('unhandledRejection') handler that logs but does not propagate Agent listener invoked outside active run errors, since they are known-safe to ignore.
Workaround we are using
- Manual gateway restart when the cascade is detected (clears in-memory corruption).
- A plugin-level "recovery clock" in our
fallback-router that forces chainIndex = 0 after 15 min of being stuck on a fallback (helps the next run recover even if the current one is corrupted).
Related
- Sister issue: gateway session-resume returns
modelApplied: true even when the actual inference runs on a stale resumed model (filed separately).
- Full incident report:
~/.openclaw/workspace/output/post-restart-fallback-cascade-incident-report.md (local; happy to share excerpts on request).
Upstream Bug 1 — pi-agent-core lifecycle race: "Agent listener invoked outside active run"
Repo: github.com/openclaw/openclaw
Suggested labels:
bug,gateway,pi-agent-core,stabilityOpenClaw version: 2026.4.5
Node: 22.22.1
OS: macOS (Apple Silicon)
Title
Gateway: unhandled promise rejection from
pi-agent-coreAgent.processEventsafter run abort/timeout corrupts in-memory run stateSummary
When a model request times out or is aborted mid-stream, the embedded
@mariozechner/pi-agent-corelibrary can fire listener callbacks after the run has already been closed. The check insideAgent.processEvents(agent.js:388) raisesError: Agent listener invoked outside active run, which surfaces as an unhandled promise rejection in the gateway. This corrupts the gateway's in-memory run state and cascades failures down any remaining fallback chain — even models that would otherwise succeed.We hit this repeatedly during a 24-hour cascade incident on 2026-04-06/07; the gateway only fully recovered after a restart.
Environment
openclaw@2026.4.5@mariozechner/pi-agent-core(located at/opt/homebrew/lib/node_modules/openclaw/node_modules/@mariozechner/pi-agent-core/dist/agent.js)Reproduction
anthropic/claude-opus-4-6→openai-codex/gpt-5.4→zai/glm-5.1).We saw this 5+ times in a 24-hour window, always immediately after timeout/abort sequences during compaction or handoff transitions.
Stack trace (from
~/.openclaw/logs/gateway.err.log)Identical stack appeared at:
…etc.
Expected behavior
Late listener events fired after a run is closed should be dropped silently (or logged at debug level) — they should not throw an unhandled rejection that crashes/corrupts the outer gateway run state.
Actual behavior
Cascade evidence
After the lifecycle race fires once, we routinely see:
…even when the underlying providers are independently healthy when curled directly. Gateway restart immediately restores the chain.
Suggested fix
In
pi-agent-core'sAgent.processEvents():'error'handler in the gateway that drops these specific errors at the run-controller level.A defense-in-depth option in OpenClaw itself: register a
process.on('unhandledRejection')handler that logs but does not propagateAgent listener invoked outside active runerrors, since they are known-safe to ignore.Workaround we are using
fallback-routerthat forceschainIndex = 0after 15 min of being stuck on a fallback (helps the next run recover even if the current one is corrupted).Related
modelApplied: trueeven when the actual inference runs on a stale resumed model (filed separately).~/.openclaw/workspace/output/post-restart-fallback-cascade-incident-report.md(local; happy to share excerpts on request).