Gateway: unhandled rejection from pi-agent-core Agent.processEvents after run abort corrupts in-memory run state

# Upstream Bug 1 — pi-agent-core lifecycle race: "Agent listener invoked outside active run"

**Repo:** github.com/openclaw/openclaw
**Suggested labels:** `bug`, `gateway`, `pi-agent-core`, `stability`
**OpenClaw version:** 2026.4.5
**Node:** 22.22.1
**OS:** macOS (Apple Silicon)

---

## Title
Gateway: unhandled promise rejection from `pi-agent-core` `Agent.processEvents` after run abort/timeout corrupts in-memory run state

## Summary
When a model request times out or is aborted mid-stream, the embedded `@mariozechner/pi-agent-core` library can fire listener callbacks **after** the run has already been closed. The check inside `Agent.processEvents` (`agent.js:388`) raises `Error: Agent listener invoked outside active run`, which surfaces as an **unhandled promise rejection** in the gateway. This corrupts the gateway's in-memory run state and cascades failures down any remaining fallback chain — even models that would otherwise succeed.

We hit this repeatedly during a 24-hour cascade incident on 2026-04-06/07; the gateway only fully recovered after a restart.

## Environment
- `openclaw@2026.4.5`
- Embedded dep: `@mariozechner/pi-agent-core` (located at `/opt/homebrew/lib/node_modules/openclaw/node_modules/@mariozechner/pi-agent-core/dist/agent.js`)
- macOS 14, Node 22.22.1
- LaunchAgent-managed gateway daemon

## Reproduction
1. Configure an agent with a multi-tier fallback chain (e.g. `anthropic/claude-opus-4-6` → `openai-codex/gpt-5.4` → `zai/glm-5.1`).
2. Run a heavy-context turn that triggers compaction (~92% prompt usage works reliably).
3. Force the primary model to fail or stall (we hit this naturally with a 60s idle timeout under load).
4. Observe the gateway log: a fallback handoff fires, and shortly after — sometimes **after** the next model has already started — the late listener event from the aborted primary run fires.
5. Result: unhandled rejection bubbles up, in-memory run state is corrupted, and subsequent models in the same chain fail spuriously.

We saw this **5+ times** in a 24-hour window, always immediately after timeout/abort sequences during compaction or handoff transitions.

## Stack trace (from `~/.openclaw/logs/gateway.err.log`)
```
[2026-04-07T07:31:36.765-05:00] [openclaw] Unhandled promise rejection: Error: Agent listener invoked outside active run
    at Agent.processEvents (file:///opt/homebrew/lib/node_modules/openclaw/node_modules/@mariozechner/pi-agent-core/dist/agent.js:388:19)
```

Identical stack appeared at:
- 2026-04-06 14:36 CT
- 2026-04-06 15:20 CT
- 2026-04-06 15:43 CT
- 2026-04-06 15:55 CT
- 2026-04-06 16:06 CT
- 2026-04-07 07:31 CT
…etc.

## Expected behavior
Late listener events fired after a run is closed should be **dropped silently** (or logged at debug level) — they should not throw an unhandled rejection that crashes/corrupts the outer gateway run state.

## Actual behavior
- Unhandled rejection bubbles up to the gateway process
- Subsequent fallback attempts in the same chain fail (we believe due to corrupted in-memory run state — the symptom is "all models failed" even when the next chain entry would normally be healthy)
- Only a full gateway restart clears it

## Cascade evidence
After the lifecycle race fires once, we routinely see:
```
2026-04-07T08:04:38.138-05:00 Embedded agent failed before reply: All models failed (2):
  openai-codex/gpt-5.4: LLM error api_error: Internal server error (timeout)
  zai/glm-5.1: LLM error api_error: Internal server error (timeout)
```
…even when the underlying providers are independently healthy when curled directly. Gateway restart immediately restores the chain.

## Suggested fix
In `pi-agent-core`'s `Agent.processEvents()`:
1. Add a guard at the top of the function: if the run is already closed, return early instead of throwing.
2. Or, attach an `'error'` handler in the gateway that drops these specific errors at the run-controller level.

A defense-in-depth option in OpenClaw itself: register a `process.on('unhandledRejection')` handler that logs but does **not** propagate `Agent listener invoked outside active run` errors, since they are known-safe to ignore.

## Workaround we are using
- Manual gateway restart when the cascade is detected (clears in-memory corruption).
- A plugin-level "recovery clock" in our `fallback-router` that forces `chainIndex = 0` after 15 min of being stuck on a fallback (helps the *next* run recover even if the current one is corrupted).

## Related
- Sister issue: gateway session-resume returns `modelApplied: true` even when the actual inference runs on a stale resumed model (filed separately).
- Full incident report: `~/.openclaw/workspace/output/post-restart-fallback-cascade-incident-report.md` (local; happy to share excerpts on request).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway: unhandled rejection from pi-agent-core Agent.processEvents after run abort corrupts in-memory run state #63220

Upstream Bug 1 — pi-agent-core lifecycle race: "Agent listener invoked outside active run"

Title

Summary

Environment

Reproduction

Stack trace (from `~/.openclaw/logs/gateway.err.log`)

Expected behavior

Actual behavior

Cascade evidence

Suggested fix

Workaround we are using

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Gateway: unhandled rejection from pi-agent-core Agent.processEvents after run abort corrupts in-memory run state #63220

Description

Upstream Bug 1 — pi-agent-core lifecycle race: "Agent listener invoked outside active run"

Title

Summary

Environment

Reproduction

Stack trace (from ~/.openclaw/logs/gateway.err.log)

Expected behavior

Actual behavior

Cascade evidence

Suggested fix

Workaround we are using

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Stack trace (from `~/.openclaw/logs/gateway.err.log`)