Skip to content

Gateway: unhandled rejection from pi-agent-core Agent.processEvents after run abort corrupts in-memory run state #63220

@jdroblee-afk

Description

@jdroblee-afk

Upstream Bug 1 — pi-agent-core lifecycle race: "Agent listener invoked outside active run"

Repo: github.com/openclaw/openclaw
Suggested labels: bug, gateway, pi-agent-core, stability
OpenClaw version: 2026.4.5
Node: 22.22.1
OS: macOS (Apple Silicon)


Title

Gateway: unhandled promise rejection from pi-agent-core Agent.processEvents after run abort/timeout corrupts in-memory run state

Summary

When a model request times out or is aborted mid-stream, the embedded @mariozechner/pi-agent-core library can fire listener callbacks after the run has already been closed. The check inside Agent.processEvents (agent.js:388) raises Error: Agent listener invoked outside active run, which surfaces as an unhandled promise rejection in the gateway. This corrupts the gateway's in-memory run state and cascades failures down any remaining fallback chain — even models that would otherwise succeed.

We hit this repeatedly during a 24-hour cascade incident on 2026-04-06/07; the gateway only fully recovered after a restart.

Environment

  • openclaw@2026.4.5
  • Embedded dep: @mariozechner/pi-agent-core (located at /opt/homebrew/lib/node_modules/openclaw/node_modules/@mariozechner/pi-agent-core/dist/agent.js)
  • macOS 14, Node 22.22.1
  • LaunchAgent-managed gateway daemon

Reproduction

  1. Configure an agent with a multi-tier fallback chain (e.g. anthropic/claude-opus-4-6openai-codex/gpt-5.4zai/glm-5.1).
  2. Run a heavy-context turn that triggers compaction (~92% prompt usage works reliably).
  3. Force the primary model to fail or stall (we hit this naturally with a 60s idle timeout under load).
  4. Observe the gateway log: a fallback handoff fires, and shortly after — sometimes after the next model has already started — the late listener event from the aborted primary run fires.
  5. Result: unhandled rejection bubbles up, in-memory run state is corrupted, and subsequent models in the same chain fail spuriously.

We saw this 5+ times in a 24-hour window, always immediately after timeout/abort sequences during compaction or handoff transitions.

Stack trace (from ~/.openclaw/logs/gateway.err.log)

[2026-04-07T07:31:36.765-05:00] [openclaw] Unhandled promise rejection: Error: Agent listener invoked outside active run
    at Agent.processEvents (file:///opt/homebrew/lib/node_modules/openclaw/node_modules/@mariozechner/pi-agent-core/dist/agent.js:388:19)

Identical stack appeared at:

  • 2026-04-06 14:36 CT
  • 2026-04-06 15:20 CT
  • 2026-04-06 15:43 CT
  • 2026-04-06 15:55 CT
  • 2026-04-06 16:06 CT
  • 2026-04-07 07:31 CT
    …etc.

Expected behavior

Late listener events fired after a run is closed should be dropped silently (or logged at debug level) — they should not throw an unhandled rejection that crashes/corrupts the outer gateway run state.

Actual behavior

  • Unhandled rejection bubbles up to the gateway process
  • Subsequent fallback attempts in the same chain fail (we believe due to corrupted in-memory run state — the symptom is "all models failed" even when the next chain entry would normally be healthy)
  • Only a full gateway restart clears it

Cascade evidence

After the lifecycle race fires once, we routinely see:

2026-04-07T08:04:38.138-05:00 Embedded agent failed before reply: All models failed (2):
  openai-codex/gpt-5.4: LLM error api_error: Internal server error (timeout)
  zai/glm-5.1: LLM error api_error: Internal server error (timeout)

…even when the underlying providers are independently healthy when curled directly. Gateway restart immediately restores the chain.

Suggested fix

In pi-agent-core's Agent.processEvents():

  1. Add a guard at the top of the function: if the run is already closed, return early instead of throwing.
  2. Or, attach an 'error' handler in the gateway that drops these specific errors at the run-controller level.

A defense-in-depth option in OpenClaw itself: register a process.on('unhandledRejection') handler that logs but does not propagate Agent listener invoked outside active run errors, since they are known-safe to ignore.

Workaround we are using

  • Manual gateway restart when the cascade is detected (clears in-memory corruption).
  • A plugin-level "recovery clock" in our fallback-router that forces chainIndex = 0 after 15 min of being stuck on a fallback (helps the next run recover even if the current one is corrupted).

Related

  • Sister issue: gateway session-resume returns modelApplied: true even when the actual inference runs on a stale resumed model (filed separately).
  • Full incident report: ~/.openclaw/workspace/output/post-restart-fallback-cascade-incident-report.md (local; happy to share excerpts on request).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions