Bug Summary
When an embedded run times out but the underlying provider promise never settles (e.g., dead HTTP connection, hung stream), the run handle stays in ACTIVE_EMBEDDED_RUNS permanently. This silently kills all subsequent heartbeat deliveries for the session.
Steps to Reproduce
- Configure a heartbeat sharing a session with the main agent (default behavior)
- Trigger a condition where the provider connection hangs (e.g., misconfigured endpoint, network interruption)
- Wait for the embedded run timeout to fire
- Observe that
ACTIVE_EMBEDDED_RUNS still contains the zombie handle after timeout
- All subsequent heartbeat ticks are silently dropped
Root Cause Analysis
In src/agents/pi-embedded-runner/run/attempt.ts, clearActiveEmbeddedRun is placed in a finally block:
} finally {
clearTimeout(abortTimer);
clearActiveEmbeddedRun(params.sessionId, queueHandle, params.sessionKey);
}
However, this finally only executes when the await abortable(activeSession.prompt(...)) promise resolves or rejects. If the abort signal fires but the provider stream/promise never settles, the finally block never runs.
The abort timer fires and calls abortRun(true), but if the underlying HTTP client does not honor the abort (or the connection is in a state where it cannot be interrupted), the promise hangs indefinitely.
In src/auto-reply/reply/queue-policy.ts:
export function resolveActiveRunQueueAction(params) {
if (!params.isActive) return "run-now";
if (params.isHeartbeat) return "drop"; // heartbeat silently killed!
// ...
}
Since the zombie handle keeps isEmbeddedPiRunActive(sessionId) returning true, every heartbeat tick hits "drop" and exits without any log message.
Observed Behavior
From gateway logs:
2026-03-21T13:08:05 [agent/embedded] embedded run timeout: runId=620fd6bf sessionId=50d88d7d timeoutMs=600000
2026-03-21T13:08:13 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000
The zombie run persisted for 10+ hours. During that period, no heartbeat was delivered.
Suggested Fix
After the abort timer fires and a grace period elapses (e.g., 30-60s), forcibly remove the handle:
// In the abort timer callback, after abortRun(true):
setTimeout(() => {
if (ACTIVE_EMBEDDED_RUNS.get(params.sessionId) === queueHandle) {
ACTIVE_EMBEDDED_RUNS.delete(params.sessionId);
notifyEmbeddedRunEnded(params.sessionId);
log.warn(`force-cleared zombie run: sessionId=${params.sessionId}`);
}
}, ZOMBIE_CLEANUP_GRACE_MS); // e.g., 30_000
Alternatively, waitForActiveEmbeddedRuns could forcibly clear runs that have exceeded their timeout.
Impact
- Heartbeat failure causes session to appear dead
- No auto-recovery without gateway restart
- Silent failure - very difficult to diagnose
Labels
bug, regression, embedded-run, heartbeat
Bug Summary
When an embedded run times out but the underlying provider promise never settles (e.g., dead HTTP connection, hung stream), the run handle stays in
ACTIVE_EMBEDDED_RUNSpermanently. This silently kills all subsequent heartbeat deliveries for the session.Steps to Reproduce
ACTIVE_EMBEDDED_RUNSstill contains the zombie handle after timeoutRoot Cause Analysis
In
src/agents/pi-embedded-runner/run/attempt.ts,clearActiveEmbeddedRunis placed in afinallyblock:However, this
finallyonly executes when theawait abortable(activeSession.prompt(...))promise resolves or rejects. If the abort signal fires but the provider stream/promise never settles, thefinallyblock never runs.The abort timer fires and calls
abortRun(true), but if the underlying HTTP client does not honor the abort (or the connection is in a state where it cannot be interrupted), the promise hangs indefinitely.In
src/auto-reply/reply/queue-policy.ts:Since the zombie handle keeps
isEmbeddedPiRunActive(sessionId)returningtrue, every heartbeat tick hits"drop"and exits without any log message.Observed Behavior
From gateway logs:
The zombie run persisted for 10+ hours. During that period, no heartbeat was delivered.
Suggested Fix
After the abort timer fires and a grace period elapses (e.g., 30-60s), forcibly remove the handle:
Alternatively,
waitForActiveEmbeddedRunscould forcibly clear runs that have exceeded their timeout.Impact
Labels
bug, regression, embedded-run, heartbeat