Skip to content

[Bug]: Embedded run timeout leaves zombie handle blocking heartbeat delivery #52231

@ai-nurmamat

Description

@ai-nurmamat

Bug Summary

When an embedded run times out but the underlying provider promise never settles (e.g., dead HTTP connection, hung stream), the run handle stays in ACTIVE_EMBEDDED_RUNS permanently. This silently kills all subsequent heartbeat deliveries for the session.

Steps to Reproduce

  1. Configure a heartbeat sharing a session with the main agent (default behavior)
  2. Trigger a condition where the provider connection hangs (e.g., misconfigured endpoint, network interruption)
  3. Wait for the embedded run timeout to fire
  4. Observe that ACTIVE_EMBEDDED_RUNS still contains the zombie handle after timeout
  5. All subsequent heartbeat ticks are silently dropped

Root Cause Analysis

In src/agents/pi-embedded-runner/run/attempt.ts, clearActiveEmbeddedRun is placed in a finally block:

} finally {
    clearTimeout(abortTimer);
    clearActiveEmbeddedRun(params.sessionId, queueHandle, params.sessionKey);
}

However, this finally only executes when the await abortable(activeSession.prompt(...)) promise resolves or rejects. If the abort signal fires but the provider stream/promise never settles, the finally block never runs.

The abort timer fires and calls abortRun(true), but if the underlying HTTP client does not honor the abort (or the connection is in a state where it cannot be interrupted), the promise hangs indefinitely.

In src/auto-reply/reply/queue-policy.ts:

export function resolveActiveRunQueueAction(params) {
  if (!params.isActive) return "run-now";
  if (params.isHeartbeat) return "drop";  // heartbeat silently killed!
  // ...
}

Since the zombie handle keeps isEmbeddedPiRunActive(sessionId) returning true, every heartbeat tick hits "drop" and exits without any log message.

Observed Behavior

From gateway logs:

2026-03-21T13:08:05 [agent/embedded] embedded run timeout: runId=620fd6bf sessionId=50d88d7d timeoutMs=600000
2026-03-21T13:08:13 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000

The zombie run persisted for 10+ hours. During that period, no heartbeat was delivered.

Suggested Fix

After the abort timer fires and a grace period elapses (e.g., 30-60s), forcibly remove the handle:

// In the abort timer callback, after abortRun(true):
setTimeout(() => {
  if (ACTIVE_EMBEDDED_RUNS.get(params.sessionId) === queueHandle) {
    ACTIVE_EMBEDDED_RUNS.delete(params.sessionId);
    notifyEmbeddedRunEnded(params.sessionId);
    log.warn(`force-cleared zombie run: sessionId=${params.sessionId}`);
  }
}, ZOMBIE_CLEANUP_GRACE_MS); // e.g., 30_000

Alternatively, waitForActiveEmbeddedRuns could forcibly clear runs that have exceeded their timeout.

Impact

  • Heartbeat failure causes session to appear dead
  • No auto-recovery without gateway restart
  • Silent failure - very difficult to diagnose

Labels

bug, regression, embedded-run, heartbeat

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions