fix(workflows): prevent zombie workflow runs from hung Pi cleanup#1563
Conversation
) When a Pi/Minimax DAG node received an SDK error result, the executor threw inside the for-await loop and bridgeSession's finally hung on `await promptPromise.catch(...)` because session.dispose() does not settle session.prompt(). Bun's event loop drained with no remaining I/O sources and the process exited before the catch and DB finalization could run, leaving the workflow run stuck in 'running' indefinitely. Changes: - Wrap the post-dispose prompt-promise await in Promise.race against a 10s timeout so cleanup always completes - Add a defensive finally backstop in executeWorkflow that flips any still-'running' run to 'failed' before returning - Add regression test for the cleanup timeout - Update store mocks in executor tests to include getWorkflowRunStatus Fixes #1561
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR adds defensive mechanisms to prevent "zombie" workflow runs that exit without proper finalization. The event-bridge cleanup is refactored to avoid blocking on unresolved promise settlement, while the executor gains a finally backstop that detects and fails any run left in ChangesZombie Run Prevention
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
Comprehensive PR ReviewPR: #1563 — fix(workflows): prevent zombie workflow runs from hung Pi cleanup SummaryThe fix correctly addresses the zombie workflow run problem (closes #1561) via two complementary mechanisms: a Verdict:
🟡 Medium IssuesM1: Backstop fires silently — no log when it corrects a zombie run📍 When the backstop detects View fixconst backstopStatus = await deps.store.getWorkflowRunStatus(runId).catch(() => null);
if (backstopStatus === 'running') {
getLog().warn(
{ workflowRunId: runId },
'executor.backstop_triggered'
);
await deps.store
.failWorkflowRun(runId, 'Workflow exited without finalizing — see logs')
.catch((err: Error) => {
getLog().error({ err, workflowRunId: runId }, 'executor.backstop_fail_failed');
});
}M2: executor.ts
|
- Clear the cleanup setTimeout after Promise.race resolves to prevent a dangling timer from holding the event loop open for up to 10s in fast CLI Pi workflows - Add debug log inside promptPromise.catch for Pi SDK cleanup-phase rejections so secondary errors surface without polluting output - Add warn log when the executor finally backstop fires so zombie corrections are visible in logs (not just the DB record) - Add executor finally backstop tests covering the 'running' and 'completed' status branches - Add fast-path test asserting bridgeSession cleanup completes in <1s when prompt() settles promptly (guards the clearTimeout fix)
⚡ Self-Fix Report (Aggressive)Status: COMPLETE Fixes Applied (5 total)
View all fixes
Tests Added
Skipped(none — all findings addressed) Validation✅ Type check | ✅ Lint | ✅ Tests (43 event-bridge + 33 executor passed) Self-fix by Archon · aggressive mode · fixes pushed to |
The previous fix wrapped `await promptPromise.catch(...)` in a 10s timeout race to prevent the cleanup from hanging when Pi's session.prompt() never settles after dispose(). That solved #1561 but introduced a magic number and unnecessary work. The await itself was the bug. The queue is closed before the await runs, and the .then() handlers attached to promptPromise only push to that queue — closed pushes are no-ops. There's nothing the cleanup is actually waiting for. Whether prompt() resolves in 1ms or never, no observable behavior changes for the caller. Drop the await entirely. Attach .catch() fire-and-forget so a stray async rejection (the .then() handlers should preclude this, but belt-and- suspenders) doesn't bubble up as an unhandled-rejection process exit. Cleanup now returns in <200ms regardless of Pi's prompt() behavior. Tests: - "cleanup does not block when prompt() hangs forever" — asserts <200ms (vs. the previous ~10s tolerance) - "late prompt() rejection does not become unhandled" — regression for the .catch() belt
…leam00#1563) * fix(workflows): prevent zombie workflow runs from hung Pi cleanup (coleam00#1561) When a Pi/Minimax DAG node received an SDK error result, the executor threw inside the for-await loop and bridgeSession's finally hung on `await promptPromise.catch(...)` because session.dispose() does not settle session.prompt(). Bun's event loop drained with no remaining I/O sources and the process exited before the catch and DB finalization could run, leaving the workflow run stuck in 'running' indefinitely. Changes: - Wrap the post-dispose prompt-promise await in Promise.race against a 10s timeout so cleanup always completes - Add a defensive finally backstop in executeWorkflow that flips any still-'running' run to 'failed' before returning - Add regression test for the cleanup timeout - Update store mocks in executor tests to include getWorkflowRunStatus Fixes coleam00#1561 * fix: address review findings from PR coleam00#1563 - Clear the cleanup setTimeout after Promise.race resolves to prevent a dangling timer from holding the event loop open for up to 10s in fast CLI Pi workflows - Add debug log inside promptPromise.catch for Pi SDK cleanup-phase rejections so secondary errors surface without polluting output - Add warn log when the executor finally backstop fires so zombie corrections are visible in logs (not just the DB record) - Add executor finally backstop tests covering the 'running' and 'completed' status branches - Add fast-path test asserting bridgeSession cleanup completes in <1s when prompt() settles promptly (guards the clearTimeout fix) * simplify: use unknown instead of Error in backstop catch handler * fix(pi): drop blocking await on prompt promise; cleanup is non-blocking The previous fix wrapped `await promptPromise.catch(...)` in a 10s timeout race to prevent the cleanup from hanging when Pi's session.prompt() never settles after dispose(). That solved coleam00#1561 but introduced a magic number and unnecessary work. The await itself was the bug. The queue is closed before the await runs, and the .then() handlers attached to promptPromise only push to that queue — closed pushes are no-ops. There's nothing the cleanup is actually waiting for. Whether prompt() resolves in 1ms or never, no observable behavior changes for the caller. Drop the await entirely. Attach .catch() fire-and-forget so a stray async rejection (the .then() handlers should preclude this, but belt-and- suspenders) doesn't bubble up as an unhandled-rejection process exit. Cleanup now returns in <200ms regardless of Pi's prompt() behavior. Tests: - "cleanup does not block when prompt() hangs forever" — asserts <200ms (vs. the previous ~10s tolerance) - "late prompt() rejection does not become unhandled" — regression for the .catch() belt
Summary
bridgeSessionhangs forever —session.dispose()closes Pi's I/O handles but doesn't settlesession.prompt(). Bun drains its event loop and exits with code 0, leaving the workflow run stuck instatus = 'running'indefinitely.abandon. No automatic recovery exists.bridgeSessionpost-disposeawaitnow races against a 10 s timeout so the event loop is never left with zero anchors; (C2)executeWorkflowgets afinallybackstop that flips any still-runningrun tofailedafter all code paths complete; (C3) regression test added that intentionally passes a never-settlingprompt()mock.withIdleTimeouttiming, DAG execution logic, process signal handling, and all existing behaviour on successful runs.UX Journey
Before
After
Architecture Diagram
Before
After
Connection inventory:
dag-executor.tsevent-bridge.ts(viawithIdleTimeout→generator.return)event-bridge.tsbridgeSessionfinallypromptPromisePromise.racewith 10 s timeoutexecutor.tsexecuteWorkflowIWorkflowStore.getWorkflowRunStatusfinallyexecutor.tsexecuteWorkflowIWorkflowStore.failWorkflowRunfailedinfinallyLabel Snapshot
risk: lowsize: Sworkflows,providersworkflows:executor,providers:piChange Metadata
bugmulti(workflows + providers)Linked Issue
Validation Evidence (required)
All checks passed:
bun run check:bundled— PASS (bundled-defaults.generated.ts up to date, 36 commands, 20 workflows)bun run type-check— PASS (all 10 packages clean)bun run lint --max-warnings 0— PASS (0 errors, 0 warnings)bun run format:check— PASSbun run test— PASS (full suite green; newbridgeSessiontimeout test passes in ~10.04s, dominated by the intentional 10s timeout it exercises)The new regression test in
event-bridge.test.tsdirectly validates the fix: a mock session whoseprompt()returns a never-settling promise no longer hangs cleanup — the consumer's thrown error propagates within ~10 s. Without the fix, this test would hang indefinitely (caught by the 15 s test-level timeout).Security Impact (required)
Compatibility / Migration
bridgeSessionchange is Pi-specific; thefinallybackstop inexecutor.tsis a no-op on runs that already finalize correctlyHuman Verification (required)
bun run validatesuite passed (type-check, lint, format, tests, bundled-defaults check)getWorkflowRunStatusthrow in backstop is caught and swallowed; (C2)failWorkflowRunthrow in backstop is caught and logged; (C3)workflowRunundefined guard prevents backstop from running on run-creation failure; (C4) timeout fires after 10 s in the test, not soonerSide Effects / Blast Radius (required)
executor.tsfinallybackstop is universal but fires only when a run is stillrunningpost-execution (should be rare/never on non-Pi paths)bridgeSession's finally is triggered for cleanup, so the race resolves immediately (prompt already settled)executor.backstop_fail_failedlog event emitted if the backstop's ownfailWorkflowRuncall failsRollback Plan (required)
853f491d(git revert 853f491d)runningafter a Pi SDK error;archon workflow statusshowsrunningindefinitely; subsequent runs on same worktree blocked with "worktree is in use"Risks and Mitigations
prompt()mocks. Isolated in its ownbun testinvocation via the existing@archon/providerstest batch configuration.Summary by CodeRabbit
Bug Fixes
Tests