Skip to content

fix(workflows): prevent zombie workflow runs from hung Pi cleanup#1563

Merged
Wirasm merged 4 commits into
devfrom
archon/task-fix-issue-1561
May 4, 2026
Merged

fix(workflows): prevent zombie workflow runs from hung Pi cleanup#1563
Wirasm merged 4 commits into
devfrom
archon/task-fix-issue-1561

Conversation

@Wirasm

@Wirasm Wirasm commented May 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Problem: When a Pi DAG node receives an SDK error result, the thrown error never reaches the executor's catch block because the async generator cleanup in bridgeSession hangs forever — session.dispose() closes Pi's I/O handles but doesn't settle session.prompt(). Bun drains its event loop and exits with code 0, leaving the workflow run stuck in status = 'running' indefinitely.
  • Why it matters: Zombie runs block all subsequent runs on the same worktree path (path-lock) and require manual abandon. No automatic recovery exists.
  • What changed: (C1) bridgeSession post-dispose await now races against a 10 s timeout so the event loop is never left with zero anchors; (C2) executeWorkflow gets a finally backstop that flips any still-running run to failed after all code paths complete; (C3) regression test added that intentionally passes a never-settling prompt() mock.
  • What did NOT change: Claude and Codex providers, withIdleTimeout timing, DAG execution logic, process signal handling, and all existing behaviour on successful runs.

UX Journey

Before

User                    Archon CLI               Pi Provider
────                    ──────────               ───────────
archon workflow run ──▶ starts DAG run
                        (status = 'running')
                        streams to Pi ─────────▶ returns isError: true
                        dag-executor throws
                        [for-await calls .return()]
                        bridgeSession.finally:
                          session.dispose()
                          await promptPromise   ← hangs forever
                                                  (Bun event loop drains)
                        [process exits code 0]
                                                  catch block: NEVER RUNS
                                                  status stays 'running'
User sees:              $ (prompt returns, status = 'running' in DB)
archon workflow status  shows run stuck 'running' — zombie forever
archon workflow run ... "worktree is in use" error on next run

After

User                    Archon CLI               Pi Provider
────                    ──────────               ───────────
archon workflow run ──▶ starts DAG run
                        (status = 'running')
                        streams to Pi ─────────▶ returns isError: true
                        dag-executor throws
                        [for-await calls .return()]
                        bridgeSession.finally:
                          session.dispose()
                          [Promise.race: prompt vs 10s timeout]
                          timeout wins after 10s  ← event loop stays alive
                        throw propagates to catch
                        dag_node_failed logged
                        executeWorkflow catch runs
                        [finally backstop checks status]
                        status → 'failed'
User sees:              run status = 'failed' (not 'running')
archon workflow status  shows correct terminal state
archon workflow run ... next run proceeds normally

Architecture Diagram

Before

dag-executor.ts
  executeNodeInternal()
    for await (msg of withIdleTimeout(aiClient.sendQuery()))
      if (msg.isError) throw Error          ← throws
        [.return() called on withIdleTimeout]
          idle-timeout.ts finally:
            await generator.return(undefined)
              event-bridge.ts finally:
                session.dispose()           ← synchronous
                await promptPromise.catch() ← HANGS FOREVER (Pi bug)
                                              process exits, catch never runs

After

dag-executor.ts
  executeNodeInternal()
    for await (msg of withIdleTimeout(aiClient.sendQuery()))
      if (msg.isError) throw Error          ← throws
        [.return() called on withIdleTimeout]
          idle-timeout.ts finally:
            await generator.return(undefined)
              event-bridge.ts finally:      [~MODIFIED]
                session.dispose()
                [Promise.race([promptPromise, 10s timeout])]
                                            ← resolves within 10s
                throw propagates upward
  executor.ts executeWorkflow catch:
    failWorkflowRun(...)                    ← status = 'failed'
  executor.ts executeWorkflow finally:      [+NEW]
    if status still 'running': flip to 'failed'  ← backstop

Connection inventory:

From To Status Notes
dag-executor.ts event-bridge.ts (via withIdleTimeoutgenerator.return) unchanged Throw propagation path
event-bridge.ts bridgeSession finally promptPromise modified Now Promise.race with 10 s timeout
executor.ts executeWorkflow IWorkflowStore.getWorkflowRunStatus new Backstop reads current status in finally
executor.ts executeWorkflow IWorkflowStore.failWorkflowRun new Backstop flips zombie to failed in finally

Label Snapshot

  • Risk: risk: low
  • Size: size: S
  • Scope: workflows, providers
  • Module: workflows:executor, providers:pi

Change Metadata

  • Change type: bug
  • Primary scope: multi (workflows + providers)

Linked Issue

Validation Evidence (required)

bun run validate

All checks passed:

  • bun run check:bundled — PASS (bundled-defaults.generated.ts up to date, 36 commands, 20 workflows)
  • bun run type-check — PASS (all 10 packages clean)
  • bun run lint --max-warnings 0 — PASS (0 errors, 0 warnings)
  • bun run format:check — PASS
  • bun run test — PASS (full suite green; new bridgeSession timeout test passes in ~10.04s, dominated by the intentional 10s timeout it exercises)

The new regression test in event-bridge.test.ts directly validates the fix: a mock session whose prompt() returns a never-settling promise no longer hangs cleanup — the consumer's thrown error propagates within ~10 s. Without the fix, this test would hang indefinitely (caught by the 15 s test-level timeout).

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes — existing Claude and Codex workflows are completely unaffected; the bridgeSession change is Pi-specific; the finally backstop in executor.ts is a no-op on runs that already finalize correctly
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

  • Verified scenarios: Full bun run validate suite passed (type-check, lint, format, tests, bundled-defaults check)
  • Edge cases checked: (C1) getWorkflowRunStatus throw in backstop is caught and swallowed; (C2) failWorkflowRun throw in backstop is caught and logged; (C3) workflowRun undefined guard prevents backstop from running on run-creation failure; (C4) timeout fires after 10 s in the test, not sooner
  • What was not verified: Live end-to-end run against a real Pi/Minimax provider (requires external API key)

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: Pi provider cleanup path only; executor.ts finally backstop is universal but fires only when a run is still running post-execution (should be rare/never on non-Pi paths)
  • Potential unintended effects: The 10 s post-dispose wait only affects already-errored Pi sessions; normal successful runs complete and close the generator cleanly before bridgeSession's finally is triggered for cleanup, so the race resolves immediately (prompt already settled)
  • Guardrails: New executor.backstop_fail_failed log event emitted if the backstop's own failWorkflowRun call fails

Rollback Plan (required)

  • Fast rollback command/path: Revert commit 853f491d (git revert 853f491d)
  • Feature flags or config toggles: None
  • Observable failure symptoms: Workflow runs stuck in running after a Pi SDK error; archon workflow status shows running indefinitely; subsequent runs on same worktree blocked with "worktree is in use"

Risks and Mitigations

  • Risk: 10 s post-dispose timeout could delay test suite if many Pi tests use never-settling mocks
    • Mitigation: The single regression test is intentional and already accounts for this (~10 s). No other tests use never-settling prompt() mocks. Isolated in its own bun test invocation via the existing @archon/providers test batch configuration.

Summary by CodeRabbit

  • Bug Fixes

    • Improved cleanup behavior to prevent blocking on unresolved operations.
    • Added safeguards to properly handle workflows that unexpectedly exit while still running, ensuring they are marked as failed and logged appropriately.
  • Tests

    • Added regression tests validating cleanup resilience and workflow failure handling.

)

When a Pi/Minimax DAG node received an SDK error result, the executor
threw inside the for-await loop and bridgeSession's finally hung on
`await promptPromise.catch(...)` because session.dispose() does not
settle session.prompt(). Bun's event loop drained with no remaining
I/O sources and the process exited before the catch and DB finalization
could run, leaving the workflow run stuck in 'running' indefinitely.

Changes:
- Wrap the post-dispose prompt-promise await in Promise.race against a
  10s timeout so cleanup always completes
- Add a defensive finally backstop in executeWorkflow that flips any
  still-'running' run to 'failed' before returning
- Add regression test for the cleanup timeout
- Update store mocks in executor tests to include getWorkflowRunStatus

Fixes #1561
@coderabbitai

coderabbitai Bot commented May 4, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae5c02ec-c497-484d-8581-e031dd5c81ed

📥 Commits

Reviewing files that changed from the base of the PR and between 88d0109 and cd9670c.

📒 Files selected for processing (5)
  • packages/providers/src/community/pi/event-bridge.test.ts
  • packages/providers/src/community/pi/event-bridge.ts
  • packages/workflows/src/executor-preamble.test.ts
  • packages/workflows/src/executor.test.ts
  • packages/workflows/src/executor.ts

📝 Walkthrough

Walkthrough

This PR adds defensive mechanisms to prevent "zombie" workflow runs that exit without proper finalization. The event-bridge cleanup is refactored to avoid blocking on unresolved promise settlement, while the executor gains a finally backstop that detects and fails any run left in running status after process exit.

Changes

Zombie Run Prevention

Layer / File(s) Summary
Event-Bridge Cleanup
packages/providers/src/community/pi/event-bridge.ts
bridgeSession finally block stops awaiting promptPromise. Queue closes, listeners clean up, and a defensive .catch() is attached to log late rejections without blocking process exit.
Executor Backstop
packages/workflows/src/executor.ts
executeWorkflow adds finally block that queries run status; if still 'running', calls failWorkflowRun() with "exited without finalizing" message to prevent orphaned zombie runs.
Test Fixtures
packages/workflows/src/executor-preamble.test.ts, packages/workflows/src/executor.test.ts
Mocked IWorkflowStore adds getWorkflowRunStatus() method returning 'completed', enabling run-status-aware test assertions.
Regression & Backstop Tests
packages/providers/src/community/pi/event-bridge.test.ts, packages/workflows/src/executor.test.ts
Two event-bridge cleanup regression tests verify prompt hanging does not block cleanup and late rejections do not surface as unhandled errors; executor backstop tests verify failWorkflowRun() called only when status is 'running', not 'completed'.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Poem

🧟 A zombie once lurked in the flow,
Process exiting high, status stuck low,
But cleanup now won't block on a dream,
And finally backstops guard the workflow stream,
No more undead runs haunting the queue! 🐰✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch archon/task-fix-issue-1561

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@Wirasm

Wirasm commented May 4, 2026

Copy link
Copy Markdown
Collaborator Author

Comprehensive PR Review

PR: #1563 — fix(workflows): prevent zombie workflow runs from hung Pi cleanup
Reviewed by: 4 specialized agents (code-review, error-handling, test-coverage, docs-impact)
Date: 2026-05-04


Summary

The fix correctly addresses the zombie workflow run problem (closes #1561) via two complementary mechanisms: a Promise.race timeout in event-bridge.ts to unblock hung Pi cleanup, and a defensive finally backstop in executor.ts to catch any run still left in 'running'. Implementation is type-safe, scope-disciplined, well-commented, and ships a strong regression test for the primary fix. No documentation changes needed.

Verdict: REQUEST_CHANGES — 2 medium issues, 3 low advisory issues. No blockers.

Severity Count
🔴 CRITICAL 0
🟠 HIGH 0
🟡 MEDIUM 2
🟢 LOW 3

🟡 Medium Issues

M1: Backstop fires silently — no log when it corrects a zombie run

📍 packages/workflows/src/executor.ts:840-845

When the backstop detects backstopStatus === 'running' and calls failWorkflowRun, there is no structured log entry. Only the failure of the backstop is logged. If a zombie run is ever corrected in production, the DB record flips to 'failed' with nothing in the logs to show the backstop triggered — making incident investigation very difficult.

View fix
const backstopStatus = await deps.store.getWorkflowRunStatus(runId).catch(() => null);
if (backstopStatus === 'running') {
  getLog().warn(
    { workflowRunId: runId },
    'executor.backstop_triggered'
  );
  await deps.store
    .failWorkflowRun(runId, 'Workflow exited without finalizing — see logs')
    .catch((err: Error) => {
      getLog().error({ err, workflowRunId: runId }, 'executor.backstop_fail_failed');
    });
}

M2: executor.ts finally backstop has no test for the 'running' branch

📍 packages/workflows/src/executor.ts:839-846 / packages/workflows/src/executor.test.ts

All makeStore mocks default to returning 'completed' from getWorkflowRunStatus. The guarded block (if (backstopStatus === 'running')) is dead code in the entire test suite. A typo in the status string ('Running' vs 'running') would silently disable zombie cleanup with no alarm.

View suggested tests
describe('finally backstop', () => {
  it('calls failWorkflowRun when run is still running at finally', async () => {
    const failSpy = mock(async () => {});
    const store = makeStore({
      getWorkflowRunStatus: mock(async () => 'running' as const),
      failWorkflowRun: failSpy,
    });
    const deps = makeDeps(store);
    await executeWorkflow(deps, makePlatform(), 'conv-1', '/tmp', makeWorkflow(), 'test', 'db-conv-1');
    const call = failSpy.mock.calls.find(
      (c: unknown[]) => typeof c[1] === 'string' && (c[1] as string).includes('exited without finalizing')
    );
    expect(call).toBeDefined();
  });

  it('does not call failWorkflowRun when run already completed', async () => {
    const failSpy = mock(async () => {});
    const store = makeStore({
      getWorkflowRunStatus: mock(async () => 'completed' as const),
      failWorkflowRun: failSpy,
    });
    const deps = makeDeps(store);
    await executeWorkflow(deps, makePlatform(), 'conv-1', '/tmp', makeWorkflow(), 'test', 'db-conv-1');
    const backstopCall = failSpy.mock.calls.find(
      (c: unknown[]) => typeof c[1] === 'string' && (c[1] as string).includes('exited without finalizing')
    );
    expect(backstopCall).toBeUndefined();
  });
});

🟢 Low Issues

View 3 low-priority suggestions

L1: Dangling setTimeout — can add up to 10s hang after fast Pi cleanup in CLI workflows

📍 packages/providers/src/community/pi/event-bridge.ts:416-418

When promptPromise settles before the 10s timer (normal/fast path), Promise.race resolves but the setTimeout stays queued. A live timer holds the Bun/Node event loop open. For fast CLI Pi workflows (<10s total), this can add up to 10s of unnecessary wait before process exit — a regression vs. pre-fix behavior.

// Fix: capture and clear the timer
let cleanupTimeoutId: ReturnType<typeof setTimeout> | undefined;
await Promise.race([
  promptPromise.catch(() => {
    /* errors already surfaced through the queue */
  }),
  new Promise<void>(resolve => {
    cleanupTimeoutId = setTimeout(resolve, 10_000);
  }),
]);
clearTimeout(cleanupTimeoutId);

Note: Does not affect server deployments (process never exits). Only impacts fast CLI Pi executions.

L2: No fast-path test for bridgeSession.finally completion time

📍 packages/providers/src/community/pi/event-bridge.test.ts:484-538

The regression test verifies elapsed < 11s (zombie path). No test verifies that when promptPromise settles quickly, cleanup also completes quickly. Pairs with L1 fix — a companion test would catch any future regression that accidentally reintroduces a fixed delay. Advisory only.

L3: promptPromise.catch swallows secondary cleanup errors silently

📍 packages/providers/src/community/pi/event-bridge.ts:413-415

Out of scope per scope.md (Pi SDK internals). A getLog().debug({ err }, 'pi.event-bridge.prompt_rejected_during_cleanup') inside the catch would surface SDK regressions during cleanup without polluting output. Advisory only.


What's Good

  • Two independent layers of defense: the Pi timeout prevents the hang; the executor backstop catches any future code path that misses failWorkflowRun. Neither depends on the other; both are needed.
  • event-bridge regression test is exemplary: directly reproduces the exact failure mode (never-settling prompt() after dispose()), validates cleanup within the 10s window, covers the consumer-throws path that triggers cleanup.
  • Mock completeness: proactively updated executor.test.ts and executor-preamble.test.ts to add getWorkflowRunStatus to makeStore — prevented 18/31 executor tests from failing.
  • Backstop condition is narrowly scoped: backstopStatus === 'running' means the backstop cannot fire for legitimately paused, cancelled, or completed runs.
  • CLAUDE.md compliance: within-process backstop correctly follows "No Autonomous Lifecycle Mutation Across Process Boundaries"; comments explain WHY not WHAT; no any types; full type annotations.
  • Scope discipline: Claude/Codex providers untouched, STEP_IDLE_TIMEOUT_MS untouched, no process.on('exit') handler added.

Suggested Next Steps

  1. Add the warn log in the backstop if block (M1) — one line
  2. Add the describe('finally backstop') tests (M2) — low effort, two it() blocks
  3. Optionally fix the dangling timer (L1) — prevents potential 10s hang regression for fast CLI Pi workflows
  4. Rebase onto dev (currently 2 commits behind)

Artifacts: /Users/rasmus/.archon/workspaces/coleam00/Archon/artifacts/runs/aba41af102dd6a958af92d5d4cac9300/review/

- Clear the cleanup setTimeout after Promise.race resolves to prevent
  a dangling timer from holding the event loop open for up to 10s in
  fast CLI Pi workflows
- Add debug log inside promptPromise.catch for Pi SDK cleanup-phase
  rejections so secondary errors surface without polluting output
- Add warn log when the executor finally backstop fires so zombie
  corrections are visible in logs (not just the DB record)
- Add executor finally backstop tests covering the 'running' and
  'completed' status branches
- Add fast-path test asserting bridgeSession cleanup completes in <1s
  when prompt() settles promptly (guards the clearTimeout fix)
@Wirasm

Wirasm commented May 4, 2026

Copy link
Copy Markdown
Collaborator Author

⚡ Self-Fix Report (Aggressive)

Status: COMPLETE
Pushed: ✅ Changes pushed to archon/task-fix-issue-1561
Commit: 92886af
Philosophy: Fix everything unless clearly a new concern


Fixes Applied (5 total)

Severity Count
🔴 CRITICAL 0
🟠 HIGH 0
🟡 MEDIUM 2
🟢 LOW 3
View all fixes
  • Backstop fires silently (executor.ts:840) — Added getLog().warn({ workflowRunId: runId }, 'executor.backstop_triggered') so zombie corrections leave a log breadcrumb
  • executor.ts finally backstop untested (executor.test.ts) — Added describe('finally backstop') with two tests: status 'running' fires failWorkflowRun, status 'completed' does not
  • Dangling setTimeout (event-bridge.ts:416) — Extracted timer to cleanupTimeoutId; clearTimeout(cleanupTimeoutId) runs after Promise.race so fast CLI Pi workflows don't hang 10s post-completion
  • Missing fast-path cleanup test (event-bridge.test.ts) — Added test asserting elapsed < 1_000ms when dispose() settles prompt() immediately; catches future fixed-delay regressions
  • promptPromise.catch swallows cleanup errors silently (event-bridge.ts:413) — Added getLog().debug({ err }, 'pi.event-bridge.prompt_rejected_during_cleanup') with explanatory comment

Tests Added

  • event-bridge.test.ts: fast-path cleanup timing test (elapsed < 1_000)
  • executor.test.ts: backstop fires for 'running' status
  • executor.test.ts: backstop skips for 'completed' status

Skipped

(none — all findings addressed)


Validation

✅ Type check | ✅ Lint | ✅ Tests (43 event-bridge + 33 executor passed)


Self-fix by Archon · aggressive mode · fixes pushed to archon/task-fix-issue-1561

The previous fix wrapped `await promptPromise.catch(...)` in a 10s timeout
race to prevent the cleanup from hanging when Pi's session.prompt() never
settles after dispose(). That solved #1561 but introduced a magic number
and unnecessary work.

The await itself was the bug. The queue is closed before the await runs,
and the .then() handlers attached to promptPromise only push to that
queue — closed pushes are no-ops. There's nothing the cleanup is actually
waiting for. Whether prompt() resolves in 1ms or never, no observable
behavior changes for the caller.

Drop the await entirely. Attach .catch() fire-and-forget so a stray async
rejection (the .then() handlers should preclude this, but belt-and-
suspenders) doesn't bubble up as an unhandled-rejection process exit.
Cleanup now returns in <200ms regardless of Pi's prompt() behavior.

Tests:
- "cleanup does not block when prompt() hangs forever" — asserts <200ms
  (vs. the previous ~10s tolerance)
- "late prompt() rejection does not become unhandled" — regression for
  the .catch() belt
@Wirasm Wirasm marked this pull request as ready for review May 4, 2026 12:55
@Wirasm Wirasm merged commit 5593498 into dev May 4, 2026
6 of 7 checks passed
@Wirasm Wirasm mentioned this pull request May 12, 2026
cropse pushed a commit to cropse/Archon that referenced this pull request May 19, 2026
…leam00#1563)

* fix(workflows): prevent zombie workflow runs from hung Pi cleanup (coleam00#1561)

When a Pi/Minimax DAG node received an SDK error result, the executor
threw inside the for-await loop and bridgeSession's finally hung on
`await promptPromise.catch(...)` because session.dispose() does not
settle session.prompt(). Bun's event loop drained with no remaining
I/O sources and the process exited before the catch and DB finalization
could run, leaving the workflow run stuck in 'running' indefinitely.

Changes:
- Wrap the post-dispose prompt-promise await in Promise.race against a
  10s timeout so cleanup always completes
- Add a defensive finally backstop in executeWorkflow that flips any
  still-'running' run to 'failed' before returning
- Add regression test for the cleanup timeout
- Update store mocks in executor tests to include getWorkflowRunStatus

Fixes coleam00#1561

* fix: address review findings from PR coleam00#1563

- Clear the cleanup setTimeout after Promise.race resolves to prevent
  a dangling timer from holding the event loop open for up to 10s in
  fast CLI Pi workflows
- Add debug log inside promptPromise.catch for Pi SDK cleanup-phase
  rejections so secondary errors surface without polluting output
- Add warn log when the executor finally backstop fires so zombie
  corrections are visible in logs (not just the DB record)
- Add executor finally backstop tests covering the 'running' and
  'completed' status branches
- Add fast-path test asserting bridgeSession cleanup completes in <1s
  when prompt() settles promptly (guards the clearTimeout fix)

* simplify: use unknown instead of Error in backstop catch handler

* fix(pi): drop blocking await on prompt promise; cleanup is non-blocking

The previous fix wrapped `await promptPromise.catch(...)` in a 10s timeout
race to prevent the cleanup from hanging when Pi's session.prompt() never
settles after dispose(). That solved coleam00#1561 but introduced a magic number
and unnecessary work.

The await itself was the bug. The queue is closed before the await runs,
and the .then() handlers attached to promptPromise only push to that
queue — closed pushes are no-ops. There's nothing the cleanup is actually
waiting for. Whether prompt() resolves in 1ms or never, no observable
behavior changes for the caller.

Drop the await entirely. Attach .catch() fire-and-forget so a stray async
rejection (the .then() handlers should preclude this, but belt-and-
suspenders) doesn't bubble up as an unhandled-rejection process exit.
Cleanup now returns in <200ms regardless of Pi's prompt() behavior.

Tests:
- "cleanup does not block when prompt() hangs forever" — asserts <200ms
  (vs. the previous ~10s tolerance)
- "late prompt() rejection does not become unhandled" — regression for
  the .catch() belt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workflow runs zombie in 'running' state when DAG node hits SDK error mid-stream

1 participant