fix(workflows): prevent zombie workflow runs from hung Pi cleanup by Wirasm · Pull Request #1563 · coleam00/Archon

Wirasm · 2026-05-04T09:24:02Z

Summary

Problem: When a Pi DAG node receives an SDK error result, the thrown error never reaches the executor's catch block because the async generator cleanup in bridgeSession hangs forever — session.dispose() closes Pi's I/O handles but doesn't settle session.prompt(). Bun drains its event loop and exits with code 0, leaving the workflow run stuck in status = 'running' indefinitely.
Why it matters: Zombie runs block all subsequent runs on the same worktree path (path-lock) and require manual abandon. No automatic recovery exists.
What changed: (C1) bridgeSession post-dispose await now races against a 10 s timeout so the event loop is never left with zero anchors; (C2) executeWorkflow gets a finally backstop that flips any still-running run to failed after all code paths complete; (C3) regression test added that intentionally passes a never-settling prompt() mock.
What did NOT change: Claude and Codex providers, withIdleTimeout timing, DAG execution logic, process signal handling, and all existing behaviour on successful runs.

UX Journey

Before

User                    Archon CLI               Pi Provider
────                    ──────────               ───────────
archon workflow run ──▶ starts DAG run
                        (status = 'running')
                        streams to Pi ─────────▶ returns isError: true
                        dag-executor throws
                        [for-await calls .return()]
                        bridgeSession.finally:
                          session.dispose()
                          await promptPromise   ← hangs forever
                                                  (Bun event loop drains)
                        [process exits code 0]
                                                  catch block: NEVER RUNS
                                                  status stays 'running'
User sees:              $ (prompt returns, status = 'running' in DB)
archon workflow status  shows run stuck 'running' — zombie forever
archon workflow run ... "worktree is in use" error on next run

After

User                    Archon CLI               Pi Provider
────                    ──────────               ───────────
archon workflow run ──▶ starts DAG run
                        (status = 'running')
                        streams to Pi ─────────▶ returns isError: true
                        dag-executor throws
                        [for-await calls .return()]
                        bridgeSession.finally:
                          session.dispose()
                          [Promise.race: prompt vs 10s timeout]
                          timeout wins after 10s  ← event loop stays alive
                        throw propagates to catch
                        dag_node_failed logged
                        executeWorkflow catch runs
                        [finally backstop checks status]
                        status → 'failed'
User sees:              run status = 'failed' (not 'running')
archon workflow status  shows correct terminal state
archon workflow run ... next run proceeds normally

Architecture Diagram

Before

dag-executor.ts
  executeNodeInternal()
    for await (msg of withIdleTimeout(aiClient.sendQuery()))
      if (msg.isError) throw Error          ← throws
        [.return() called on withIdleTimeout]
          idle-timeout.ts finally:
            await generator.return(undefined)
              event-bridge.ts finally:
                session.dispose()           ← synchronous
                await promptPromise.catch() ← HANGS FOREVER (Pi bug)
                                              process exits, catch never runs

After

dag-executor.ts
  executeNodeInternal()
    for await (msg of withIdleTimeout(aiClient.sendQuery()))
      if (msg.isError) throw Error          ← throws
        [.return() called on withIdleTimeout]
          idle-timeout.ts finally:
            await generator.return(undefined)
              event-bridge.ts finally:      [~MODIFIED]
                session.dispose()
                [Promise.race([promptPromise, 10s timeout])]
                                            ← resolves within 10s
                throw propagates upward
  executor.ts executeWorkflow catch:
    failWorkflowRun(...)                    ← status = 'failed'
  executor.ts executeWorkflow finally:      [+NEW]
    if status still 'running': flip to 'failed'  ← backstop

Connection inventory:

From	To	Status	Notes
`dag-executor.ts`	`event-bridge.ts` (via `withIdleTimeout` → `generator.return`)	unchanged	Throw propagation path
`event-bridge.ts` `bridgeSession` finally	`promptPromise`	modified	Now `Promise.race` with 10 s timeout
`executor.ts` `executeWorkflow`	`IWorkflowStore.getWorkflowRunStatus`	new	Backstop reads current status in `finally`
`executor.ts` `executeWorkflow`	`IWorkflowStore.failWorkflowRun`	new	Backstop flips zombie to `failed` in `finally`

Label Snapshot

Risk: risk: low
Size: size: S
Scope: workflows, providers
Module: workflows:executor, providers:pi

Change Metadata

Change type: bug
Primary scope: multi (workflows + providers)

Linked Issue

Closes Workflow runs zombie in 'running' state when DAG node hits SDK error mid-stream #1561

Validation Evidence (required)

bun run validate

All checks passed:

bun run check:bundled — PASS (bundled-defaults.generated.ts up to date, 36 commands, 20 workflows)
bun run type-check — PASS (all 10 packages clean)
bun run lint --max-warnings 0 — PASS (0 errors, 0 warnings)
bun run format:check — PASS
bun run test — PASS (full suite green; new bridgeSession timeout test passes in ~10.04s, dominated by the intentional 10s timeout it exercises)

The new regression test in event-bridge.test.ts directly validates the fix: a mock session whose prompt() returns a never-settling promise no longer hangs cleanup — the consumer's thrown error propagates within ~10 s. Without the fix, this test would hang indefinitely (caught by the 15 s test-level timeout).

Security Impact (required)

New permissions/capabilities? No
New external network calls? No
Secrets/tokens handling changed? No
File system access scope changed? No

Compatibility / Migration

Backward compatible? Yes — existing Claude and Codex workflows are completely unaffected; the bridgeSession change is Pi-specific; the finally backstop in executor.ts is a no-op on runs that already finalize correctly
Config/env changes? No
Database migration needed? No

Human Verification (required)

Verified scenarios: Full bun run validate suite passed (type-check, lint, format, tests, bundled-defaults check)
Edge cases checked: (C1) getWorkflowRunStatus throw in backstop is caught and swallowed; (C2) failWorkflowRun throw in backstop is caught and logged; (C3) workflowRun undefined guard prevents backstop from running on run-creation failure; (C4) timeout fires after 10 s in the test, not sooner
What was not verified: Live end-to-end run against a real Pi/Minimax provider (requires external API key)

Side Effects / Blast Radius (required)

Affected subsystems/workflows: Pi provider cleanup path only; executor.ts finally backstop is universal but fires only when a run is still running post-execution (should be rare/never on non-Pi paths)
Potential unintended effects: The 10 s post-dispose wait only affects already-errored Pi sessions; normal successful runs complete and close the generator cleanly before bridgeSession's finally is triggered for cleanup, so the race resolves immediately (prompt already settled)
Guardrails: New executor.backstop_fail_failed log event emitted if the backstop's own failWorkflowRun call fails

Rollback Plan (required)

Fast rollback command/path: Revert commit 853f491d (git revert 853f491d)
Feature flags or config toggles: None
Observable failure symptoms: Workflow runs stuck in running after a Pi SDK error; archon workflow status shows running indefinitely; subsequent runs on same worktree blocked with "worktree is in use"

Risks and Mitigations

Risk: 10 s post-dispose timeout could delay test suite if many Pi tests use never-settling mocks
- Mitigation: The single regression test is intentional and already accounts for this (~10 s). No other tests use never-settling prompt() mocks. Isolated in its own bun test invocation via the existing @archon/providers test batch configuration.

Summary by CodeRabbit

Bug Fixes
- Improved cleanup behavior to prevent blocking on unresolved operations.
- Added safeguards to properly handle workflows that unexpectedly exit while still running, ensuring they are marked as failed and logged appropriately.
Tests
- Added regression tests validating cleanup resilience and workflow failure handling.

) When a Pi/Minimax DAG node received an SDK error result, the executor threw inside the for-await loop and bridgeSession's finally hung on `await promptPromise.catch(...)` because session.dispose() does not settle session.prompt(). Bun's event loop drained with no remaining I/O sources and the process exited before the catch and DB finalization could run, leaving the workflow run stuck in 'running' indefinitely. Changes: - Wrap the post-dispose prompt-promise await in Promise.race against a 10s timeout so cleanup always completes - Add a defensive finally backstop in executeWorkflow that flips any still-'running' run to 'failed' before returning - Add regression test for the cleanup timeout - Update store mocks in executor tests to include getWorkflowRunStatus Fixes #1561

coderabbitai · 2026-05-04T09:25:39Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae5c02ec-c497-484d-8581-e031dd5c81ed

📥 Commits

Reviewing files that changed from the base of the PR and between 88d0109 and cd9670c.

📒 Files selected for processing (5)

packages/providers/src/community/pi/event-bridge.test.ts
packages/providers/src/community/pi/event-bridge.ts
packages/workflows/src/executor-preamble.test.ts
packages/workflows/src/executor.test.ts
packages/workflows/src/executor.ts

📝 Walkthrough

Walkthrough

This PR adds defensive mechanisms to prevent "zombie" workflow runs that exit without proper finalization. The event-bridge cleanup is refactored to avoid blocking on unresolved promise settlement, while the executor gains a finally backstop that detects and fails any run left in running status after process exit.

Changes

Zombie Run Prevention

Layer / File(s)	Summary
Event-Bridge Cleanup `packages/providers/src/community/pi/event-bridge.ts`	`bridgeSession` `finally` block stops awaiting `promptPromise`. Queue closes, listeners clean up, and a defensive `.catch()` is attached to log late rejections without blocking process exit.
Executor Backstop `packages/workflows/src/executor.ts`	`executeWorkflow` adds `finally` block that queries run status; if still `'running'`, calls `failWorkflowRun()` with "exited without finalizing" message to prevent orphaned zombie runs.
Test Fixtures `packages/workflows/src/executor-preamble.test.ts`, `packages/workflows/src/executor.test.ts`	Mocked `IWorkflowStore` adds `getWorkflowRunStatus()` method returning `'completed'`, enabling run-status-aware test assertions.
Regression & Backstop Tests `packages/providers/src/community/pi/event-bridge.test.ts`, `packages/workflows/src/executor.test.ts`	Two event-bridge cleanup regression tests verify prompt hanging does not block cleanup and late rejections do not surface as unhandled errors; executor backstop tests verify `failWorkflowRun()` called only when status is `'running'`, not `'completed'`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Workflow runs zombie in 'running' state when DAG node hits SDK error mid-stream #1561: This PR directly implements the defensive backstop for zombie runs (calls failWorkflowRun() in executor finally block when status is 'running') and addresses provider-level teardown that could cause zombie-run symptoms via the event-bridge cleanup refactor.

Possibly related PRs

feat(providers/pi): best-effort structured output via prompt engineering #1297: Modifies the same bridgeSession logic in event-bridge.ts for session/prompt handling and cleanup behavior.
feat(workflows): add mutates_checkout to allow concurrent runs on live checkout #1438: Also modifies executeWorkflow() in executor.ts, introducing conditional guard logic alongside this PR's finally backstop.
fix(providers/pi): lazy-load Pi SDK to unbreak compiled archon binary #1355: Updates event-bridge/bridgeSession consumers to dynamically import, intersecting with this PR's event-bridge cleanup refactor.

Poem

🧟 A zombie once lurked in the flow,
Process exiting high, status stuck low,
But cleanup now won't block on a dream,
And finally backstops guard the workflow stream,
No more undead runs haunting the queue! 🐰✨

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch archon/task-fix-issue-1561

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Wirasm · 2026-05-04T09:34:28Z

Comprehensive PR Review

PR: #1563 — fix(workflows): prevent zombie workflow runs from hung Pi cleanup
Reviewed by: 4 specialized agents (code-review, error-handling, test-coverage, docs-impact)
Date: 2026-05-04

Summary

The fix correctly addresses the zombie workflow run problem (closes #1561) via two complementary mechanisms: a Promise.race timeout in event-bridge.ts to unblock hung Pi cleanup, and a defensive finally backstop in executor.ts to catch any run still left in 'running'. Implementation is type-safe, scope-disciplined, well-commented, and ships a strong regression test for the primary fix. No documentation changes needed.

Verdict: REQUEST_CHANGES — 2 medium issues, 3 low advisory issues. No blockers.

Severity	Count
🔴 CRITICAL	0
🟠 HIGH	0
🟡 MEDIUM	2
🟢 LOW	3

🟡 Medium Issues

M1: Backstop fires silently — no log when it corrects a zombie run

📍 packages/workflows/src/executor.ts:840-845

When the backstop detects backstopStatus === 'running' and calls failWorkflowRun, there is no structured log entry. Only the failure of the backstop is logged. If a zombie run is ever corrected in production, the DB record flips to 'failed' with nothing in the logs to show the backstop triggered — making incident investigation very difficult.

View fix

const backstopStatus = await deps.store.getWorkflowRunStatus(runId).catch(() => null);
if (backstopStatus === 'running') {
  getLog().warn(
    { workflowRunId: runId },
    'executor.backstop_triggered'
  );
  await deps.store
    .failWorkflowRun(runId, 'Workflow exited without finalizing — see logs')
    .catch((err: Error) => {
      getLog().error({ err, workflowRunId: runId }, 'executor.backstop_fail_failed');
    });
}

M2: executor.ts `finally` backstop has no test for the `'running'` branch

📍 packages/workflows/src/executor.ts:839-846 / packages/workflows/src/executor.test.ts

All makeStore mocks default to returning 'completed' from getWorkflowRunStatus. The guarded block (if (backstopStatus === 'running')) is dead code in the entire test suite. A typo in the status string ('Running' vs 'running') would silently disable zombie cleanup with no alarm.

View suggested tests

describe('finally backstop', () => {
  it('calls failWorkflowRun when run is still running at finally', async () => {
    const failSpy = mock(async () => {});
    const store = makeStore({
      getWorkflowRunStatus: mock(async () => 'running' as const),
      failWorkflowRun: failSpy,
    });
    const deps = makeDeps(store);
    await executeWorkflow(deps, makePlatform(), 'conv-1', '/tmp', makeWorkflow(), 'test', 'db-conv-1');
    const call = failSpy.mock.calls.find(
      (c: unknown[]) => typeof c[1] === 'string' && (c[1] as string).includes('exited without finalizing')
    );
    expect(call).toBeDefined();
  });

  it('does not call failWorkflowRun when run already completed', async () => {
    const failSpy = mock(async () => {});
    const store = makeStore({
      getWorkflowRunStatus: mock(async () => 'completed' as const),
      failWorkflowRun: failSpy,
    });
    const deps = makeDeps(store);
    await executeWorkflow(deps, makePlatform(), 'conv-1', '/tmp', makeWorkflow(), 'test', 'db-conv-1');
    const backstopCall = failSpy.mock.calls.find(
      (c: unknown[]) => typeof c[1] === 'string' && (c[1] as string).includes('exited without finalizing')
    );
    expect(backstopCall).toBeUndefined();
  });
});

🟢 Low Issues

View 3 low-priority suggestions

L1: Dangling `setTimeout` — can add up to 10s hang after fast Pi cleanup in CLI workflows

📍 packages/providers/src/community/pi/event-bridge.ts:416-418

When promptPromise settles before the 10s timer (normal/fast path), Promise.race resolves but the setTimeout stays queued. A live timer holds the Bun/Node event loop open. For fast CLI Pi workflows (<10s total), this can add up to 10s of unnecessary wait before process exit — a regression vs. pre-fix behavior.

// Fix: capture and clear the timer
let cleanupTimeoutId: ReturnType<typeof setTimeout> | undefined;
await Promise.race([
  promptPromise.catch(() => {
    /* errors already surfaced through the queue */
  }),
  new Promise<void>(resolve => {
    cleanupTimeoutId = setTimeout(resolve, 10_000);
  }),
]);
clearTimeout(cleanupTimeoutId);

Note: Does not affect server deployments (process never exits). Only impacts fast CLI Pi executions.

L2: No fast-path test for `bridgeSession.finally` completion time

📍 packages/providers/src/community/pi/event-bridge.test.ts:484-538

The regression test verifies elapsed < 11s (zombie path). No test verifies that when promptPromise settles quickly, cleanup also completes quickly. Pairs with L1 fix — a companion test would catch any future regression that accidentally reintroduces a fixed delay. Advisory only.

L3: `promptPromise.catch` swallows secondary cleanup errors silently

📍 packages/providers/src/community/pi/event-bridge.ts:413-415

Out of scope per scope.md (Pi SDK internals). A getLog().debug({ err }, 'pi.event-bridge.prompt_rejected_during_cleanup') inside the catch would surface SDK regressions during cleanup without polluting output. Advisory only.

What's Good

Two independent layers of defense: the Pi timeout prevents the hang; the executor backstop catches any future code path that misses failWorkflowRun. Neither depends on the other; both are needed.
event-bridge regression test is exemplary: directly reproduces the exact failure mode (never-settling prompt() after dispose()), validates cleanup within the 10s window, covers the consumer-throws path that triggers cleanup.
Mock completeness: proactively updated executor.test.ts and executor-preamble.test.ts to add getWorkflowRunStatus to makeStore — prevented 18/31 executor tests from failing.
Backstop condition is narrowly scoped: backstopStatus === 'running' means the backstop cannot fire for legitimately paused, cancelled, or completed runs.
CLAUDE.md compliance: within-process backstop correctly follows "No Autonomous Lifecycle Mutation Across Process Boundaries"; comments explain WHY not WHAT; no any types; full type annotations.
Scope discipline: Claude/Codex providers untouched, STEP_IDLE_TIMEOUT_MS untouched, no process.on('exit') handler added.

Suggested Next Steps

Add the warn log in the backstop if block (M1) — one line
Add the describe('finally backstop') tests (M2) — low effort, two it() blocks
Optionally fix the dangling timer (L1) — prevents potential 10s hang regression for fast CLI Pi workflows
Rebase onto dev (currently 2 commits behind)

Artifacts: /Users/rasmus/.archon/workspaces/coleam00/Archon/artifacts/runs/aba41af102dd6a958af92d5d4cac9300/review/

- Clear the cleanup setTimeout after Promise.race resolves to prevent a dangling timer from holding the event loop open for up to 10s in fast CLI Pi workflows - Add debug log inside promptPromise.catch for Pi SDK cleanup-phase rejections so secondary errors surface without polluting output - Add warn log when the executor finally backstop fires so zombie corrections are visible in logs (not just the DB record) - Add executor finally backstop tests covering the 'running' and 'completed' status branches - Add fast-path test asserting bridgeSession cleanup completes in <1s when prompt() settles promptly (guards the clearTimeout fix)

Wirasm · 2026-05-04T09:37:12Z

⚡ Self-Fix Report (Aggressive)

Status: COMPLETE
Pushed: ✅ Changes pushed to archon/task-fix-issue-1561
Commit: 92886af
Philosophy: Fix everything unless clearly a new concern

Fixes Applied (5 total)

Severity	Count
🔴 CRITICAL	0
🟠 HIGH	0
🟡 MEDIUM	2
🟢 LOW	3

View all fixes

✅ Backstop fires silently (executor.ts:840) — Added getLog().warn({ workflowRunId: runId }, 'executor.backstop_triggered') so zombie corrections leave a log breadcrumb
✅ executor.ts finally backstop untested (executor.test.ts) — Added describe('finally backstop') with two tests: status 'running' fires failWorkflowRun, status 'completed' does not
✅ Dangling setTimeout (event-bridge.ts:416) — Extracted timer to cleanupTimeoutId; clearTimeout(cleanupTimeoutId) runs after Promise.race so fast CLI Pi workflows don't hang 10s post-completion
✅ Missing fast-path cleanup test (event-bridge.test.ts) — Added test asserting elapsed < 1_000ms when dispose() settles prompt() immediately; catches future fixed-delay regressions
✅ promptPromise.catch swallows cleanup errors silently (event-bridge.ts:413) — Added getLog().debug({ err }, 'pi.event-bridge.prompt_rejected_during_cleanup') with explanatory comment

Tests Added

event-bridge.test.ts: fast-path cleanup timing test (elapsed < 1_000)
executor.test.ts: backstop fires for 'running' status
executor.test.ts: backstop skips for 'completed' status

Skipped

(none — all findings addressed)

Validation

✅ Type check | ✅ Lint | ✅ Tests (43 event-bridge + 33 executor passed)

Self-fix by Archon · aggressive mode · fixes pushed to archon/task-fix-issue-1561

The previous fix wrapped `await promptPromise.catch(...)` in a 10s timeout race to prevent the cleanup from hanging when Pi's session.prompt() never settles after dispose(). That solved #1561 but introduced a magic number and unnecessary work. The await itself was the bug. The queue is closed before the await runs, and the .then() handlers attached to promptPromise only push to that queue — closed pushes are no-ops. There's nothing the cleanup is actually waiting for. Whether prompt() resolves in 1ms or never, no observable behavior changes for the caller. Drop the await entirely. Attach .catch() fire-and-forget so a stray async rejection (the .then() handlers should preclude this, but belt-and- suspenders) doesn't bubble up as an unhandled-rejection process exit. Cleanup now returns in <200ms regardless of Pi's prompt() behavior. Tests: - "cleanup does not block when prompt() hangs forever" — asserts <200ms (vs. the previous ~10s tolerance) - "late prompt() rejection does not become unhandled" — regression for the .catch() belt

…leam00#1563) * fix(workflows): prevent zombie workflow runs from hung Pi cleanup (coleam00#1561) When a Pi/Minimax DAG node received an SDK error result, the executor threw inside the for-await loop and bridgeSession's finally hung on `await promptPromise.catch(...)` because session.dispose() does not settle session.prompt(). Bun's event loop drained with no remaining I/O sources and the process exited before the catch and DB finalization could run, leaving the workflow run stuck in 'running' indefinitely. Changes: - Wrap the post-dispose prompt-promise await in Promise.race against a 10s timeout so cleanup always completes - Add a defensive finally backstop in executeWorkflow that flips any still-'running' run to 'failed' before returning - Add regression test for the cleanup timeout - Update store mocks in executor tests to include getWorkflowRunStatus Fixes coleam00#1561 * fix: address review findings from PR coleam00#1563 - Clear the cleanup setTimeout after Promise.race resolves to prevent a dangling timer from holding the event loop open for up to 10s in fast CLI Pi workflows - Add debug log inside promptPromise.catch for Pi SDK cleanup-phase rejections so secondary errors surface without polluting output - Add warn log when the executor finally backstop fires so zombie corrections are visible in logs (not just the DB record) - Add executor finally backstop tests covering the 'running' and 'completed' status branches - Add fast-path test asserting bridgeSession cleanup completes in <1s when prompt() settles promptly (guards the clearTimeout fix) * simplify: use unknown instead of Error in backstop catch handler * fix(pi): drop blocking await on prompt promise; cleanup is non-blocking The previous fix wrapped `await promptPromise.catch(...)` in a 10s timeout race to prevent the cleanup from hanging when Pi's session.prompt() never settles after dispose(). That solved coleam00#1561 but introduced a magic number and unnecessary work. The await itself was the bug. The queue is closed before the await runs, and the .then() handlers attached to promptPromise only push to that queue — closed pushes are no-ops. There's nothing the cleanup is actually waiting for. Whether prompt() resolves in 1ms or never, no observable behavior changes for the caller. Drop the await entirely. Attach .catch() fire-and-forget so a stray async rejection (the .then() handlers should preclude this, but belt-and- suspenders) doesn't bubble up as an unhandled-rejection process exit. Cleanup now returns in <200ms regardless of Pi's prompt() behavior. Tests: - "cleanup does not block when prompt() hangs forever" — asserts <200ms (vs. the previous ~10s tolerance) - "late prompt() rejection does not become unhandled" — regression for the .catch() belt

simplify: use unknown instead of Error in backstop catch handler

6dd2e79

Wirasm mentioned this pull request May 4, 2026

Workflow runs zombie in 'running' state when DAG node hits SDK error mid-stream #1561

Closed

Wirasm marked this pull request as ready for review May 4, 2026 12:55

Wirasm merged commit 5593498 into dev May 4, 2026
6 of 7 checks passed

Wirasm mentioned this pull request May 4, 2026

fix(pi): surface SDK error messages and cap concurrency to stop cascade failures #1572

Merged

This was referenced May 5, 2026

Internal/main #1584

Closed

fix(core): Enable Pi agent to successfully use slash commands: Accumulate multi-chunk AI commands before parsing #1581

Merged

fix(workflows): make resume explicit via prepareResumedRun / hydrateResumableRun (closes #1392) #1646

Merged

Wirasm mentioned this pull request May 12, 2026

Release 0.3.11 #1653

Merged

This was referenced May 13, 2026

feat(workflows): add $WORKFLOW_NAME variable and effort estimation node #1666

Closed

fix(cli/workflow): add SIGTERM grace period to avoid losing completion to a race #1707

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workflows): prevent zombie workflow runs from hung Pi cleanup#1563

fix(workflows): prevent zombie workflow runs from hung Pi cleanup#1563
Wirasm merged 4 commits into
devfrom
archon/task-fix-issue-1561

Wirasm commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

Wirasm commented May 4, 2026

L1: Dangling `setTimeout` — can add up to 10s hang after fast Pi cleanup in CLI workflows

L2: No fast-path test for `bridgeSession.finally` completion time

L3: `promptPromise.catch` swallows secondary cleanup errors silently

Uh oh!

Wirasm commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Wirasm commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

UX Journey

Before

After

Architecture Diagram

Before

After

Label Snapshot

Change Metadata

Linked Issue

Validation Evidence (required)

Security Impact (required)

Compatibility / Migration

Human Verification (required)

Side Effects / Blast Radius (required)

Rollback Plan (required)

Risks and Mitigations

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

Wirasm commented May 4, 2026

Comprehensive PR Review

Summary

🟡 Medium Issues

M1: Backstop fires silently — no log when it corrects a zombie run

M2: executor.ts finally backstop has no test for the 'running' branch

🟢 Low Issues

L1: Dangling setTimeout — can add up to 10s hang after fast Pi cleanup in CLI workflows

L2: No fast-path test for bridgeSession.finally completion time

L3: promptPromise.catch swallows secondary cleanup errors silently

What's Good

Suggested Next Steps

Uh oh!

Wirasm commented May 4, 2026

⚡ Self-Fix Report (Aggressive)

Fixes Applied (5 total)

Tests Added

Skipped

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Wirasm commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

M2: executor.ts `finally` backstop has no test for the `'running'` branch

L1: Dangling `setTimeout` — can add up to 10s hang after fast Pi cleanup in CLI workflows

L2: No fast-path test for `bridgeSession.finally` completion time

L3: `promptPromise.catch` swallows secondary cleanup errors silently