Summary
- What broke: When a DAG node returns an SDK error result mid-stream, the workflow's bash process can exit cleanly with code 0 without running the finalization block. The
remote_agent_workflow_runs row stays stuck in running forever, blocking subsequent workflows on the same path (since the path-lock check counts non-terminal runs).
- Severity:
major — recoverable via manual archon workflow abandon, but degrades trust in the lifecycle and silently accumulates zombie rows.
- Most reliable trigger today: Pi/Minimax provider hitting an SDK error on a node, e.g.
closed-dedup-check in repo-triage-minimax (which uses the Claude-only agents: feature → Pi rejects → SDK error after ~31m).
Steps to Reproduce
- Run any DAG workflow on the
pi provider that contains a node which will trip an SDK error. Easy repro:
archon workflow run repo-triage-minimax ""
(closed-dedup-check declares agents: which Pi doesn't support — the node logs dag.unsupported_capabilities then runs, eventually hitting dag.node_sdk_error_result after a long timeout.)
- Wait for the bash process to exit. Task notification reports exit code 0.
archon workflow status — the run is still listed as running.
- Re-launch the same workflow (or any workflow on the same path) — it errors with
❌ This worktree is in use by ....
Expected vs Actual
- Expected: Workflow run row transitions to
failed (one node failed → failWorkflowRun fires). CLI either prints Workflow did not complete successfully and exits non-zero, or prints a clear failure summary. Subsequent runs on the same path are not blocked.
- Actual: The run stays
running indefinitely. CLI exits 0. No failure log, no terminal status. Operator must manually archon workflow abandon <id>.
What the logs show (and don't)
In the failing run, dag.node_sdk_error_result is the last log line. None of the events that should follow ever fire:
| Expected log |
File |
Fired? |
dag_node_failed (catch in executeNodeInternal) |
dag-executor.ts:1222 |
❌ |
dag_layer_had_failures |
dag-executor.ts:2990 |
❌ |
dag_workflow_finished |
dag-executor.ts:3057 |
❌ |
failWorkflowRun |
dag-executor.ts:3100 |
❌ |
CLI Workflow did not complete successfully |
cli/commands/workflow.ts:746 |
❌ |
The throw at dag-executor.ts:916 (`Node '${node.id}' failed: SDK returned ${subtype}`) should be caught at :1204, return { state: 'failed' }, and let the layer loop process it normally. For Pi/Minimax it doesn't.
User Flow
CLI executeWorkflow executeDagWorkflow executeNodeInternal Pi/Minimax
─── ─────────────── ─────────────────── ────────────────── ──────────
archon workflow run ────▶ create run row (running)
executeDagWorkflow ───────▶ layer loop
executeNodeInternal ─────▶ for-await stream ──────▶ pi.sendQuery
…31m…
result.isError
log dag.node_sdk_error_result
throw Error
[X] catch never logs dag_node_failed
process exits cleanly (code 0)
(no return, no finalize)
(no terminal DB write)
(no result message printed)
exit 0 ◀──────────────────
DB row: still 'running'
Two layers of fix needed
1. Root cause (Pi-specific) — investigate why the SDK throw at dag-executor.ts:916 doesn't reach the catch at :1204 for Pi/Minimax nodes. Suspects:
- A fire-and-forget
deps.store.createWorkflowEvent(...).catch(logErr) upstream firing an unhandled rejection that aborts the runtime
- Pi's stream not closing after
result.isError, leaving for await suspended, with cleanup hitting a Bun process-exit watchdog
- Provider-level signal-handling / event-emitter teardown order
A quick win: add getLog().error(..., 'dag_node_executor_internal_throw') immediately before `throw new Error(...)` at :916 so we know if the throw was emitted, then in the catch block at :1204 log entry as the first statement. That bisects which side dies.
2. Defensive backstop (provider-agnostic) — executor.ts has no top-level finally that guarantees a terminal DB status before exit. Add:
// In executor.ts executeWorkflow's outer try
try {
// ... existing logic
} finally {
const final = await deps.store.getWorkflowRun(workflowRun.id);
if (final && (final.status === 'running' || final.status === null)) {
await deps.store.failWorkflowRun(workflowRun.id, 'Workflow exited without finalizing — see logs');
}
}
This wouldn't mask the underlying bug (the operator still gets a failed row + log line) but would prevent zombie accumulation.
Also worth adding a CLI-level process.on('exit') / SIGTERM / SIGINT handler that flips any active runs from this process to failed before exit, since Bun async log flushes can be lost on abrupt exit.
Environment
- Platform: CLI
- Database: SQLite
- Running in worktree?: No (
worktree.enabled: false workflows — repo-triage-minimax, maintainer-review-pr with paused gate, etc.)
- OS: macOS 25.3.0 (Darwin), Bun runtime
Impact
- Affected workflows/commands: any DAG workflow on
pi provider where a node hits an SDK error — observed today on repo-triage-minimax (closed-dedup-check) and 3 days ago on 6× maintainer-review-pr runs (Pi code-review nodes). Other providers may be vulnerable to the defensive-gap layer.
- Reproduction rate: Always (for
repo-triage-minimax's closed-dedup-check on Pi)
- Workaround:
archon workflow abandon <run-id> after each affected run. Tedious and easy to miss.
- Data loss risk: No (only DB lifecycle state; artifacts on disk are intact).
Scope
- Package(s) likely involved:
workflows, providers (community/pi)
- Modules:
workflows:dag-executor (lines 916, 1204, 3057), workflows:executor (lines 730–765), providers:community/pi:provider, cli:commands/workflow
Summary
remote_agent_workflow_runsrow stays stuck inrunningforever, blocking subsequent workflows on the same path (since the path-lock check counts non-terminal runs).major— recoverable via manualarchon workflow abandon, but degrades trust in the lifecycle and silently accumulates zombie rows.closed-dedup-checkinrepo-triage-minimax(which uses the Claude-onlyagents:feature → Pi rejects → SDK error after ~31m).Steps to Reproduce
piprovider that contains a node which will trip an SDK error. Easy repro:archon workflow run repo-triage-minimax ""closed-dedup-checkdeclaresagents:which Pi doesn't support — the node logsdag.unsupported_capabilitiesthen runs, eventually hittingdag.node_sdk_error_resultafter a long timeout.)archon workflow status— the run is still listed asrunning.❌ This worktree is in use by ....Expected vs Actual
failed(one node failed →failWorkflowRunfires). CLI either printsWorkflow did not complete successfullyand exits non-zero, or prints a clear failure summary. Subsequent runs on the same path are not blocked.runningindefinitely. CLI exits 0. No failure log, no terminal status. Operator must manuallyarchon workflow abandon <id>.What the logs show (and don't)
In the failing run,
dag.node_sdk_error_resultis the last log line. None of the events that should follow ever fire:dag_node_failed(catch inexecuteNodeInternal)dag-executor.ts:1222dag_layer_had_failuresdag-executor.ts:2990dag_workflow_finisheddag-executor.ts:3057failWorkflowRundag-executor.ts:3100Workflow did not complete successfullycli/commands/workflow.ts:746The throw at
dag-executor.ts:916(`Node '${node.id}' failed: SDK returned ${subtype}`) should be caught at:1204, return{ state: 'failed' }, and let the layer loop process it normally. For Pi/Minimax it doesn't.User Flow
Two layers of fix needed
1. Root cause (Pi-specific) — investigate why the SDK throw at
dag-executor.ts:916doesn't reach the catch at:1204for Pi/Minimax nodes. Suspects:deps.store.createWorkflowEvent(...).catch(logErr)upstream firing an unhandled rejection that aborts the runtimeresult.isError, leavingfor awaitsuspended, with cleanup hitting a Bun process-exit watchdogA quick win: add
getLog().error(..., 'dag_node_executor_internal_throw')immediately before `throw new Error(...)` at:916so we know if the throw was emitted, then in the catch block at:1204log entry as the first statement. That bisects which side dies.2. Defensive backstop (provider-agnostic) —
executor.tshas no top-levelfinallythat guarantees a terminal DB status before exit. Add:This wouldn't mask the underlying bug (the operator still gets a
failedrow + log line) but would prevent zombie accumulation.Also worth adding a CLI-level
process.on('exit')/SIGTERM/SIGINThandler that flips any active runs from this process tofailedbefore exit, since Bun async log flushes can be lost on abrupt exit.Environment
worktree.enabled: falseworkflows —repo-triage-minimax,maintainer-review-prwith paused gate, etc.)Impact
piprovider where a node hits an SDK error — observed today onrepo-triage-minimax(closed-dedup-check) and 3 days ago on 6×maintainer-review-prruns (Picode-reviewnodes). Other providers may be vulnerable to the defensive-gap layer.repo-triage-minimax'sclosed-dedup-checkon Pi)archon workflow abandon <run-id>after each affected run. Tedious and easy to miss.Scope
workflows,providers(community/pi)workflows:dag-executor(lines 916, 1204, 3057),workflows:executor(lines 730–765),providers:community/pi:provider,cli:commands/workflow