Skip to content

Workflow runs zombie in 'running' state when DAG node hits SDK error mid-stream #1561

@Wirasm

Description

@Wirasm

Summary

  • What broke: When a DAG node returns an SDK error result mid-stream, the workflow's bash process can exit cleanly with code 0 without running the finalization block. The remote_agent_workflow_runs row stays stuck in running forever, blocking subsequent workflows on the same path (since the path-lock check counts non-terminal runs).
  • Severity: major — recoverable via manual archon workflow abandon, but degrades trust in the lifecycle and silently accumulates zombie rows.
  • Most reliable trigger today: Pi/Minimax provider hitting an SDK error on a node, e.g. closed-dedup-check in repo-triage-minimax (which uses the Claude-only agents: feature → Pi rejects → SDK error after ~31m).

Steps to Reproduce

  1. Run any DAG workflow on the pi provider that contains a node which will trip an SDK error. Easy repro:
    archon workflow run repo-triage-minimax ""
    (closed-dedup-check declares agents: which Pi doesn't support — the node logs dag.unsupported_capabilities then runs, eventually hitting dag.node_sdk_error_result after a long timeout.)
  2. Wait for the bash process to exit. Task notification reports exit code 0.
  3. archon workflow status — the run is still listed as running.
  4. Re-launch the same workflow (or any workflow on the same path) — it errors with ❌ This worktree is in use by ....

Expected vs Actual

  • Expected: Workflow run row transitions to failed (one node failed → failWorkflowRun fires). CLI either prints Workflow did not complete successfully and exits non-zero, or prints a clear failure summary. Subsequent runs on the same path are not blocked.
  • Actual: The run stays running indefinitely. CLI exits 0. No failure log, no terminal status. Operator must manually archon workflow abandon <id>.

What the logs show (and don't)

In the failing run, dag.node_sdk_error_result is the last log line. None of the events that should follow ever fire:

Expected log File Fired?
dag_node_failed (catch in executeNodeInternal) dag-executor.ts:1222
dag_layer_had_failures dag-executor.ts:2990
dag_workflow_finished dag-executor.ts:3057
failWorkflowRun dag-executor.ts:3100
CLI Workflow did not complete successfully cli/commands/workflow.ts:746

The throw at dag-executor.ts:916 (`Node '${node.id}' failed: SDK returned ${subtype}`) should be caught at :1204, return { state: 'failed' }, and let the layer loop process it normally. For Pi/Minimax it doesn't.

User Flow

CLI                       executeWorkflow              executeDagWorkflow         executeNodeInternal     Pi/Minimax
───                       ───────────────              ───────────────────         ──────────────────       ──────────
archon workflow run ────▶ create run row (running)
                          executeDagWorkflow ───────▶  layer loop
                                                       executeNodeInternal ─────▶ for-await stream ──────▶ pi.sendQuery
                                                                                                           …31m…
                                                                                                           result.isError
                                                                                  log dag.node_sdk_error_result
                                                                                  throw Error
                                                                                  [X] catch never logs dag_node_failed
                                                                                  process exits cleanly (code 0)
                          (no return, no finalize)                                
                          (no terminal DB write)
                          (no result message printed)
exit 0 ◀──────────────────                                                        
                                                                                                          DB row: still 'running'

Two layers of fix needed

1. Root cause (Pi-specific) — investigate why the SDK throw at dag-executor.ts:916 doesn't reach the catch at :1204 for Pi/Minimax nodes. Suspects:

  • A fire-and-forget deps.store.createWorkflowEvent(...).catch(logErr) upstream firing an unhandled rejection that aborts the runtime
  • Pi's stream not closing after result.isError, leaving for await suspended, with cleanup hitting a Bun process-exit watchdog
  • Provider-level signal-handling / event-emitter teardown order

A quick win: add getLog().error(..., 'dag_node_executor_internal_throw') immediately before `throw new Error(...)` at :916 so we know if the throw was emitted, then in the catch block at :1204 log entry as the first statement. That bisects which side dies.

2. Defensive backstop (provider-agnostic)executor.ts has no top-level finally that guarantees a terminal DB status before exit. Add:

// In executor.ts executeWorkflow's outer try
try {
  // ... existing logic
} finally {
  const final = await deps.store.getWorkflowRun(workflowRun.id);
  if (final && (final.status === 'running' || final.status === null)) {
    await deps.store.failWorkflowRun(workflowRun.id, 'Workflow exited without finalizing — see logs');
  }
}

This wouldn't mask the underlying bug (the operator still gets a failed row + log line) but would prevent zombie accumulation.

Also worth adding a CLI-level process.on('exit') / SIGTERM / SIGINT handler that flips any active runs from this process to failed before exit, since Bun async log flushes can be lost on abrupt exit.

Environment

  • Platform: CLI
  • Database: SQLite
  • Running in worktree?: No (worktree.enabled: false workflows — repo-triage-minimax, maintainer-review-pr with paused gate, etc.)
  • OS: macOS 25.3.0 (Darwin), Bun runtime

Impact

  • Affected workflows/commands: any DAG workflow on pi provider where a node hits an SDK error — observed today on repo-triage-minimax (closed-dedup-check) and 3 days ago on 6× maintainer-review-pr runs (Pi code-review nodes). Other providers may be vulnerable to the defensive-gap layer.
  • Reproduction rate: Always (for repo-triage-minimax's closed-dedup-check on Pi)
  • Workaround: archon workflow abandon <run-id> after each affected run. Tedious and easy to miss.
  • Data loss risk: No (only DB lifecycle state; artifacts on disk are intact).

Scope

  • Package(s) likely involved: workflows, providers (community/pi)
  • Modules: workflows:dag-executor (lines 916, 1204, 3057), workflows:executor (lines 730–765), providers:community/pi:provider, cli:commands/workflow

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions