Skip to content

bug(workflow): two silent-failure paths in DAG executor — Codex turn.failed exits stream, Claude success-with-stop-sequence thrown as failure #1425

@kallitaxi

Description

@kallitaxi

Filing two related silent-vs-loud failure modes in packages/workflows/src/dag-executor.ts. They were observed during a single workflow run (archon-fix-github-issue against an external repo) and combine to produce a workflow run that looks completed but is fictional.

Both stem from the executor not handling SDK result quirks defensively. Listing them together because the second-order behaviour they produce is the same: a draft PR opened with a fabricated description and an empty diff, while downstream nodes either run on the empty diff or get skipped by trigger rules.


Bug A — Codex turn.failed silently exits the stream → executor records success with empty output

Repro path: any Codex provider failure that surfaces as turn.failed rather than a thrown exception (auth error misclassified as crash via #1266; mid-turn rate limit; transient API error).

Where the failure is swallowed:

  • packages/providers/src/codex/provider.ts:222-228streamCodexEvents handler for turn.failed yields a system chunk with the error text and then breaks out of the loop. The async generator exits without throwing and without emitting a result chunk.

Where the executor mis-classifies it as success:

  • packages/workflows/src/dag-executor.ts:751nodeOutputText only accumulates on assistant chunks
  • packages/workflows/src/dag-executor.ts:844result chunk is the terminal signal, but no flag tracks whether it was actually seen
  • packages/workflows/src/dag-executor.ts:1105 — falls through to dag_node_completed log
  • packages/workflows/src/dag-executor.ts:1147 — returns { state: 'completed', output: '' }
  • packages/workflows/src/dag-executor.ts:2917-2992 — run-level status derivation sees anyFailed=false and marks the workflow completed

Why "5 retries / 18 s" instead of fail-fast on a 401: provider.ts:121-128,449 correctly classifies 401 as shouldRetry: false, but #1266's AbortSignal-poison bug causes each retry attempt to be SIGTERM'd before it authenticates. The error gets reclassified as a crash (class A in #1266) and runs through the retry loop. After exhaustion, Bug A above swallows the failure.

This bug is distinct from #1266: even with #1266 fixed, the turn.failed path remains a silent-success vector.

Suggested fix:

  1. provider.ts:222-228 — replace break with throw new Error(...) so the generator surfaces the failure
  2. dag-executor.ts — add a receivedResultChunk flag set true at the line ~873 result handler; before emitting dag_node_completed at line 1105, treat !receivedResultChunk && nodeOutputText === '' as state: 'failed'
  3. (stretch) Provider-level auth precheck before the first SDK call — fail fast with a clear "Codex auth not configured" error rather than relying on the SDK to surface a 401

Bug B — Claude SDK result with is_error: true + errorSubtype: 'success' is thrown as a failure

Repro path: any Claude node configured with stop sequences (review-classify pipelines, structured-output nodes, agents that use </foo> style termination markers). The model produces the stop sequence, signalling clean completion. The Claude SDK still flags this as is_error: true because the stop reason is stop_sequence rather than end_turn.

Where the false-failure is thrown:

  • packages/workflows/src/dag-executor.ts:908:
    throw new Error(`Node '${node.id}' failed: SDK returned ${subtype}${errorsDetail}`);
    When msg.errorSubtype === 'success', this throws "Node 'X' failed: SDK returned success" — a contradictory error message that's user-hostile and operationally wrong (the node actually completed correctly).

Observed: in a real run, two review-pipeline nodes (code-review, test-coverage) both produced this error simultaneously, blocking the trigger-rule join for synthesize / self-fix / simplify / report (all four were skipped). The agents had done their work; the workflow couldn't deliver the result.

Suggested fix:

  • dag-executor.ts:908 — check if (subtype === 'success') { ... treat as completed ... } else { throw }. Alternatively, gate on the presence of meaningful msg.errors content rather than the is_error boolean alone, since errorSubtype: 'success' is the SDK's own way of saying "this isn't really an error."

Combined effect (real-world incident)

A single archon-fix-github-issue run against kallitaxi/Client-Dashboard#2:

  1. Codex implementation node hit Bug A → produced zero source code, executor recorded success
  2. Workflow proceeded to create-pr node, which generated a draft PR description from the plan (not the diff) — body claimed extensive Zoho integration work; actual diff was 23 unrelated build-fix lines, with Fixes #2 in the trailer
  3. Review-scope node correctly flagged "all Zoho-specific files are absent"
  4. code-review and test-coverage nodes ran, hit Bug B → false-failure
  5. Trigger rules skipped synthesize / self-fix / simplify / report — review findings never written back
  6. Workflow exited with ❌ DAG workflow ... completed with failures: 'code-review': SDK returned success; 'test-coverage': SDK returned success

The user almost merged the fictional PR (Fixes #2 would have auto-closed the issue while delivering none of the work).

Related

Why filing together

Both bugs live in the same file, both involve mis-handling SDK result patterns, and the fix for both is the same shape: distinguish "real failure" from "the SDK happened to use this protocol marker." The sequence above shows they compound — fixing only one leaves the other as a parallel silent-failure vector.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priority - Address soon, next in queuearea: workflowsWorkflow enginebugSomething is brokencodexCodex SDK integrationeffort/highCross-cutting changes, multiple domains, requires design decisions

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions