Filing two related silent-vs-loud failure modes in packages/workflows/src/dag-executor.ts. They were observed during a single workflow run (archon-fix-github-issue against an external repo) and combine to produce a workflow run that looks completed but is fictional.
Both stem from the executor not handling SDK result quirks defensively. Listing them together because the second-order behaviour they produce is the same: a draft PR opened with a fabricated description and an empty diff, while downstream nodes either run on the empty diff or get skipped by trigger rules.
Bug A — Codex turn.failed silently exits the stream → executor records success with empty output
Repro path: any Codex provider failure that surfaces as turn.failed rather than a thrown exception (auth error misclassified as crash via #1266; mid-turn rate limit; transient API error).
Where the failure is swallowed:
packages/providers/src/codex/provider.ts:222-228 — streamCodexEvents handler for turn.failed yields a system chunk with the error text and then breaks out of the loop. The async generator exits without throwing and without emitting a result chunk.
Where the executor mis-classifies it as success:
packages/workflows/src/dag-executor.ts:751 — nodeOutputText only accumulates on assistant chunks
packages/workflows/src/dag-executor.ts:844 — result chunk is the terminal signal, but no flag tracks whether it was actually seen
packages/workflows/src/dag-executor.ts:1105 — falls through to dag_node_completed log
packages/workflows/src/dag-executor.ts:1147 — returns { state: 'completed', output: '' }
packages/workflows/src/dag-executor.ts:2917-2992 — run-level status derivation sees anyFailed=false and marks the workflow completed
Why "5 retries / 18 s" instead of fail-fast on a 401: provider.ts:121-128,449 correctly classifies 401 as shouldRetry: false, but #1266's AbortSignal-poison bug causes each retry attempt to be SIGTERM'd before it authenticates. The error gets reclassified as a crash (class A in #1266) and runs through the retry loop. After exhaustion, Bug A above swallows the failure.
This bug is distinct from #1266: even with #1266 fixed, the turn.failed path remains a silent-success vector.
Suggested fix:
provider.ts:222-228 — replace break with throw new Error(...) so the generator surfaces the failure
dag-executor.ts — add a receivedResultChunk flag set true at the line ~873 result handler; before emitting dag_node_completed at line 1105, treat !receivedResultChunk && nodeOutputText === '' as state: 'failed'
- (stretch) Provider-level auth precheck before the first SDK call — fail fast with a clear "Codex auth not configured" error rather than relying on the SDK to surface a 401
Bug B — Claude SDK result with is_error: true + errorSubtype: 'success' is thrown as a failure
Repro path: any Claude node configured with stop sequences (review-classify pipelines, structured-output nodes, agents that use </foo> style termination markers). The model produces the stop sequence, signalling clean completion. The Claude SDK still flags this as is_error: true because the stop reason is stop_sequence rather than end_turn.
Where the false-failure is thrown:
packages/workflows/src/dag-executor.ts:908:
throw new Error(`Node '${node.id}' failed: SDK returned ${subtype}${errorsDetail}`);
When msg.errorSubtype === 'success', this throws "Node 'X' failed: SDK returned success" — a contradictory error message that's user-hostile and operationally wrong (the node actually completed correctly).
Observed: in a real run, two review-pipeline nodes (code-review, test-coverage) both produced this error simultaneously, blocking the trigger-rule join for synthesize / self-fix / simplify / report (all four were skipped). The agents had done their work; the workflow couldn't deliver the result.
Suggested fix:
dag-executor.ts:908 — check if (subtype === 'success') { ... treat as completed ... } else { throw }. Alternatively, gate on the presence of meaningful msg.errors content rather than the is_error boolean alone, since errorSubtype: 'success' is the SDK's own way of saying "this isn't really an error."
Combined effect (real-world incident)
A single archon-fix-github-issue run against kallitaxi/Client-Dashboard#2:
- Codex implementation node hit Bug A → produced zero source code, executor recorded success
- Workflow proceeded to
create-pr node, which generated a draft PR description from the plan (not the diff) — body claimed extensive Zoho integration work; actual diff was 23 unrelated build-fix lines, with Fixes #2 in the trailer
- Review-scope node correctly flagged
"all Zoho-specific files are absent"
code-review and test-coverage nodes ran, hit Bug B → false-failure
- Trigger rules skipped
synthesize / self-fix / simplify / report — review findings never written back
- Workflow exited with
❌ DAG workflow ... completed with failures: 'code-review': SDK returned success; 'test-coverage': SDK returned success
The user almost merged the fictional PR (Fixes #2 would have auto-closed the issue while delivering none of the work).
Related
Why filing together
Both bugs live in the same file, both involve mis-handling SDK result patterns, and the fix for both is the same shape: distinguish "real failure" from "the SDK happened to use this protocol marker." The sequence above shows they compound — fixing only one leaves the other as a parallel silent-failure vector.
Filing two related silent-vs-loud failure modes in
packages/workflows/src/dag-executor.ts. They were observed during a single workflow run (archon-fix-github-issueagainst an external repo) and combine to produce a workflow run that looks completed but is fictional.Both stem from the executor not handling SDK result quirks defensively. Listing them together because the second-order behaviour they produce is the same: a draft PR opened with a fabricated description and an empty diff, while downstream nodes either run on the empty diff or get skipped by trigger rules.
Bug A — Codex
turn.failedsilently exits the stream → executor records success with empty outputRepro path: any Codex provider failure that surfaces as
turn.failedrather than a thrown exception (auth error misclassified as crash via #1266; mid-turn rate limit; transient API error).Where the failure is swallowed:
packages/providers/src/codex/provider.ts:222-228—streamCodexEventshandler forturn.failedyields asystemchunk with the error text and thenbreaks out of the loop. The async generator exits without throwing and without emitting aresultchunk.Where the executor mis-classifies it as success:
packages/workflows/src/dag-executor.ts:751—nodeOutputTextonly accumulates onassistantchunkspackages/workflows/src/dag-executor.ts:844—resultchunk is the terminal signal, but no flag tracks whether it was actually seenpackages/workflows/src/dag-executor.ts:1105— falls through todag_node_completedlogpackages/workflows/src/dag-executor.ts:1147— returns{ state: 'completed', output: '' }packages/workflows/src/dag-executor.ts:2917-2992— run-level status derivation seesanyFailed=falseand marks the workflowcompletedWhy "5 retries / 18 s" instead of fail-fast on a 401:
provider.ts:121-128,449correctly classifies 401 asshouldRetry: false, but #1266's AbortSignal-poison bug causes each retry attempt to be SIGTERM'd before it authenticates. The error gets reclassified as a crash (class A in #1266) and runs through the retry loop. After exhaustion, Bug A above swallows the failure.This bug is distinct from #1266: even with #1266 fixed, the
turn.failedpath remains a silent-success vector.Suggested fix:
provider.ts:222-228— replacebreakwiththrow new Error(...)so the generator surfaces the failuredag-executor.ts— add areceivedResultChunkflag set true at the line ~873 result handler; before emittingdag_node_completedat line 1105, treat!receivedResultChunk && nodeOutputText === ''asstate: 'failed'Bug B — Claude SDK result with
is_error: true+errorSubtype: 'success'is thrown as a failureRepro path: any Claude node configured with stop sequences (review-classify pipelines, structured-output nodes, agents that use
</foo>style termination markers). The model produces the stop sequence, signalling clean completion. The Claude SDK still flags this asis_error: truebecause the stop reason isstop_sequencerather thanend_turn.Where the false-failure is thrown:
packages/workflows/src/dag-executor.ts:908:msg.errorSubtype === 'success', this throws"Node 'X' failed: SDK returned success"— a contradictory error message that's user-hostile and operationally wrong (the node actually completed correctly).Observed: in a real run, two review-pipeline nodes (
code-review,test-coverage) both produced this error simultaneously, blocking the trigger-rule join forsynthesize/self-fix/simplify/report(all four were skipped). The agents had done their work; the workflow couldn't deliver the result.Suggested fix:
dag-executor.ts:908— checkif (subtype === 'success') { ... treat as completed ... } else { throw }. Alternatively, gate on the presence of meaningfulmsg.errorscontent rather than theis_errorboolean alone, sinceerrorSubtype: 'success'is the SDK's own way of saying "this isn't really an error."Combined effect (real-world incident)
A single
archon-fix-github-issuerun againstkallitaxi/Client-Dashboard#2:create-prnode, which generated a draft PR description from the plan (not the diff) — body claimed extensive Zoho integration work; actual diff was 23 unrelated build-fix lines, withFixes #2in the trailer"all Zoho-specific files are absent"code-reviewandtest-coveragenodes ran, hit Bug B → false-failuresynthesize/self-fix/simplify/report— review findings never written back❌ DAG workflow ... completed with failures: 'code-review': SDK returned success; 'test-coverage': SDK returned successThe user almost merged the fictional PR (
Fixes #2would have auto-closed the issue while delivering none of the work).Related
Why filing together
Both bugs live in the same file, both involve mis-handling SDK result patterns, and the fix for both is the same shape: distinguish "real failure" from "the SDK happened to use this protocol marker." The sequence above shows they compound — fixing only one leaves the other as a parallel silent-failure vector.