Skip to content

Workflow exits success when condition_json_parse_failed cascades to silent node skips #1673

@Wirasm

Description

@Wirasm

Summary

  • What broke: When a workflow node's output can't be parsed for a downstream when: condition (e.g. $node.output.verdict), the condition-evaluator logs condition_json_parse_failed and the dependent nodes are marked Skipped (when_condition). The workflow then exits 0 — indistinguishable from a successful run where the gate model legitimately decided to skip those branches.
  • When it started (if known): pre-existing; surfaces frequently with Pi/Minimax outputs in maintainer-review-pr (the gate node sometimes prepends reasoning prose or wraps JSON in a markdown fence, breaking field extraction).
  • Severity: major

Steps to Reproduce

  1. Run bun run cli workflow run maintainer-review-pr "<PR#>" against a PR with Pi/Minimax as the provider.
  2. Wait for the gate node to complete. ~10–20% of the time Pi emits prose like Let me read the direction.md... before the JSON, or wraps it in ```json.
  3. Workflow "completes successfully" — exit code 0. But no review is posted, no findings recorded, every aspect node shows Skipped (when_condition).

Expected vs Actual

  • Expected: A condition_json_parse_failed against a when: clause should fail the workflow with a clear error pointing at the offending node and the unparseable output, so the user can retry or fix the prompt.
  • Actual: The error is logged at warn level and the DAG silently continues by treating the missing field as falsy. Every dependent node is skipped. The CLI reports "Workflow completed successfully."

User Flow

User                       Archon                       Pi
────                       ──────                       ──
runs maintainer-review-pr ▶ executes gate node ────────▶ returns "Let me think...\n```json\n{...}\n```"
                            condition_json_parse_failed (warn log only)
                            $gate.output.verdict == 'review' → falsy
                            [X] every downstream aspect node → Skipped (when_condition)
                            workflow finishes
sees "Workflow completed   ◀── exit 0, no review posted
successfully" but no
review exists on PR

Environment

  • Platform: CLI (also reproducible from any adapter)
  • Database: SQLite
  • Running in worktree? No (workflow has worktree.enabled: false)
  • OS: macOS / Linux (provider-agnostic; issue is in @archon/workflows condition-evaluator)

Logs

{"level":40,"module":"workflow.condition-evaluator","nodeId":"gate","field":"verdict","outputPreview":"\n\nLet me read the direction.md to verify my alignment assessment before writing artifacts.\n\n\nAll gat","msg":"condition_json_parse_failed"}
{"level":30,"module":"workflow.dag-executor","nodeId":"review-classify","when":"$gate.output.verdict == 'review'","msg":"dag_node_skipped_condition"}
[review-classify] Skipped (when_condition)
[approve-decline] Skipped (when_condition)
[approve-unclear] Skipped (when_condition)
[maintainer-review-code-review] Skipped (trigger_rule)
[maintainer-review-test-coverage] Skipped (trigger_rule)
... (all downstream nodes skipped)
{"module":"workflow.dag-executor","nodeCount":18,"anyCompleted":true,"anyFailed":false,"msg":"dag_workflow_finished"}
Workflow completed successfully.

Impact

  • Affected workflows/commands: any workflow with when: clauses referencing $node.output.<field> where the producing node is an LLM and the field isn't guaranteed parseable. In practice most frequent with Pi/Minimax (no structured-output SDK enforcement), but also possible with Claude/Codex if the prompt drifts.
  • Reproduction rate: Intermittent on Pi — observed twice in ~12 maintainer-review-pr runs in one session (~17%).
  • Workaround available: Re-run the workflow; usually succeeds on retry. No way to detect from the CLI exit code that the review didn't actually happen — must inspect logs.
  • Data loss risk: No, but real cost — a 5-minute workflow "succeeded" without doing the work.

Scope

  • Package(s) likely involved: workflows
  • Module: workflows:condition-evaluator, workflows:dag-executor

Proposed direction

Treat condition_json_parse_failed against a when: reference as a node error, not a warning. The DAG should fail the dependent node (and its descendants via the existing failure-propagation path), and the CLI should exit non-zero with a message naming the unparseable node and surfacing the output preview. This aligns with the project's "Fail Fast + Explicit Errors" principle — silent fallback in agent runtimes can create unsafe or costly behavior, and condition_json_parse_failed cascading to skipped nodes is precisely that.

Open questions for the implementer:

  • Should this be opt-out per-node (e.g. condition_strict: false for legacy workflows that depended on the silent fallback)? Probably not — the silent path is wrong everywhere.
  • Should the field-extraction layer also attempt graceful repairs (strip ```json fences, find balanced braces) before declaring parse failure? Worth doing, but as a separate improvement — the failure surfacing is the root issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions