Problem
When a workflow run is auto-resumed, every node previously marked `completed` is skipped via the `prior_success` path and its output is replayed from cache to downstream consumers. There's no escape hatch when a producer node exited 0 but its output is structurally bad (corrupted file, malformed JSON written to disk, missing fields). The downstream consumer keeps failing forever, and the only way to clear the bad cache is to rename the producer node so its identity changes.
Repro
- Build a workflow with a producer (
bash: or script:) that writes a file to $ARTIFACTS_DIR and exits 0, plus a consumer (script:) that reads that file.
- Author the producer with a subtle bash bug — e.g. `printf '%s' '$node.output'` (the
'…' collide with the auto-shell-quoter, the JSON gets bash-brace-expanded into garbage). Producer exits 0; file is corrupted.
- Consumer fails parsing the file.
- Fix the producer.
- Re-run `archon workflow run ...` — observe `workflow.dag_resuming`, `priorCompletedCount: N`, producer `Skipped (prior_success)`, consumer fails again with the same corrupted file.
- Repeat — fix doesn't take effect because producer is forever cached.
- Workaround: rename producer node `foo` → `foo-v2`, update all
depends_on references, update all $foo.output references downstream. Cache miss on the new id, producer re-runs, fix takes effect.
This happened to me building a release workflow. The bridge node (persist-draft-changelog) wrote corrupted JSON because of an unrelated quoting bug, but the bash command exited 0. Three subsequent "fix and re-run" cycles all failed at the consumer because the cached corrupted file kept being used. Discovered the workaround by renaming to save-draft-json.
Why this is the worst class of bug
"Producer exit 0 + bad output" is structurally the hardest workflow failure mode to debug:
- The producer's log says "Completed (10ms)" — looks fine
- The consumer's error points at the consumer code ("JSON parse error") — misleading
- The bad artifact persists across resumes — every iteration sees the same broken downstream
- The
--resume-cached behavior actively hides the fix
- Renaming a node and chasing every reference is exactly the kind of yak-shave that breaks an authoring flow
For agents authoring workflows, this trap is catastrophic — they don't know to rename the node, and their fix-the-producer iterations all look like the producer is fine and the consumer is broken.
Proposed direction (sketches)
Any of these would help; not prescriptive:
archon workflow node invalidate <run-id> <node-id> — CLI escape hatch. Drops the cached output for one node; on next resume it re-runs.
archon workflow node invalidate <run-id> --downstream-of <node-id> — invalidate everything that consumed <node-id>'s output. Useful when the producer was OK but you changed its consumer's contract.
--no-resume flag on archon workflow run — force fresh run, ignore prior cached state. Currently the only way to do this is archon workflow abandon <id> followed by archon workflow run (and abandon errors with "already terminal" if the run is already failed — see issue F).
always_run: true on a node — opts out of resume caching for that node only. Makes producers like "persist AI output to disk" idempotent-by-policy.
- Make node identity content-addressable — hash the node's resolved script body + inputs into the cache key. A code change to the producer auto-busts the cache. (Heavier change, but solves the root cause.)
Adjacent
- See #1389 — "errors should help agents self-correct". This issue is the same family: an agent debugging a failing workflow gets pointed at the wrong problem.
- See [issue F: auto-resume default behavior] — the resume-by-default behavior amplifies this trap because authors don't realize they're resuming.
Found while
Building a workshop archon-release workflow. The workflow itself is sound; this Archon behavior cost ~3 iteration cycles before I figured out the cache-invalidation trick.
Problem
When a workflow run is auto-resumed, every node previously marked `completed` is skipped via the `prior_success` path and its output is replayed from cache to downstream consumers. There's no escape hatch when a producer node exited 0 but its output is structurally bad (corrupted file, malformed JSON written to disk, missing fields). The downstream consumer keeps failing forever, and the only way to clear the bad cache is to rename the producer node so its identity changes.
Repro
bash:orscript:) that writes a file to$ARTIFACTS_DIRand exits 0, plus a consumer (script:) that reads that file.'…'collide with the auto-shell-quoter, the JSON gets bash-brace-expanded into garbage). Producer exits 0; file is corrupted.depends_onreferences, update all$foo.outputreferences downstream. Cache miss on the new id, producer re-runs, fix takes effect.This happened to me building a release workflow. The bridge node (
persist-draft-changelog) wrote corrupted JSON because of an unrelated quoting bug, but the bash command exited 0. Three subsequent "fix and re-run" cycles all failed at the consumer because the cached corrupted file kept being used. Discovered the workaround by renaming tosave-draft-json.Why this is the worst class of bug
"Producer exit 0 + bad output" is structurally the hardest workflow failure mode to debug:
--resume-cached behavior actively hides the fixFor agents authoring workflows, this trap is catastrophic — they don't know to rename the node, and their fix-the-producer iterations all look like the producer is fine and the consumer is broken.
Proposed direction (sketches)
Any of these would help; not prescriptive:
archon workflow node invalidate <run-id> <node-id>— CLI escape hatch. Drops the cached output for one node; on next resume it re-runs.archon workflow node invalidate <run-id> --downstream-of <node-id>— invalidate everything that consumed<node-id>'s output. Useful when the producer was OK but you changed its consumer's contract.--no-resumeflag onarchon workflow run— force fresh run, ignore prior cached state. Currently the only way to do this isarchon workflow abandon <id>followed byarchon workflow run(and abandon errors with "already terminal" if the run is already failed — see issue F).always_run: trueon a node — opts out of resume caching for that node only. Makes producers like "persist AI output to disk" idempotent-by-policy.Adjacent
Found while
Building a workshop
archon-releaseworkflow. The workflow itself is sound; this Archon behavior cost ~3 iteration cycles before I figured out the cache-invalidation trick.