Auto-resume cache traps workflows on "successful" producer with bad output — only escape is renaming the node

## Problem

When a workflow run is auto-resumed, every node previously marked \`completed\` is skipped via the \`prior_success\` path and its output is replayed from cache to downstream consumers. There's no escape hatch when a producer node exited 0 but its output is structurally bad (corrupted file, malformed JSON written to disk, missing fields). The downstream consumer keeps failing forever, and the only way to clear the bad cache is to **rename the producer node** so its identity changes.

## Repro

1. Build a workflow with a producer (`bash:` or `script:`) that writes a file to `$ARTIFACTS_DIR` and exits 0, plus a consumer (`script:`) that reads that file.
2. Author the producer with a subtle bash bug — e.g. \`printf '%s' '\$node.output'\` (the `'…'` collide with the auto-shell-quoter, the JSON gets bash-brace-expanded into garbage). Producer exits 0; file is corrupted.
3. Consumer fails parsing the file.
4. Fix the producer.
5. Re-run \`archon workflow run <name> ...\` — observe \`workflow.dag_resuming\`, \`priorCompletedCount: N\`, producer \`Skipped (prior_success)\`, consumer fails again with the **same** corrupted file.
6. Repeat — fix doesn't take effect because producer is forever cached.
7. Workaround: rename producer node \`foo\` → \`foo-v2\`, update all `depends_on` references, update all `$foo.output` references downstream. Cache miss on the new id, producer re-runs, fix takes effect.

This happened to me building a release workflow. The bridge node (`persist-draft-changelog`) wrote corrupted JSON because of an unrelated quoting bug, but the bash command exited 0. Three subsequent \"fix and re-run\" cycles all failed at the consumer because the cached corrupted file kept being used. Discovered the workaround by renaming to `save-draft-json`.

## Why this is the worst class of bug

\"Producer exit 0 + bad output\" is structurally the hardest workflow failure mode to debug:

- The producer's log says \"Completed (10ms)\" — looks fine
- The consumer's error points at the consumer code (\"JSON parse error\") — misleading
- The bad artifact persists across resumes — every iteration sees the same broken downstream
- The `--resume`-cached behavior actively hides the fix
- Renaming a node and chasing every reference is exactly the kind of yak-shave that breaks an authoring flow

For agents authoring workflows, this trap is catastrophic — they don't know to rename the node, and their fix-the-producer iterations all look like the producer is fine and the consumer is broken.

## Proposed direction (sketches)

Any of these would help; not prescriptive:

1. **`archon workflow node invalidate <run-id> <node-id>`** — CLI escape hatch. Drops the cached output for one node; on next resume it re-runs.
2. **`archon workflow node invalidate <run-id> --downstream-of <node-id>`** — invalidate everything that consumed `<node-id>`'s output. Useful when the producer was OK but you changed its consumer's contract.
3. **`--no-resume` flag on `archon workflow run`** — force fresh run, ignore prior cached state. Currently the *only* way to do this is `archon workflow abandon <id>` followed by `archon workflow run` (and abandon errors with \"already terminal\" if the run is already failed — see issue F).
4. **`always_run: true` on a node** — opts out of resume caching for that node only. Makes producers like \"persist AI output to disk\" idempotent-by-policy.
5. **Make node identity content-addressable** — hash the node's resolved script body + inputs into the cache key. A code change to the producer auto-busts the cache. (Heavier change, but solves the root cause.)

## Adjacent

- See [#1389](https://github.com/coleam00/Archon/issues/1389) — \"errors should help agents self-correct\". This issue is the same family: an agent debugging a failing workflow gets pointed at the wrong problem.
- See [issue F: auto-resume default behavior] — the resume-by-default behavior amplifies this trap because authors don't realize they're resuming.

## Found while

Building a workshop `archon-release` workflow. The workflow itself is sound; this Archon behavior cost ~3 iteration cycles before I figured out the cache-invalidation trick.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-resume cache traps workflows on "successful" producer with bad output — only escape is renaming the node #1391

Problem

Repro

Why this is the worst class of bug

Proposed direction (sketches)

Adjacent

Found while

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Auto-resume cache traps workflows on "successful" producer with bad output — only escape is renaming the node #1391

Description

Problem

Repro

Why this is the worst class of bug

Proposed direction (sketches)

Adjacent

Found while

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions