Skip to content

Auto-resume cache traps workflows on "successful" producer with bad output — only escape is renaming the node #1391

@Wirasm

Description

@Wirasm

Problem

When a workflow run is auto-resumed, every node previously marked `completed` is skipped via the `prior_success` path and its output is replayed from cache to downstream consumers. There's no escape hatch when a producer node exited 0 but its output is structurally bad (corrupted file, malformed JSON written to disk, missing fields). The downstream consumer keeps failing forever, and the only way to clear the bad cache is to rename the producer node so its identity changes.

Repro

  1. Build a workflow with a producer (bash: or script:) that writes a file to $ARTIFACTS_DIR and exits 0, plus a consumer (script:) that reads that file.
  2. Author the producer with a subtle bash bug — e.g. `printf '%s' '$node.output'` (the '…' collide with the auto-shell-quoter, the JSON gets bash-brace-expanded into garbage). Producer exits 0; file is corrupted.
  3. Consumer fails parsing the file.
  4. Fix the producer.
  5. Re-run `archon workflow run ...` — observe `workflow.dag_resuming`, `priorCompletedCount: N`, producer `Skipped (prior_success)`, consumer fails again with the same corrupted file.
  6. Repeat — fix doesn't take effect because producer is forever cached.
  7. Workaround: rename producer node `foo` → `foo-v2`, update all depends_on references, update all $foo.output references downstream. Cache miss on the new id, producer re-runs, fix takes effect.

This happened to me building a release workflow. The bridge node (persist-draft-changelog) wrote corrupted JSON because of an unrelated quoting bug, but the bash command exited 0. Three subsequent "fix and re-run" cycles all failed at the consumer because the cached corrupted file kept being used. Discovered the workaround by renaming to save-draft-json.

Why this is the worst class of bug

"Producer exit 0 + bad output" is structurally the hardest workflow failure mode to debug:

  • The producer's log says "Completed (10ms)" — looks fine
  • The consumer's error points at the consumer code ("JSON parse error") — misleading
  • The bad artifact persists across resumes — every iteration sees the same broken downstream
  • The --resume-cached behavior actively hides the fix
  • Renaming a node and chasing every reference is exactly the kind of yak-shave that breaks an authoring flow

For agents authoring workflows, this trap is catastrophic — they don't know to rename the node, and their fix-the-producer iterations all look like the producer is fine and the consumer is broken.

Proposed direction (sketches)

Any of these would help; not prescriptive:

  1. archon workflow node invalidate <run-id> <node-id> — CLI escape hatch. Drops the cached output for one node; on next resume it re-runs.
  2. archon workflow node invalidate <run-id> --downstream-of <node-id> — invalidate everything that consumed <node-id>'s output. Useful when the producer was OK but you changed its consumer's contract.
  3. --no-resume flag on archon workflow run — force fresh run, ignore prior cached state. Currently the only way to do this is archon workflow abandon <id> followed by archon workflow run (and abandon errors with "already terminal" if the run is already failed — see issue F).
  4. always_run: true on a node — opts out of resume caching for that node only. Makes producers like "persist AI output to disk" idempotent-by-policy.
  5. Make node identity content-addressable — hash the node's resolved script body + inputs into the cache key. A code change to the producer auto-busts the cache. (Heavier change, but solves the root cause.)

Adjacent

  • See #1389 — "errors should help agents self-correct". This issue is the same family: an agent debugging a failing workflow gets pointed at the wrong problem.
  • See [issue F: auto-resume default behavior] — the resume-by-default behavior amplifies this trap because authors don't realize they're resuming.

Found while

Building a workshop archon-release workflow. The workflow itself is sound; this Archon behavior cost ~3 iteration cycles before I figured out the cache-invalidation trick.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priority - Backlog, when time permitsarea: cliCLI commands and interfacearea: workflowsWorkflow enginebugSomething is brokeneffort/mediumFew files, one domain or module, some coordination needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions