Skip to content

/workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests #1549

@ztech-gthb

Description

@ztech-gthb

Maintainer note (updated 2026-06-09): Confirmed still LIVE on dev and re-verified against current code. The original report (preserved below) is accurate, but the resume path has been substantially refactored since this was filed (#1646 hydrateResumableRun, #1830 CAS guards), so the implementation must be built on the current model — not the code shape described in the original report. PR #1551 (the original fix, by @ztech-gthb) was closed because its /workflow resume re-routing collides with the new CAS model (text-merges clean, semantically broken). Implement fresh using the guide below; reuse #1551's UX verbatim and co-credit the author.


🔧 Implementation guide (for the eng picking this up)

The bug, in current-code terms

A fresh /workflow run X "B" silently auto-resumes a prior FAILED run of X in the same conversation, executing in that run's old worktree with its stale persisted user_message. The new args ("B") are discarded with no UI/log signal. Compounding: /workflow abandon rejects failed runs as terminal, so users can't easily escape.

Verified root cause (current code):

  • findResumableRunByParentConversation(name, convId, codebaseId) filters status IN ('failed', 'paused')packages/core/src/db/workflows.ts:421
  • Resume detection runs for every /workflow run (all platforms) inside dispatchOrchestratorWorkflowpackages/core/src/orchestrator/orchestrator-agent.ts (~the findResumableRunByParentConversation call, currently ~line 542). A fresh run finds the prior failed run and resumes its working_path.

Architecture you MUST build on (do not regress)

The resume/approval path was reworked after this issue was filed:

There are three resume entry points, all flowing through dispatchOrchestratorWorkflow. The fix must treat them differently:

Entry point Location Desired behavior
a. Fresh /workflow run X "B" (the bug) dispatch resume-detection If a failed run exists → PROMPT (don't resume). If paused → resume (approval-gate continuation).
b. Natural-language approval of a paused gate orchestrator-agent.ts ~line 1037 dev marks the approved run status:'failed' then re-dispatches with pausedRun.user_message → must RESUME, not prompt.
c. /workflow resume <id> command-handler.ts case 'resume' (~line 687) Explicit user resume → must RESUME, not prompt (no prompt-loop).

Fix design

  1. dispatchOrchestratorWorkflow — add options?: { force?: boolean; resumeRunId?: string }.
    • Guard the lookup: const resumableRun = options?.force ? null : await findResumableRunByParentConversation(...).
    • Gate before hydrateResumableRun:
      if (resumableRun.status !== 'paused' && resumableRun.id !== options?.resumeRunId) {
        // emit the 3-option prompt (see UX below) + log 'orchestrator.failed_resume_user_prompted'
        return;
      }
    • Everything else (hydrate + CAS + executeWorkflow) stays as dev has it.
  2. NL-approval path (entry b): pass { resumeRunId: pausedRun.id } so the approved-then-marked-failed run bypasses the prompt and resumes.
  3. /workflow resume <id> (entry c): carry an explicit-resume signal so the subsequent dispatch bypasses the prompt without pre-transitioning the run out of failed/paused before hydrateResumableRun reads it. (i.e. let hydrate own the CAS; the command path only needs to mark intent — e.g. thread resumeRunId onto the workflow result and into options.)
  4. abandonWorkflow (packages/core/src/operations/workflow-operations.ts): accept failed (transition failed → cancelled); reject only completed | cancelled. After abandon, a subsequent /workflow run must not prompt (cancelled is excluded from the lookup).
  5. --force parsing (command-handler.ts, /workflow run): recognize --force anywhere in args, strip it from the workflow args, thread it onto the result → options.force.
  6. types (packages/core/src/types/index.ts): force?: boolean and resumeRunId?: string on the workflow command result.

UX (reuse from closed PR #1551 — by @ztech-gthb)

Failed-run prompt = three options, with a preview of the prior run's user_message (truncate ~160 chars) so the user can tell whether "resume" matches their intent:

  1. /workflow resume <id> — resume that run (re-runs the prior prompt)
  2. /workflow abandon <id> then re-run the command — discard + fresh
  3. /workflow run <name> --force "<msg>" — fresh, leave the failed run as-is

Escape \, ", and ` in the interpolated suggested command (a raw backtick would break the markdown code fence). Co-credit the original author on the commit: Co-authored-by: Zolto <zolto@zhome.local>.

Acceptance criteria

  • Fresh /workflow run X "B" with a prior failed run → prompt fires, executeWorkflow not called, new args not discarded.
  • --force → fresh run dispatched, failed row untouched.
  • /workflow abandon <failed-id>cancelled; subsequent /workflow run no longer prompts.
  • NL-approval of a paused gate → resumes (no prompt). ← regression guard, easy to break
  • /workflow resume <id> → resumes the prior run, no prompt-loop, no fresh run.
  • Paused-run auto-resume (approval gate, PR 🐛 UserReportedError: Manual bug report #914) unchanged.
  • bun run validate clean.

Files

  • packages/core/src/orchestrator/orchestrator-agent.ts — dispatch gate, NL-approval bypass, thread options
  • packages/core/src/handlers/command-handler.ts--force parsing, resume-intent signal
  • packages/core/src/operations/workflow-operations.tsabandonWorkflow accepts failed
  • packages/core/src/types/index.tsforce? / resumeRunId?
  • packages/core/src/db/workflows.ts — likely no change (SELECT * already returns status + user_message)
  • Tests: orchestrator-agent.test.ts, command-handler.test.ts

Original report

Summary

  • What broke: /workflow run X "task B" silently auto-resumes a prior failed run of X in the same chat, executing in the failed run's sub-worktree with the failed run's persisted user_message ("task A"). The new prompt is discarded with no UI/log indication. The user sees a positive completion report on task A and is confused why task B never happened. Compounding: /workflow abandon rejects failed runs as "already terminal", so users hit by this cannot easily escape.
  • When it started (if known): introduced in PR 🐛 UserReportedError: Manual bug report #914 (fix: foreground resume for interactive workflows + chat auto-resume) which added findResumableRunByParentConversation with status IN ('failed', 'paused'). The 'failed' clause was scoped to support manual /workflow resume <id>; using it for automatic resume on a fresh /workflow run produces the silent-hijack behavior.
  • Severity: major (silent data loss / silent intent loss; trust-corroding)

Steps to Reproduce

  1. Pick any workflow whose first node materializes the user_message into $ARTIFACTS_DIR/.X files (most non-trivial workflows do this — e.g. parse-args style scripts).
  2. Run it with input that fails an early step:
    /workflow run my-workflow "input that fails parse"
    
  3. Observe: run is failed in remote_agent_workflow_runs, working_path = .../worktrees/archon/thread-<old-id>/.
  4. In the same chat conversation, run it with new input:
    /workflow run my-workflow "completely different task"
    
  5. Observe in server logs:
    module=command-handler   args="completely different task"   ← what the user typed
    module=orchestrator-agent msg=orchestrator.foreground_resume_detected
                              resumableRunId=<old-id>
                              workingPath=…/thread-<old-id>/
    
  6. The workflow runs again, in the same sub-worktree, with the previous run's user_message (preserved in $ARTIFACTS_DIR/.X files). Step 4's input is never used.

Expected vs Actual

  • Expected: a fresh /workflow run with new args dispatches a fresh run in a fresh worktree with the new args. The prior failed run remains as an audit-trail row but does not steer execution. If the user wants to continue the failed run from where it stopped, they explicitly type /workflow resume <id>.
  • Actual: the orchestrator silently picks up any failed | paused resumable run for the same (workflow_name, parent_conversation_id), calls executeWorkflow with the failed run's working_path, and the workflow re-reads stale state from disk. The new args travel through the call as userMessage but are discarded by parse-args/script-style early nodes.

User Flow

User                              Archon                              DB
────                              ──────                              ──
runs /workflow run X "A" ───────▶ findResumable... → null
                                  dispatch fresh                  ───▶ run-A row → status='failed'
                                  (parse-args fails on input "A")

runs /workflow run X "B" ───────▶ findResumable... → run-A
                                  [X] auto-resume in run-A's worktree
                                      with run-A's persisted state
                                  executeWorkflow(
                                      working_path=thread-run-A,
                                      userMessage="B"
                                      ↑ scripts ignore: they read
                                        $ARTIFACTS_DIR/.X from run-A
                                  )                              ───▶ run-A re-executed,
                                                                       still on task A
sees positive report ◀────────── task-A success report
   "I asked for B"                (still no idea task B was hijacked)

The [X] is where intent silently disappears.

Environment

  • Platform: Web (orchestrator agent path)
  • Database: SQLite (PostgreSQL has the same SQL, same behavior)
  • Running in worktree? Yes (workflow sub-worktrees)
  • OS: macOS host with Linux container; not OS-specific

Logs

{"level":30,"module":"command-handler","workflow":"ztech-marimo-edit",
 "args":"fortigapminder.marimo.py Remove redundant local tomllib re-imports
         from cells 4, 7, 12 and 13",                       ← user's correct args
 "msg":"cmd.workflow_starting"}

{"level":30,"module":"orchestrator-agent",
 "workflowName":"ztech-marimo-edit",
 "resumableRunId":"92d86ea89fd6808c5f6534b4ef34acbc",       ← prior failed run
 "workingPath":"/.archon/.../worktrees/archon/thread-85a590f9",
 "msg":"orchestrator.foreground_resume_detected"}

{"level":30,"module":"workflow.dag-executor",
 "priorCompletedCount":5,
 "msg":"dag.workflow_resume_prepopulated"}                  ← old state restored

{"level":50,"module":"workflow.dag-executor","exitCode":1,
 "stderrTail":"ERROR: First argument must be a notebook path ending in .py
              [...] INPUT (arg $1)='Edit the notebook at fortigapminder...'",
                                          ↑ THE OLD reformulated user_message,
                                            not the new args
 "msg":"dag_node_failed"}

The fresh /workflow run typed the correct path-prefixed args, but the resumed run reads the old natural-language reformulation from .edit-description artifact persisted by run-A.

Impact

  • Affected workflows/commands: any workflow with a first node that materializes user_message into $ARTIFACTS_DIR/.X files (most non-trivial DAG workflows). archon-fix-issue, archon-feature-development, custom user workflows, etc.
  • Reproduction rate: Always — deterministic given the SQL match (failed run + same workflow + same conversation).
  • Workaround available: pre-this-PR there was none. /workflow abandon rejected failed runs as terminal; /workflow resume <id> re-ran the same stale state; the only way out was direct DB manipulation (UPDATE remote_agent_workflow_runs SET status='cancelled' WHERE id=...).
  • Data loss risk: Yes — silent intent loss. The user's request is discarded with no log/UI indication.

Scope

  • Package(s): core
  • Module: core:orchestrator (dispatch logic), core:db (findResumableRunByParentConversation), core:operations (abandonWorkflow)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priority - Address soon, next in queuearea: cliCLI commands and interfacebugSomething is broken

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions