You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Maintainer note (updated 2026-06-09): Confirmed still LIVE on dev and re-verified against current code. The original report (preserved below) is accurate, but the resume path has been substantially refactored since this was filed (#1646hydrateResumableRun, #1830 CAS guards), so the implementation must be built on the current model — not the code shape described in the original report. PR #1551 (the original fix, by @ztech-gthb) was closed because its /workflow resume re-routing collides with the new CAS model (text-merges clean, semantically broken). Implement fresh using the guide below; reuse #1551's UX verbatim and co-credit the author.
🔧 Implementation guide (for the eng picking this up)
The bug, in current-code terms
A fresh /workflow run X "B" silently auto-resumes a prior FAILED run of X in the same conversation, executing in that run's old worktree with its stale persisted user_message. The new args ("B") are discarded with no UI/log signal. Compounding: /workflow abandon rejects failed runs as terminal, so users can't easily escape.
Verified root cause (current code):
findResumableRunByParentConversation(name, convId, codebaseId) filters status IN ('failed', 'paused') — packages/core/src/db/workflows.ts:421
Resume detection runs for every/workflow run (all platforms) inside dispatchOrchestratorWorkflow — packages/core/src/orchestrator/orchestrator-agent.ts (~the findResumableRunByParentConversation call, currently ~line 542). A fresh run finds the prior failed run and resumes its working_path.
Architecture you MUST build on (do not regress)
The resume/approval path was reworked after this issue was filed:
if(resumableRun.status!=='paused'&&resumableRun.id!==options?.resumeRunId){// emit the 3-option prompt (see UX below) + log 'orchestrator.failed_resume_user_prompted'return;}
Everything else (hydrate + CAS + executeWorkflow) stays as dev has it.
NL-approval path (entry b): pass { resumeRunId: pausedRun.id } so the approved-then-marked-failed run bypasses the prompt and resumes.
/workflow resume <id> (entry c): carry an explicit-resume signal so the subsequent dispatch bypasses the prompt without pre-transitioning the run out of failed/paused before hydrateResumableRun reads it. (i.e. let hydrate own the CAS; the command path only needs to mark intent — e.g. thread resumeRunId onto the workflow result and into options.)
abandonWorkflow (packages/core/src/operations/workflow-operations.ts): accept failed (transition failed → cancelled); reject only completed | cancelled. After abandon, a subsequent /workflow run must not prompt (cancelled is excluded from the lookup).
--force parsing (command-handler.ts, /workflow run): recognize --force anywhere in args, strip it from the workflow args, thread it onto the result → options.force.
types (packages/core/src/types/index.ts): force?: boolean and resumeRunId?: string on the workflow command result.
Failed-run prompt = three options, with a preview of the prior run's user_message (truncate ~160 chars) so the user can tell whether "resume" matches their intent:
/workflow resume <id> — resume that run (re-runs the prior prompt)
/workflow abandon <id> then re-run the command — discard + fresh
/workflow run <name> --force "<msg>" — fresh, leave the failed run as-is
Escape \, ", and ` in the interpolated suggested command (a raw backtick would break the markdown code fence). Co-credit the original author on the commit: Co-authored-by: Zolto <zolto@zhome.local>.
Acceptance criteria
Fresh /workflow run X "B" with a prior failed run → prompt fires, executeWorkflownot called, new args not discarded.
--force → fresh run dispatched, failed row untouched.
/workflow abandon <failed-id> → cancelled; subsequent /workflow run no longer prompts.
NL-approval of a paused gate → resumes (no prompt). ← regression guard, easy to break
/workflow resume <id> → resumes the prior run, no prompt-loop, no fresh run.
What broke: /workflow run X "task B" silently auto-resumes a prior failed run of X in the same chat, executing in the failed run's sub-worktree with the failed run's persisted user_message ("task A"). The new prompt is discarded with no UI/log indication. The user sees a positive completion report on task A and is confused why task B never happened. Compounding: /workflow abandon rejects failed runs as "already terminal", so users hit by this cannot easily escape.
When it started (if known): introduced in PR 🐛 UserReportedError: Manual bug report #914 (fix: foreground resume for interactive workflows + chat auto-resume) which added findResumableRunByParentConversation with status IN ('failed', 'paused'). The 'failed' clause was scoped to support manual /workflow resume <id>; using it for automatic resume on a fresh /workflow run produces the silent-hijack behavior.
Severity: major (silent data loss / silent intent loss; trust-corroding)
Steps to Reproduce
Pick any workflow whose first node materializes the user_message into $ARTIFACTS_DIR/.X files (most non-trivial workflows do this — e.g. parse-args style scripts).
Run it with input that fails an early step:
/workflow run my-workflow "input that fails parse"
Observe: run is failed in remote_agent_workflow_runs, working_path = .../worktrees/archon/thread-<old-id>/.
In the same chat conversation, run it with new input:
/workflow run my-workflow "completely different task"
Observe in server logs:
module=command-handler args="completely different task" ← what the user typed
module=orchestrator-agent msg=orchestrator.foreground_resume_detected
resumableRunId=<old-id>
workingPath=…/thread-<old-id>/
The workflow runs again, in the same sub-worktree, with the previous run'suser_message (preserved in $ARTIFACTS_DIR/.X files). Step 4's input is never used.
Expected vs Actual
Expected: a fresh /workflow run with new args dispatches a fresh run in a fresh worktree with the new args. The prior failed run remains as an audit-trail row but does not steer execution. If the user wants to continue the failed run from where it stopped, they explicitly type /workflow resume <id>.
Actual: the orchestrator silently picks up any failed | paused resumable run for the same (workflow_name, parent_conversation_id), calls executeWorkflow with the failed run's working_path, and the workflow re-reads stale state from disk. The new args travel through the call as userMessage but are discarded by parse-args/script-style early nodes.
User Flow
User Archon DB
──── ────── ──
runs /workflow run X "A" ───────▶ findResumable... → null
dispatch fresh ───▶ run-A row → status='failed'
(parse-args fails on input "A")
runs /workflow run X "B" ───────▶ findResumable... → run-A
[X] auto-resume in run-A's worktree
with run-A's persisted state
executeWorkflow(
working_path=thread-run-A,
userMessage="B"
↑ scripts ignore: they read
$ARTIFACTS_DIR/.X from run-A
) ───▶ run-A re-executed,
still on task A
sees positive report ◀────────── task-A success report
"I asked for B" (still no idea task B was hijacked)
The [X] is where intent silently disappears.
Environment
Platform: Web (orchestrator agent path)
Database: SQLite (PostgreSQL has the same SQL, same behavior)
Running in worktree? Yes (workflow sub-worktrees)
OS: macOS host with Linux container; not OS-specific
Logs
{"level":30,"module":"command-handler","workflow":"ztech-marimo-edit",
"args":"fortigapminder.marimo.py Remove redundant local tomllib re-imports
from cells 4, 7, 12 and 13", ← user's correct args
"msg":"cmd.workflow_starting"}
{"level":30,"module":"orchestrator-agent",
"workflowName":"ztech-marimo-edit",
"resumableRunId":"92d86ea89fd6808c5f6534b4ef34acbc", ← prior failed run
"workingPath":"/.archon/.../worktrees/archon/thread-85a590f9",
"msg":"orchestrator.foreground_resume_detected"}
{"level":30,"module":"workflow.dag-executor",
"priorCompletedCount":5,
"msg":"dag.workflow_resume_prepopulated"} ← old state restored
{"level":50,"module":"workflow.dag-executor","exitCode":1,
"stderrTail":"ERROR: First argument must be a notebook path ending in .py
[...] INPUT (arg $1)='Edit the notebook at fortigapminder...'",
↑ THE OLD reformulated user_message,
not the new args
"msg":"dag_node_failed"}
The fresh /workflow run typed the correct path-prefixed args, but the resumed run reads the old natural-language reformulation from .edit-description artifact persisted by run-A.
Impact
Affected workflows/commands: any workflow with a first node that materializes user_message into $ARTIFACTS_DIR/.X files (most non-trivial DAG workflows). archon-fix-issue, archon-feature-development, custom user workflows, etc.
Reproduction rate: Always — deterministic given the SQL match (failed run + same workflow + same conversation).
Workaround available: pre-this-PR there was none. /workflow abandon rejected failed runs as terminal; /workflow resume <id> re-ran the same stale state; the only way out was direct DB manipulation (UPDATE remote_agent_workflow_runs SET status='cancelled' WHERE id=...).
Data loss risk: Yes — silent intent loss. The user's request is discarded with no log/UI indication.
🔧 Implementation guide (for the eng picking this up)
The bug, in current-code terms
A fresh
/workflow run X "B"silently auto-resumes a prior FAILED run ofXin the same conversation, executing in that run's old worktree with its stale persisteduser_message. The new args ("B") are discarded with no UI/log signal. Compounding:/workflow abandonrejects failed runs as terminal, so users can't easily escape.Verified root cause (current code):
findResumableRunByParentConversation(name, convId, codebaseId)filtersstatus IN ('failed', 'paused')—packages/core/src/db/workflows.ts:421/workflow run(all platforms) insidedispatchOrchestratorWorkflow—packages/core/src/orchestrator/orchestrator-agent.ts(~thefindResumableRunByParentConversationcall, currently ~line 542). A fresh run finds the priorfailedrun and resumes itsworking_path.Architecture you MUST build on (do not regress)
The resume/approval path was reworked after this issue was filed:
hydrateResumableRun(deps, run)(fix(workflows): make resume explicit via prepareResumedRun / hydrateResumableRun (closes #1392) #1646) — performs the resume CAS transition itself and returns the prepared run (priorCompletedNodes etc.). Do not transition the run torunningbefore dispatch and then rely onfindResumable— it only matchesfailed/paused, so a pre-transitioned run would be missed and a fresh run would start. This is exactly the trap PR fix(workflow): prompt user on resume of failed run + allow abandoning failed + add --force flag #1551 fell into.WorkflowNotResumableErrorCAS guards (fix(core): concurrency-safe workflow resume/cancel (CAS guards) #1830) — concurrent resumers (web Resume button / chat re-dispatch / CLI) must not double-claim a worktree. Keep this intact.There are three resume entry points, all flowing through
dispatchOrchestratorWorkflow. The fix must treat them differently:/workflow run X "B"(the bug)orchestrator-agent.ts~line 1037status:'failed'then re-dispatches withpausedRun.user_message→ must RESUME, not prompt./workflow resume <id>command-handler.tscase 'resume'(~line 687)Fix design
dispatchOrchestratorWorkflow— addoptions?: { force?: boolean; resumeRunId?: string }.const resumableRun = options?.force ? null : await findResumableRunByParentConversation(...).hydrateResumableRun:executeWorkflow) stays as dev has it.{ resumeRunId: pausedRun.id }so the approved-then-marked-failedrun bypasses the prompt and resumes./workflow resume <id>(entry c): carry an explicit-resume signal so the subsequent dispatch bypasses the prompt without pre-transitioning the run out offailed/pausedbeforehydrateResumableRunreads it. (i.e. lethydrateown the CAS; the command path only needs to mark intent — e.g. threadresumeRunIdonto the workflow result and intooptions.)abandonWorkflow(packages/core/src/operations/workflow-operations.ts): acceptfailed(transitionfailed → cancelled); reject onlycompleted | cancelled. After abandon, a subsequent/workflow runmust not prompt (cancelled is excluded from the lookup).--forceparsing (command-handler.ts,/workflow run): recognize--forceanywhere in args, strip it from the workflow args, thread it onto the result →options.force.packages/core/src/types/index.ts):force?: booleanandresumeRunId?: stringon the workflow command result.UX (reuse from closed PR #1551 — by @ztech-gthb)
Failed-run prompt = three options, with a preview of the prior run's
user_message(truncate ~160 chars) so the user can tell whether "resume" matches their intent:/workflow resume <id>— resume that run (re-runs the prior prompt)/workflow abandon <id>then re-run the command — discard + fresh/workflow run <name> --force "<msg>"— fresh, leave the failed run as-isEscape
\,", and`in the interpolated suggested command (a raw backtick would break the markdown code fence). Co-credit the original author on the commit:Co-authored-by: Zolto <zolto@zhome.local>.Acceptance criteria
/workflow run X "B"with a prior failed run → prompt fires,executeWorkflownot called, new args not discarded.--force→ fresh run dispatched, failed row untouched./workflow abandon <failed-id>→cancelled; subsequent/workflow runno longer prompts./workflow resume <id>→ resumes the prior run, no prompt-loop, no fresh run.bun run validateclean.Files
packages/core/src/orchestrator/orchestrator-agent.ts— dispatch gate, NL-approval bypass, threadoptionspackages/core/src/handlers/command-handler.ts—--forceparsing, resume-intent signalpackages/core/src/operations/workflow-operations.ts—abandonWorkflowacceptsfailedpackages/core/src/types/index.ts—force?/resumeRunId?packages/core/src/db/workflows.ts— likely no change (SELECT *already returnsstatus+user_message)orchestrator-agent.test.ts,command-handler.test.tsOriginal report
Summary
/workflow run X "task B"silently auto-resumes a prior failed run ofXin the same chat, executing in the failed run's sub-worktree with the failed run's persisteduser_message("task A"). The new prompt is discarded with no UI/log indication. The user sees a positive completion report on task A and is confused why task B never happened. Compounding:/workflow abandonrejects failed runs as "already terminal", so users hit by this cannot easily escape.fix: foreground resume for interactive workflows + chat auto-resume) which addedfindResumableRunByParentConversationwithstatus IN ('failed', 'paused'). The'failed'clause was scoped to support manual/workflow resume <id>; using it for automatic resume on a fresh/workflow runproduces the silent-hijack behavior.major(silent data loss / silent intent loss; trust-corroding)Steps to Reproduce
$ARTIFACTS_DIR/.Xfiles (most non-trivial workflows do this — e.g.parse-argsstyle scripts).failedinremote_agent_workflow_runs,working_path = .../worktrees/archon/thread-<old-id>/.user_message(preserved in$ARTIFACTS_DIR/.Xfiles). Step 4's input is never used.Expected vs Actual
/workflow runwith new args dispatches a fresh run in a fresh worktree with the new args. The prior failed run remains as an audit-trail row but does not steer execution. If the user wants to continue the failed run from where it stopped, they explicitly type/workflow resume <id>.failed | pausedresumable run for the same(workflow_name, parent_conversation_id), callsexecuteWorkflowwith the failed run'sworking_path, and the workflow re-reads stale state from disk. The new args travel through the call asuserMessagebut are discarded by parse-args/script-style early nodes.User Flow
The
[X]is where intent silently disappears.Environment
Logs
The fresh
/workflow runtyped the correct path-prefixed args, but the resumed run reads the old natural-language reformulation from.edit-descriptionartifact persisted by run-A.Impact
$ARTIFACTS_DIR/.Xfiles (most non-trivial DAG workflows).archon-fix-issue,archon-feature-development, custom user workflows, etc./workflow abandonrejected failed runs as terminal;/workflow resume <id>re-ran the same stale state; the only way out was direct DB manipulation (UPDATE remote_agent_workflow_runs SET status='cancelled' WHERE id=...).Scope
corecore:orchestrator(dispatch logic),core:db(findResumableRunByParentConversation),core:operations(abandonWorkflow)