/workflow run silently auto-resumes failed runs with stale args, hijacking  fresh requests

> **Maintainer note (updated 2026-06-09):** Confirmed **still LIVE** on `dev` and re-verified against current code. The original report (preserved below) is accurate, but the resume path has been **substantially refactored since this was filed** (#1646 `hydrateResumableRun`, #1830 CAS guards), so the implementation must be built on the *current* model — not the code shape described in the original report. PR #1551 (the original fix, by @ztech-gthb) was closed because its `/workflow resume` re-routing collides with the new CAS model (text-merges clean, semantically broken). **Implement fresh** using the guide below; reuse #1551's UX verbatim and co-credit the author.

---

## 🔧 Implementation guide (for the eng picking this up)

### The bug, in current-code terms
A fresh `/workflow run X "B"` silently **auto-resumes a prior FAILED run** of `X` in the same conversation, executing in that run's old worktree with its stale persisted `user_message`. The new args ("B") are discarded with no UI/log signal. Compounding: `/workflow abandon` rejects failed runs as terminal, so users can't easily escape.

**Verified root cause (current code):**
- `findResumableRunByParentConversation(name, convId, codebaseId)` filters `status IN ('failed', 'paused')` — `packages/core/src/db/workflows.ts:421`
- Resume detection runs for **every** `/workflow run` (all platforms) inside `dispatchOrchestratorWorkflow` — `packages/core/src/orchestrator/orchestrator-agent.ts` (~the `findResumableRunByParentConversation` call, currently ~line 542). A fresh run finds the prior `failed` run and resumes its `working_path`.

### Architecture you MUST build on (do not regress)
The resume/approval path was reworked after this issue was filed:
- **`hydrateResumableRun(deps, run)`** (#1646) — performs the resume **CAS transition itself** and returns the prepared run (priorCompletedNodes etc.). Do **not** transition the run to `running` *before* dispatch and then rely on `findResumable` — it only matches `failed`/`paused`, so a pre-transitioned run would be missed and a **fresh** run would start. This is exactly the trap PR #1551 fell into.
- **`WorkflowNotResumableError` CAS guards** (#1830) — concurrent resumers (web Resume button / chat re-dispatch / CLI) must not double-claim a worktree. Keep this intact.

There are **three** resume entry points, all flowing through `dispatchOrchestratorWorkflow`. The fix must treat them differently:

| Entry point | Location | Desired behavior |
|-------------|----------|------------------|
| **a. Fresh `/workflow run X "B"`** (the bug) | dispatch resume-detection | If a **failed** run exists → **PROMPT** (don't resume). If **paused** → resume (approval-gate continuation). |
| **b. Natural-language approval** of a paused gate | `orchestrator-agent.ts` ~line 1037 | dev marks the approved run **`status:'failed'`** then re-dispatches with `pausedRun.user_message` → must **RESUME**, not prompt. |
| **c. `/workflow resume <id>`** | `command-handler.ts` `case 'resume'` (~line 687) | Explicit user resume → must **RESUME**, not prompt (no prompt-loop). |

### Fix design
1. **`dispatchOrchestratorWorkflow`** — add `options?: { force?: boolean; resumeRunId?: string }`.
   - Guard the lookup: `const resumableRun = options?.force ? null : await findResumableRunByParentConversation(...)`.
   - **Gate before `hydrateResumableRun`:**
     ```ts
     if (resumableRun.status !== 'paused' && resumableRun.id !== options?.resumeRunId) {
       // emit the 3-option prompt (see UX below) + log 'orchestrator.failed_resume_user_prompted'
       return;
     }
     ```
   - Everything else (hydrate + CAS + `executeWorkflow`) stays as dev has it.
2. **NL-approval path** (entry b): pass `{ resumeRunId: pausedRun.id }` so the approved-then-marked-`failed` run bypasses the prompt and resumes.
3. **`/workflow resume <id>`** (entry c): carry an explicit-resume signal so the subsequent dispatch bypasses the prompt **without** pre-transitioning the run out of `failed`/`paused` before `hydrateResumableRun` reads it. (i.e. let `hydrate` own the CAS; the command path only needs to mark intent — e.g. thread `resumeRunId` onto the workflow result and into `options`.)
4. **`abandonWorkflow`** (`packages/core/src/operations/workflow-operations.ts`): accept `failed` (transition `failed → cancelled`); reject only `completed | cancelled`. After abandon, a subsequent `/workflow run` must not prompt (cancelled is excluded from the lookup).
5. **`--force` parsing** (`command-handler.ts`, `/workflow run`): recognize `--force` anywhere in args, strip it from the workflow args, thread it onto the result → `options.force`.
6. **types** (`packages/core/src/types/index.ts`): `force?: boolean` and `resumeRunId?: string` on the workflow command result.

### UX (reuse from closed PR #1551 — by @ztech-gthb)
Failed-run prompt = three options, with a **preview of the prior run's `user_message`** (truncate ~160 chars) so the user can tell whether "resume" matches their intent:
1. `/workflow resume <id>` — resume that run (re-runs the prior prompt)
2. `/workflow abandon <id>` then re-run the command — discard + fresh
3. `/workflow run <name> --force "<msg>"` — fresh, leave the failed run as-is

Escape `\`, `"`, and `` ` `` in the interpolated suggested command (a raw backtick would break the markdown code fence). **Co-credit the original author** on the commit: `Co-authored-by: Zolto <zolto@zhome.local>`.

### Acceptance criteria
- [ ] Fresh `/workflow run X "B"` with a prior **failed** run → prompt fires, `executeWorkflow` **not** called, new args not discarded.
- [ ] `--force` → fresh run dispatched, failed row untouched.
- [ ] `/workflow abandon <failed-id>` → `cancelled`; subsequent `/workflow run` no longer prompts.
- [ ] **NL-approval** of a paused gate → resumes (no prompt). ← regression guard, easy to break
- [ ] **`/workflow resume <id>`** → resumes the prior run, no prompt-loop, no fresh run.
- [ ] **Paused**-run auto-resume (approval gate, PR #914) unchanged.
- [ ] `bun run validate` clean.

### Files
- `packages/core/src/orchestrator/orchestrator-agent.ts` — dispatch gate, NL-approval bypass, thread `options`
- `packages/core/src/handlers/command-handler.ts` — `--force` parsing, resume-intent signal
- `packages/core/src/operations/workflow-operations.ts` — `abandonWorkflow` accepts `failed`
- `packages/core/src/types/index.ts` — `force?` / `resumeRunId?`
- `packages/core/src/db/workflows.ts` — likely no change (`SELECT *` already returns `status` + `user_message`)
- Tests: `orchestrator-agent.test.ts`, `command-handler.test.ts`

---

## Original report

## Summary

- What broke: `/workflow run X "task B"` silently auto-resumes a prior failed run of `X` in the same chat, executing in the failed run's sub-worktree with the failed run's persisted `user_message` ("task A"). The new prompt is discarded with no UI/log indication. The user sees a positive completion report on task A and is confused why task B never happened. Compounding: `/workflow abandon` rejects failed runs as "already terminal", so users hit by this cannot easily escape.
- When it started (if known): introduced in PR #914 (`fix: foreground resume for interactive workflows + chat auto-resume`) which added `findResumableRunByParentConversation` with `status IN ('failed', 'paused')`. The `'failed'` clause was scoped to support manual `/workflow resume <id>`; using it for *automatic* resume on a fresh `/workflow run` produces the silent-hijack behavior.
- Severity: `major` (silent data loss / silent intent loss; trust-corroding)

## Steps to Reproduce

1. Pick any workflow whose first node materializes the user_message into `$ARTIFACTS_DIR/.X` files (most non-trivial workflows do this — e.g. `parse-args` style scripts).
2. Run it with input that fails an early step:
   ```
   /workflow run my-workflow "input that fails parse"
   ```
3. Observe: run is `failed` in `remote_agent_workflow_runs`, `working_path = .../worktrees/archon/thread-<old-id>/`.
4. In the **same chat conversation**, run it with new input:
   ```
   /workflow run my-workflow "completely different task"
   ```
5. Observe in server logs:
   ```
   module=command-handler   args="completely different task"   ← what the user typed
   module=orchestrator-agent msg=orchestrator.foreground_resume_detected
                             resumableRunId=<old-id>
                             workingPath=…/thread-<old-id>/
   ```
6. The workflow runs again, in the **same** sub-worktree, with the **previous run's** `user_message` (preserved in `$ARTIFACTS_DIR/.X` files). Step 4's input is never used.

## Expected vs Actual

- **Expected**: a fresh `/workflow run` with new args dispatches a fresh run in a fresh worktree with the new args. The prior failed run remains as an audit-trail row but does not steer execution. If the user wants to continue the failed run from where it stopped, they explicitly type `/workflow resume <id>`.
- **Actual**: the orchestrator silently picks up any `failed | paused` resumable run for the same `(workflow_name, parent_conversation_id)`, calls `executeWorkflow` with the failed run's `working_path`, and the workflow re-reads stale state from disk. The new args travel through the call as `userMessage` but are discarded by parse-args/script-style early nodes.

## User Flow

```
User                              Archon                              DB
────                              ──────                              ──
runs /workflow run X "A" ───────▶ findResumable... → null
                                  dispatch fresh                  ───▶ run-A row → status='failed'
                                  (parse-args fails on input "A")

runs /workflow run X "B" ───────▶ findResumable... → run-A
                                  [X] auto-resume in run-A's worktree
                                      with run-A's persisted state
                                  executeWorkflow(
                                      working_path=thread-run-A,
                                      userMessage="B"
                                      ↑ scripts ignore: they read
                                        $ARTIFACTS_DIR/.X from run-A
                                  )                              ───▶ run-A re-executed,
                                                                       still on task A
sees positive report ◀────────── task-A success report
   "I asked for B"                (still no idea task B was hijacked)
```

The `[X]` is where intent silently disappears.

## Environment

- Platform: Web (orchestrator agent path)
- Database: SQLite (PostgreSQL has the same SQL, same behavior)
- Running in worktree? **Yes** (workflow sub-worktrees)
- OS: macOS host with Linux container; not OS-specific

## Logs

```
{"level":30,"module":"command-handler","workflow":"ztech-marimo-edit",
 "args":"fortigapminder.marimo.py Remove redundant local tomllib re-imports
         from cells 4, 7, 12 and 13",                       ← user's correct args
 "msg":"cmd.workflow_starting"}

{"level":30,"module":"orchestrator-agent",
 "workflowName":"ztech-marimo-edit",
 "resumableRunId":"92d86ea89fd6808c5f6534b4ef34acbc",       ← prior failed run
 "workingPath":"/.archon/.../worktrees/archon/thread-85a590f9",
 "msg":"orchestrator.foreground_resume_detected"}

{"level":30,"module":"workflow.dag-executor",
 "priorCompletedCount":5,
 "msg":"dag.workflow_resume_prepopulated"}                  ← old state restored

{"level":50,"module":"workflow.dag-executor","exitCode":1,
 "stderrTail":"ERROR: First argument must be a notebook path ending in .py
              [...] INPUT (arg $1)='Edit the notebook at fortigapminder...'",
                                          ↑ THE OLD reformulated user_message,
                                            not the new args
 "msg":"dag_node_failed"}
```

The fresh `/workflow run` typed the correct path-prefixed args, but the resumed run reads the old natural-language reformulation from `.edit-description` artifact persisted by run-A.

## Impact

- Affected workflows/commands: any workflow with a first node that materializes user_message into `$ARTIFACTS_DIR/.X` files (most non-trivial DAG workflows). `archon-fix-issue`, `archon-feature-development`, custom user workflows, etc.
- Reproduction rate: **Always** — deterministic given the SQL match (failed run + same workflow + same conversation).
- Workaround available: pre-this-PR there was none. `/workflow abandon` rejected failed runs as terminal; `/workflow resume <id>` re-ran the same stale state; the only way out was direct DB manipulation (`UPDATE remote_agent_workflow_runs SET status='cancelled' WHERE id=...`).
- Data loss risk: **Yes** — silent intent loss. The user's request is discarded with no log/UI indication.

## Scope

- Package(s): `core`
- Module: `core:orchestrator` (dispatch logic), `core:db` (`findResumableRunByParentConversation`), `core:operations` (`abandonWorkflow`)



Entry point	Location	Desired behavior
a. Fresh `/workflow run X "B"` (the bug)	dispatch resume-detection	If a failed run exists → PROMPT (don't resume). If paused → resume (approval-gate continuation).
b. Natural-language approval of a paused gate	`orchestrator-agent.ts` ~line 1037	dev marks the approved run `status:'failed'` then re-dispatches with `pausedRun.user_message` → must RESUME, not prompt.
c. `/workflow resume <id>`	`command-handler.ts` `case 'resume'` (~line 687)	Explicit user resume → must RESUME, not prompt (no prompt-loop).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests #1549

🔧 Implementation guide (for the eng picking this up)

The bug, in current-code terms

Architecture you MUST build on (do not regress)

Fix design

UX (reuse from closed PR #1551 — by @ztech-gthb)

Acceptance criteria

Files

Original report

Summary

Steps to Reproduce

Expected vs Actual

User Flow

Environment

Logs

Impact

Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

/workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests #1549

Description

🔧 Implementation guide (for the eng picking this up)

The bug, in current-code terms

Architecture you MUST build on (do not regress)

Fix design

UX (reuse from closed PR #1551 — by @ztech-gthb)

Acceptance criteria

Files

Original report

Summary

Steps to Reproduce

Expected vs Actual

User Flow

Environment

Logs

Impact

Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions