fix(workflow): prompt user on resume of failed run + allow abandoning failed + add --force flag by ztech-gthb · Pull Request #1551 · coleam00/Archon

ztech-gthb · 2026-05-03T12:57:03Z

Summary

Problem: /workflow run X "task B" silently auto-resumes a prior failed run of X in the same chat, executing in the failed run's sub-worktree with the failed run's persisted user_message (= "task A"). The new prompt is discarded with no UI/log indication.
Why it matters: confident-but-wrong reports of work the user didn't ask for. Trust-corroding silent data loss; users debug nonexistent problems while the assistant runs the same stale task repeatedly.
What changed: failed runs now emit a 3-option prompt (resume / abandon-then-rerun / --force) instead of silently auto-resuming. /workflow abandon accepts failed runs (transitions them to cancelled). New --force flag bypasses the resume-detection lookup for callers who explicitly want a fresh run while keeping the failed audit row.
What did not change: paused-run auto-resume (PR 🐛 UserReportedError: Manual bug report #914 approval-gate path) is unchanged; /workflow resume <id> still accepts failed runs; DB schema unchanged; no migrations.

UX Journey

Before

User                          Archon orchestrator           DB
────                          ───────────────────           ──
runs /workflow run X "A" ───▶ findResumable... → null
                              dispatch fresh run        ───▶ run-A row (failed)

runs /workflow run X "B" ───▶ findResumable... → run-A
                              [silent foreground resume]
                              executeWorkflow(working_path=
                                  thread-of-run-A,
                                  userMessage="B"
                                  ↑ ignored: scripts read
                                  $ARTIFACTS_DIR/.X persisted
                                  from run-A)             ───▶ "completed" on
                                                                task A again

sees positive report ◀──── reports task A as success
   (still confused about
    why task B didn't happen)

After

User                          Archon orchestrator           DB
────                          ───────────────────           ──
runs /workflow run X "A" ───▶ findResumable... → null
                              dispatch fresh run        ───▶ run-A row (failed)

runs /workflow run X "B" ───▶ findResumable... → run-A
                              status === 'paused'?
                                no → emit 3-option prompt    [no new run]
                                     **return** (no dispatch)
sees prompt with: ◀────
  /workflow resume <id>           [option 1]
  /workflow abandon <id>          [option 2: works on failed now]
   then re-run command
  /workflow run X --force "B"     [option 3: keep run-A as-is]

Architecture Diagram

Before

                   ┌─ command-handler.ts (parses /workflow run)
                   │     │ args.slice(2).join(' ') = "..."
                   │     ▼
                   │  CommandResult.workflow = { definition, args }
                   │     │
orchestrator-      │     ▼
agent.ts ──────────┴─ handleWorkflowRunCommand
                         │
                         ▼
                      dispatchOrchestratorWorkflow
                         │
                         ▼
                      findResumableRunByParentConversation
                      (status IN ['failed','paused'])
                         │
                         ▼
                      resumableRun? → executeWorkflow [silent]
                                     (auto-resume regardless of status)


workflow-operations.ts: abandonWorkflow rejects ALL terminal statuses
  (= completed | failed | cancelled) — failed runs un-abandonable

After

                   ┌─ command-handler.ts (parses /workflow run)
                   │     │ NEW: also parses --force from args
                   │     ▼
                   │  CommandResult.workflow = { definition, args, force? } [~]
                   │     │
orchestrator-      │     ▼
agent.ts ──────────┴─ handleWorkflowRunCommand(..., options={force}) [~]
                         │
                         ▼
                      dispatchOrchestratorWorkflow(..., options) [~]
                         │
                         ▼
                      options.force? null : findResumableRun(...) [+]
                         │
                         ▼
                      resumableRun?
                         ├─ status==='paused' → executeWorkflow [unchanged]
                         └─ else (failed/...)
                              → platform.sendMessage(3-option prompt) [+]
                              → return [+]


workflow-operations.ts: abandonWorkflow rejects only completed | cancelled [~]
  (failed → cancelled is the user's discard action; row stays in DB)

Connection inventory:

From	To	Status	Notes
`command-handler.ts:case 'run'`	`args.findIndex('--force')`	new	flag parsing
`command-handler.ts`	`CommandResult.workflow.force`	new	type-level pass-through
`orchestrator-agent.ts:handleWorkflowRunCommand`	`dispatchOrchestratorWorkflow`	modified	new `options` param
`dispatchOrchestratorWorkflow`	`findResumableRunByParentConversation`	modified	guarded by `options.force`
`dispatchOrchestratorWorkflow`	`platform.sendMessage` (3-option prompt)	new	failed-run user prompt
`dispatchOrchestratorWorkflow`	`executeWorkflow` (paused branch)	unchanged	PR #914 path preserved
`abandonWorkflow`	`TERMINAL_WORKFLOW_STATUSES` check	removed	replaced with explicit `completed/cancelled`

Label Snapshot

Risk: risk: low
Size: size: M
Scope: core
Module: core:orchestrator, core:operations, core:db

Change Metadata

Change type: bug
Primary scope: core

Linked Issue

Closes /workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests #1549
Related archon-assist silently discards edits — workflow has no persistence step #1546 — archon-assist persistence; same UX class (silent edit loss), independent mechanism
Builds on 🐛 UserReportedError: Manual bug report #914 — fix: foreground resume for interactive workflows + chat auto-resume introduced the findResumableRunByParentConversation lookup this PR refines

Validation Evidence (required)

bun run type-check    # clean across all 10 packages
bun --filter @archon/core test
# 102 pass / 0 fail
# - foreground_resume_detected: existing test, fixture switched 'failed' → 'paused'
# - failed_resume_user_prompted: new (asserts no executeWorkflow, prompt text contains all 3 commands)
# - --force flag: new (asserts findResumable not consulted, dispatchBackground fires)

bun run lint          # clean for all touched files
bun run format:check  # clean

End-to-end manual verification on a real chat thread (2026-05-03):

Existing failed run 32c786ef… from prior session stayed failed in DB.
/workflow run ztech-marimo-edit "..." produced the 3-option prompt. No silent dispatch, no executeWorkflow call observed in workflow logs.
/workflow run ztech-marimo-edit --force "..." dispatched fresh, completed success: true with new feature branch (wf/marimo-edit-1777809446). Original 32c786ef… row untouched.
/workflow abandon 32c786ef… transitioned the row failed → cancelled, completed_at populated. Subsequent /workflow run in the same chat no longer triggered the prompt.

Security Impact (required)

New permissions/capabilities? No
New external network calls? No
Secrets/tokens handling changed? No
File system access scope changed? No

The prompt text is constructed via plain string concatenation; the userMessage's embedded quotes are escaped (\") when interpolated into the suggested re-run command block. No new untrusted-input parsing path.

Compatibility / Migration

Backward compatible? Yes for the paused-run auto-resume path (PR 🐛 UserReportedError: Manual bug report #914), /workflow resume <id> for failed runs, /workflow run callers without --force. The only behavior change is: failed-run-on-fresh-/workflow run now prompts instead of silently auto-resuming — the prior behavior was silent data loss, so this is the intended fix, not a regression.
Config/env changes? No
Database migration needed? No

Human Verification (required)

Verified scenarios:

Failed-run prompt fires for status='failed' resumable, three commands appear with correct IDs interpolated.
--force skips the resume lookup entirely (verified via mockFindResumableRunByParentConversation.not.toHaveBeenCalled() in test, plus end-to-end real run).
/workflow abandon on a failed run transitions to cancelled (DB confirmed).
After abandon, /workflow run in the same conversation no longer prompts (cancelled is excluded from the resume lookup).
paused-status auto-resume continues to work in the existing test fixture.

Edge cases checked:

--force token recognized at any position in args (not just immediately after workflow name).
userMessage with embedded " is escaped in the option-3 suggested command block.

What was not verified:

A workflow that legitimately wants to retry a transient failed run silently (if such a case exists, this PR makes it require either /workflow resume <id> or --force). No such workflow has been identified in the default workflows.

Side Effects / Blast Radius (required)

Affected subsystems: core:orchestrator dispatch path, core:operations abandon validation, core:handlers workflow command parsing, types.
Potential unintended effects: a workflow author who depended on the silent failed-run auto-resume as a "retry on flaky steps" mechanism will see the new prompt. The prompt's first option (/workflow resume <id>) preserves that capability with one extra step.
Guardrails: the new orchestrator.failed_resume_user_prompted log event lets operators detect users hitting this path. The existing orchestrator.foreground_resume_detected log fires only on the (unchanged) paused branch.

Rollback Plan (required)

Fast rollback: revert this commit. Behavior returns to silent auto-resume of failed runs and abandon-rejection. No data migration needed.
Feature flags: none. The behavior change is unconditional but additive (--force is opt-in for the override case, and the prompt itself emits one user-visible message).
Observable failure symptoms if rollback is needed: users complaining the prompt is too noisy (would suggest making the prompt suppressible per-workflow), or workflows expecting silent retry-on-failure breaking (would suggest a per-workflow auto_resume_failed flag). Neither has been observed in testing.

Risks and Mitigations

Risk: a workflow author relied on silent auto-resume of failed runs as a retry mechanism.
- Mitigation: the prompt's first option offers /workflow resume <id> as a copy-pasteable command; the retry capability is preserved with one extra user step.
Risk: --force is recognized as a token anywhere in args. A user passing --force as part of literal description text ("... change the --force flag handling ...") could accidentally trigger force.
- Mitigation: known pattern in CLI tools without -- separator support. If literal collision becomes a real problem, a -- separator could be added in a follow-up. Not a regression: the flag didn't exist before, no existing input pattern is broken.
Risk: hand-maintained 3-option prompt drifts from the actual command-handler behavior over time.
- Mitigation: the new test (failed_resume_user_prompted) asserts the three command strings are present in the prompt; CI catches accidental removal. A future PR could derive the prompt from command-handler metadata, but is premature given the small surface area.

Summary by CodeRabbit

New Features
- Global --force option to start a fresh workflow run, skipping resume detection.
- /workflow resume <id> now returns workflow details (including optional resumeRunId and force) to trigger immediate foreground resumption.
Bug Fixes
- Paused runs auto-resume; failed runs prompt users with clear choices: resume, abandon+retry, or start fresh.
- Abandon now accepts running, paused, and failed runs.
- Clear error shown when a resumed workflow definition is missing.

… --force flag

coderabbitai · 2026-05-03T12:57:14Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b024ee2-7d50-495c-9bc7-94570458a80a

📥 Commits

Reviewing files that changed from the base of the PR and between 332426e and 4088888.

📒 Files selected for processing (2)

packages/core/src/orchestrator/orchestrator-agent.test.ts
packages/core/src/orchestrator/orchestrator-agent.ts

✅ Files skipped from review due to trivial changes (1)

packages/core/src/orchestrator/orchestrator-agent.test.ts

📝 Walkthrough

Walkthrough

This PR prevents silent auto-resume of failed runs: only paused runs auto-resume, failed runs prompt the user (or require explicit resumeRunId), /workflow run supports a positional --force to skip resume detection, abandonWorkflow permits abandoning failed runs, and tests/types/handlers/orchestrator updated.

Changes

Auto-resume hijack fix (single cohort)

Layer / File(s)	Summary
Data Shape `packages/core/src/types/index.ts`	`CommandResult.workflow` adds optional `force?: boolean` and `resumeRunId?: string`.
Command Parsing / CLI `packages/core/src/handlers/command-handler.ts`	`/workflow run` parses `--force` positionally (first bare arg), strips it from args, and returns `workflow.force`; `/workflow resume <id>` resumes the run then reloads discovered workflows from the resumed run's cwd, validates the definition exists, and returns a `workflow` payload with `definition`, `args` (from `run.user_message ?? ''`) and `resumeRunId`, or an explicit discovery/missing-workflow error.
Core Orchestration Logic `packages/core/src/orchestrator/orchestrator-agent.ts`	`dispatchOrchestratorWorkflow(options?: {force?: boolean; resumeRunId?: string})` added. On web: skip resumable-run lookup when `options.force` is true; otherwise auto-resume only if prior run `status === 'paused'` or `resumableRun.id === options.resumeRunId`; for other prior runs (e.g., `failed`) send a user prompt with resume/abandon/force choices. Deterministic `/workflow` handling forwards `force` and `resumeRunId` into `handleWorkflowRunCommand`. Natural-language approval resume passes `{ resumeRunId }`.
Operations Layer `packages/core/src/operations/workflow-operations.ts`	`abandonWorkflow()` now rejects only `completed` and `cancelled`; it allows abandoning `failed`, `running`, and `paused` runs. Removed `TERMINAL_WORKFLOW_STATUSES` usage.
DB Docs `packages/core/src/db/workflows.ts`	Expanded doc comment for `findResumableRunByParentConversation` to document orchestrator semantics: `paused` runs auto-resume; `failed` runs cause a prompt to avoid reusing stale persisted `user_message`.
Tests `packages/core/src/orchestrator/orchestrator-agent.test.ts`, `packages/core/src/handlers/command-handler.test.ts`	Reset `mockFindResumableRunByParentConversation` in `beforeEach`. Updated foreground-resume test to use `status: 'paused'`. Added tests covering `failed`-run prompt, explicit resume via `workflow.resumeRunId`, `--force` bypass, and updated `/workflow resume` assertions and discovery seeding helper.

Sequence Diagram

sequenceDiagram
    actor User
    participant Handler as Command Handler
    participant Orchestrator as Orchestrator Agent
    participant DB as Database
    participant Executor as Workflow Executor

    rect rgba(0,150,0,0.5)
    Note over User,Executor: Forced fresh run (skip resume lookup)
    User->>Handler: /workflow run my-workflow "task B" --force
    Handler->>Handler: extract --force, clean args
    Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:true})
    Orchestrator->>Orchestrator: skip findResumableRunByParentConversation
    Orchestrator->>Executor: dispatchBackgroundWorkflow(fresh run)
    Executor-->>Orchestrator: run dispatched
    end

    rect rgba(100,150,255,0.5)
    Note over User,Executor: Paused run auto-resume
    User->>Handler: /workflow run my-workflow "continue"
    Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:false})
    Orchestrator->>DB: findResumableRunByParentConversation()
    DB-->>Orchestrator: {status:'paused', working_path:'...'}
    Orchestrator->>Executor: executeWorkflow(working_path from prior run)
    Executor-->>Orchestrator: run resumed
    end

    rect rgba(255,100,100,0.5)
    Note over User,Orchestrator: Failed run → user prompt
    User->>Handler: /workflow run my-workflow "task B"
    Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:false})
    Orchestrator->>DB: findResumableRunByParentConversation()
    DB-->>Orchestrator: {status:'failed', id:'abc123'}
    Orchestrator->>User: prompt (resume abc123 / abandon+rerun / start fresh --force)
    User-->>Handler: selects action
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

archon workflow run auto-resumes failed runs without --resume — docs say opt-in #1392: Implements status-aware resume behavior (paused vs failed) matching the reported auto-resume hijack.
bug(workflows): auto-resume after reject-with-redraft bypasses the approval gate and posts #1429: Related — distinguishes paused vs failed runs and adds resume/force controls.
/workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests #1549: Directly addressed — prevents silent auto-resume of failed runs and adds --force/prompt behavior.

Possibly related PRs

coleam00/Archon#1025: Related changes to orchestrator dispatch payloads and resume handling.
coleam00/Archon#1329: Overlapping orchestrator resume logic and tests.

Poem

🐰 I hopped through worktrees, found an old paused trail,

Failed runs now knock before they set sail.
--force clears the path, fresh starts shine bright,
Abandon hears the failed, resume when named right. 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the three main changes: prompting user on failed-run resume, allowing abandonment of failed runs, and adding the --force flag for fresh runs.
Description check	✅ Passed	The description comprehensively covers all required template sections: Summary (4 bullets), UX Journey (before/after flows), Architecture Diagram with connection inventory, Label Snapshot, Change Metadata, Linked Issues, Validation Evidence, Security Impact, Compatibility/Migration, Human Verification, Side Effects, Rollback Plan, and Risks/Mitigations.
Linked Issues check	✅ Passed	All primary objectives from issue `#1549` are met: failed-run resume now prompts instead of silently auto-resuming; --force flag bypasses resume detection; /workflow abandon accepts failed runs; paused-run auto-resume is preserved; explicit resumeRunId option enables programmatic resume.
Out of Scope Changes check	✅ Passed	All code changes directly address issue `#1549`'s objectives: command-handler --force parsing, orchestrator options threading, workflow-operations abandon validation, type additions, and test coverage are all scoped to the failed-run resume problem statement.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/core/src/operations/workflow-operations.ts (1)

112-127: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Return the post-cancel state from abandonWorkflow.

This still returns the pre-update row, so callers that read run.status will continue to see failed/paused after a successful abandon. Now that failed runs are intentionally abandonable, that stale status is much easier to surface in UI/metadata.

Suggested fix

 export async function abandonWorkflow(runId: string): Promise<WorkflowRun> {
   const run = await getRunOrThrow(runId, 'operations.workflow_abandon_lookup_failed');
   if (run.status === 'completed' || run.status === 'cancelled') {
     throw new Error(`Cannot abandon run with status '${run.status}'. Run is already terminal.`);
   }
   try {
     await workflowDb.cancelWorkflowRun(runId);
   } catch (error) {
     const err = error as Error;
     getLog().error(
       { err, errorType: err.constructor.name, runId },
       'operations.workflow_abandon_failed'
     );
     throw new Error(`Failed to abandon workflow run ${runId}: ${err.message}`);
   }
-  return run;
+  return (await workflowDb.getWorkflowRun(runId)) ?? { ...run, status: 'cancelled' };
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/operations/workflow-operations.ts` around lines 112 - 127,
abandonWorkflow currently returns the pre-cancel WorkflowRun, so update it to
return the post-cancel state: after awaiting workflowDb.cancelWorkflowRun(runId)
call get the updated run (e.g. via getRunOrThrow(runId,
'operations.workflow_abandon_lookup_failed') or the appropriate workflowDb fetch
method) and return that WorkflowRun instead of the original `run`; keep the
existing error logging around workflowDb.cancelWorkflowRun and rethrow as
before.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/core/src/orchestrator/orchestrator-agent.ts`:
- Around line 281-348: The code is prompting users when a prior run.status ===
'failed' even for explicit resume attempts because resumeWorkflow() doesn't mark
the intent; update the orchestration logic to detect an explicit resume request
(e.g., accept a resumeRunId or resume flag passed into the entry point that
calls orchestrator-agent) and treat resumableRun as allowed to auto-resume when
resumableRun.id === resumeRunId (or when a resume=true flag is present), so the
branch that currently shows prompt will instead call executeWorkflow(...) for
that specific resumableRun; wire the resumeRunId from the /workflow resume <id>
handler (or make that handler dispatch directly to executeWorkflow) so
resumeWorkflow() sets that identifier before re-entering the resumable-run
check.

---

Outside diff comments:
In `@packages/core/src/operations/workflow-operations.ts`:
- Around line 112-127: abandonWorkflow currently returns the pre-cancel
WorkflowRun, so update it to return the post-cancel state: after awaiting
workflowDb.cancelWorkflowRun(runId) call get the updated run (e.g. via
getRunOrThrow(runId, 'operations.workflow_abandon_lookup_failed') or the
appropriate workflowDb fetch method) and return that WorkflowRun instead of the
original `run`; keep the existing error logging around
workflowDb.cancelWorkflowRun and rethrow as before.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 13320894-c3d7-4717-a5cc-d4a2e25a008f

📥 Commits

Reviewing files that changed from the base of the PR and between 69b2c89 and 599209c.

📒 Files selected for processing (6)

packages/core/src/db/workflows.ts
packages/core/src/handlers/command-handler.ts
packages/core/src/operations/workflow-operations.ts
packages/core/src/orchestrator/orchestrator-agent.test.ts
packages/core/src/orchestrator/orchestrator-agent.ts
packages/core/src/types/index.ts

…oleam00#1552 from coleam00#1551)

…V2a prompt for explicit resume (resumeRunId option)

ztech-gthb · 2026-05-03T13:33:25Z

Good catch — fixed in latest commit. Added options.resumeRunId to dispatchOrchestratorWorkflow; the V2a check is now status === 'paused' || resumableRun.id === options?.resumeRunId. Two callers wire it: /workflow resume <id> now dispatches directly (no more 'run same workflow again' instruction that was looping users through the prompt) and the natural-language approval handler passes the paused run's id when it transitions status to 'failed' before dispatching. New regression test: resumeRunId option: failed run DOES auto-resume when CommandResult.workflow.resumeRunId matches. 103/103 in orchestrator-agent.test.ts, type-check + lint + format clean.

Wirasm · 2026-05-04T06:59:09Z

@ztech-gthb related to #1549 — workflow resume prompt fix.

Wirasm · 2026-05-04T17:36:33Z

Thanks @ztech-gthb — the architectural primitives here are right. The three-option failure prompt captures the actual decision space, and the force / resumeRunId distinction in the orchestrator avoids the prompt loop cleanly. The doc comment on abandonWorkflow is exemplary.

Three items worth tightening before merge:

1. `--force` parsing matches the flag anywhere in args (`packages/core/src/handlers/command-handler.ts:817-828`)

const forceIndex = restArgs.findIndex(a => a === '--force');

findIndex matches --force anywhere in the user's args. If someone runs:

/workflow run deploy "deploy --force to staging"

…the parser strips --force from inside the quoted message, sets force=true, and passes deploy to staging as args. Silent corruption of the user's input.

Suggested fix: only strip --force when it's a bareword in a flag-position (e.g., immediately after the workflow name, or at the end before the quoted message), not as a substring of the quoted args. Or use -- as a verbatim-args separator. The existing "be lenient" comment hints at the intent, but lenience here is unsafe.

2. Markdown escaping in the failure prompt is incomplete (`packages/core/src/orchestrator/orchestrator-agent.ts`)

const escapedMsg = userMessage.replace(/"/g, '\\"');

Only " is escaped. If the user's original message contains a backtick, the markdown code-blocks in the prompt close early and the rest renders as plain text. If it contains \, that won't be re-escaped either. Not a security issue, but a UX paper-cut whenever the user's first attempt had backticks in it.

Suggested fix: also escape backticks and backslashes, or render the failed-message as an indented code block (4-space) so backticks inside don't terminate it.

3. `/workflow resume <id>` mutates DB before checking the YAML exists (`packages/core/src/handlers/command-handler.ts`)

Order today:

resumeWorkflow(runId) — transitions DB row to running
discoverWorkflowsWithConfig(...) — load YAML
If YAML missing, error

If step 3 fails, the DB is in running state but no execution happens. The error message ("Restore the YAML and try again") is fine UX-wise, but the run sits in a half-resumed state until the user retries.

Suggested fix: discover the workflow YAML before calling resumeWorkflow, or roll the DB back when discovery fails.

CI is green and the rest of the change looks solid. Happy to merge once these three are addressed (or to merge as-is and track them as a follow-up issue if you'd rather move on — your call).

…ks/backslashes in failed-run prompt (review feedback)

ztech-gthb · 2026-05-04T18:55:19Z

Thanks @Wirasm — addressed in 332426e:

1. --force flag-position parsing (packages/core/src/handlers/command-handler.ts)
Tightened to restArgs[0] === '--force'. Worth noting: parseCommand already splits quoted strings into a single arg before the handler sees them, so the /workflow run deploy "deploy --force to staging" example resolves to args = ['run', 'deploy', 'deploy --force to staging'] — findIndex would have returned -1 there. But the principle is right: lenient matching invites surprises if parseCommand is ever changed. Strict position-0 match is the safer contract.

2. Markdown escaping (packages/core/src/orchestrator/orchestrator-agent.ts)
Extended to [\\\\\"\`] so backticks and backslashes are also escaped. The triple-backtick code-fence in the failed-run prompt is now safe against arbitrary userMessage content.

3. /workflow resume <id> — DB mutation before YAML check
I read this twice and I believe it's a non-issue. resumeWorkflow in packages/core/src/operations/workflow-operations.ts is a pure validator:

```ts
export async function resumeWorkflow(runId: string): Promise {
const run = await getRunOrThrow(runId, 'operations.workflow_resume_lookup_failed');
if (!RESUMABLE_WORKFLOW_STATUSES.includes(run.status)) {
throw new Error(...);
}
return run;
}
```

No status transition, no `runs` row update, no event insert. The actual resume happens later in `dispatchOrchestratorWorkflow` once the workflow YAML has been resolved and validated. The function name is admittedly misleading; happy to rename to `validateResumable` (or similar) in a follow-up if you'd like — but no half-resumed-state risk in current code. Let me know if I'm missing a side-effect somewhere.

…e callout The resume-prompt for a prior failed run previously showed only the runId, leaving Option 1 ('Resume that run') ambiguous: the user couldn't tell whether resuming would continue their current intent or re-run a stale, possibly-unrelated prompt. Real-world failure: user asked about a max_tokens UI field; the orchestrator found a stale failed run from earlier the same day about input-field restoration; choosing Option 1 would silently switch topics with no visible warning. Fix: include a 160-char blockquote preview of the failed run's stored user_message under an explicit 'Run prompt was:' label, wrapped in horizontal rules for visual separation. Option 1 wording disambiguated to 're-runs the prompt shown above, not your current message'. Adds two regression tests covering the preview path and truncation.

ztech-gthb · 2026-05-05T12:42:29Z

Small follow-up in 4088888 addressing a UX gap I hit while testing this PR locally:

The resume-prompt previously surfaced only the runId of the prior failed run. If the failed run is hours/days old, choosing Option 1 ('Resume that run') re-runs the run's stored user_message — which can be a totally different topic from what the user just typed. Without the message visible, the topic-switch is invisible.

Now the prompt includes a 160-char blockquote preview of the failed run's stored user_message under an explicit 'Run prompt was:' label (wrapped in '---' separators), and Option 1 wording is disambiguated to 're-runs the prompt shown above, not your current message'. Two regression tests cover the preview path and truncation.

Same scope as the rest of this PR, drop-in additive; happy to split into a separate PR if you'd prefer to merge the parsing/escape fixes first.

ztech-gthb · 2026-05-05T12:43:04Z

Small follow-up in 4088888 addressing a UX gap I hit while testing this PR locally:

The resume-prompt previously surfaced only the runId of the prior failed run. If the failed run is hours/days old, choosing Option 1 ('Resume that run') re-runs the run's stored user_message — which can be a totally different topic from what the user just typed. Without the message visible, the topic-switch is invisible.

Now the prompt includes a 160-char blockquote preview of the failed run's stored user_message under an explicit 'Run prompt was:' label (wrapped in '---' separators), and Option 1 wording is disambiguated to 're-runs the prompt shown above, not your current message'. Two regression tests cover the preview path and truncation.

Same scope as the rest of this PR, drop-in additive; happy to split into a separate PR if you'd prefer to merge the parsing/escape fixes first.

Wirasm · 2026-05-13T12:16:30Z

Review Summary

Verdict: minor-fixes-needed

Two bugs in the PR wiring mean resumeRunId and --force are dropped before reaching dispatchOrchestratorWorkflow, so the V2a prompt fires when it shouldn't after natural-language approval or /workflow resume. Additionally, the new --force flag and changed behavior of /workflow resume and /workflow abandon are missing from the docs. Fixable — the core logic and test coverage are solid.

Blocking issues

packages/core/src/orchestrator/orchestrator-agent.ts:762-770: After a user approves a paused approval gate (run transitions to failed), dispatchOrchestratorWorkflow is called without options. Without options.resumeRunId, the V2a "Found a prior failed run" prompt fires anyway — the approval flow loops indefinitely. Pass { resumeRunId: pausedRun.id } as the 8th argument.
packages/core/src/handlers/command-handler.ts:835-843: When handleCommand returns a result.workflow from /workflow resume <id>, the resumeRunId and force fields are dropped before reaching dispatchOrchestratorWorkflow because handleWorkflowRunCommand is called without options. Pass { force: result.workflow.force, resumeRunId: result.workflow.resumeRunId }.

Suggested fixes

packages/docs-web/src/content/docs/reference/commands.md + packages/docs-web/src/content/docs/book/quick-reference.md: Add /workflow run <name> --force [args] to the commands table. Explain it bypasses failed-run detection and starts a fresh run.
packages/docs-web/src/content/docs/getting-started/overview.md: Update /workflow abandon description from "Discard a non-terminal run" to "Discard a failed, paused, or running run."
packages/docs-web/src/content/docs/guides/authoring-workflows.md: (1) Update "Run the same workflow again to auto-resume" to reflect that /workflow resume <id> now dispatches immediately. (2) Document the three-option failed-run prompt. (3) Add one sentence: "Paused runs auto-resume silently; failed runs surface a prompt."
packages/core/src/handlers/command-handler.test.ts:1356-1360 + packages/core/src/operations/workflow-operations.test.ts:298-320: Add test for status: 'failed' in the /workflow abandon and abandonWorkflow test suites.
packages/core/src/handlers/command-handler.ts:726-729: Add a test that mocks discoverWorkflowsWithConfig to throw during resume handling. Assert result.success === false.

Minor / nice-to-have

packages/core/src/command-handler.test.ts:1364-1365: Remove // see PR #1551 — the PR number will rot. The surrounding comment explains the behavior without it.
packages/core/src/types/index.ts:113: Remove "V2a-" from "skips the V2a-prompt branch" — "user-prompt branch" is clearer and doesn't leak an internal name.
packages/core/src/workflows.ts:340: Rephrase dispatchOrchestratorWorkflow reference to "the orchestrator" to avoid a future stale comment if the function is renamed.

Compliments

The V2a "failed run user prompt" design with three copy-pasteable options (resume, abandon+re-run, --force fresh start) and the message preview is excellent UX — it gives users exactly what they need without being verbose.
The --force flag-position rationale comment explains why the narrowing was necessary and why the old lenient match was wrong. That's a comment that saves future readers from undoing the fix.
Five new orchestrator-agent tests covering the V2a flow, message truncation, resumeRunId bypass, and --force skip make the new surface well-guarded.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality, docs-impact.

Wirasm · 2026-05-25T17:59:44Z

Hi @ztech-gthb — your recent commit addressed the review findings, but the PR has since gone DIRTY from dev moving on. Could you rebase onto current dev? Once it's clean I'll re-verify and merge. Ping me if you hit any blockers.

Wirasm · 2026-06-09T10:11:53Z

Thank you for this, @ztech-gthb — genuinely one of the most thorough PRs we've had: clear root-cause analysis, the 3-option UX, the run-preview, end-to-end verification on a real thread, rollback plan, the lot. The diagnosis in #1549 is exactly right and your fix direction is the one we're shipping.

I'm closing this PR not because anything's wrong with it, but because it was authored against dev as it stood in early May, and the resume path has been substantially restructured since:

fix(workflows): make resume explicit via prepareResumedRun / hydrateResumableRun (closes #1392) #1646 introduced hydrateResumableRun (single-roundtrip hydrate of the resumable candidate)
fix(core): concurrency-safe workflow resume/cancel (CAS guards) #1830 added compare-and-swap (CAS) guards so concurrent resumers (web Resume button, chat re-dispatch, CLI) can't double-claim a worktree

The core of your fix — --force guard, the failed-run prompt, and abandon-accepts-failed — still grafts cleanly onto that. But the later /workflow resume <id> re-routing commit (calling resumeWorkflow() then dispatching with resumeRunId) now collides with the CAS model: on current dev, hydrateResumableRun performs the status transition itself, and findResumableRunByParentConversation only matches failed/paused — so transitioning to running first means the re-dispatch finds nothing and starts a fresh run. It rebases with zero text conflicts but is semantically broken on today's dev — which is the dangerous kind (it'd pass CI and surface as a subtle regression).

Rather than risk that slipping through, I'm re-implementing the fix fresh on current dev, built natively on the hydrate/CAS resume model, and reusing your UX verbatim — the 3-option prompt, the run-message preview, --force, and abandon-failed. You'll be credited as Co-authored-by on the commit, and the new PR will close #1549.

Really appreciate the contribution — this made the fix better than it would've been from a cold start. 🙏

fix(workflow): prompt user on resume of failed run + abandon failed +…

599209c

… --force flag

coderabbitai Bot reviewed May 3, 2026

View reviewed changes

Comment thread packages/core/src/orchestrator/orchestrator-agent.ts

ztech-gthb pushed a commit to ztech-gthb/Archon that referenced this pull request May 3, 2026

fix(orchestrator): drop --force from slash-command catalog (decouple c…

968a262

…oleam00#1552 from coleam00#1551)

ztech-gthb mentioned this pull request May 3, 2026

fix(orchestrator): catalog /workflow slash commands in system prompt #1552

Open

fix(workflow): also dispatch /workflow resume <id> directly + bypass …

60ae397

…V2a prompt for explicit resume (resumeRunId option)

ztech-gthb mentioned this pull request May 3, 2026

fix(workflows): archon-assist runs in live checkout (closes #1546) #1555

Merged

test(handlers): update /workflow resume tests for resumeRunId-pattern

f1168a0

This was referenced May 4, 2026

/workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests #1549

Open

Pi/Minimax SDK errors cascade under concurrent load — needs throttling, better classification, and richer error surface #1569

Closed

fix(workflow): tighten --force flag-position parsing + escape backtic…

332426e

…ks/backslashes in failed-run prompt (review feedback)

Wirasm closed this Jun 9, 2026

Wirasm mentioned this pull request Jun 9, 2026

Gate stale workflow resumes #1935

Draft

Conversation

ztech-gthb commented May 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

UX Journey

Before

After

Architecture Diagram

Before

After

Label Snapshot

Change Metadata

Linked Issue

Validation Evidence (required)

Security Impact (required)

Compatibility / Migration

Human Verification (required)

Side Effects / Blast Radius (required)

Rollback Plan (required)

Risks and Mitigations

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ztech-gthb commented May 3, 2026

Uh oh!

Wirasm commented May 4, 2026

Uh oh!

Wirasm commented May 4, 2026

1. --force parsing matches the flag anywhere in args (packages/core/src/handlers/command-handler.ts:817-828)

2. Markdown escaping in the failure prompt is incomplete (packages/core/src/orchestrator/orchestrator-agent.ts)

3. /workflow resume <id> mutates DB before checking the YAML exists (packages/core/src/handlers/command-handler.ts)

Uh oh!

ztech-gthb commented May 4, 2026

Uh oh!

ztech-gthb commented May 5, 2026

Uh oh!

ztech-gthb commented May 5, 2026

Uh oh!

Wirasm commented May 13, 2026

Review Summary

Blocking issues

Suggested fixes

Minor / nice-to-have

Compliments

Uh oh!

Wirasm commented May 25, 2026

Uh oh!

Wirasm commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ztech-gthb commented May 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 3, 2026 •

edited

Loading

1. `--force` parsing matches the flag anywhere in args (`packages/core/src/handlers/command-handler.ts:817-828`)

2. Markdown escaping in the failure prompt is incomplete (`packages/core/src/orchestrator/orchestrator-agent.ts`)

3. `/workflow resume <id>` mutates DB before checking the YAML exists (`packages/core/src/handlers/command-handler.ts`)