Skip to content

fix(workflow): prompt user on resume of failed run + allow abandoning failed + add --force flag#1551

Closed
ztech-gthb wants to merge 5 commits into
coleam00:devfrom
ztech-gthb:fix/workflow-resume-prompts-user
Closed

fix(workflow): prompt user on resume of failed run + allow abandoning failed + add --force flag#1551
ztech-gthb wants to merge 5 commits into
coleam00:devfrom
ztech-gthb:fix/workflow-resume-prompts-user

Conversation

@ztech-gthb

@ztech-gthb ztech-gthb commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: /workflow run X "task B" silently auto-resumes a prior failed run of X in the same chat, executing in the failed run's sub-worktree with the failed run's persisted user_message (= "task A"). The new prompt is discarded with no UI/log indication.
  • Why it matters: confident-but-wrong reports of work the user didn't ask for. Trust-corroding silent data loss; users debug nonexistent problems while the assistant runs the same stale task repeatedly.
  • What changed: failed runs now emit a 3-option prompt (resume / abandon-then-rerun / --force) instead of silently auto-resuming. /workflow abandon accepts failed runs (transitions them to cancelled). New --force flag bypasses the resume-detection lookup for callers who explicitly want a fresh run while keeping the failed audit row.
  • What did not change: paused-run auto-resume (PR 🐛 UserReportedError: Manual bug report #914 approval-gate path) is unchanged; /workflow resume <id> still accepts failed runs; DB schema unchanged; no migrations.

UX Journey

Before

User                          Archon orchestrator           DB
────                          ───────────────────           ──
runs /workflow run X "A" ───▶ findResumable... → null
                              dispatch fresh run        ───▶ run-A row (failed)

runs /workflow run X "B" ───▶ findResumable... → run-A
                              [silent foreground resume]
                              executeWorkflow(working_path=
                                  thread-of-run-A,
                                  userMessage="B"
                                  ↑ ignored: scripts read
                                  $ARTIFACTS_DIR/.X persisted
                                  from run-A)             ───▶ "completed" on
                                                                task A again

sees positive report ◀──── reports task A as success
   (still confused about
    why task B didn't happen)

After

User                          Archon orchestrator           DB
────                          ───────────────────           ──
runs /workflow run X "A" ───▶ findResumable... → null
                              dispatch fresh run        ───▶ run-A row (failed)

runs /workflow run X "B" ───▶ findResumable... → run-A
                              status === 'paused'?
                                no → emit 3-option prompt    [no new run]
                                     **return** (no dispatch)
sees prompt with: ◀────
  /workflow resume <id>           [option 1]
  /workflow abandon <id>          [option 2: works on failed now]
   then re-run command
  /workflow run X --force "B"     [option 3: keep run-A as-is]

Architecture Diagram

Before

                   ┌─ command-handler.ts (parses /workflow run)
                   │     │ args.slice(2).join(' ') = "..."
                   │     ▼
                   │  CommandResult.workflow = { definition, args }
                   │     │
orchestrator-      │     ▼
agent.ts ──────────┴─ handleWorkflowRunCommand
                         │
                         ▼
                      dispatchOrchestratorWorkflow
                         │
                         ▼
                      findResumableRunByParentConversation
                      (status IN ['failed','paused'])
                         │
                         ▼
                      resumableRun? → executeWorkflow [silent]
                                     (auto-resume regardless of status)


workflow-operations.ts: abandonWorkflow rejects ALL terminal statuses
  (= completed | failed | cancelled) — failed runs un-abandonable

After

                   ┌─ command-handler.ts (parses /workflow run)
                   │     │ NEW: also parses --force from args
                   │     ▼
                   │  CommandResult.workflow = { definition, args, force? } [~]
                   │     │
orchestrator-      │     ▼
agent.ts ──────────┴─ handleWorkflowRunCommand(..., options={force}) [~]
                         │
                         ▼
                      dispatchOrchestratorWorkflow(..., options) [~]
                         │
                         ▼
                      options.force? null : findResumableRun(...) [+]
                         │
                         ▼
                      resumableRun?
                         ├─ status==='paused' → executeWorkflow [unchanged]
                         └─ else (failed/...)
                              → platform.sendMessage(3-option prompt) [+]
                              → return [+]


workflow-operations.ts: abandonWorkflow rejects only completed | cancelled [~]
  (failed → cancelled is the user's discard action; row stays in DB)

Connection inventory:

From To Status Notes
command-handler.ts:case 'run' args.findIndex('--force') new flag parsing
command-handler.ts CommandResult.workflow.force new type-level pass-through
orchestrator-agent.ts:handleWorkflowRunCommand dispatchOrchestratorWorkflow modified new options param
dispatchOrchestratorWorkflow findResumableRunByParentConversation modified guarded by options.force
dispatchOrchestratorWorkflow platform.sendMessage (3-option prompt) new failed-run user prompt
dispatchOrchestratorWorkflow executeWorkflow (paused branch) unchanged PR #914 path preserved
abandonWorkflow TERMINAL_WORKFLOW_STATUSES check removed replaced with explicit completed/cancelled

Label Snapshot

  • Risk: risk: low
  • Size: size: M
  • Scope: core
  • Module: core:orchestrator, core:operations, core:db

Change Metadata

  • Change type: bug
  • Primary scope: core

Linked Issue

Validation Evidence (required)

bun run type-check    # clean across all 10 packages
bun --filter @archon/core test
# 102 pass / 0 fail
# - foreground_resume_detected: existing test, fixture switched 'failed' → 'paused'
# - failed_resume_user_prompted: new (asserts no executeWorkflow, prompt text contains all 3 commands)
# - --force flag: new (asserts findResumable not consulted, dispatchBackground fires)

bun run lint          # clean for all touched files
bun run format:check  # clean

End-to-end manual verification on a real chat thread (2026-05-03):

  1. Existing failed run 32c786ef… from prior session stayed failed in DB.
  2. /workflow run ztech-marimo-edit "..." produced the 3-option prompt. No silent dispatch, no executeWorkflow call observed in workflow logs.
  3. /workflow run ztech-marimo-edit --force "..." dispatched fresh, completed success: true with new feature branch (wf/marimo-edit-1777809446). Original 32c786ef… row untouched.
  4. /workflow abandon 32c786ef… transitioned the row failed → cancelled, completed_at populated. Subsequent /workflow run in the same chat no longer triggered the prompt.

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

The prompt text is constructed via plain string concatenation; the userMessage's embedded quotes are escaped (\") when interpolated into the suggested re-run command block. No new untrusted-input parsing path.

Compatibility / Migration

  • Backward compatible? Yes for the paused-run auto-resume path (PR 🐛 UserReportedError: Manual bug report #914), /workflow resume <id> for failed runs, /workflow run callers without --force. The only behavior change is: failed-run-on-fresh-/workflow run now prompts instead of silently auto-resuming — the prior behavior was silent data loss, so this is the intended fix, not a regression.
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

Verified scenarios:

  • Failed-run prompt fires for status='failed' resumable, three commands appear with correct IDs interpolated.
  • --force skips the resume lookup entirely (verified via mockFindResumableRunByParentConversation.not.toHaveBeenCalled() in test, plus end-to-end real run).
  • /workflow abandon on a failed run transitions to cancelled (DB confirmed).
  • After abandon, /workflow run in the same conversation no longer prompts (cancelled is excluded from the resume lookup).
  • paused-status auto-resume continues to work in the existing test fixture.

Edge cases checked:

  • --force token recognized at any position in args (not just immediately after workflow name).
  • userMessage with embedded " is escaped in the option-3 suggested command block.

What was not verified:

  • A workflow that legitimately wants to retry a transient failed run silently (if such a case exists, this PR makes it require either /workflow resume <id> or --force). No such workflow has been identified in the default workflows.

Side Effects / Blast Radius (required)

  • Affected subsystems: core:orchestrator dispatch path, core:operations abandon validation, core:handlers workflow command parsing, types.
  • Potential unintended effects: a workflow author who depended on the silent failed-run auto-resume as a "retry on flaky steps" mechanism will see the new prompt. The prompt's first option (/workflow resume <id>) preserves that capability with one extra step.
  • Guardrails: the new orchestrator.failed_resume_user_prompted log event lets operators detect users hitting this path. The existing orchestrator.foreground_resume_detected log fires only on the (unchanged) paused branch.

Rollback Plan (required)

  • Fast rollback: revert this commit. Behavior returns to silent auto-resume of failed runs and abandon-rejection. No data migration needed.
  • Feature flags: none. The behavior change is unconditional but additive (--force is opt-in for the override case, and the prompt itself emits one user-visible message).
  • Observable failure symptoms if rollback is needed: users complaining the prompt is too noisy (would suggest making the prompt suppressible per-workflow), or workflows expecting silent retry-on-failure breaking (would suggest a per-workflow auto_resume_failed flag). Neither has been observed in testing.

Risks and Mitigations

  • Risk: a workflow author relied on silent auto-resume of failed runs as a retry mechanism.
    • Mitigation: the prompt's first option offers /workflow resume <id> as a copy-pasteable command; the retry capability is preserved with one extra user step.
  • Risk: --force is recognized as a token anywhere in args. A user passing --force as part of literal description text ("... change the --force flag handling ...") could accidentally trigger force.
    • Mitigation: known pattern in CLI tools without -- separator support. If literal collision becomes a real problem, a -- separator could be added in a follow-up. Not a regression: the flag didn't exist before, no existing input pattern is broken.
  • Risk: hand-maintained 3-option prompt drifts from the actual command-handler behavior over time.
    • Mitigation: the new test (failed_resume_user_prompted) asserts the three command strings are present in the prompt; CI catches accidental removal. A future PR could derive the prompt from command-handler metadata, but is premature given the small surface area.

Summary by CodeRabbit

  • New Features

    • Global --force option to start a fresh workflow run, skipping resume detection.
    • /workflow resume <id> now returns workflow details (including optional resumeRunId and force) to trigger immediate foreground resumption.
  • Bug Fixes

    • Paused runs auto-resume; failed runs prompt users with clear choices: resume, abandon+retry, or start fresh.
    • Abandon now accepts running, paused, and failed runs.
    • Clear error shown when a resumed workflow definition is missing.

@coderabbitai

coderabbitai Bot commented May 3, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b024ee2-7d50-495c-9bc7-94570458a80a

📥 Commits

Reviewing files that changed from the base of the PR and between 332426e and 4088888.

📒 Files selected for processing (2)
  • packages/core/src/orchestrator/orchestrator-agent.test.ts
  • packages/core/src/orchestrator/orchestrator-agent.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/core/src/orchestrator/orchestrator-agent.test.ts

📝 Walkthrough

Walkthrough

This PR prevents silent auto-resume of failed runs: only paused runs auto-resume, failed runs prompt the user (or require explicit resumeRunId), /workflow run supports a positional --force to skip resume detection, abandonWorkflow permits abandoning failed runs, and tests/types/handlers/orchestrator updated.

Changes

Auto-resume hijack fix (single cohort)

Layer / File(s) Summary
Data Shape
packages/core/src/types/index.ts
CommandResult.workflow adds optional force?: boolean and resumeRunId?: string.
Command Parsing / CLI
packages/core/src/handlers/command-handler.ts
/workflow run parses --force positionally (first bare arg), strips it from args, and returns workflow.force; /workflow resume <id> resumes the run then reloads discovered workflows from the resumed run's cwd, validates the definition exists, and returns a workflow payload with definition, args (from run.user_message ?? '') and resumeRunId, or an explicit discovery/missing-workflow error.
Core Orchestration Logic
packages/core/src/orchestrator/orchestrator-agent.ts
dispatchOrchestratorWorkflow(options?: {force?: boolean; resumeRunId?: string}) added. On web: skip resumable-run lookup when options.force is true; otherwise auto-resume only if prior run status === 'paused' or resumableRun.id === options.resumeRunId; for other prior runs (e.g., failed) send a user prompt with resume/abandon/force choices. Deterministic /workflow handling forwards force and resumeRunId into handleWorkflowRunCommand. Natural-language approval resume passes { resumeRunId }.
Operations Layer
packages/core/src/operations/workflow-operations.ts
abandonWorkflow() now rejects only completed and cancelled; it allows abandoning failed, running, and paused runs. Removed TERMINAL_WORKFLOW_STATUSES usage.
DB Docs
packages/core/src/db/workflows.ts
Expanded doc comment for findResumableRunByParentConversation to document orchestrator semantics: paused runs auto-resume; failed runs cause a prompt to avoid reusing stale persisted user_message.
Tests
packages/core/src/orchestrator/orchestrator-agent.test.ts, packages/core/src/handlers/command-handler.test.ts
Reset mockFindResumableRunByParentConversation in beforeEach. Updated foreground-resume test to use status: 'paused'. Added tests covering failed-run prompt, explicit resume via workflow.resumeRunId, --force bypass, and updated /workflow resume assertions and discovery seeding helper.

Sequence Diagram

sequenceDiagram
    actor User
    participant Handler as Command Handler
    participant Orchestrator as Orchestrator Agent
    participant DB as Database
    participant Executor as Workflow Executor

    rect rgba(0,150,0,0.5)
    Note over User,Executor: Forced fresh run (skip resume lookup)
    User->>Handler: /workflow run my-workflow "task B" --force
    Handler->>Handler: extract --force, clean args
    Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:true})
    Orchestrator->>Orchestrator: skip findResumableRunByParentConversation
    Orchestrator->>Executor: dispatchBackgroundWorkflow(fresh run)
    Executor-->>Orchestrator: run dispatched
    end

    rect rgba(100,150,255,0.5)
    Note over User,Executor: Paused run auto-resume
    User->>Handler: /workflow run my-workflow "continue"
    Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:false})
    Orchestrator->>DB: findResumableRunByParentConversation()
    DB-->>Orchestrator: {status:'paused', working_path:'...'}
    Orchestrator->>Executor: executeWorkflow(working_path from prior run)
    Executor-->>Orchestrator: run resumed
    end

    rect rgba(255,100,100,0.5)
    Note over User,Orchestrator: Failed run → user prompt
    User->>Handler: /workflow run my-workflow "task B"
    Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:false})
    Orchestrator->>DB: findResumableRunByParentConversation()
    DB-->>Orchestrator: {status:'failed', id:'abc123'}
    Orchestrator->>User: prompt (resume abc123 / abandon+rerun / start fresh --force)
    User-->>Handler: selects action
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I hopped through worktrees, found an old paused trail,

Failed runs now knock before they set sail.
--force clears the path, fresh starts shine bright,
Abandon hears the failed, resume when named right. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the three main changes: prompting user on failed-run resume, allowing abandonment of failed runs, and adding the --force flag for fresh runs.
Description check ✅ Passed The description comprehensively covers all required template sections: Summary (4 bullets), UX Journey (before/after flows), Architecture Diagram with connection inventory, Label Snapshot, Change Metadata, Linked Issues, Validation Evidence, Security Impact, Compatibility/Migration, Human Verification, Side Effects, Rollback Plan, and Risks/Mitigations.
Linked Issues check ✅ Passed All primary objectives from issue #1549 are met: failed-run resume now prompts instead of silently auto-resuming; --force flag bypasses resume detection; /workflow abandon accepts failed runs; paused-run auto-resume is preserved; explicit resumeRunId option enables programmatic resume.
Out of Scope Changes check ✅ Passed All code changes directly address issue #1549's objectives: command-handler --force parsing, orchestrator options threading, workflow-operations abandon validation, type additions, and test coverage are all scoped to the failed-run resume problem statement.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/core/src/operations/workflow-operations.ts (1)

112-127: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Return the post-cancel state from abandonWorkflow.

This still returns the pre-update row, so callers that read run.status will continue to see failed/paused after a successful abandon. Now that failed runs are intentionally abandonable, that stale status is much easier to surface in UI/metadata.

Suggested fix
 export async function abandonWorkflow(runId: string): Promise<WorkflowRun> {
   const run = await getRunOrThrow(runId, 'operations.workflow_abandon_lookup_failed');
   if (run.status === 'completed' || run.status === 'cancelled') {
     throw new Error(`Cannot abandon run with status '${run.status}'. Run is already terminal.`);
   }
   try {
     await workflowDb.cancelWorkflowRun(runId);
   } catch (error) {
     const err = error as Error;
     getLog().error(
       { err, errorType: err.constructor.name, runId },
       'operations.workflow_abandon_failed'
     );
     throw new Error(`Failed to abandon workflow run ${runId}: ${err.message}`);
   }
-  return run;
+  return (await workflowDb.getWorkflowRun(runId)) ?? { ...run, status: 'cancelled' };
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/operations/workflow-operations.ts` around lines 112 - 127,
abandonWorkflow currently returns the pre-cancel WorkflowRun, so update it to
return the post-cancel state: after awaiting workflowDb.cancelWorkflowRun(runId)
call get the updated run (e.g. via getRunOrThrow(runId,
'operations.workflow_abandon_lookup_failed') or the appropriate workflowDb fetch
method) and return that WorkflowRun instead of the original `run`; keep the
existing error logging around workflowDb.cancelWorkflowRun and rethrow as
before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/core/src/orchestrator/orchestrator-agent.ts`:
- Around line 281-348: The code is prompting users when a prior run.status ===
'failed' even for explicit resume attempts because resumeWorkflow() doesn't mark
the intent; update the orchestration logic to detect an explicit resume request
(e.g., accept a resumeRunId or resume flag passed into the entry point that
calls orchestrator-agent) and treat resumableRun as allowed to auto-resume when
resumableRun.id === resumeRunId (or when a resume=true flag is present), so the
branch that currently shows prompt will instead call executeWorkflow(...) for
that specific resumableRun; wire the resumeRunId from the /workflow resume <id>
handler (or make that handler dispatch directly to executeWorkflow) so
resumeWorkflow() sets that identifier before re-entering the resumable-run
check.

---

Outside diff comments:
In `@packages/core/src/operations/workflow-operations.ts`:
- Around line 112-127: abandonWorkflow currently returns the pre-cancel
WorkflowRun, so update it to return the post-cancel state: after awaiting
workflowDb.cancelWorkflowRun(runId) call get the updated run (e.g. via
getRunOrThrow(runId, 'operations.workflow_abandon_lookup_failed') or the
appropriate workflowDb fetch method) and return that WorkflowRun instead of the
original `run`; keep the existing error logging around
workflowDb.cancelWorkflowRun and rethrow as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 13320894-c3d7-4717-a5cc-d4a2e25a008f

📥 Commits

Reviewing files that changed from the base of the PR and between 69b2c89 and 599209c.

📒 Files selected for processing (6)
  • packages/core/src/db/workflows.ts
  • packages/core/src/handlers/command-handler.ts
  • packages/core/src/operations/workflow-operations.ts
  • packages/core/src/orchestrator/orchestrator-agent.test.ts
  • packages/core/src/orchestrator/orchestrator-agent.ts
  • packages/core/src/types/index.ts

Comment thread packages/core/src/orchestrator/orchestrator-agent.ts
…V2a prompt for explicit resume (resumeRunId option)
@ztech-gthb

Copy link
Copy Markdown
Contributor Author

Good catch — fixed in latest commit. Added options.resumeRunId to dispatchOrchestratorWorkflow; the V2a check is now status === 'paused' || resumableRun.id === options?.resumeRunId. Two callers wire it: /workflow resume <id> now dispatches directly (no more 'run same workflow again' instruction that was looping users through the prompt) and the natural-language approval handler passes the paused run's id when it transitions status to 'failed' before dispatching. New regression test: resumeRunId option: failed run DOES auto-resume when CommandResult.workflow.resumeRunId matches. 103/103 in orchestrator-agent.test.ts, type-check + lint + format clean.

@Wirasm

Wirasm commented May 4, 2026

Copy link
Copy Markdown
Collaborator

@ztech-gthb related to #1549 — workflow resume prompt fix.

@Wirasm

Wirasm commented May 4, 2026

Copy link
Copy Markdown
Collaborator

Thanks @ztech-gthb — the architectural primitives here are right. The three-option failure prompt captures the actual decision space, and the force / resumeRunId distinction in the orchestrator avoids the prompt loop cleanly. The doc comment on abandonWorkflow is exemplary.

Three items worth tightening before merge:

1. --force parsing matches the flag anywhere in args (packages/core/src/handlers/command-handler.ts:817-828)

const forceIndex = restArgs.findIndex(a => a === '--force');

findIndex matches --force anywhere in the user's args. If someone runs:

/workflow run deploy "deploy --force to staging"

…the parser strips --force from inside the quoted message, sets force=true, and passes deploy to staging as args. Silent corruption of the user's input.

Suggested fix: only strip --force when it's a bareword in a flag-position (e.g., immediately after the workflow name, or at the end before the quoted message), not as a substring of the quoted args. Or use -- as a verbatim-args separator. The existing "be lenient" comment hints at the intent, but lenience here is unsafe.

2. Markdown escaping in the failure prompt is incomplete (packages/core/src/orchestrator/orchestrator-agent.ts)

const escapedMsg = userMessage.replace(/"/g, '\\"');

Only " is escaped. If the user's original message contains a backtick, the markdown code-blocks in the prompt close early and the rest renders as plain text. If it contains \, that won't be re-escaped either. Not a security issue, but a UX paper-cut whenever the user's first attempt had backticks in it.

Suggested fix: also escape backticks and backslashes, or render the failed-message as an indented code block (4-space) so backticks inside don't terminate it.

3. /workflow resume <id> mutates DB before checking the YAML exists (packages/core/src/handlers/command-handler.ts)

Order today:

  1. resumeWorkflow(runId) — transitions DB row to running
  2. discoverWorkflowsWithConfig(...) — load YAML
  3. If YAML missing, error

If step 3 fails, the DB is in running state but no execution happens. The error message ("Restore the YAML and try again") is fine UX-wise, but the run sits in a half-resumed state until the user retries.

Suggested fix: discover the workflow YAML before calling resumeWorkflow, or roll the DB back when discovery fails.


CI is green and the rest of the change looks solid. Happy to merge once these three are addressed (or to merge as-is and track them as a follow-up issue if you'd rather move on — your call).

…ks/backslashes in failed-run prompt (review feedback)
@ztech-gthb

Copy link
Copy Markdown
Contributor Author

Thanks @Wirasm — addressed in 332426e:

1. --force flag-position parsing (packages/core/src/handlers/command-handler.ts)
Tightened to restArgs[0] === '--force'. Worth noting: parseCommand already splits quoted strings into a single arg before the handler sees them, so the /workflow run deploy "deploy --force to staging" example resolves to args = ['run', 'deploy', 'deploy --force to staging']findIndex would have returned -1 there. But the principle is right: lenient matching invites surprises if parseCommand is ever changed. Strict position-0 match is the safer contract.

2. Markdown escaping (packages/core/src/orchestrator/orchestrator-agent.ts)
Extended to [\\\\\"\`] so backticks and backslashes are also escaped. The triple-backtick code-fence in the failed-run prompt is now safe against arbitrary userMessage content.

3. /workflow resume <id> — DB mutation before YAML check
I read this twice and I believe it's a non-issue. resumeWorkflow in packages/core/src/operations/workflow-operations.ts is a pure validator:

```ts
export async function resumeWorkflow(runId: string): Promise {
const run = await getRunOrThrow(runId, 'operations.workflow_resume_lookup_failed');
if (!RESUMABLE_WORKFLOW_STATUSES.includes(run.status)) {
throw new Error(...);
}
return run;
}
```

No status transition, no `runs` row update, no event insert. The actual resume happens later in `dispatchOrchestratorWorkflow` once the workflow YAML has been resolved and validated. The function name is admittedly misleading; happy to rename to `validateResumable` (or similar) in a follow-up if you'd like — but no half-resumed-state risk in current code. Let me know if I'm missing a side-effect somewhere.

…e callout

The resume-prompt for a prior failed run previously showed only the runId,
leaving Option 1 ('Resume that run') ambiguous: the user couldn't tell
whether resuming would continue their current intent or re-run a stale,
possibly-unrelated prompt.

Real-world failure: user asked about a max_tokens UI field; the orchestrator
found a stale failed run from earlier the same day about input-field
restoration; choosing Option 1 would silently switch topics with no visible
warning.

Fix: include a 160-char blockquote preview of the failed run's stored
user_message under an explicit 'Run prompt was:' label, wrapped in horizontal
rules for visual separation. Option 1 wording disambiguated to 're-runs the
prompt shown above, not your current message'.

Adds two regression tests covering the preview path and truncation.
@ztech-gthb

Copy link
Copy Markdown
Contributor Author

Small follow-up in 4088888 addressing a UX gap I hit while testing this PR locally:

The resume-prompt previously surfaced only the runId of the prior failed run. If the failed run is hours/days old, choosing Option 1 ('Resume that run') re-runs the run's stored user_message — which can be a totally different topic from what the user just typed. Without the message visible, the topic-switch is invisible.

Now the prompt includes a 160-char blockquote preview of the failed run's stored user_message under an explicit 'Run prompt was:' label (wrapped in '---' separators), and Option 1 wording is disambiguated to 're-runs the prompt shown above, not your current message'. Two regression tests cover the preview path and truncation.

Same scope as the rest of this PR, drop-in additive; happy to split into a separate PR if you'd prefer to merge the parsing/escape fixes first.

1 similar comment
@ztech-gthb

Copy link
Copy Markdown
Contributor Author

Small follow-up in 4088888 addressing a UX gap I hit while testing this PR locally:

The resume-prompt previously surfaced only the runId of the prior failed run. If the failed run is hours/days old, choosing Option 1 ('Resume that run') re-runs the run's stored user_message — which can be a totally different topic from what the user just typed. Without the message visible, the topic-switch is invisible.

Now the prompt includes a 160-char blockquote preview of the failed run's stored user_message under an explicit 'Run prompt was:' label (wrapped in '---' separators), and Option 1 wording is disambiguated to 're-runs the prompt shown above, not your current message'. Two regression tests cover the preview path and truncation.

Same scope as the rest of this PR, drop-in additive; happy to split into a separate PR if you'd prefer to merge the parsing/escape fixes first.

@Wirasm

Wirasm commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Review Summary

Verdict: minor-fixes-needed

Two bugs in the PR wiring mean resumeRunId and --force are dropped before reaching dispatchOrchestratorWorkflow, so the V2a prompt fires when it shouldn't after natural-language approval or /workflow resume. Additionally, the new --force flag and changed behavior of /workflow resume and /workflow abandon are missing from the docs. Fixable — the core logic and test coverage are solid.

Blocking issues

  • packages/core/src/orchestrator/orchestrator-agent.ts:762-770: After a user approves a paused approval gate (run transitions to failed), dispatchOrchestratorWorkflow is called without options. Without options.resumeRunId, the V2a "Found a prior failed run" prompt fires anyway — the approval flow loops indefinitely. Pass { resumeRunId: pausedRun.id } as the 8th argument.

  • packages/core/src/handlers/command-handler.ts:835-843: When handleCommand returns a result.workflow from /workflow resume <id>, the resumeRunId and force fields are dropped before reaching dispatchOrchestratorWorkflow because handleWorkflowRunCommand is called without options. Pass { force: result.workflow.force, resumeRunId: result.workflow.resumeRunId }.

Suggested fixes

  • packages/docs-web/src/content/docs/reference/commands.md + packages/docs-web/src/content/docs/book/quick-reference.md: Add /workflow run <name> --force [args] to the commands table. Explain it bypasses failed-run detection and starts a fresh run.

  • packages/docs-web/src/content/docs/getting-started/overview.md: Update /workflow abandon description from "Discard a non-terminal run" to "Discard a failed, paused, or running run."

  • packages/docs-web/src/content/docs/guides/authoring-workflows.md: (1) Update "Run the same workflow again to auto-resume" to reflect that /workflow resume <id> now dispatches immediately. (2) Document the three-option failed-run prompt. (3) Add one sentence: "Paused runs auto-resume silently; failed runs surface a prompt."

  • packages/core/src/handlers/command-handler.test.ts:1356-1360 + packages/core/src/operations/workflow-operations.test.ts:298-320: Add test for status: 'failed' in the /workflow abandon and abandonWorkflow test suites.

  • packages/core/src/handlers/command-handler.ts:726-729: Add a test that mocks discoverWorkflowsWithConfig to throw during resume handling. Assert result.success === false.

Minor / nice-to-have

  • packages/core/src/command-handler.test.ts:1364-1365: Remove // see PR #1551 — the PR number will rot. The surrounding comment explains the behavior without it.
  • packages/core/src/types/index.ts:113: Remove "V2a-" from "skips the V2a-prompt branch" — "user-prompt branch" is clearer and doesn't leak an internal name.
  • packages/core/src/workflows.ts:340: Rephrase dispatchOrchestratorWorkflow reference to "the orchestrator" to avoid a future stale comment if the function is renamed.

Compliments

  • The V2a "failed run user prompt" design with three copy-pasteable options (resume, abandon+re-run, --force fresh start) and the message preview is excellent UX — it gives users exactly what they need without being verbose.
  • The --force flag-position rationale comment explains why the narrowing was necessary and why the old lenient match was wrong. That's a comment that saves future readers from undoing the fix.
  • Five new orchestrator-agent tests covering the V2a flow, message truncation, resumeRunId bypass, and --force skip make the new surface well-guarded.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality, docs-impact.

@Wirasm

Wirasm commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Hi @ztech-gthb — your recent commit addressed the review findings, but the PR has since gone DIRTY from dev moving on. Could you rebase onto current dev? Once it's clean I'll re-verify and merge. Ping me if you hit any blockers.

@Wirasm

Wirasm commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Thank you for this, @ztech-gthb — genuinely one of the most thorough PRs we've had: clear root-cause analysis, the 3-option UX, the run-preview, end-to-end verification on a real thread, rollback plan, the lot. The diagnosis in #1549 is exactly right and your fix direction is the one we're shipping.

I'm closing this PR not because anything's wrong with it, but because it was authored against dev as it stood in early May, and the resume path has been substantially restructured since:

The core of your fix — --force guard, the failed-run prompt, and abandon-accepts-failed — still grafts cleanly onto that. But the later /workflow resume <id> re-routing commit (calling resumeWorkflow() then dispatching with resumeRunId) now collides with the CAS model: on current dev, hydrateResumableRun performs the status transition itself, and findResumableRunByParentConversation only matches failed/paused — so transitioning to running first means the re-dispatch finds nothing and starts a fresh run. It rebases with zero text conflicts but is semantically broken on today's dev — which is the dangerous kind (it'd pass CI and surface as a subtle regression).

Rather than risk that slipping through, I'm re-implementing the fix fresh on current dev, built natively on the hydrate/CAS resume model, and reusing your UX verbatim — the 3-option prompt, the run-message preview, --force, and abandon-failed. You'll be credited as Co-authored-by on the commit, and the new PR will close #1549.

Really appreciate the contribution — this made the fix better than it would've been from a cold start. 🙏

@Wirasm Wirasm closed this Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests

2 participants