fix(workflow): prompt user on resume of failed run + allow abandoning failed + add --force flag#1551
fix(workflow): prompt user on resume of failed run + allow abandoning failed + add --force flag#1551ztech-gthb wants to merge 5 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThis PR prevents silent auto-resume of failed runs: only ChangesAuto-resume hijack fix (single cohort)
Sequence DiagramsequenceDiagram
actor User
participant Handler as Command Handler
participant Orchestrator as Orchestrator Agent
participant DB as Database
participant Executor as Workflow Executor
rect rgba(0,150,0,0.5)
Note over User,Executor: Forced fresh run (skip resume lookup)
User->>Handler: /workflow run my-workflow "task B" --force
Handler->>Handler: extract --force, clean args
Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:true})
Orchestrator->>Orchestrator: skip findResumableRunByParentConversation
Orchestrator->>Executor: dispatchBackgroundWorkflow(fresh run)
Executor-->>Orchestrator: run dispatched
end
rect rgba(100,150,255,0.5)
Note over User,Executor: Paused run auto-resume
User->>Handler: /workflow run my-workflow "continue"
Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:false})
Orchestrator->>DB: findResumableRunByParentConversation()
DB-->>Orchestrator: {status:'paused', working_path:'...'}
Orchestrator->>Executor: executeWorkflow(working_path from prior run)
Executor-->>Orchestrator: run resumed
end
rect rgba(255,100,100,0.5)
Note over User,Orchestrator: Failed run → user prompt
User->>Handler: /workflow run my-workflow "task B"
Handler->>Orchestrator: dispatchOrchestratorWorkflow(options:{force:false})
Orchestrator->>DB: findResumableRunByParentConversation()
DB-->>Orchestrator: {status:'failed', id:'abc123'}
Orchestrator->>User: prompt (resume abc123 / abandon+rerun / start fresh --force)
User-->>Handler: selects action
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/core/src/operations/workflow-operations.ts (1)
112-127:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReturn the post-cancel state from
abandonWorkflow.This still returns the pre-update row, so callers that read
run.statuswill continue to seefailed/pausedafter a successful abandon. Now that failed runs are intentionally abandonable, that stale status is much easier to surface in UI/metadata.Suggested fix
export async function abandonWorkflow(runId: string): Promise<WorkflowRun> { const run = await getRunOrThrow(runId, 'operations.workflow_abandon_lookup_failed'); if (run.status === 'completed' || run.status === 'cancelled') { throw new Error(`Cannot abandon run with status '${run.status}'. Run is already terminal.`); } try { await workflowDb.cancelWorkflowRun(runId); } catch (error) { const err = error as Error; getLog().error( { err, errorType: err.constructor.name, runId }, 'operations.workflow_abandon_failed' ); throw new Error(`Failed to abandon workflow run ${runId}: ${err.message}`); } - return run; + return (await workflowDb.getWorkflowRun(runId)) ?? { ...run, status: 'cancelled' }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/core/src/operations/workflow-operations.ts` around lines 112 - 127, abandonWorkflow currently returns the pre-cancel WorkflowRun, so update it to return the post-cancel state: after awaiting workflowDb.cancelWorkflowRun(runId) call get the updated run (e.g. via getRunOrThrow(runId, 'operations.workflow_abandon_lookup_failed') or the appropriate workflowDb fetch method) and return that WorkflowRun instead of the original `run`; keep the existing error logging around workflowDb.cancelWorkflowRun and rethrow as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/core/src/orchestrator/orchestrator-agent.ts`:
- Around line 281-348: The code is prompting users when a prior run.status ===
'failed' even for explicit resume attempts because resumeWorkflow() doesn't mark
the intent; update the orchestration logic to detect an explicit resume request
(e.g., accept a resumeRunId or resume flag passed into the entry point that
calls orchestrator-agent) and treat resumableRun as allowed to auto-resume when
resumableRun.id === resumeRunId (or when a resume=true flag is present), so the
branch that currently shows prompt will instead call executeWorkflow(...) for
that specific resumableRun; wire the resumeRunId from the /workflow resume <id>
handler (or make that handler dispatch directly to executeWorkflow) so
resumeWorkflow() sets that identifier before re-entering the resumable-run
check.
---
Outside diff comments:
In `@packages/core/src/operations/workflow-operations.ts`:
- Around line 112-127: abandonWorkflow currently returns the pre-cancel
WorkflowRun, so update it to return the post-cancel state: after awaiting
workflowDb.cancelWorkflowRun(runId) call get the updated run (e.g. via
getRunOrThrow(runId, 'operations.workflow_abandon_lookup_failed') or the
appropriate workflowDb fetch method) and return that WorkflowRun instead of the
original `run`; keep the existing error logging around
workflowDb.cancelWorkflowRun and rethrow as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 13320894-c3d7-4717-a5cc-d4a2e25a008f
📒 Files selected for processing (6)
packages/core/src/db/workflows.tspackages/core/src/handlers/command-handler.tspackages/core/src/operations/workflow-operations.tspackages/core/src/orchestrator/orchestrator-agent.test.tspackages/core/src/orchestrator/orchestrator-agent.tspackages/core/src/types/index.ts
…V2a prompt for explicit resume (resumeRunId option)
|
Good catch — fixed in latest commit. Added |
|
@ztech-gthb related to #1549 — workflow resume prompt fix. |
|
Thanks @ztech-gthb — the architectural primitives here are right. The three-option failure prompt captures the actual decision space, and the Three items worth tightening before merge: 1.
|
…ks/backslashes in failed-run prompt (review feedback)
|
Thanks @Wirasm — addressed in 332426e: 1. 2. Markdown escaping ( 3. ```ts No status transition, no `runs` row update, no event insert. The actual resume happens later in `dispatchOrchestratorWorkflow` once the workflow YAML has been resolved and validated. The function name is admittedly misleading; happy to rename to `validateResumable` (or similar) in a follow-up if you'd like — but no half-resumed-state risk in current code. Let me know if I'm missing a side-effect somewhere. |
…e callout
The resume-prompt for a prior failed run previously showed only the runId,
leaving Option 1 ('Resume that run') ambiguous: the user couldn't tell
whether resuming would continue their current intent or re-run a stale,
possibly-unrelated prompt.
Real-world failure: user asked about a max_tokens UI field; the orchestrator
found a stale failed run from earlier the same day about input-field
restoration; choosing Option 1 would silently switch topics with no visible
warning.
Fix: include a 160-char blockquote preview of the failed run's stored
user_message under an explicit 'Run prompt was:' label, wrapped in horizontal
rules for visual separation. Option 1 wording disambiguated to 're-runs the
prompt shown above, not your current message'.
Adds two regression tests covering the preview path and truncation.
|
Small follow-up in 4088888 addressing a UX gap I hit while testing this PR locally: The resume-prompt previously surfaced only the runId of the prior failed run. If the failed run is hours/days old, choosing Option 1 ('Resume that run') re-runs the run's stored user_message — which can be a totally different topic from what the user just typed. Without the message visible, the topic-switch is invisible. Now the prompt includes a 160-char blockquote preview of the failed run's stored user_message under an explicit 'Run prompt was:' label (wrapped in '---' separators), and Option 1 wording is disambiguated to 're-runs the prompt shown above, not your current message'. Two regression tests cover the preview path and truncation. Same scope as the rest of this PR, drop-in additive; happy to split into a separate PR if you'd prefer to merge the parsing/escape fixes first. |
1 similar comment
|
Small follow-up in 4088888 addressing a UX gap I hit while testing this PR locally: The resume-prompt previously surfaced only the runId of the prior failed run. If the failed run is hours/days old, choosing Option 1 ('Resume that run') re-runs the run's stored user_message — which can be a totally different topic from what the user just typed. Without the message visible, the topic-switch is invisible. Now the prompt includes a 160-char blockquote preview of the failed run's stored user_message under an explicit 'Run prompt was:' label (wrapped in '---' separators), and Option 1 wording is disambiguated to 're-runs the prompt shown above, not your current message'. Two regression tests cover the preview path and truncation. Same scope as the rest of this PR, drop-in additive; happy to split into a separate PR if you'd prefer to merge the parsing/escape fixes first. |
Review SummaryVerdict: minor-fixes-needed Two bugs in the PR wiring mean Blocking issues
Suggested fixes
Minor / nice-to-have
Compliments
Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality, docs-impact. |
|
Hi @ztech-gthb — your recent commit addressed the review findings, but the PR has since gone DIRTY from dev moving on. Could you rebase onto current |
|
Thank you for this, @ztech-gthb — genuinely one of the most thorough PRs we've had: clear root-cause analysis, the 3-option UX, the run-preview, end-to-end verification on a real thread, rollback plan, the lot. The diagnosis in #1549 is exactly right and your fix direction is the one we're shipping. I'm closing this PR not because anything's wrong with it, but because it was authored against
The core of your fix — Rather than risk that slipping through, I'm re-implementing the fix fresh on current Really appreciate the contribution — this made the fix better than it would've been from a cold start. 🙏 |
Summary
/workflow run X "task B"silently auto-resumes a prior failed run ofXin the same chat, executing in the failed run's sub-worktree with the failed run's persisteduser_message(= "task A"). The new prompt is discarded with no UI/log indication.--force) instead of silently auto-resuming./workflow abandonaccepts failed runs (transitions them tocancelled). New--forceflag bypasses the resume-detection lookup for callers who explicitly want a fresh run while keeping the failed audit row./workflow resume <id>still accepts failed runs; DB schema unchanged; no migrations.UX Journey
Before
After
Architecture Diagram
Before
After
Connection inventory:
command-handler.ts:case 'run'args.findIndex('--force')command-handler.tsCommandResult.workflow.forceorchestrator-agent.ts:handleWorkflowRunCommanddispatchOrchestratorWorkflowoptionsparamdispatchOrchestratorWorkflowfindResumableRunByParentConversationoptions.forcedispatchOrchestratorWorkflowplatform.sendMessage(3-option prompt)dispatchOrchestratorWorkflowexecuteWorkflow(paused branch)abandonWorkflowTERMINAL_WORKFLOW_STATUSEScheckcompleted/cancelledLabel Snapshot
risk: lowsize: Mcorecore:orchestrator,core:operations,core:dbChange Metadata
bugcoreLinked Issue
archon-assistpersistence; same UX class (silent edit loss), independent mechanismfix: foreground resume for interactive workflows + chat auto-resumeintroduced thefindResumableRunByParentConversationlookup this PR refinesValidation Evidence (required)
End-to-end manual verification on a real chat thread (2026-05-03):
32c786ef…from prior session stayedfailedin DB./workflow run ztech-marimo-edit "..."produced the 3-option prompt. No silent dispatch, noexecuteWorkflowcall observed in workflow logs./workflow run ztech-marimo-edit --force "..."dispatched fresh, completedsuccess: truewith new feature branch (wf/marimo-edit-1777809446). Original32c786ef…row untouched./workflow abandon 32c786ef…transitioned the rowfailed → cancelled,completed_atpopulated. Subsequent/workflow runin the same chat no longer triggered the prompt.Security Impact (required)
The prompt text is constructed via plain string concatenation; the
userMessage's embedded quotes are escaped (\") when interpolated into the suggested re-run command block. No new untrusted-input parsing path.Compatibility / Migration
paused-run auto-resume path (PR 🐛 UserReportedError: Manual bug report #914),/workflow resume <id>for failed runs,/workflow runcallers without--force. The only behavior change is: failed-run-on-fresh-/workflow runnow prompts instead of silently auto-resuming — the prior behavior was silent data loss, so this is the intended fix, not a regression.Human Verification (required)
Verified scenarios:
--forceskips the resume lookup entirely (verified viamockFindResumableRunByParentConversation.not.toHaveBeenCalled()in test, plus end-to-end real run)./workflow abandonon a failed run transitions tocancelled(DB confirmed)./workflow runin the same conversation no longer prompts (cancelled is excluded from the resume lookup).paused-status auto-resume continues to work in the existing test fixture.Edge cases checked:
--forcetoken recognized at any position in args (not just immediately after workflow name).userMessagewith embedded"is escaped in the option-3 suggested command block.What was not verified:
failedrun silently (if such a case exists, this PR makes it require either/workflow resume <id>or--force). No such workflow has been identified in the default workflows.Side Effects / Blast Radius (required)
core:orchestratordispatch path,core:operationsabandon validation,core:handlersworkflow command parsing, types./workflow resume <id>) preserves that capability with one extra step.orchestrator.failed_resume_user_promptedlog event lets operators detect users hitting this path. The existingorchestrator.foreground_resume_detectedlog fires only on the (unchanged) paused branch.Rollback Plan (required)
--forceis opt-in for the override case, and the prompt itself emits one user-visible message).auto_resume_failedflag). Neither has been observed in testing.Risks and Mitigations
/workflow resume <id>as a copy-pasteable command; the retry capability is preserved with one extra user step.--forceis recognized as a token anywhere in args. A user passing--forceas part of literal description text ("... change the --force flag handling ...") could accidentally trigger force.--separator support. If literal collision becomes a real problem, a--separator could be added in a follow-up. Not a regression: the flag didn't exist before, no existing input pattern is broken.failed_resume_user_prompted) asserts the three command strings are present in the prompt; CI catches accidental removal. A future PR could derive the prompt from command-handler metadata, but is premature given the small surface area.Summary by CodeRabbit
New Features
/workflow resume <id>now returns workflow details (including optional resumeRunId and force) to trigger immediate foreground resumption.Bug Fixes