Skip to content

Gate stale workflow resumes#1935

Draft
Wirasm wants to merge 3 commits into
devfrom
archon/task-fix-issue-1549
Draft

Gate stale workflow resumes#1935
Wirasm wants to merge 3 commits into
devfrom
archon/task-fix-issue-1549

Conversation

@Wirasm

@Wirasm Wirasm commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Describe this PR in 2-5 bullets:

  • Problem: Fresh /workflow run <name> ... commands could silently resume a prior failed run for the same workflow/conversation/codebase and execute the failed run's stale user_message instead of the new request.
  • Why it matters: This caused silent intent loss in a P1 workflow path, where a user-visible fresh command could dispatch old work with no confirmation and no clean abandon path for failed runs.
  • What changed: Added explicit dispatch intent for forced fresh runs and explicit resumes, gated failed resume candidates behind a 3-option prompt before hydration, preserved paused approval auto-resume, parsed /workflow run --force anywhere in args, and allowed failed runs to be abandoned to cancelled.
  • What did not change (scope boundary): The existing findResumableRunByParentConversation lookup, hydrateResumableRun resume model, and WorkflowNotResumableError CAS guards remain the owner of the failed/paused -> running transition.

UX Journey

Before

User                         Archon                             Workflow engine
----                         ------                             ---------------
/workflow run fix A ----->   creates/dispatches workflow ---->  run fails

/workflow run fix B ----->   finds prior failed run
                             hydrates/resumes old row ------->  executes with old user_message A
User sees fresh command      no prompt shown                    stale intent runs

After

User                         Archon                             Workflow engine
----                         ------                             ---------------
/workflow run fix A ----->   creates/dispatches workflow ---->  run fails

/workflow run fix B ----->   finds prior failed run
                             [shows 3-option prompt]
                             returns without hydrate/execute
User chooses:
  /workflow resume <id> -->  [resumeRunId bypass] ----------->  hydrateResumableRun owns CAS resume
  /workflow abandon <id> ->  failed -> cancelled
  /workflow run fix B --force [skip lookup] ----------------->  starts fresh run with B

Architecture Diagram

Before

command-handler
  | CommandResult.workflow { definition, args }
  v
orchestrator-agent dispatchOrchestratorWorkflow
  | findResumableRunByParentConversation(failed|paused)
  | hydrateResumableRun(any candidate)
  v
workflows executor / dag executor
  | uses persisted workflowRun.user_message
  v
workflow db

workflow-operations abandonWorkflow
  | rejects all terminal statuses, including failed
  v
workflow db

After

command-handler [~]
  | CommandResult.workflow { definition, args, force, resumeRunId }
  v
orchestrator-agent dispatchOrchestratorWorkflow [~]
  | [force skips lookup]
  | findResumableRunByParentConversation(failed|paused)
  | [failed + no matching resumeRunId -> 3-option prompt, return]
  | hydrateResumableRun(paused or explicit resume)
  v
workflows executor / dag executor
  | unchanged CAS-owned resume and persisted run hydration
  v
workflow db

workflow-operations abandonWorkflow [~]
  | rejects completed/cancelled only; failed -> cancelled
  v
workflow db

Connection inventory (list every module-to-module edge, mark changes):

From To Status Notes
packages/core/src/handlers/command-handler.ts packages/core/src/types/index.ts modified Workflow command payload now carries force and resumeRunId intent.
packages/core/src/orchestrator/orchestrator-agent.ts packages/core/src/db/workflows.ts unchanged Still uses findResumableRunByParentConversation; caller now gates failed rows before hydration.
packages/core/src/orchestrator/orchestrator-agent.ts packages/workflows/src/executor.ts modified Dispatch only passes resumeRunId-authorized or paused candidates into hydration.
packages/core/src/orchestrator/orchestrator-agent.ts platform response sending modified Fresh run with prior failed run now sends the verbatim 3-option prompt and returns.
packages/core/src/operations/workflow-operations.ts workflow DB update path modified Failed runs can be abandoned/cancelled; completed and cancelled remain rejected.
packages/core/src/**/*.test.ts command/orchestrator/operation modules modified Added regression coverage for all issue acceptance paths.

Label Snapshot

  • Risk: risk: medium
  • Size: size: M
  • Scope: core|tests
  • Module: core:orchestrator, core:command-handler, core:workflow-operations

Change Metadata

  • Change type: bug
  • Primary scope: core

Linked Issue

Validation Evidence (required)

Commands and result summary:

bun run type-check
bun run lint
bun run format:check
bun run test
bun run validate
  • Evidence provided (test/log/trace/screenshot): Run artifact validation reports type-check passed, lint passed with 0 warnings, format check passed, bun run test passed with 4834 passed / 0 failed, build passed, and bun run validate passed.
  • If any command is intentionally skipped, explain why: None skipped. The validation artifact notes root bun test was not used as the authoritative command because this repo documents that root bun test causes Bun mock.module() pollution; bun run test is the supported isolated test entrypoint and passed.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • New external network calls? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • File system access scope changed? (Yes/No) No
  • If any Yes, describe risk and mitigation: No security-sensitive behavior changes were introduced.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Database migration needed? (Yes/No) No
  • If yes, exact upgrade steps: No migration or operator action required.

Human Verification (required)

What was personally validated beyond CI:

  • Verified scenarios: Fresh /workflow run with a prior failed run prompts and does not call executeWorkflow; --force bypasses the lookup and starts fresh; /workflow resume <id> dispatches with resumeRunId; natural-language approval resumes without prompting; paused auto-resume remains unchanged; failed abandon transitions to cancelled.
  • Edge cases checked: --force can appear anywhere after the workflow name and is stripped from workflow args; completed and cancelled runs still cannot be abandoned; the failed-run prompt escapes backslash, double quote, and backtick in the suggested command and truncates the prior user message preview.
  • What was not verified: No manual end-to-end UI screenshot was captured; coverage is through focused command/orchestrator/operation regression tests plus full repository validation.

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: Core workflow command handling, orchestrator workflow dispatch, workflow abandon operation, and related tests.
  • Potential unintended effects: Users with a failed run will now see a prompt instead of implicit failed-run resume when issuing a fresh /workflow run; automation that depended on implicit failed auto-resume should switch to /workflow resume <id>.
  • Guardrails/monitoring for early detection: Regression tests assert no executeWorkflow call occurs on the prompt path and that explicit resume paths still hydrate through the existing CAS model.

Rollback Plan (required)

  • Fast rollback command/path: Revert commit cde8b12b96b98b54d3a99de504898a8f801a2e50.
  • Feature flags or config toggles (if any): None.
  • Observable failure symptoms: /workflow run prompts unexpectedly for failed candidates, explicit /workflow resume <id> loops back to a prompt, natural-language approval fails to resume, or failed runs cannot be abandoned.

Risks and Mitigations

List real risks in this PR (or write None).

  • Risk: Explicit resume and paused approval paths could be mistaken for fresh-run intent and prompt-loop.
    • Mitigation: resumeRunId bypasses the failed-run prompt, and tests cover /workflow resume <id> plus natural-language approval redispatch.
  • Risk: The fix could accidentally bypass the existing CAS guard by pre-transitioning runs.
    • Mitigation: No pre-transition is added; hydration remains before execution and owns the CAS transition through the current WorkflowNotResumableError path.
  • Risk: The new prompt could make stale failed runs harder to clear.
    • Mitigation: abandonWorkflow now accepts failed runs and transitions them to cancelled so the prompt can be cleared intentionally.

Fresh /workflow run commands could silently resume a prior failed run and execute stale persisted args. Gate failed candidates behind explicit user intent while preserving paused approval resume and CAS-owned hydration.

Changes:
- Add force and resumeRunId intent to workflow command dispatch
- Prompt before hydrating failed resume candidates unless explicitly resumed
- Parse /workflow run --force and allow abandoning failed runs
- Add regressions for failed prompts, force, explicit resume, NL approval, and abandon

Fixes #1549

Co-authored-by: Zolto <zolto@zhome.local>
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34f52c67-b5aa-4251-b939-dd1a30ef374c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch archon/task-fix-issue-1549

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Wirasm commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

🔍 Comprehensive PR Review

PR: #1935
Reviewed by: 3 available specialized agent artifacts; 2 requested artifacts were missing
Date: 2026-06-09T10:56:15Z


Summary

The PR is well scoped and covers the main stale failed-run gate behavior without pre-transitioning runs to running. The strongest issue is that explicit /workflow resume <id> intent is not preserved all the way through dispatch: the command validates one run, but dispatch can hydrate or prompt from the newest resumable run for the same workflow conversation instead. Two medium issues remain around command suggestion round-tripping and misleading workflow-load feedback during explicit resume.

Verdict: REQUEST_CHANGES

Severity Count
🔴 CRITICAL 0
🟠 HIGH 1
🟡 MEDIUM 2
🟢 LOW 0

🔴 Critical Issues

None.


🟠 High Issues (Auto-fixing)

Explicit Resume Can Hydrate Or Prompt For The Wrong Run

📍 packages/core/src/orchestrator/orchestrator-agent.ts:546

/workflow resume <id> validates the requested run in the command handler, but dispatch later calls findResumableRunByParentConversation(workflow.name, conversation.id, codebase.id). That helper returns the latest resumable run for the workflow conversation, not necessarily the requested resumeRunId. If multiple failed or paused runs exist, explicit resume can prompt for a newer failed run or hydrate a different paused run than the user selected.

View recommended fix
interface WorkflowDispatchOptions {
  force?: boolean;
  resumeRun?: WorkflowRun;
}

const resumableRun = options?.force
  ? null
  : options?.resumeRun ??
    (await workflowDb.findResumableRunByParentConversation(
      workflow.name,
      conversation.id,
      codebase.id
    ));

if (options?.resumeRun && options.resumeRun.status !== 'paused' && !options.resumeRun.working_path) {
  await platform.sendMessage(
    conversationId,
    `Cannot resume ${options.resumeRun.id}: missing working path.`
  );
  return;
}

Then set resumeRun: run in the /workflow resume result path and use the approved run object for natural-language approval dispatch. Add a regression test where resumeRunId is old-run but the broad lookup returns new-run, asserting the dispatcher does not hydrate or prompt for new-run.


🟡 Medium Issues (Needs Decision)

Suggested Escaped Commands Do Not Round-Trip Through parseCommand()

📍 packages/core/src/orchestrator/orchestrator-agent.ts:555

The failed-run prompt escapes backslash, double quote, and backtick as if the generated command will be parsed by a shell. Archon's parseCommand() uses a regex tokenizer and does not interpret backslash escapes inside quoted strings, so copied suggestions containing escaped quotes can be tokenized incorrectly.

Options: Fix now | Create issue | Skip

View details
Option Approach Effort Risk if Skipped
Fix Now Replace the regex tokenizer with a small quoted-string parser that supports escaped quotes and backslashes, with regression tests for the generated command. MED Users can copy a suggested command that silently changes workflow args.
Create Issue Defer command parser hardening to a separate PR. LOW Prompt UX remains unreliable for messages with quotes or backslashes.
Skip Accept as-is. NONE Generated commands remain non-deterministic for common text inputs.

Recommendation: Fix now if the prompt stays copyable. Otherwise, remove arbitrary quoted prompt text from the suggestion and tell users to re-run their original command with --force.


/workflow resume Hides Workflow Load Errors Behind "Not Found"

📍 packages/core/src/handlers/command-handler.ts:699

The explicit resume path calls discoverWorkflowsWithConfig() but keeps only result.workflows. Per-file workflow parse/read failures are returned in result.errors, not thrown. If the run's workflow exists but currently fails to parse or read, /workflow resume <id> reports that the workflow was not found instead of surfacing the actionable load error.

Options: Fix now | Create issue | Skip

View details
Option Approach Effort Risk if Skipped
Fix Now Preserve result.errors in the resume path and mirror /workflow run matching before returning "not found". LOW Users get the wrong recovery path and may not know a workflow file is broken.
Create Issue Defer the UX consistency fix. LOW Resume remains inconsistent with run when workflow loading partially fails.
Skip Accept misleading fallback. NONE Debugging resume failures requires inspecting logs or running other commands.

Recommendation: Fix now. This is a localized change and follows an existing command-handler pattern.


🟢 Low Issues

None reported in available artifacts.


✅ What's Good

  • Failed-run auto-resume is gated before hydrateResumableRun(), preserving the no-pre-transition resume model.
  • /workflow run --force is parsed anywhere after the workflow name, stripped from workflow args, and skips resumable lookup.
  • Failed-run abandon support is narrowly implemented while completed and cancelled runs remain rejected.
  • Tests cover the main happy and guard paths: failed prompt, force fresh dispatch, natural-language approval, explicit resume happy path, paused auto-resume, and failed-run abandon.
  • Resume CAS handling remains specific to WorkflowNotResumableError and does not fall through to fresh execution.

📋 Suggested Follow-up Issues

Issue Title Priority Related Finding
Make slash command parsing round-trip escaped quoted suggestions P2 Medium issue #1
Surface workflow discovery load errors during explicit resume P2 Medium issue #2

Review Coverage Note

The review directory contained these artifacts: code-review-findings.md, error-handling-findings.md, and test-coverage-findings.md. The requested comment-quality-findings.md and docs-impact-findings.md artifacts were not present, so this synthesis is based on the available three reports.


Next Steps

  1. ⚡ Auto-fix step should address the HIGH explicit-resume identity issue.
  2. 📝 Review the MEDIUM issues above.
  3. 🎯 Re-run focused tests for orchestrator dispatch and command handler resume behavior.

Reviewed by Archon comprehensive-pr-review workflow
Artifacts: /Users/rasmus/.archon/workspaces/coleam00/Archon/artifacts/runs/8d4bf8f85cdcac7542fd75849aee3e0f/review/

Fixed:
- Preserve explicit workflow resume identity through dispatch
- Parse escaped quoted slash-command arguments
- Surface workflow load errors during explicit resume

Tests added:
- Parser regression coverage for escaped quoted suggestions
- Resume load-error feedback coverage
- Explicit resume identity and missing working path coverage

Skipped:
- none
@Wirasm

Wirasm commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

⚡ Self-Fix Report (Aggressive)

Status: COMPLETE
Pushed: ✅ Changes pushed to archon/task-fix-issue-1549
Commit: e097004679c1275853f0d14a3b8e8dff1d9db15d
Philosophy: Fix everything unless clearly a new concern


Fixes Applied (3 total)

Severity Count
🔴 CRITICAL 0
🟠 HIGH 1
🟡 MEDIUM 2
🟢 LOW 0
View all fixes
  • Explicit Resume Can Hydrate Or Prompt For The Wrong Run (packages/core/src/orchestrator/orchestrator-agent.ts:546) — carried the validated run through dispatch and hydrate the requested run directly.
  • Suggested Escaped Commands Do Not Round-Trip Through parseCommand() (packages/core/src/handlers/command-handler.ts:179) — added quoted-string escape parsing and regression tests.
  • /workflow resume Hides Workflow Load Errors Behind Not Found (packages/core/src/handlers/command-handler.ts:699) — preserved discovery load errors and surfaced targeted YAML/load feedback.

Tests Added

  • packages/core/src/orchestrator/orchestrator-agent.test.ts: explicit resume identity and missing working path coverage.
  • packages/core/src/handlers/command-handler.test.ts: escaped quote parsing and resume load-error feedback coverage.

Skipped (0)

(none — all findings addressed)


Suggested Follow-up Issues

(none)


Validation

✅ Type check | ✅ Lint | ✅ Tests (bun run test) | ✅ Format check


Self-fix by Archon · aggressive mode · fixes pushed to archon/task-fix-issue-1549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/workflow run silently auto-resumes failed runs with stale args, hijacking fresh requests

1 participant