Skip to content

fix(workflows): make resume explicit via prepareResumedRun / hydrateResumableRun (closes #1392)#1646

Merged
Wirasm merged 2 commits into
devfrom
fix/workflow-resume-explicit
May 12, 2026
Merged

fix(workflows): make resume explicit via prepareResumedRun / hydrateResumableRun (closes #1392)#1646
Wirasm merged 2 commits into
devfrom
fix/workflow-resume-explicit

Conversation

@Wirasm

@Wirasm Wirasm commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Problem: executeWorkflow silently auto-resumed the most recent failed run for the same (workflow_name, cwd) pair. The --resume CLI flag was documented as opt-in but had no causal effect on this code path; it only steered worktree selection. We reproduced the live failure mode in PR Fix Issue #972 | Workflow Composition #1282's review queue — extract-pr-number's cached output "1634" leaked into every subsequent invocation of archon workflow run maintainer-review-pr <N> for the same cwd.
  • Why it matters: Users who changed inputs or wanted a fresh start silently got a cached resume instead — visible as wrong outputs, ghost state across "unrelated" runs, and behaviour that contradicted the docs.
  • What changed: Moves the resume decision out of executeWorkflow and into the caller. Two new helpers (hydrateResumableRun, prepareResumedRun) cover the two real lookup shapes. Trailing positional args consolidate into a single ExecuteWorkflowOptions bag — fixes the undefined, undefined, undefined, undefined, true smell at call sites. Orphan-cleanup code deletes outright.
  • What did NOT change (scope boundary): Approve/reject auto-resume from chat (still works — orchestrator now calls hydrateResumableRun directly with the run row it already found). Web UI resume button. archon workflow resume <id> semantics. DB schema. Platform adapters.

UX Journey

Before

User                          CLI                        executeWorkflow
────                          ───                        ───────────────
archon workflow run foo "x"   ─────────────────────────▶ findResumableRun()  ← fires always
                                                         found prior failed run
                                                         ▶ silently resumes (wrong inputs!)

After

User                          CLI                        executeWorkflow
────                          ───                        ───────────────
archon workflow run foo "x"   ─────────────────────────▶ no internal lookup
                                                         ▶ creates fresh run ✓

archon workflow run foo "x" --resume
                              looks up + hydrates  ───▶  executes resumed run ✓

archon workflow resume <id>   → workflowRunCommand({resume: true})
                              looks up + hydrates  ───▶  executes resumed run ✓

Architecture Diagram

Before

executor.ts
  findResumableRun() ← unconditional, internal
    │
    └─→ if found: resumeWorkflowRun + load priors
        else:     use preCreatedRun or create fresh

CLI workflow.ts (--resume)
  workflowDb.findResumableRun()  ← used only for worktree path selection
  executeWorkflow(...positionals)  ← executor re-discovers via its own findResumableRun

orchestrator-agent.ts (foreground-resume)
  findResumableRunByParentConversation()  ← used only for working_path
  executeWorkflow(...positionals)  ← executor re-discovers via its own findResumableRun

After

@archon/workflows/executor (NEW)
  prepareResumedRun(deps, workflow, cwd)
    └── findResumableRun → hydrateResumableRun
  hydrateResumableRun(deps, candidate)
    ├── getCompletedDagNodeOutputs
    ├── return null if 0 nodes + no interactive-loop state
    └── resumeWorkflowRun → { run, priorCompletedNodes }
  executeWorkflow(..., opts: ExecuteWorkflowOptions)
    ├── opts.preCreatedRun ? use it : createWorkflowRun
    ├── opts.priorCompletedNodes → flows to dag-executor
    └── NO store lookups

CLI workflow.ts (--resume)
  workflowDb.findResumableRun()  ← path resolution + candidate
  hydrateResumableRun(deps, candidate)  ← new
  executeWorkflow(..., { preCreatedRun, priorCompletedNodes })

orchestrator-agent.ts (foreground-resume)
  findResumableRunByParentConversation()  ← path resolution + candidate
  hydrateResumableRun(deps, candidate)  ← new
  executeWorkflow(..., { preCreatedRun, priorCompletedNodes })

Connection inventory:

From To Status Notes
executor.ts findResumableRun removed Executor no longer queries the store for prior runs
executor.ts prepareResumedRun / hydrateResumableRun new exports Caller-invoked resume preparation
executor.ts ExecuteWorkflowOptions new export All trailing optional args (codebaseId, issueContext, isolationContext, parentConversationId, preCreatedRun, priorCompletedNodes)
cli/workflow.ts hydrateResumableRun new Reuses the candidate already found for worktree-path resolution
orchestrator.ts executeWorkflow call shape Trailing args → opts bag
orchestrator-agent.ts foreground-resume hydrateResumableRun new Was relying on executor internal auto-resume
orchestrator-agent.ts interactive executeWorkflow call shape Trailing args → opts bag
Orphan-cleanup code in executor.ts (~120 lines) removed No preCreatedRun to orphan now that resume flows through opts

Label Snapshot

Change Metadata

  • Change type: bug
  • Primary scope: workflows

Linked Issue

Closes #1392. Supersedes #1396.

Validation Evidence (required)

bun run validate  # ✅ exit 0
  • check:bundled, check:bundled-skill, type-check, lint --max-warnings 0, format:check, all per-package test batches — green.
  • 72 test batches passed, 0 failures. New / changed tests:
    • prepareResumedRun suite (null when no resumable, hydrates candidate, propagates DB errors)
    • hydrateResumableRun suite (hydrate, return null on zero+no-loop, paused-loop hydration, DB error propagation on both getCompletedDagNodeOutputs and resumeWorkflowRun)
    • Regression test for archon workflow run auto-resumes failed runs without --resume — docs say opt-in #1392 cross-invocation leak: asserts the executor does NOT call findResumableRun on its own
    • executor.test.ts "uses caller-supplied preCreatedRun + priorCompletedNodes" — asserts opts flow into dag-executor at the right slot
    • Updated orchestrator + orchestrator-agent tests to assert resume payload on the opts bag

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? No for executeWorkflow callers (API), yes for users of documented CLI/UI surfaces.
    • The signature change is breaking for any external caller of executeWorkflow. There are no public extension points that call it directly; in-tree callers are migrated in this commit.
    • User-facing behaviour: archon workflow run <name> always starts fresh (matches docs); resume requires --resume, archon workflow resume <id>, the web UI button, or approve/reject in chat (unchanged).
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

What was personally validated beyond CI:

  • Verified scenarios: Full validate suite passes; manually ran the failing reproduction from the standup queue (back-to-back archon workflow run maintainer-review-pr <N> for different <N>) — confirms each invocation now creates a fresh run, no dag.node_skipped_prior_success leak.
  • Edge cases checked:
    • Plain run with no prior failed run — fresh run created.
    • Plain run when a prior failed run exists for (workflow, cwd) — fresh run created (no implicit resume). This is the bug-fix.
    • --resume with a resumable prior run — resumes and skips completed nodes.
    • --resume with no resumable prior run — clear error, no fresh run.
    • Orchestrator foreground-resume (approve/reject from chat) — hydrateResumableRun succeeds and resume continues.
  • What was not verified:
    • Live end-to-end web UI resume button flow (the bridge call path is unchanged; only positional → opts call shape).

Side Effects / Blast Radius (required)

  • Affected subsystems: Workflow executor (packages/workflows/src/executor.ts), CLI workflow run / workflow resume, orchestrator chat dispatch (background + foreground-resume + interactive paths). No DB schema or adapter changes.
  • Potential unintended effects: Anyone scripting against the previous undocumented implicit auto-resume behaviour will now get fresh runs. This matches docs.
  • Guardrails: Existing tests for --resume, approve/reject, foreground-resume, and concurrent-run-guard are all updated and pass. A new regression test pins down the cross-invocation leak from archon workflow run auto-resumes failed runs without --resume — docs say opt-in #1392.

Rollback Plan (required)

  • Fast rollback: Revert the commit.
  • Feature flags: None.
  • Observable failure symptoms: Users reporting "--resume does not resume" (would indicate a positional/opts wiring bug). The fix has zero observable change for invocations that don't use --resume — they always start fresh, as documented.

Risks and Mitigations

  • Risk: Users who depended on the undocumented implicit auto-resume will now get a fresh run.
    • Mitigation: This matches the documented behaviour. The --resume flag and /workflow resume <id> command are the explicit paths. CHANGELOG entry calls this out.
  • Risk: Test-mock divergence on @archon/workflows/executor for callers that mock it.
    • Mitigation: Updated all in-tree mocks. The new hydrateResumableRun export is part of the module signature, so consumers needing tighter test-mocks now know the surface.

Summary by CodeRabbit

  • Bug Fixes

    • Workflow run no longer auto-resumes on re-run; resume is explicit via CLI --resume, archon workflow resume <id>, or Web UI resume. Web resume now validates a parent web conversation and dispatches a resume message.
  • Documentation

    • Clarified resume mechanics across guides and API docs: resuming skips already-completed steps and is opt-in.
  • Tests

    • Expanded tests to cover resume paths and error conditions.

Review Change Stack

…esumableRun (closes #1392)

Previously `executeWorkflow` called `findResumableRun(workflow.name, cwd)`
unconditionally and silently auto-resumed the most recent failed run for
the same `(workflow_name, cwd)` pair. The `--resume` CLI flag was
documented as opt-in but had no causal effect on this path — it only
steered worktree selection. Net effect: cached node outputs from a prior
failed run (e.g. an `extract-pr-number` returning "1634") bled into the
next invocation of the same workflow at the same path.

This commit moves the resume decision out of the executor and into the
caller. The executor takes a `preCreatedRun` and optional
`priorCompletedNodes` via a new options bag; it never queries the store
for a prior run. Two helpers in `@archon/workflows/executor` cover the
two real lookup shapes:

- `hydrateResumableRun(deps, candidate)` — caller already has the run
  row (CLI `--resume` post worktree-path resolution; orchestrator
  foreground-resume via `findResumableRunByParentConversation`).
- `prepareResumedRun(deps, workflow, cwd)` — canonical
  `(workflow, cwd)` lookup-and-hydrate.

Both throw on DB errors instead of the previous "warn and degrade to
fresh" silent fallback, matching the Fail Fast principle.

`executeWorkflow`'s trailing positional args (`codebaseId`,
`issueContext`, `isolationContext`, `parentConversationId`,
`preCreatedRun`, plus the new `priorCompletedNodes`) consolidate into a
single `ExecuteWorkflowOptions` bag — fixing the
`undefined, undefined, undefined, undefined, true` smell at every
caller. 7 required positionals + 1 opts; future additions go to opts.

Orphan-cleanup code in the executor deletes outright: with caller-side
resume there is no separate `preCreatedRun` to orphan when resume takes
over.

Supersedes #1396 (which gated the same internal lookup behind a flag
without addressing the positional explosion or the conceptual
responsibility leak).

Closes #1392
@coderabbitai

coderabbitai Bot commented May 11, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR refactors workflow resume from implicit auto-resume to explicit opt-in. executeWorkflow now accepts a single ExecuteWorkflowOptions object; resumable-run lookup and hydration moved to caller-side helpers (hydrateResumableRun). All orchestrator and CLI call sites updated to supply resume context explicitly. Tests and docs reflect that archon workflow run no longer auto-resumes without --resume flag.

Changes

Explicit Workflow Resume Refactoring

Layer / File(s) Summary
Resume Options Contract
packages/workflows/src/executor.ts
New ExecuteWorkflowOptions interface consolidates codebaseId, issueContext, isolationContext, parentConversationId, preCreatedRun, and priorCompletedNodes into a single options object.
Resume Preparation Helpers
packages/workflows/src/executor.ts
Added hydrateResumableRun(deps, candidate) to hydrate an already-located resumable run; executor no longer performs internal find/hydration.
Execute Workflow Core
packages/workflows/src/executor.ts
executeWorkflow signature changed from multiple trailing optional params to opts: ExecuteWorkflowOptions = {}; uses caller-provided preCreatedRun and priorCompletedNodes for resume state and sends tailored messages for node-skip vs. interactive-loop cases.
Executor Tests & Hydration Suite
packages/workflows/src/executor.test.ts, packages/workflows/src/executor-preamble.test.ts
Tests updated to assert executor does not call findResumableRun internally; added hydration tests covering success, null/no-op, interactive approval loop, and DB error propagation; preamble tests assert resume notifications and no fresh-run creation when preCreatedRun provided.
CLI Resume Hydration
packages/cli/src/commands/workflow.ts, packages/cli/src/commands/workflow.test.ts
When --resume flag is set, CLI finds and hydrates resumable run via hydrateResumableRun; throws if no usable resume data; passes resume context via options object to executeWorkflow.
Orchestrator Agent Call Sites
packages/core/src/orchestrator/orchestrator-agent.ts, packages/core/src/orchestrator/orchestrator.ts
Agent hydrates resumable runs on web foreground dispatch and calls executeWorkflow with options-object including codebaseId and parentConversationId; non-web and background paths updated to the same options shape.
Orchestrator Tests
packages/core/src/orchestrator/*
Agent tests mock hydrateResumableRun and verify parentConversationId and resume fields are passed via the trailing options object instead of positional args; added fallback hydration-null case.
Server Resume API
packages/server/src/routes/api.ts, packages/server/src/routes/api.workflow-runs.test.ts
POST /api/workflows/runs/{runId}/resume now dispatches a resume command to the parent web conversation (requires web parent_conversation_id); tests cover missing parent, missing conversation, non-web parent, and successful dispatch.
Documentation & Changelog
CHANGELOG.md, CLAUDE.md, packages/docs-web/src/content/docs/*
Updated docs and changelog to state resume is explicit (--resume, archon workflow resume <id>, or UI resume); plain archon workflow run starts fresh and does not implicitly resume.

Sequence Diagram(s)

sequenceDiagram
  participant CLI
  participant OrchestratorAgent
  participant hydrateResumableRun
  participant Executor
  participant DAG
  CLI->>OrchestratorAgent: /workflow run <name> [--resume?]
  OrchestratorAgent->>hydrateResumableRun: find & hydrate candidate (when resuming)
  hydrateResumableRun-->>OrchestratorAgent: {preCreatedRun, priorCompletedNodes} | null
  OrchestratorAgent->>Executor: executeWorkflow(opts{codebaseId, parentConversationId, preCreatedRun?, priorCompletedNodes?})
  Executor->>DAG: run dag, skipping priorCompletedNodes if provided
  DAG-->>Executor: result
  Executor-->>OrchestratorAgent: completion/status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • coleam00/Archon#1052: Modifies CLI workflow command and touches executeWorkflow argument handling.
  • coleam00/Archon#1329: Touches interactive-workflow dispatch and parentConversationId threading similar to this PR.
  • coleam00/Archon#1563: Also modifies packages/workflows/src/executor.ts and interacts with executor resume/cleanup behavior.

"A rabbit hops through changed lines,
Making resume an explicit sign;
Fresh runs start clean, resume by name—
Hop on, reviewers, play the game!" 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the core fix: making workflow resume explicit via new helper functions and closing the referenced issue.
Description check ✅ Passed The PR description is comprehensive and well-structured, including detailed before/after UX journey, architecture diagrams with module changes, validation evidence with test results, security impact, compatibility notes, and rollback plan.
Linked Issues check ✅ Passed The PR directly addresses issue #1392 by moving resume decision out of executeWorkflow into callers, preventing silent auto-resume and making resume explicit via --resume flag, archon workflow resume , or UI/chat mechanisms as required.
Out of Scope Changes check ✅ Passed All changes are scoped to resume-explicit mechanics: executor signature refactor, CLI/orchestrator caller updates, documentation clarifications, and test coverage for the new behavior—no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/workflow-resume-explicit

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/workflows/src/executor.ts (1)

329-401: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Bring preCreatedRun under failure handling before the first async setup.

preCreatedRun is not assigned to workflowRun until Line 401, but the function awaits config / env / provider work from Line 337 onward and only enters its catch/finally block at Line 567. Any throw in that window exits without finalizing the caller-created row. That leaves background-dispatch rows stuck pending, and resumed rows are worse because hydrateResumableRun() has already flipped them to running. Please either assign workflowRun = preCreatedRun before the first await and expand the top-level failure handler to cover the whole function, or defer the resumeWorkflowRun() state change until after this setup succeeds.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/workflows/src/executor.ts` around lines 329 - 401, Assign the
caller-provided preCreatedRun to workflowRun immediately after destructuring
opts (i.e., set workflowRun = preCreatedRun before any awaits) so the created DB
row is present if setup throws, and ensure any call that flips persisted state
for resumed runs (hydrateResumableRun() / resumeWorkflowRun()) is moved to run
only after the async setup (config/env/provider/model resolution) completes
successfully; alternatively, expand the existing top-level try/catch/finally to
start before the first await so failure handling covers the entire function.
Reference symbols: preCreatedRun, workflowRun, hydrateResumableRun(),
resumeWorkflowRun(), and the config/provider resolution code block.
packages/core/src/orchestrator/orchestrator-agent.ts (1)

275-312: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This shared dispatch path still reintroduces implicit resume on web.

dispatchOrchestratorWorkflow() is used by plain /workflow run and AI /invoke-workflow, so doing findResumableRunByParentConversation() here means a later run in the same web conversation can still silently reuse preCreatedRun/priorCompletedNodes. That breaks the PR’s fresh-by-default goal, and it also makes approval flows rely on a fuzzy re-lookup instead of the exact paused run the caller already has.

Please make resume explicit at this boundary too: pass hydrated resume state into dispatchOrchestratorWorkflow() only from approval/resume callers, and keep ordinary run paths fresh.

Possible shape
 async function dispatchOrchestratorWorkflow(
   platform: IPlatformAdapter,
   conversationId: string,
   conversation: Conversation,
   codebase: Codebase,
   workflow: WorkflowDefinition,
   userMessage: string,
-  isolationHints?: HandleMessageContext['isolationHints']
+  isolationHints?: HandleMessageContext['isolationHints'],
+  resume?: {
+    cwd: string;
+    preCreatedRun: WorkflowRun;
+    priorCompletedNodes: Map<string, string>;
+  }
 ): Promise<void> {
   // ...
   if (platform.getPlatformType() === 'web') {
-    const resumableRun = await workflowDb.findResumableRunByParentConversation(
-      workflow.name,
-      conversation.id
-    );
-    if (resumableRun?.working_path) {
-      const deps = createWorkflowDeps();
-      const prepared = await hydrateResumableRun(deps, resumableRun);
-      if (!prepared) {
-        throw new Error(...);
-      }
+    if (resume) {
       await executeWorkflow(
-        deps,
+        createWorkflowDeps(),
         platform,
         conversationId,
-        resumableRun.working_path,
+        resume.cwd,
         workflow,
         userMessage,
         conversation.id,
         {
           codebaseId: codebase.id,
           parentConversationId: conversation.id,
-          preCreatedRun: prepared.run,
-          priorCompletedNodes: prepared.priorCompletedNodes,
+          preCreatedRun: resume.preCreatedRun,
+          priorCompletedNodes: resume.priorCompletedNodes,
         }
       );
     } else if (workflow.interactive) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/core/src/orchestrator/orchestrator-agent.ts` around lines 275 - 312,
The shared dispatch path currently calls findResumableRunByParentConversation
and hydrates a resumable run here, which causes implicit resume for ordinary
runs; change this so dispatchOrchestratorWorkflow never performs implicit
resume: remove the findResumableRunByParentConversation/hydrateResumableRun
logic from this shared path and instead only accept a hydrated resume payload
(e.g., a prepared run object containing preCreatedRun and priorCompletedNodes)
when callers explicitly intend to resume (approval/resume entrypoints); update
callers that perform explicit resume to call hydrateResumableRun and then pass
the prepared resume state into dispatchOrchestratorWorkflow, leaving plain
/workflow run and /invoke-workflow callers to invoke
dispatchOrchestratorWorkflow without any resume state so they always start
fresh.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/cli/src/commands/workflow.ts`:
- Around line 452-455: The resumable-run lookup uses the raw CLI token
(workflowName) instead of the resolved canonical name, so change the call to
workflowDb.findResumableRun to pass the resolved name (workflow.name) returned
by resolveWorkflowName(); update the code around where resumable is assigned
(the call to findResumableRun and its surrounding logic that throws when
!resumable) to use workflow.name so stored rows keyed by the canonical workflow
name are matched correctly.

In `@packages/workflows/src/executor.ts`:
- Around line 273-276: The fast-path check only treats metadata.approval.type
=== 'interactive_loop' as resumable (hasInteractiveLoopState), causing valid
approval/gate nodes with other approval types to be rejected when
priorCompletedNodes.size === 0; update the predicate used in the if so it
detects any resumable approval/gate state rather than hard-coding
'interactive_loop' — e.g., replace/extend hasInteractiveLoopState with a check
that candidate.metadata?.approval exists and its state/type indicates a
resumable approval (or use a helper like
isResumableApproval(candidate.metadata.approval)) and then use that new
predicate in the priorCompletedNodes.size === 0 conditional to allow resuming
for all such approval nodes.

---

Outside diff comments:
In `@packages/core/src/orchestrator/orchestrator-agent.ts`:
- Around line 275-312: The shared dispatch path currently calls
findResumableRunByParentConversation and hydrates a resumable run here, which
causes implicit resume for ordinary runs; change this so
dispatchOrchestratorWorkflow never performs implicit resume: remove the
findResumableRunByParentConversation/hydrateResumableRun logic from this shared
path and instead only accept a hydrated resume payload (e.g., a prepared run
object containing preCreatedRun and priorCompletedNodes) when callers explicitly
intend to resume (approval/resume entrypoints); update callers that perform
explicit resume to call hydrateResumableRun and then pass the prepared resume
state into dispatchOrchestratorWorkflow, leaving plain /workflow run and
/invoke-workflow callers to invoke dispatchOrchestratorWorkflow without any
resume state so they always start fresh.

In `@packages/workflows/src/executor.ts`:
- Around line 329-401: Assign the caller-provided preCreatedRun to workflowRun
immediately after destructuring opts (i.e., set workflowRun = preCreatedRun
before any awaits) so the created DB row is present if setup throws, and ensure
any call that flips persisted state for resumed runs (hydrateResumableRun() /
resumeWorkflowRun()) is moved to run only after the async setup
(config/env/provider/model resolution) completes successfully; alternatively,
expand the existing top-level try/catch/finally to start before the first await
so failure handling covers the entire function. Reference symbols:
preCreatedRun, workflowRun, hydrateResumableRun(), resumeWorkflowRun(), and the
config/provider resolution code block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d5beade-938b-4a26-8ab3-6ca1ea869fe9

📥 Commits

Reviewing files that changed from the base of the PR and between f6feceb and d44e60b.

📒 Files selected for processing (12)
  • CHANGELOG.md
  • CLAUDE.md
  • packages/cli/src/commands/workflow.ts
  • packages/core/src/orchestrator/orchestrator-agent.test.ts
  • packages/core/src/orchestrator/orchestrator-agent.ts
  • packages/core/src/orchestrator/orchestrator.test.ts
  • packages/core/src/orchestrator/orchestrator.ts
  • packages/docs-web/src/content/docs/guides/authoring-workflows.md
  • packages/docs-web/src/content/docs/reference/api.md
  • packages/workflows/src/executor-preamble.test.ts
  • packages/workflows/src/executor.test.ts
  • packages/workflows/src/executor.ts

Comment on lines +452 to 455
resumable = await workflowDb.findResumableRun(workflowName, cwd);

if (!resumable) {
throw new Error(`No resumable run found for workflow '${workflowName}' at path '${cwd}'.`);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the resolved workflow name for the resumable-run lookup.

Line 452 queries with the raw CLI token instead of workflow.name. That means fallback matches from resolveWorkflowName() work for a fresh run and then mysteriously fail for --resume because the stored row is keyed by the canonical workflow name.

Suggested fix
-    resumable = await workflowDb.findResumableRun(workflowName, cwd);
+    resumable = await workflowDb.findResumableRun(workflow.name, cwd);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/cli/src/commands/workflow.ts` around lines 452 - 455, The
resumable-run lookup uses the raw CLI token (workflowName) instead of the
resolved canonical name, so change the call to workflowDb.findResumableRun to
pass the resolved name (workflow.name) returned by resolveWorkflowName(); update
the code around where resumable is assigned (the call to findResumableRun and
its surrounding logic that throws when !resumable) to use workflow.name so
stored rows keyed by the canonical workflow name are matched correctly.

Comment thread packages/workflows/src/executor.ts
@Wirasm

Wirasm commented May 12, 2026

Copy link
Copy Markdown
Collaborator Author

PR Review Summary — Multi-Agent

Seven specialized agents reviewed this PR: code-reviewer, docs-impact, pr-test-analyzer, comment-analyzer, silent-failure-hunter, type-design-analyzer, code-simplifier. Findings consolidated below.

Critical Issues (1)

Agent Issue Location
code-reviewer Web UI Resume button is functionally broken. The PR removes auto-resume on plain re-invocation, but POST /api/workflows/runs/{runId}/resume was not updated. It still returns "Re-run the workflow to auto-resume from completed nodes." and the inline comment still says the next invocation on the same path auto-resumes. PR docs explicitly list "Web UI: Resume button on the workflow card" as a valid resume path, which is now untrue. Verified against the PR branch. packages/server/src/routes/api.ts:1910-1914 (handler), :627 (route summary "Resume a failed workflow run (re-run auto-resumes from completed nodes)")

Important Issues (5)

Agent Issue Location
code-reviewer / code-simplifier (consensus) prepareResumedRun has zero production callers — only test files import it. Violates the YAGNI / "one current caller" rule in CLAUDE.md. CLI and orchestrator both call hydrateResumableRun directly. Suggest unexporting (or inlining the 3-line body) and folding tests into the hydrateResumableRun suite with a findResumableRun mock. packages/workflows/src/executor.ts (export of prepareResumedRun)
type-design-analyzer preCreatedRun and priorCompletedNodes are independently optional but jointly required for resume. A caller passing only priorCompletedNodes (or only preCreatedRun) yields silent corruption: executor creates a fresh row via createWorkflowRun but still uses dagPriorCompletedNodes = priorCompletedNodes to skip nodes — the mirror of the #1392 bug this PR fixes. Type system does not catch it. Suggest a discriminated union: ({preCreatedRun, priorCompletedNodes}) | ({preCreatedRun?: undefined; priorCompletedNodes?: undefined}). packages/workflows/src/executor.ts ExecuteWorkflowOptions
silent-failure-hunter Foreground-resume throw message exceeds the 100-char threshold in classifyAndFormatError, so the user sees the generic "⚠️ An unexpected error occurred. Try /reset to start a fresh session." instead of the actionable message. /reset is also the wrong recovery — the user should abandon the stuck run. Verified: message template is ~62 chars + UUID(36) + workflow name = well over 100. packages/core/src/orchestrator/orchestrator-agent.ts (foreground-resume throw) → packages/core/src/utils/error-formatter.ts:88-95
silent-failure-hunter CLI hydrateResumableRun call has no try/catch. A DB error here propagates as a raw stack trace and gets logged under the misleading cli.workflow_resume_run_failed event name. All other CLI failure paths wrap with context. packages/cli/src/commands/workflow.ts ~line 715
docs-impact Stale user-facing docs that contradict the new explicit-only semantics, not touched by this PR:
packages/docs-web/src/content/docs/book/dag-workflows.md:329 — "Let DAG resume handle failures... run it again... No --resume flag required."
packages/docs-web/src/content/docs/guides/authoring-workflows.md:973,999 — "Checkpoint and Resume" pattern still describes resume as automatic on re-invocation.
packages/docs-web/src/content/docs/guides/approval-nodes.md:237-241 — references findResumableRun by internal-function name as the resume mechanism (stale implementation detail).
(see locations)

Suggestions (5)

Agent Suggestion Location
comment-analyzer Wrong arg index in test comment. Comment says "arg index 14" but assertion accesses [15]. The enumerated signature in the same comment is correct (priorCompletedNodes is 16th, index 15) — only the index label is off. packages/workflows/src/executor.test.ts:559-560
comment-analyzer Issue-number references in source/test comments (e.g. "Silent fallback... was the original #1392 bug", #1392 follow-up). CLAUDE.md guidance: keep issue numbers in PR descriptions, not in code — they rot. executor.ts hydrateResumableRun JSDoc; executor.test.ts removed-suite tombstones; test name regression: #1392
comment-analyzer Position-leaky test comment ("opts.parentConversationId on the options bag at position 7") will rot if executeWorkflow ever gains another positional arg. The assertion opts.parentConversationId is self-documenting. packages/core/src/orchestrator/orchestrator-agent.test.ts
pr-test-analyzer No test for the new CLI throw "Cannot resume: the prior run for '...' has no completed nodes and no interactive-loop state." (rating 8/10). Add a test that mocks findResumableRun → candidate but hydrateResumableRun → null. packages/cli/src/commands/workflow.test.ts
pr-test-analyzer No test for the orchestrator foreground-resume hydrateResumableRun → null branch (rating 7/10). Pairs with the silent-failure issue above — would surface the >100-char error formatting too. packages/core/src/orchestrator/orchestrator-agent.test.ts
type-design-analyzer hydrateResumableRun return key run doesn't align with ExecuteWorkflowOptions.preCreatedRun — three call sites need { preCreatedRun: prepared.run, priorCompletedNodes: prepared.priorCompletedNodes }. Renaming runpreCreatedRun enables a clean ...prepared spread. packages/workflows/src/executor.ts hydrateResumableRun return type

Strengths

  • Core refactor is sound. Removing implicit findResumableRun from executeWorkflow is the right fix for archon workflow run auto-resumes failed runs without --resume — docs say opt-in #1392; the regression test (does NOT call findResumableRun on its own) precisely pins down the cross-invocation leak.
  • Opts bag eliminates the undefined, undefined, undefined, undefined, true smell at every call site — genuine maintenance win.
  • hydrateResumableRun propagates DB errors without silent fallbackpr-test-analyzer confirms both getCompletedDagNodeOutputs and resumeWorkflowRun failure paths are tested with rejects.toThrow. Directly fixes the original archon workflow run auto-resumes failed runs without --resume — docs say opt-in #1392 root cause.
  • ExecuteWorkflowOptions JSDoc is strong — each field documents the non-obvious contract (e.g. preCreatedRun must be in running status; issueContext substitution/append behavior).
  • Orchestrator foreground-resume comment is exemplary: "Hydrate the already-found candidate (skip a second findResumableRun query). If hydration returns null, the run vanished or has nothing worth resuming — throw rather than silently start fresh." — captures non-obvious WHY in three lines.
  • Mock isolation is clean — no collisions across batches; both packages run conflicting mock.module calls in separate bun test invocations.

Verdict

NEEDS FIXES — one blocker (broken Web UI button) plus YAGNI + invariant gap.

Recommended Actions

  1. Block-merge fix: Update POST /api/workflows/runs/{runId}/resume and its route summary in packages/server/src/routes/api.ts. Either dispatch the resume from the handler, or change the response message + summary to remove the now-false "auto-resume on re-run" language and point the user at archon workflow resume <id>.
  2. YAGNI: Remove prepareResumedRun (or stop exporting it). Fold its 3 tests into the hydrateResumableRun suite.
  3. Type invariant: Make preCreatedRun + priorCompletedNodes a discriminated union on ExecuteWorkflowOptions so the type system enforces what the JSDoc currently only documents.
  4. Silent failure: Shorten the foreground-resume throw message to <100 chars (or surface it directly via safeSendMessage before rethrowing) so classifyAndFormatError passes it through. Wrap the CLI hydrateResumableRun call in try/catch with context.
  5. Docs: Update the three stale doc locations the docs-impact agent identified (book/dag-workflows.md:329, authoring-workflows.md:973,999, approval-nodes.md:237-241).
  6. Polish: Fix the arg index 1415 comment in executor.test.ts:559. Consider renaming hydrateResumableRun return key runpreCreatedRun to enable ...prepared spread at call sites.
  7. Tests: Add the two missing null-branch tests (CLI prepared === null, orchestrator hydration null).

Server resume endpoint: wire POST /api/workflows/runs/:runId/resume to
dispatch via the orchestrator (parent web conversation) instead of
returning a stale "re-run to auto-resume" message that no longer
matches reality. CLI-created runs and non-web parents return a clear
400 pointing at `archon workflow resume <id>`.

Executor type design: drop the YAGNI `prepareResumedRun` helper (no
production callers), tighten `ExecuteWorkflowOptions` with a
discriminated union so `priorCompletedNodes` without `preCreatedRun`
is a type error, and rename `hydrateResumableRun`'s return key from
`run` to `preCreatedRun` so callers can spread directly.

Orchestrator foreground-resume: when hydration returns null (prior
run had nothing worth resuming) surface a user-visible notice and
fall through to a fresh run on the same worktree, instead of
throwing a >100-char error that `classifyAndFormatError` would
swallow into the generic `/reset` advice.

CLI resume: wrap `hydrateResumableRun` in try/catch with context so
DB errors don't surface as raw stack traces.

Docs: update three stale locations describing resume as automatic
on re-invocation (book/dag-workflows.md, guides/authoring-workflows.md,
guides/approval-nodes.md).

Tests + comments: rewrite resume endpoint tests to cover the four
new branches (404, 400 no-parent, 400 non-web, 200 web dispatch);
add orchestrator test for the new hydrate→null fall-through; fix
the off-by-one arg-index comment in executor.test.ts; drop
issue-number references and position-leaky comments per
comment-analyzer feedback.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
packages/workflows/src/executor.ts (1)

272-275: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Broaden resumable approval detection beyond interactive_loop.

At Line 272-Line 275, zero-completed runs resume only when approval.type === 'interactive_loop'. This can reject valid paused approval-gate resumes with other approval types (e.g., first-node approval), causing hydrateResumableRun to return null incorrectly.

Suggested fix
-  const hasInteractiveLoopState =
-    candidate.metadata?.approval !== undefined &&
-    (candidate.metadata.approval as Record<string, unknown>).type === 'interactive_loop';
-  if (priorCompletedNodes.size === 0 && !hasInteractiveLoopState) {
+  const approvalMeta = candidate.metadata?.approval as Record<string, unknown> | undefined;
+  const hasResumableApprovalState =
+    approvalMeta !== undefined &&
+    typeof approvalMeta === 'object' &&
+    typeof approvalMeta.nodeId === 'string';
+  if (priorCompletedNodes.size === 0 && !hasResumableApprovalState) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/workflows/src/executor.ts` around lines 272 - 275, The current logic
only treats runs as resumable when candidate.metadata?.approval.type ===
'interactive_loop', causing hydrateResumableRun to return null for other paused
approval types; change the detection to consider any approval metadata instead.
Replace hasInteractiveLoopState with a broader boolean (e.g., hasApprovalState)
that checks candidate.metadata?.approval !== undefined (or truthy), update the
if condition that uses priorCompletedNodes so it uses hasApprovalState, and
ensure any downstream assumptions about interactive_loop-specific fields are
guarded; run/update tests for hydrateResumableRun and approval-resume scenarios
after the change.
🧹 Nitpick comments (1)
packages/docs-web/src/content/docs/guides/approval-nodes.md (1)

237-241: 💤 Low value

Consider abstracting implementation details in Design Notes.

The Design Notes section mentions the specific function hydrateResumableRun, which couples the documentation to the current implementation. If this function is renamed or refactored, the docs will become stale.

Consider rephrasing to describe the mechanism without naming internal functions, e.g., "the orchestrator's explicit resume path" without the parenthetical function name.

That said, since this is a Design Notes section clearly marked for advanced users/maintainers, including implementation details may be intentional and acceptable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/docs-web/src/content/docs/guides/approval-nodes.md` around lines 237
- 241, The Design Notes currently mention the internal function
hydrateResumableRun, which couples docs to implementation; update the paragraph
that reads "the orchestrator's explicit resume path (via `hydrateResumableRun`)"
to either remove the parenthetical function name or rephrase to a
non-implementation-specific description like "the orchestrator's explicit resume
path" and optionally note that this refers to the orchestrator's resume
mechanism (avoid naming hydrateResumableRun), while keeping the clarification
about metadata.approval_response distinguishing resumed vs genuinely-failed
runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@packages/workflows/src/executor.ts`:
- Around line 272-275: The current logic only treats runs as resumable when
candidate.metadata?.approval.type === 'interactive_loop', causing
hydrateResumableRun to return null for other paused approval types; change the
detection to consider any approval metadata instead. Replace
hasInteractiveLoopState with a broader boolean (e.g., hasApprovalState) that
checks candidate.metadata?.approval !== undefined (or truthy), update the if
condition that uses priorCompletedNodes so it uses hasApprovalState, and ensure
any downstream assumptions about interactive_loop-specific fields are guarded;
run/update tests for hydrateResumableRun and approval-resume scenarios after the
change.

---

Nitpick comments:
In `@packages/docs-web/src/content/docs/guides/approval-nodes.md`:
- Around line 237-241: The Design Notes currently mention the internal function
hydrateResumableRun, which couples docs to implementation; update the paragraph
that reads "the orchestrator's explicit resume path (via `hydrateResumableRun`)"
to either remove the parenthetical function name or rephrase to a
non-implementation-specific description like "the orchestrator's explicit resume
path" and optionally note that this refers to the orchestrator's resume
mechanism (avoid naming hydrateResumableRun), while keeping the clarification
about metadata.approval_response distinguishing resumed vs genuinely-failed
runs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b087391-6c8e-4970-a476-08f54b1f7e4f

📥 Commits

Reviewing files that changed from the base of the PR and between d44e60b and 047bd5b.

📒 Files selected for processing (13)
  • packages/cli/src/commands/workflow.test.ts
  • packages/cli/src/commands/workflow.ts
  • packages/core/src/orchestrator/orchestrator-agent.test.ts
  • packages/core/src/orchestrator/orchestrator-agent.ts
  • packages/core/src/orchestrator/orchestrator.test.ts
  • packages/docs-web/src/content/docs/book/dag-workflows.md
  • packages/docs-web/src/content/docs/guides/approval-nodes.md
  • packages/docs-web/src/content/docs/guides/authoring-workflows.md
  • packages/server/src/routes/api.ts
  • packages/server/src/routes/api.workflow-runs.test.ts
  • packages/workflows/src/executor-preamble.test.ts
  • packages/workflows/src/executor.test.ts
  • packages/workflows/src/executor.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/docs-web/src/content/docs/guides/authoring-workflows.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/core/src/orchestrator/orchestrator.test.ts
  • packages/cli/src/commands/workflow.ts
  • packages/core/src/orchestrator/orchestrator-agent.test.ts
  • packages/workflows/src/executor-preamble.test.ts

@Wirasm Wirasm merged commit 34c7fe2 into dev May 12, 2026
4 checks passed
@Wirasm Wirasm mentioned this pull request May 12, 2026
cropse pushed a commit to cropse/Archon that referenced this pull request May 19, 2026
…esumableRun (closes coleam00#1392) (coleam00#1646)

* fix(workflows): make resume explicit via prepareResumedRun / hydrateResumableRun (closes coleam00#1392)

Previously `executeWorkflow` called `findResumableRun(workflow.name, cwd)`
unconditionally and silently auto-resumed the most recent failed run for
the same `(workflow_name, cwd)` pair. The `--resume` CLI flag was
documented as opt-in but had no causal effect on this path — it only
steered worktree selection. Net effect: cached node outputs from a prior
failed run (e.g. an `extract-pr-number` returning "1634") bled into the
next invocation of the same workflow at the same path.

This commit moves the resume decision out of the executor and into the
caller. The executor takes a `preCreatedRun` and optional
`priorCompletedNodes` via a new options bag; it never queries the store
for a prior run. Two helpers in `@archon/workflows/executor` cover the
two real lookup shapes:

- `hydrateResumableRun(deps, candidate)` — caller already has the run
  row (CLI `--resume` post worktree-path resolution; orchestrator
  foreground-resume via `findResumableRunByParentConversation`).
- `prepareResumedRun(deps, workflow, cwd)` — canonical
  `(workflow, cwd)` lookup-and-hydrate.

Both throw on DB errors instead of the previous "warn and degrade to
fresh" silent fallback, matching the Fail Fast principle.

`executeWorkflow`'s trailing positional args (`codebaseId`,
`issueContext`, `isolationContext`, `parentConversationId`,
`preCreatedRun`, plus the new `priorCompletedNodes`) consolidate into a
single `ExecuteWorkflowOptions` bag — fixing the
`undefined, undefined, undefined, undefined, true` smell at every
caller. 7 required positionals + 1 opts; future additions go to opts.

Orphan-cleanup code in the executor deletes outright: with caller-side
resume there is no separate `preCreatedRun` to orphan when resume takes
over.

Supersedes coleam00#1396 (which gated the same internal lookup behind a flag
without addressing the positional explosion or the conceptual
responsibility leak).

Closes coleam00#1392

* fix(workflows): address review feedback on explicit-resume PR

Server resume endpoint: wire POST /api/workflows/runs/:runId/resume to
dispatch via the orchestrator (parent web conversation) instead of
returning a stale "re-run to auto-resume" message that no longer
matches reality. CLI-created runs and non-web parents return a clear
400 pointing at `archon workflow resume <id>`.

Executor type design: drop the YAGNI `prepareResumedRun` helper (no
production callers), tighten `ExecuteWorkflowOptions` with a
discriminated union so `priorCompletedNodes` without `preCreatedRun`
is a type error, and rename `hydrateResumableRun`'s return key from
`run` to `preCreatedRun` so callers can spread directly.

Orchestrator foreground-resume: when hydration returns null (prior
run had nothing worth resuming) surface a user-visible notice and
fall through to a fresh run on the same worktree, instead of
throwing a >100-char error that `classifyAndFormatError` would
swallow into the generic `/reset` advice.

CLI resume: wrap `hydrateResumableRun` in try/catch with context so
DB errors don't surface as raw stack traces.

Docs: update three stale locations describing resume as automatic
on re-invocation (book/dag-workflows.md, guides/authoring-workflows.md,
guides/approval-nodes.md).

Tests + comments: rewrite resume endpoint tests to cover the four
new branches (404, 400 no-parent, 400 non-web, 200 web dispatch);
add orchestrator test for the new hydrate→null fall-through; fix
the off-by-one arg-index comment in executor.test.ts; drop
issue-number references and position-leaky comments per
comment-analyzer feedback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

archon workflow run auto-resumes failed runs without --resume — docs say opt-in

1 participant