fix(core): concurrency-safe workflow resume/cancel (CAS guards) by Wirasm · Pull Request #1830 · coleam00/Archon

Wirasm · 2026-06-01T15:12:01Z

Summary

Problem: resumeWorkflowRun flips a run to running with WHERE id = $1 and no status guard, so two callers racing the same run both succeed and both executeWorkflow on the same worktree (double-claim). The per-conversation lock doesn't protect against this — it's keyed on conversationId, not runId, the CLI resume path has no lock at all, and the web Resume button bypasses it. cancelWorkflowRun is likewise unguarded (WHERE id = $1) and can re-stamp completed_at / resurrect a terminal run.
Why it matters: a double-claimed worktree means two agents mutating the same checkout concurrently — silent corruption. This is the foundational concurrency-safety primitive the planned pending-state / suspend-resume refactor builds on, and it's a live race today.
What changed: resumeWorkflowRun is now a compare-and-swap — the UPDATE only matches a row still in a resumable state (the exact findResumableRun predicate: failed/paused, or a stale running orphan), and since it refreshes last_activity_at a second concurrent resumer is excluded in every case. On a CAS miss it probes the current status and throws an actionable not resumable (status: X). cancelWorkflowRun gains status NOT IN ('completed','cancelled') and treats a no-match as an idempotent no-op.
What did NOT change (scope boundary): no schema/migration, no status-enum change, no new resumeRun() entrypoint, no executor changes. failed stays cancellable (it's overloaded as a resumable/approval state) and a running run stays cancellable (cooperative cancellation). This is only the DB-layer guards.

UX Journey

Before

web "Resume" ─┐
chat re-msg  ─┼─▶ resumeWorkflowRun(id)  UPDATE … WHERE id=$1   ──▶ both succeed
CLI resume   ─┘                          (no status guard)          two executors,
                                                                    one worktree  ✗

After

web "Resume" ─┐                          UPDATE … WHERE id=$1 AND <still-resumable>
chat re-msg  ─┼─▶ resumeWorkflowRun(id)  *first wins (row → running, activity bumped)*
CLI resume   ─┘                          *second: rowCount 0 → "not resumable (status: running)"* ✓

Label Snapshot

Risk: risk: low
Size: size: S
Scope: core, tests
Module: core:db:workflows

Change Metadata

Change type: bug
Primary scope: core

Linked Issue

Related: workflow pause/resume → true suspend-resume design (foundational step). No single tracking issue.

Validation Evidence (required)

bun run type-check    # ✅ all packages
bun x eslint packages/core/src/db/workflows.ts   # ✅ clean
bun test packages/core/src/db/workflows.test.ts             # ✅ 72 pass
bun test packages/core/src/operations/workflow-operations.test.ts  # ✅ 19 pass (callers)
bun test packages/server/src/routes/api.workflow-runs.test.ts      # ✅ 79 pass (resume/cancel routes)

New tests: resume CAS predicate, not resumable (concurrent-activation) miss, not-found miss, probe-error path; cancel terminal-guard, idempotent no-op, db-error.

Security Impact (required)

New permissions/capabilities? No.
New external network calls? No.
Secrets/tokens handling changed? No.
File system access scope changed? No — narrows a concurrency hazard (prevents two executors on one worktree).

Compatibility / Migration

Backward compatible? Yes. Same call signatures and return types; legitimate single-caller resume/cancel behave identically (incl. stale-running orphan recovery).
Config/env changes? No.
Database migration needed? No.

Human Verification (required)

Verified: CAS predicate present in the UPDATE; concurrent-activation miss throws not resumable (status: running) (observed db.workflow_run_resume_not_resumable warn); cancel guard present; idempotent no-op on terminal; all caller suites green.
Edge cases: stale-running orphan still resumable (predicate mirrors findResumableRun); failed/running still cancellable; deleted-run probe → "not found".
Not verified: live multi-process race (covered by reasoning + the CAS predicate refreshing last_activity_at); Postgres path exercised via the mocked-dialect unit tests only.

Side Effects / Blast Radius (required)

Affected: resumeWorkflowRun / cancelWorkflowRun in @archon/core/db/workflows, reached by the executor (hydrateResumableRun), operations layer, command-handler, and the server resume/cancel/abandon routes.
Potential unintended effects: a resume that loses the CAS race now throws instead of silently proceeding — callers (hydrateResumableRun and up) already propagate/handle resume errors, which is the intended "someone else is resuming, back off" outcome.
Guardrails: db.workflow_run_resume_not_resumable / db.workflow_run_cancel_noop log lines for observability.

Rollback Plan (required)

Fast rollback: revert the single commit (no migration, no persisted state).
Observable failure symptoms: legitimate resumes erroring not resumable (would indicate the predicate is too tight); cancels silently no-op'ing when they shouldn't.

Risks and Mitigations

Risk: the CAS predicate rejects a legitimate resume → Mitigation: predicate is copied verbatim from findResumableRun, which is what selected the candidate in the first place; unit-tested incl. the stale-running orphan clause.
Risk: cancel no-op hides a real failure → Mitigation: only completed/cancelled are excluded (truly finished); everything else still cancels; no-op is logged.

Summary by CodeRabbit

Bug Fixes
- Cancellation is idempotent and reports whether a run was actually cancelled; endpoint returns a clear “nothing to cancel” message when applicable.
- Resumption now reliably refuses non-resumable runs, distinguishes “not found” vs “not resumable”, and avoids claiming runs lost to races.
New Features
- Orchestrator surfaces a user-facing warning when a resumable run was already claimed elsewhere.
Tests
- Expanded unit and integration tests covering resumption and cancellation edge cases.

coderabbitai · 2026-06-01T15:12:15Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b2f43285-67b5-43c9-8312-3dad2ffa5be3

📥 Commits

Reviewing files that changed from the base of the PR and between 48f5299 and 2eac852.

📒 Files selected for processing (8)

packages/core/package.json
packages/core/src/db/workflows.resume-cas.integration.test.ts
packages/core/src/db/workflows.test.ts
packages/core/src/db/workflows.ts
packages/core/src/orchestrator/orchestrator-agent.ts
packages/server/src/routes/api.ts
packages/server/src/routes/api.workflow-runs.test.ts
packages/workflows/src/store.ts

📝 Walkthrough

Walkthrough

Harden resumeWorkflowRun with a CAS-guarded UPDATE and status probe on CAS miss; add WorkflowNotResumableError. Make cancelWorkflowRun idempotent and return { cancelled }. Update orchestrator, API route, unit tests, and add an integration test.

Changes

Workflow Run State Transition Safety

Layer / File(s)	Summary
Shared resumable predicate & error type `packages/core/src/db/workflows.ts`	Adds `resumableStatusClause(dialect, dayParamIndex)`, `ORPHAN_RESUME_STALE_DAYS`, and exported `WorkflowNotResumableError` used across resume logic.
Resume UPDATE with CAS and probe `packages/core/src/db/workflows.ts`, `packages/core/src/db/workflows.test.ts`	`resumeWorkflowRun` now performs a CAS-style UPDATE to `running` guarded by the resumable predicate; on zero rows it SELECTs current status and throws either not-found or `WorkflowNotResumableError`. Unit tests assert the CAS predicate, bound day parameter, probe-select behavior, and error propagation.
Orchestrator race handling `packages/core/src/orchestrator/orchestrator-agent.ts`	`dispatchOrchestratorWorkflow` catches `WorkflowNotResumableError`, logs a resume-lost race, notifies the user with `currentStatus`, and avoids starting a fresh run.
SQLite integration test + CI script `packages/core/src/db/workflows.resume-cas.integration.test.ts`, `packages/core/package.json`	Adds an in-memory Bun/SQLite integration test exercising stale-orphan CAS/date parameter path and updates `scripts.test` to run it.
Idempotent cancellation and store interface `packages/core/src/db/workflows.ts`, `packages/workflows/src/store.ts`, `packages/core/src/db/workflows.test.ts`	`cancelWorkflowRun` now updates to `cancelled` only when not already terminal, returns `{ cancelled: boolean }`, logs no-op when no rows changed; `IWorkflowStore.cancelWorkflowRun` signature updated accordingly and tests cover success, idempotence, and error propagation.
API route and server tests `packages/server/src/routes/api.ts`, `packages/server/src/routes/api.workflow-runs.test.ts`	Cancel endpoint reads `{ cancelled }` and returns a conditional message; server tests updated to mock `{ cancelled: true/false }` and include a TOCTOU regression test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

coleam00/Archon#1212: Earlier modifications to resumeWorkflowRun and started_at refresh behavior that overlap at the SQL transition level.
coleam00/Archon#1646: Related orchestrator resume error-handling changes interacting with WorkflowNotResumableError.
coleam00/Archon#1756: Related orchestrator resumable-run selection and dispatch changes.

Poem

🐰 I nibble at race conditions with cheer,
CAS locks the path so no duplicate reappear,
Cancel replies "done" or "nothing to clear",
Probes tell the truth when updates disappear,
Hooray — safe runs hop on without fear!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: implementing compare-and-swap guards for workflow resume/cancel operations to prevent concurrent execution hazards.
Description check	✅ Passed	The PR description comprehensively covers all required template sections: problem statement, rationale, changes made, scope boundaries, UX journey, architecture impact, validation evidence, security/compatibility analysis, human verification, side effects, and rollback plan.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/workflow-resume-concurrency-guards

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

packages/core/src/db/workflows.ts (2)
396-416: ⚡ Quick win

Extract the resumable-state predicate into a shared helper.

This PR’s safety guarantee depends on resumeWorkflowRun and findResumableRun staying in lockstep, but the stale-running predicate is now hand-copied in both places. Please build that fragment once and reuse it so a future edit can’t make lookup and CAS disagree.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/core/src/db/workflows.ts` around lines 396 - 416, The
resumable-state SQL predicate used in resumeWorkflowRun is duplicated from
findResumableRun and must be extracted into a single helper (e.g.,
resumableStatePredicate or buildResumablePredicate) so both functions use the
same fragment; implement a function that returns the predicate string (including
the stale-running clause with ${dialect.nowMinusDays(3)} and any
${dialect.now()} uses) and replace the inline WHERE clause in resumeWorkflowRun
and findResumableRun to reference that helper so future edits affect both places
atomically.
443-446: ⚡ Quick win

Use standard state suffixes for the new log events.

db.workflow_run_resume_not_found, db.workflow_run_resume_not_resumable, and db.workflow_run_cancel_noop don’t match the repo’s {domain}.{action}_{state} convention. Keep the suffix to a standard state and move the reason into structured fields so routing and dashboards stay consistent.

As per coding guidelines, "Structured logging with Pino (packages/paths/src/logger.ts); event naming: {domain}.{action}_{state} (standard states: _started, _completed, _failed, _validated, _rejected); always pair _started with _completed or _failed; include context: IDs, durations, error details; never log API keys, tokens (mask: token.slice(0, 8) + '...'), message content, PII".

Also applies to: 638-639
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/core/src/db/workflows.ts` around lines 443 - 446, The log event
names must follow the {domain}.{action}_{state} pattern and move human reasons
into structured fields: change calls to getLog().warn({ workflowRunId: id },
'db.workflow_run_resume_not_found') and getLog().warn({ workflowRunId: id,
currentStatus }, 'db.workflow_run_resume_not_resumable') to use standardized
suffixes (e.g. 'db.workflow_run_resume_failed' for missing run and
'db.workflow_run_resume_rejected' for non-resumable) and include the reason and
context in the object payload (e.g. { workflowRunId: id, currentStatus, reason:
'not_found' } and { workflowRunId: id, currentStatus, reason: 'not_resumable'
}); apply the same change for the related cancel noop logs (lines noted) and
ensure IDs/durations/errors are structured, no secrets are logged, and started
events elsewhere are paired with completed/failed as per the convention.
packages/core/src/db/workflows.test.ts (1)
747-761: ⚡ Quick win

Assert the stale-running guard, not just the running branch.

This test still passes if the last_activity_at cutoff disappears and any fresh running row becomes resumable again. Please also assert on the staleness clause so the regression test actually protects the double-claim fix.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/core/src/db/workflows.test.ts` around lines 747 - 761, The test for
resumeWorkflowRun currently only asserts that the UPDATE includes "status =
'running' AND" but not the staleness cutoff; update the test (around the
mockQuery inspection for updateQuery) to also assert the stale-running guard is
present by checking for the last_activity_at cutoff expression (e.g.,
expect(updateQuery).toContain("status = 'running' AND") and
expect(updateQuery).toContain("last_activity_at <") or the exact SQL fragment
used in resumeWorkflowRun that combines running with last_activity_at), so the
test enforces the stale-running clause that prevents double-claiming.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/core/src/db/workflows.test.ts`:
- Around line 747-761: The test for resumeWorkflowRun currently only asserts
that the UPDATE includes "status = 'running' AND" but not the staleness cutoff;
update the test (around the mockQuery inspection for updateQuery) to also assert
the stale-running guard is present by checking for the last_activity_at cutoff
expression (e.g., expect(updateQuery).toContain("status = 'running' AND") and
expect(updateQuery).toContain("last_activity_at <") or the exact SQL fragment
used in resumeWorkflowRun that combines running with last_activity_at), so the
test enforces the stale-running clause that prevents double-claiming.

In `@packages/core/src/db/workflows.ts`:
- Around line 396-416: The resumable-state SQL predicate used in
resumeWorkflowRun is duplicated from findResumableRun and must be extracted into
a single helper (e.g., resumableStatePredicate or buildResumablePredicate) so
both functions use the same fragment; implement a function that returns the
predicate string (including the stale-running clause with
${dialect.nowMinusDays(3)} and any ${dialect.now()} uses) and replace the inline
WHERE clause in resumeWorkflowRun and findResumableRun to reference that helper
so future edits affect both places atomically.
- Around line 443-446: The log event names must follow the
{domain}.{action}_{state} pattern and move human reasons into structured fields:
change calls to getLog().warn({ workflowRunId: id },
'db.workflow_run_resume_not_found') and getLog().warn({ workflowRunId: id,
currentStatus }, 'db.workflow_run_resume_not_resumable') to use standardized
suffixes (e.g. 'db.workflow_run_resume_failed' for missing run and
'db.workflow_run_resume_rejected' for non-resumable) and include the reason and
context in the object payload (e.g. { workflowRunId: id, currentStatus, reason:
'not_found' } and { workflowRunId: id, currentStatus, reason: 'not_resumable'
}); apply the same change for the related cancel noop logs (lines noted) and
ensure IDs/durations/errors are structured, no secrets are logged, and started
events elsewhere are paired with completed/failed as per the convention.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae104db0-c485-4af7-aa00-9e3fedceaa6e

📥 Commits

Reviewing files that changed from the base of the PR and between 367edfd and e6a5da2.

📒 Files selected for processing (2)

packages/core/src/db/workflows.test.ts
packages/core/src/db/workflows.ts

Wirasm · 2026-06-02T06:11:05Z

PR Review Summary — Multi-Agent

Six specialist agents reviewed this PR (code quality, docs impact, tests, comments, silent failures, simplification). The core concurrency design is sound — the compare-and-swap approach to preventing a double-claimed worktree is the right primitive — but there is one merge-blocking bug in the CAS UPDATE itself, which I reproduced empirically.

🔴 Critical (1) — blocks merge

C1 — The CAS predicate references an unbound SQL parameter $3; resume is broken.
packages/core/src/db/workflows.ts:415 + params at :417

dialect.nowMinusDays(n) takes a parameter index, not a day count — it emits the literal text $n and the day value must be supplied in the params array:

Postgres (postgres.ts:148): NOW() - ($3 || ' days')::INTERVAL
SQLite (sqlite.ts:542): datetime('now', '-' || $3 || ' days')

The new CAS UPDATE interpolates nowMinusDays(3) (→ $3) but passes only [id] (→ $1). $3 is never bound. Compare findResumableRun at :325/:329, which correctly passes [workflowName, workingPath, 1] so $3 = 1. The comment claiming the predicate "mirrors findResumableRun's exactly" copied the SQL text but not the parameterization convention.

Verified runtime impact:

Postgres — every resume throws. A query referencing $3 with one supplied value fails at Bind (bind message supplies 1 parameters, but prepared statement requires 3). This is caught and rethrown as Failed to resume workflow run: …. Resume is fully broken on Postgres — for failed/paused runs and orphans, since the bind fails before any row is evaluated.
SQLite — orphan recovery silently dead. The adapter's convertPlaceholders (sqlite.ts:129-145) maps $3 → params[2] = undefined, bound as NULL. last_activity_at < NULL is NULL (false), so the stale-running orphan arm never matches. failed/paused resume still works; orphan recovery does not.

I reproduced both on a live bun:sqlite DB:

# current PR code (params [id]):
resume r1 (stale 10-day orphan): changes=0   ← should resume, silently doesn't
resume r2 (failed):              changes=1

# fixed (nowMinusDays(2), params [id, 3]):
resume r1 (stale orphan, 3d threshold): changes=1   ← resumes
resume r3 (fresh 2h run,  3d threshold): changes=0   ← correctly excluded, no double-claim

Fix: bind the day count as $2 and call nowMinusDays(2) with params [id, <days>]. Note the value: findResumableRun binds 1 (its effective threshold is 1 day, despite the "3 days" comments). To truly mirror it — so a candidate selected by findResumableRun can't then be rejected by the resume CAS in the 1–3 day window — bind the same value here, or fix both to a shared constant and make the "3 days" comments true. The two predicates must agree on the number, not just the shape.

This bug exists precisely because the predicate is duplicated by hand (see S1).

🟠 Important (3)

I1 — Cancel route reports false success when the DB no-op fires.
packages/server/src/routes/api.ts:2126-2143 + workflows.ts:638
cancelWorkflowRun now returns void whether it cancelled a row or no-op'd a terminal one. The cancel route pre-checks status, then calls it, then unconditionally returns { success: true, message: "Cancelled workflow: …" }. In the TOCTOU window (run finishes between pre-check and UPDATE) the user is told "Cancelled" when nothing happened. Note: this is still strictly better than pre-PR, which would resurrect a completed run to cancelled and re-stamp completed_at — the bug this PR fixes. Low-cost improvement: return { cancelled: boolean } so callers can distinguish a real cancel from a no-op.

I2 — The not resumable throw isn't translated on the web/orchestrator resume path.
packages/core/src/orchestrator/orchestrator-agent.ts:~399
hydrateResumableRun → resumeWorkflowRun can now throw Workflow run is not resumable (id: …, status: …), but unlike the CLI path (cli/src/commands/workflow.ts:773-784, which wraps it in a friendly message), the orchestrator call is unguarded and the raw internal string (with DB ids) falls through to the generic orchestrator_message_failed catch. Wrap it and surface a "someone else is already resuming this" message.

I3 — Tests assert SQL substrings, not behavior — which is exactly why C1 slipped through.
packages/core/src/db/workflows.test.ts
The CAS-predicate test only does expect(updateQuery).toContain("status IN ('failed', 'paused')") against a mocked pool.query; it never asserts the params array and never executes SQL. Add expect(updateParams).toEqual(['…', <days>]) (would have caught C1 immediately) and at least one integration test that runs resume against a real SQLite DB to exercise the orphan-recovery arm and the dialect-specific date SQL.

🟡 Suggestions (4)

S1 — Extract the duplicated CAS predicate into a file-private helper. workflows.ts:324-326 and 413-416. The comment literally says these "must mirror" — duplication-that-must-stay-in-sync is the maintenance hazard that produced C1. A resumableStatusClause(dialect, dayParamIndex) helper used by both call sites makes the coupling structural and impossible to drift. Strongest non-blocking finding.

S2 — Log-level inconsistency. cancelWorkflowRun no-op logs at debug (:638, invisible at default prod level) while the analogous resume CAS-miss logs at warn (:446). Pick a consistent level for these concurrency-signal events (info is defensible for both).

S3 — Comment accuracy. (a) "mirrors findResumableRun's exactly" is now demonstrably false (C1) — fix it when fixing the bug. (b) "a second concurrent resumer is excluded in every case" relies on atomic row-level UPDATE serialization, which the comment doesn't state — the predicate alone isn't the mechanism. (c) The probe is labeled "best-effort" but actually rethrows on failure — "informational-only" is more accurate. (d) The executor "between-layer status check" cross-reference has no locatable pointer and omits the separate mid-stream cancel path (dag-executor.ts).

S4 — Probe-flow / minor. Capture probeRows (not currentStatus) across the try/catch so currentStatus narrows to string and the undefined check disappears; consider threading { cause: err } into the rethrown probe error.

✅ Strengths

The CAS approach correctly relies on atomic row-level UPDATE + last_activity_at refresh to exclude a second resumer — sound concurrency reasoning.
CAS-miss paths (not-found / not-resumable / probe-error / UPDATE-error) and cancel idempotency are each individually tested with clear intent-documenting names.
Cancel terminal-guard correctly keeps failed (overloaded resumable/approval state) and running (cooperative cancel) cancellable.
No silent failures at the DB layer — every error path is logged with context.

📄 Documentation

No updates needed — the changed error strings and the cancel no-op are internal/edge behaviors not described in user-facing docs; orphan-resume guidance in authoring-workflows.md remains accurate.

Verdict: NEEDS FIXES

C1 is merge-blocking: resume is currently broken on Postgres (hard error) and orphan-recovery is broken on SQLite. The fix is a few characters (bind the day param + matching value), but it needs a params-array assertion and ideally a real-DB integration test (I3) to lock it in, since the existing mock-only tests cannot catch it. I1–I3 should be addressed before merge; S1–S4 are polish. Re-run review after the fix.

Note on a pre-existing issue surfaced in passing: findResumableRun's "3 days" comments don't match its bound value of 1 — worth reconciling, but out of this PR's scope.

…rdening C1 (merge-blocking): the resume CAS predicate interpolated nowMinusDays(3) → $3 but bound only [id], leaving the day placeholder unbound — resume threw on Postgres (bind error: 1 param supplied, 3 required) and orphan recovery silently died on SQLite ($3 → NULL → `last_activity_at < NULL` is false). nowMinusDays takes a PARAMETER INDEX, not a day count. Fixed by binding the day value and, per S1, extracting the shared resumableStatusClause(dialect, idx) helper used by BOTH findResumableRun and resumeWorkflowRun so the predicates can't drift again (the hand-duplication is what caused C1). The day value is the named ORPHAN_RESUME_STALE_DAYS constant, bound identically at both call sites. - I1: cancelWorkflowRun returns { cancelled: boolean }; the cancel route reports "nothing to cancel" instead of a false "Cancelled" when the run already finished (TOCTOU no-op). IWorkflowStore updated to match. - I2: resumeWorkflowRun throws a typed WorkflowNotResumableError; the orchestrator resume path catches it and surfaces "already being resumed" instead of leaking the raw DB-id string to the generic failure catch. - I3: real-SQLite integration test (workflows.resume-cas.integration.test.ts) exercises orphan recovery + the dialect date SQL end-to-end — it fails under C1. Added params-array assertions to the mock tests (would have caught C1 directly). - S2: cancel no-op and resume CAS-miss both log at info (consistent). - S3: corrected the "mirrors exactly" comment (now a shared helper), stated the atomic-UPDATE exclusion mechanism, relabeled the probe "informational only". - S4: capture probeRows across the try/catch; thread { cause } into the rethrow.

Wirasm · 2026-06-02T06:31:29Z

Great catch on C1 — that was a real merge-blocker and you're exactly right about the root cause. Addressed in 48f52997:

Critical

C1 — fixed. nowMinusDays(n) takes a parameter index, not a day count; the CAS emitted $3 but bound only [id]. Now binds the day value, and — per S1 — both findResumableRun and resumeWorkflowRun share a resumableStatusClause(dialect, dayParamIndex) helper so the predicate can't drift again. The day value is a named ORPHAN_RESUME_STALE_DAYS constant bound identically at both sites (I kept the effective 1-day threshold findResumableRun already used, so a candidate it selects can't be rejected by the resume CAS).

Important

I1 — cancelWorkflowRun now returns { cancelled: boolean } (IWorkflowStore updated); the cancel route reports "nothing to cancel" instead of a false "Cancelled" on a TOCTOU no-op.
I2 — resumeWorkflowRun throws a typed WorkflowNotResumableError; the orchestrator resume path catches it and surfaces "already being resumed" instead of leaking the raw DB-id string to the generic catch.
I3 — added a real-SQLite integration test (workflows.resume-cas.integration.test.ts) that runs the actual resumeWorkflowRun against a bun:sqlite DB, exercising the orphan-recovery arm + dialect date SQL — it fails under C1. Also added params-array assertions (toEqual([id, 1])) to the mock tests, which would have caught C1 directly.

Suggestions

S2 — cancel no-op and resume CAS-miss both log at info now.
S3 — corrected the "mirrors exactly" comment (it's a shared helper now), stated the atomic-UPDATE exclusion mechanism explicitly, and relabeled the probe "informational only".
S4 — capture probeRows across the try/catch; threaded { cause } into the rethrown probe error. (Kept the undefined→not-found branch — it distinguishes a deleted run from a not-resumable one.)

Re the aside: findResumableRun's "3 days" comments vs the bound 1 — reconciled here via the shared ORPHAN_RESUME_STALE_DAYS constant, so the number now lives in one place.

Validation: type-check + lint clean; workflows 72, integration 6, operations 19, orchestrator-agent 122, server routes 80, @archon/workflows all green.

resumeWorkflowRun had no status guard — two callers racing the same run (the web Resume button + a chat re-dispatch, or the lock-less CLI path; the conversation lock is keyed per-conversation, not per-run) could both flip it to 'running' and double-claim the worktree. Make it a compare-and-swap: the UPDATE only matches a row still in a resumable state (the exact findResumableRun predicate — failed/paused, or a stale 'running' orphan), and because it refreshes last_activity_at a second concurrent resumer is excluded in every case. On a CAS miss, probe the current status and throw an actionable "not resumable (status: X)" instead of silently proceeding. cancelWorkflowRun was unguarded (WHERE id=$1) and could re-stamp completed_at or resurrect a terminal run. Guard it with status NOT IN ('completed','cancelled') (keeping 'failed' and 'running' cancellable by design) and treat a no-match as an idempotent no-op. This is the foundational concurrency-safety primitive the pending-state / suspend-resume refactor builds on — usable by the console UI and every other surface that resumes or cancels a run. Tests: CAS predicate + not-resumable/not-found/probe-error paths for resume; terminal guard + idempotent no-op + db-error paths for cancel.

…rdening C1 (merge-blocking): the resume CAS predicate interpolated nowMinusDays(3) → $3 but bound only [id], leaving the day placeholder unbound — resume threw on Postgres (bind error: 1 param supplied, 3 required) and orphan recovery silently died on SQLite ($3 → NULL → `last_activity_at < NULL` is false). nowMinusDays takes a PARAMETER INDEX, not a day count. Fixed by binding the day value and, per S1, extracting the shared resumableStatusClause(dialect, idx) helper used by BOTH findResumableRun and resumeWorkflowRun so the predicates can't drift again (the hand-duplication is what caused C1). The day value is the named ORPHAN_RESUME_STALE_DAYS constant, bound identically at both call sites. - I1: cancelWorkflowRun returns { cancelled: boolean }; the cancel route reports "nothing to cancel" instead of a false "Cancelled" when the run already finished (TOCTOU no-op). IWorkflowStore updated to match. - I2: resumeWorkflowRun throws a typed WorkflowNotResumableError; the orchestrator resume path catches it and surfaces "already being resumed" instead of leaking the raw DB-id string to the generic failure catch. - I3: real-SQLite integration test (workflows.resume-cas.integration.test.ts) exercises orphan recovery + the dialect date SQL end-to-end — it fails under C1. Added params-array assertions to the mock tests (would have caught C1 directly). - S2: cancel no-op and resume CAS-miss both log at info (consistent). - S3: corrected the "mirrors exactly" comment (now a shared helper), stated the atomic-UPDATE exclusion mechanism, relabeled the probe "informational only". - S4: capture probeRows across the try/catch; thread { cause } into the rethrow.

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

Wirasm added 2 commits June 2, 2026 13:15

Wirasm force-pushed the fix/workflow-resume-concurrency-guards branch from 48f5299 to 2eac852 Compare June 2, 2026 10:17

Wirasm merged commit 8d6be1c into dev Jun 2, 2026
3 of 4 checks passed

Wirasm deleted the fix/workflow-resume-concurrency-guards branch June 2, 2026 10:26

Wirasm mentioned this pull request Jun 9, 2026

Gate stale workflow resumes #1935

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): concurrency-safe workflow resume/cancel (CAS guards)#1830

fix(core): concurrency-safe workflow resume/cancel (CAS guards)#1830
Wirasm merged 2 commits into
devfrom
fix/workflow-resume-concurrency-guards

Wirasm commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Wirasm commented Jun 2, 2026

Uh oh!

Wirasm commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Wirasm commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

UX Journey

Before

After

Label Snapshot

Change Metadata

Linked Issue

Validation Evidence (required)

Security Impact (required)

Compatibility / Migration

Human Verification (required)

Side Effects / Blast Radius (required)

Rollback Plan (required)

Risks and Mitigations

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Wirasm commented Jun 2, 2026

PR Review Summary — Multi-Agent

🔴 Critical (1) — blocks merge

🟠 Important (3)

🟡 Suggestions (4)

✅ Strengths

📄 Documentation

Verdict: NEEDS FIXES

Uh oh!

Wirasm commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Wirasm commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading