fix(core): concurrency-safe workflow resume/cancel (CAS guards)#1830
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughHarden resumeWorkflowRun with a CAS-guarded UPDATE and status probe on CAS miss; add WorkflowNotResumableError. Make cancelWorkflowRun idempotent and return ChangesWorkflow Run State Transition Safety
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
packages/core/src/db/workflows.ts (2)
396-416: ⚡ Quick winExtract the resumable-state predicate into a shared helper.
This PR’s safety guarantee depends on
resumeWorkflowRunandfindResumableRunstaying in lockstep, but the stale-running predicate is now hand-copied in both places. Please build that fragment once and reuse it so a future edit can’t make lookup and CAS disagree.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/core/src/db/workflows.ts` around lines 396 - 416, The resumable-state SQL predicate used in resumeWorkflowRun is duplicated from findResumableRun and must be extracted into a single helper (e.g., resumableStatePredicate or buildResumablePredicate) so both functions use the same fragment; implement a function that returns the predicate string (including the stale-running clause with ${dialect.nowMinusDays(3)} and any ${dialect.now()} uses) and replace the inline WHERE clause in resumeWorkflowRun and findResumableRun to reference that helper so future edits affect both places atomically.
443-446: ⚡ Quick winUse standard state suffixes for the new log events.
db.workflow_run_resume_not_found,db.workflow_run_resume_not_resumable, anddb.workflow_run_cancel_noopdon’t match the repo’s{domain}.{action}_{state}convention. Keep the suffix to a standard state and move the reason into structured fields so routing and dashboards stay consistent.As per coding guidelines, "Structured logging with Pino (
packages/paths/src/logger.ts); event naming:{domain}.{action}_{state}(standard states:_started,_completed,_failed,_validated,_rejected); always pair_startedwith_completedor_failed; include context: IDs, durations, error details; never log API keys, tokens (mask:token.slice(0, 8) + '...'), message content, PII".Also applies to: 638-639
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/core/src/db/workflows.ts` around lines 443 - 446, The log event names must follow the {domain}.{action}_{state} pattern and move human reasons into structured fields: change calls to getLog().warn({ workflowRunId: id }, 'db.workflow_run_resume_not_found') and getLog().warn({ workflowRunId: id, currentStatus }, 'db.workflow_run_resume_not_resumable') to use standardized suffixes (e.g. 'db.workflow_run_resume_failed' for missing run and 'db.workflow_run_resume_rejected' for non-resumable) and include the reason and context in the object payload (e.g. { workflowRunId: id, currentStatus, reason: 'not_found' } and { workflowRunId: id, currentStatus, reason: 'not_resumable' }); apply the same change for the related cancel noop logs (lines noted) and ensure IDs/durations/errors are structured, no secrets are logged, and started events elsewhere are paired with completed/failed as per the convention.packages/core/src/db/workflows.test.ts (1)
747-761: ⚡ Quick winAssert the stale-running guard, not just the
runningbranch.This test still passes if the
last_activity_atcutoff disappears and any freshrunningrow becomes resumable again. Please also assert on the staleness clause so the regression test actually protects the double-claim fix.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/core/src/db/workflows.test.ts` around lines 747 - 761, The test for resumeWorkflowRun currently only asserts that the UPDATE includes "status = 'running' AND" but not the staleness cutoff; update the test (around the mockQuery inspection for updateQuery) to also assert the stale-running guard is present by checking for the last_activity_at cutoff expression (e.g., expect(updateQuery).toContain("status = 'running' AND") and expect(updateQuery).toContain("last_activity_at <") or the exact SQL fragment used in resumeWorkflowRun that combines running with last_activity_at), so the test enforces the stale-running clause that prevents double-claiming.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@packages/core/src/db/workflows.test.ts`:
- Around line 747-761: The test for resumeWorkflowRun currently only asserts
that the UPDATE includes "status = 'running' AND" but not the staleness cutoff;
update the test (around the mockQuery inspection for updateQuery) to also assert
the stale-running guard is present by checking for the last_activity_at cutoff
expression (e.g., expect(updateQuery).toContain("status = 'running' AND") and
expect(updateQuery).toContain("last_activity_at <") or the exact SQL fragment
used in resumeWorkflowRun that combines running with last_activity_at), so the
test enforces the stale-running clause that prevents double-claiming.
In `@packages/core/src/db/workflows.ts`:
- Around line 396-416: The resumable-state SQL predicate used in
resumeWorkflowRun is duplicated from findResumableRun and must be extracted into
a single helper (e.g., resumableStatePredicate or buildResumablePredicate) so
both functions use the same fragment; implement a function that returns the
predicate string (including the stale-running clause with
${dialect.nowMinusDays(3)} and any ${dialect.now()} uses) and replace the inline
WHERE clause in resumeWorkflowRun and findResumableRun to reference that helper
so future edits affect both places atomically.
- Around line 443-446: The log event names must follow the
{domain}.{action}_{state} pattern and move human reasons into structured fields:
change calls to getLog().warn({ workflowRunId: id },
'db.workflow_run_resume_not_found') and getLog().warn({ workflowRunId: id,
currentStatus }, 'db.workflow_run_resume_not_resumable') to use standardized
suffixes (e.g. 'db.workflow_run_resume_failed' for missing run and
'db.workflow_run_resume_rejected' for non-resumable) and include the reason and
context in the object payload (e.g. { workflowRunId: id, currentStatus, reason:
'not_found' } and { workflowRunId: id, currentStatus, reason: 'not_resumable'
}); apply the same change for the related cancel noop logs (lines noted) and
ensure IDs/durations/errors are structured, no secrets are logged, and started
events elsewhere are paired with completed/failed as per the convention.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ae104db0-c485-4af7-aa00-9e3fedceaa6e
📒 Files selected for processing (2)
packages/core/src/db/workflows.test.tspackages/core/src/db/workflows.ts
PR Review Summary — Multi-AgentSix specialist agents reviewed this PR (code quality, docs impact, tests, comments, silent failures, simplification). The core concurrency design is sound — the compare-and-swap approach to preventing a double-claimed worktree is the right primitive — but there is one merge-blocking bug in the CAS UPDATE itself, which I reproduced empirically. 🔴 Critical (1) — blocks mergeC1 — The CAS predicate references an unbound SQL parameter
The new CAS UPDATE interpolates Verified runtime impact:
I reproduced both on a live Fix: bind the day count as This bug exists precisely because the predicate is duplicated by hand (see S1). 🟠 Important (3)I1 — Cancel route reports false success when the DB no-op fires. I2 — The I3 — Tests assert SQL substrings, not behavior — which is exactly why C1 slipped through. 🟡 Suggestions (4)S1 — Extract the duplicated CAS predicate into a file-private helper. S2 — Log-level inconsistency. S3 — Comment accuracy. (a) "mirrors findResumableRun's exactly" is now demonstrably false (C1) — fix it when fixing the bug. (b) "a second concurrent resumer is excluded in every case" relies on atomic row-level UPDATE serialization, which the comment doesn't state — the predicate alone isn't the mechanism. (c) The probe is labeled "best-effort" but actually rethrows on failure — "informational-only" is more accurate. (d) The executor "between-layer status check" cross-reference has no locatable pointer and omits the separate mid-stream cancel path ( S4 — Probe-flow / minor. Capture ✅ Strengths
📄 DocumentationNo updates needed — the changed error strings and the cancel no-op are internal/edge behaviors not described in user-facing docs; orphan-resume guidance in Verdict: NEEDS FIXESC1 is merge-blocking: resume is currently broken on Postgres (hard error) and orphan-recovery is broken on SQLite. The fix is a few characters (bind the day param + matching value), but it needs a params-array assertion and ideally a real-DB integration test (I3) to lock it in, since the existing mock-only tests cannot catch it. I1–I3 should be addressed before merge; S1–S4 are polish. Re-run review after the fix. Note on a pre-existing issue surfaced in passing: |
…rdening
C1 (merge-blocking): the resume CAS predicate interpolated nowMinusDays(3) → $3
but bound only [id], leaving the day placeholder unbound — resume threw on
Postgres (bind error: 1 param supplied, 3 required) and orphan recovery silently
died on SQLite ($3 → NULL → `last_activity_at < NULL` is false). nowMinusDays
takes a PARAMETER INDEX, not a day count. Fixed by binding the day value and, per
S1, extracting the shared resumableStatusClause(dialect, idx) helper used by BOTH
findResumableRun and resumeWorkflowRun so the predicates can't drift again (the
hand-duplication is what caused C1). The day value is the named
ORPHAN_RESUME_STALE_DAYS constant, bound identically at both call sites.
- I1: cancelWorkflowRun returns { cancelled: boolean }; the cancel route reports
"nothing to cancel" instead of a false "Cancelled" when the run already finished
(TOCTOU no-op). IWorkflowStore updated to match.
- I2: resumeWorkflowRun throws a typed WorkflowNotResumableError; the orchestrator
resume path catches it and surfaces "already being resumed" instead of leaking
the raw DB-id string to the generic failure catch.
- I3: real-SQLite integration test (workflows.resume-cas.integration.test.ts)
exercises orphan recovery + the dialect date SQL end-to-end — it fails under C1.
Added params-array assertions to the mock tests (would have caught C1 directly).
- S2: cancel no-op and resume CAS-miss both log at info (consistent).
- S3: corrected the "mirrors exactly" comment (now a shared helper), stated the
atomic-UPDATE exclusion mechanism, relabeled the probe "informational only".
- S4: capture probeRows across the try/catch; thread { cause } into the rethrow.
|
Great catch on C1 — that was a real merge-blocker and you're exactly right about the root cause. Addressed in Critical
Important
Suggestions
Re the aside: Validation: type-check + lint clean; workflows 72, integration 6, operations 19, orchestrator-agent 122, server routes 80, |
resumeWorkflowRun had no status guard — two callers racing the same run (the web
Resume button + a chat re-dispatch, or the lock-less CLI path; the conversation
lock is keyed per-conversation, not per-run) could both flip it to 'running' and
double-claim the worktree. Make it a compare-and-swap: the UPDATE only matches a
row still in a resumable state (the exact findResumableRun predicate —
failed/paused, or a stale 'running' orphan), and because it refreshes
last_activity_at a second concurrent resumer is excluded in every case. On a CAS
miss, probe the current status and throw an actionable "not resumable
(status: X)" instead of silently proceeding.
cancelWorkflowRun was unguarded (WHERE id=$1) and could re-stamp completed_at or
resurrect a terminal run. Guard it with status NOT IN ('completed','cancelled')
(keeping 'failed' and 'running' cancellable by design) and treat a no-match as an
idempotent no-op.
This is the foundational concurrency-safety primitive the pending-state /
suspend-resume refactor builds on — usable by the console UI and every other
surface that resumes or cancels a run.
Tests: CAS predicate + not-resumable/not-found/probe-error paths for resume;
terminal guard + idempotent no-op + db-error paths for cancel.
…rdening
C1 (merge-blocking): the resume CAS predicate interpolated nowMinusDays(3) → $3
but bound only [id], leaving the day placeholder unbound — resume threw on
Postgres (bind error: 1 param supplied, 3 required) and orphan recovery silently
died on SQLite ($3 → NULL → `last_activity_at < NULL` is false). nowMinusDays
takes a PARAMETER INDEX, not a day count. Fixed by binding the day value and, per
S1, extracting the shared resumableStatusClause(dialect, idx) helper used by BOTH
findResumableRun and resumeWorkflowRun so the predicates can't drift again (the
hand-duplication is what caused C1). The day value is the named
ORPHAN_RESUME_STALE_DAYS constant, bound identically at both call sites.
- I1: cancelWorkflowRun returns { cancelled: boolean }; the cancel route reports
"nothing to cancel" instead of a false "Cancelled" when the run already finished
(TOCTOU no-op). IWorkflowStore updated to match.
- I2: resumeWorkflowRun throws a typed WorkflowNotResumableError; the orchestrator
resume path catches it and surfaces "already being resumed" instead of leaking
the raw DB-id string to the generic failure catch.
- I3: real-SQLite integration test (workflows.resume-cas.integration.test.ts)
exercises orphan recovery + the dialect date SQL end-to-end — it fails under C1.
Added params-array assertions to the mock tests (would have caught C1 directly).
- S2: cancel no-op and resume CAS-miss both log at info (consistent).
- S3: corrected the "mirrors exactly" comment (now a shared helper), stated the
atomic-UPDATE exclusion mechanism, relabeled the probe "informational only".
- S4: capture probeRows across the try/catch; thread { cause } into the rethrow.
48f5299 to
2eac852
Compare
Summary
resumeWorkflowRunflips a run torunningwithWHERE id = $1and no status guard, so two callers racing the same run both succeed and bothexecuteWorkflowon the same worktree (double-claim). The per-conversation lock doesn't protect against this — it's keyed onconversationId, notrunId, the CLI resume path has no lock at all, and the web Resume button bypasses it.cancelWorkflowRunis likewise unguarded (WHERE id = $1) and can re-stampcompleted_at/ resurrect a terminal run.resumeWorkflowRunis now a compare-and-swap — the UPDATE only matches a row still in a resumable state (the exactfindResumableRunpredicate:failed/paused, or a stalerunningorphan), and since it refresheslast_activity_ata second concurrent resumer is excluded in every case. On a CAS miss it probes the current status and throws an actionablenot resumable (status: X).cancelWorkflowRungainsstatus NOT IN ('completed','cancelled')and treats a no-match as an idempotent no-op.resumeRun()entrypoint, no executor changes.failedstays cancellable (it's overloaded as a resumable/approval state) and arunningrun stays cancellable (cooperative cancellation). This is only the DB-layer guards.UX Journey
Before
After
Label Snapshot
risk: lowsize: Score,testscore:db:workflowsChange Metadata
bugcoreLinked Issue
Validation Evidence (required)
not resumable(concurrent-activation) miss, not-found miss, probe-error path; cancel terminal-guard, idempotent no-op, db-error.Security Impact (required)
Compatibility / Migration
runningorphan recovery).Human Verification (required)
not resumable (status: running)(observeddb.workflow_run_resume_not_resumablewarn); cancel guard present; idempotent no-op on terminal; all caller suites green.runningorphan still resumable (predicate mirrorsfindResumableRun);failed/runningstill cancellable; deleted-run probe → "not found".last_activity_at); Postgres path exercised via the mocked-dialect unit tests only.Side Effects / Blast Radius (required)
resumeWorkflowRun/cancelWorkflowRunin@archon/core/db/workflows, reached by the executor (hydrateResumableRun), operations layer, command-handler, and the server resume/cancel/abandon routes.hydrateResumableRunand up) already propagate/handle resume errors, which is the intended "someone else is resuming, back off" outcome.db.workflow_run_resume_not_resumable/db.workflow_run_cancel_nooplog lines for observability.Rollback Plan (required)
not resumable(would indicate the predicate is too tight); cancels silently no-op'ing when they shouldn't.Risks and Mitigations
findResumableRun, which is what selected the candidate in the first place; unit-tested incl. the stale-runningorphan clause.completed/cancelledare excluded (truly finished); everything else still cancels; no-op is logged.Summary by CodeRabbit
Bug Fixes
New Features
Tests