Skip to content

fix: lock workflow runs by working_path (#1036, #1188 part 2)#1212

Merged
Wirasm merged 4 commits into
devfrom
fix/workflow-run-path-lock
Apr 14, 2026
Merged

fix: lock workflow runs by working_path (#1036, #1188 part 2)#1212
Wirasm merged 4 commits into
devfrom
fix/workflow-run-path-lock

Conversation

@Wirasm

@Wirasm Wirasm commented Apr 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Both #1036 (concurrent-run guard TOCTOU) and #1188 part 2 (parallel worktree collision) reduce to the same primitive: there's no enforced lock on `working_path`, so two dispatches that resolve to the same filesystem location can race.

This PR treats the workflow_runs row as a lock token. `pending / running / paused` = lock held; terminal statuses release.

Fixes #1036
Fixes #1188 (part 2)

What changed

File Change
`packages/core/src/db/workflows.ts` `getActiveWorkflowRunByPath` includes `pending` (with 5-min stale-orphan age window), accepts `excludeId` + `selfStartedAt`, orders by `(started_at ASC, id ASC)` for deterministic older-wins tiebreaker
`packages/workflows/src/store.ts` Updated `IWorkflowStore.getActiveWorkflowRunByPath` interface
`packages/workflows/src/executor.ts` Moved guard to AFTER `workflowRun` is finalized (so we always have self-ID); on guard fire, mark self as `cancelled` (no zombie rows); new actionable error message; resume orphan cancellation
`packages/workflows/src/utils/duration.ts` New `formatDuration` helper for the error message
`packages/core/src/db/workflows.test.ts` 5 new tests for query semantics
`packages/workflows/src/executor.test.ts` 5 new tests for guard placement, self-cancellation, message format, resume orphan
`packages/workflows/src/executor-preamble.test.ts` Updated 2 existing tests for new structural behavior
`packages/workflows/src/utils/duration.test.ts` 8 new tests

Design — why this approach

First-principles framing: the resource being protected is the filesystem (`working_path`), not workflow run identity. The DB row is the lock token.

Why the older-wins tiebreaker matters: without ordering by `(started_at, id)`, two dispatches with similar timestamps could mutually see each other and BOTH abort. With the total order, both always agree which is "first" — only the newer aborts.

Why the 5-min stale window: orphaned `pending` rows from crashed dispatches or resume-replaced rows would otherwise permanently block a path. The window is conservative — workflows transition pending→running in seconds.

Why move the guard: the old guard ran BEFORE `workflowRun` was finalized (no self-ID to exclude). Moving it after lets us pass self-ID + started_at to the query, eliminating "self-detection" false positives and enabling the tiebreaker.

Resume orphan fix: when executor activates resume (`resumableRun` replaces `workflowRun`), the orchestrator's pre-created row is now redundant. Without releasing it, the user's back-to-back resume would block on its own ghost lock until the 5-min window. Best-effort cancellation; if it fails, the stale-pending window is the safety net.

New error message:

```
❌ This worktree is in use by `archon-implement` (running 2m 5s, run `abc12345`).
• Wait for it to finish: `/workflow status`
• Cancel it: `/workflow cancel abc12345`
• Use a different branch: `--branch `
```

Every line is a concrete next action.

Deferred (intentional)

  • Worktree-layer advisory lockfile for the residual Git-level worktree adoption bypasses cross-checkout guard and allows parallel collision #1188.2 microsecond race where both dispatches reach `provider.create()` before either's row commits. Bounded by git's own atomicity for `worktree add` ("path already exists" error). Add a `.archon/.lock` later if real users hit it.
  • Startup cleanup of pre-existing stale pending rows. The 5-min age window makes them harmless.
  • DB partial UNIQUE constraint migration. Code-only is sufficient and avoids migrating leaked stale rows in user DBs.

Testing

  • `bun run type-check` clean
  • `bun run lint` zero warnings
  • All existing `db/workflows` and `executor` tests pass with updates where structural behavior changed
  • 23 new tests covering query semantics, guard placement, self-cancellation, message format, resume orphan, duration formatting
  • Full test suite green

Manual reproduction (recommended verification)

```bash

#1036 reproduction

ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --no-worktree "run1" &
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --no-worktree "run2" &
wait

Expected: one succeeds; the other shows the new "in use" message

#1188.2 reproduction

ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --branch test/collide "a" &
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --branch test/collide "b" &
wait

Resume regression (back-to-back resume must NOT be blocked)

1. Run a failing workflow

2. Resume it

3. Immediately resume again → must succeed

```

Summary by CodeRabbit

  • Bug Fixes

    • Improved concurrent workflow detection with deterministic lock ordering to avoid race races, clearer actionable "in use" messages, and automatic cancellation of conflicting runs.
    • Cleanup of orphan/stale pending workflow rows during resume and on failure to prevent indefinite blocking.
  • New Features

    • Status messages now include human-readable elapsed time and state-aware instructions.
  • Tests

    • Added tests for concurrency guards, timestamp parsing, and duration/duration-formatting.
  • Documentation

    • Updated CLI and database docs to reflect abandon behavior and path-lock semantics.

Both bugs reduce to the same primitive: there's no enforced lock on
working_path, so two dispatches that resolve to the same filesystem
location can race. The DB row is the lock token; pending/running/paused
are "lock held"; terminal statuses release.

Changes:

- getActiveWorkflowRunByPath includes `pending` (with 5-min stale-orphan
  age window), accepts excludeId + selfStartedAt, and orders by
  (started_at ASC, id ASC) for a deterministic older-wins tiebreaker.
  Eliminates the both-abort race where two near-simultaneous dispatches
  with similar timestamps could mutually abort each other.

- Move the executor's guard call site to AFTER workflowRun is finalized
  (preCreated, resumed, or freshly created). This guarantees we always
  have self-ID + started_at to pass to the lock query.

- On guard fire after row creation: mark self as 'cancelled' so we don't
  leave a zombie pending row that would then become its own lock holder.

- New error message includes workflow name, duration, short run id, and
  three concrete next-action commands (status / cancel / different
  branch). Replaces the vague "Workflow already running".

- Resume orphan fix: when executor activates a resumable run, mark the
  orchestrator's pre-created row as 'cancelled'. Without this, every
  resume leaks a pending row that would block the user's own
  back-to-back resume until the 5-min stale window.

- New formatDuration helper for the error message (8 unit tests).

Tests:

- 5 new tests in db/workflows.test.ts: pending in active set, age window,
  excludeId exclusion, tiebreaker SQL shape, ordering.
- 5 new tests in executor.test.ts: self-id passed to query, self-cancel
  on guard fire, new message format, resume orphan cancellation,
  resume proceeds even if orphan cancel fails.
- Updated 2 executor-preamble tests for new structural behavior
  (row-then-guard, new message format).
- 8 new tests for formatDuration.

Deferred (kept scope tight):
- Worktree-layer advisory lockfile (residual #1188.2 microsecond race
  where both dispatches reach provider.create — bounded by git's own
  atomicity for `worktree add`).
- Startup cleanup of pre-existing stale pending rows (5-min age window
  makes them harmless).
- DB partial UNIQUE constraint migration (code-only is sufficient).

Fixes #1036
Fixes #1188 (part 2)
@coderabbitai

coderabbitai Bot commented Apr 14, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fdb3c7df-09c2-4f16-8245-8200d5e29ba4

📥 Commits

Reviewing files that changed from the base of the PR and between 008013d and 04069ee.

📒 Files selected for processing (10)
  • packages/core/src/db/adapters/sqlite.test.ts
  • packages/core/src/db/workflows.test.ts
  • packages/core/src/db/workflows.ts
  • packages/docs-web/src/content/docs/reference/cli.md
  • packages/docs-web/src/content/docs/reference/database.md
  • packages/workflows/src/executor.test.ts
  • packages/workflows/src/executor.ts
  • packages/workflows/src/store.ts
  • packages/workflows/src/utils/duration.test.ts
  • packages/workflows/src/utils/duration.ts
✅ Files skipped from review due to trivial changes (3)
  • packages/docs-web/src/content/docs/reference/database.md
  • packages/workflows/src/utils/duration.test.ts
  • packages/workflows/src/executor.test.ts
🚧 Files skipped from review as they are similar to previous changes (5)
  • packages/core/src/db/workflows.test.ts
  • packages/workflows/src/utils/duration.ts
  • packages/workflows/src/store.ts
  • packages/workflows/src/executor.ts
  • packages/core/src/db/workflows.ts

📝 Walkthrough

Walkthrough

getActiveWorkflowRunByPath now accepts an optional self object and treats recent pending rows as active using a STALE_PENDING_AGE_MS cutoff; executor re-queries after run finalization with self-exclusion, cancels self on conflicts (best-effort), and added duration parsing/formatting utilities and tests.

Changes

Cohort / File(s) Summary
DB queries & tests
packages/core/src/db/workflows.ts, packages/core/src/db/workflows.test.ts
Added export const STALE_PENDING_AGE_MS = 5 * 60 * 1000; getActiveWorkflowRunByPath(workingPath, self?) now includes recent pending rows, optional id != $N exclusion and deterministic tiebreaker (started_at, id); ordering changed to ORDER BY started_at ASC, id ASC; tests assert SQL, params ordering, and resume UPDATE sets started_at = NOW().
Store interface
packages/workflows/src/store.ts
Updated IWorkflowStore.getActiveWorkflowRunByPath signature to accept self?: { id: string; startedAt: Date } and docstring about ignoring stale pending rows.
Executor & tests
packages/workflows/src/executor.ts, packages/workflows/src/executor.test.ts, packages/workflows/src/executor-preamble.test.ts
Removed pre-finalization guard; added post-finalization path-lock guard that calls getActiveWorkflowRunByPath(cwd, {id, startedAt}); on active-other result, best-effort updateWorkflowRun to cancel self and send status-aware “in use” messages (paused vs pending/running); added orphan pre-created-run cancellation on resume and best-effort cleanup on error; tests updated for creation-before-guard, self-cancel behavior, message content, and failure modes.
Duration utils & tests
packages/workflows/src/utils/duration.ts, packages/workflows/src/utils/duration.test.ts
Added `parseDbTimestamp(value: Date
SQLite adapter tests
packages/core/src/db/adapters/sqlite.test.ts
Added tests demonstrating lexical vs chronological datetime comparisons and correct use of datetime() for TEXT timestamps.
Docs
packages/docs-web/src/content/docs/reference/cli.md, packages/docs-web/src/content/docs/reference/database.md
Adjusted CLI docs: workflow abandon marks run cancelled and immediately releases path lock; database docs clarify per-working_path locking, pending staleness (~5 min) ignored, and blocking behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Orch as Orchestrator
    participant Exec as Executor
    participant Store as WorkflowStore
    participant DB as Database

    Orch->>DB: INSERT pre-created row (status='pending')
    Orch->>Exec: executeWorkflow(workingPath, runId, startedAt)

    rect rgba(120,160,200,0.5)
        Note over Exec,Store: Path-lock guard (self-excluding)
        Exec->>Store: getActiveWorkflowRunByPath(workingPath, self={id: runId, startedAt})
        Store->>DB: SELECT ... WHERE working_path=$1 AND (status IN ('running','paused') OR (status='pending' AND started_at > cutoff)) AND optional exclusion/tiebreaker ORDER BY started_at ASC, id ASC LIMIT 1
        alt active other run found
            DB-->>Store: active_run
            Store-->>Exec: active_run
            Exec->>DB: updateWorkflowRun(runId, {status:'cancelled'}) (best-effort)
            Exec-->>Orch: return failure + "in use" message (elapsed, id, commands)
        else no conflict
            DB-->>Store: null
            Store-->>Exec: null
            Exec->>Exec: proceed with execution (resume/create)
            Exec-->>Orch: success
        end
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I pre-made a row, then watched time tick,
Pending’s not sleeping — we added a stick.
I skip my own paw, older wins the race,
Orphans get shooed, durations keep pace. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix: lock workflow runs by working_path (#1036, #1188 part 2)' clearly identifies the main change—introducing working_path locking—and references the two issues being fixed. It is specific, concise, and accurately summarizes the primary objective.
Description check ✅ Passed The PR description covers all key template sections: summary with bullets, detailed change table, design rationale, deferred items, testing status, manual reproduction steps, and security/compatibility notes. Content is complete and well-organized.
Linked Issues check ✅ Passed The PR addresses all coding requirements from #1036 and #1188 part 2: includes pending status in guard query with 5-min stale window, implements deterministic older-wins tiebreaker via (started_at, id) ordering, excludes self from query, moves guard after finalization, cancels self on conflict, and improves error messaging with concrete actions.
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objectives: lock semantics (workflows.ts, store.ts, executor.ts), supporting utilities (duration.ts), tests, and documentation. No unrelated refactors or cosmetic changes. Deferred items (worktree lockfile, startup cleanup, DB migration) are explicitly documented.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/workflow-run-path-lock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Two issues found during E2E smoke testing:

1. bun:sqlite rejects Date objects as bindings ("Binding expected
   string, TypedArray, boolean, number, bigint or null"). Serialize
   selfStartedAt to ISO string before passing — PostgreSQL accepts
   ISO strings for TIMESTAMPTZ comparison too.

2. SQLite returns datetimes as plain strings without timezone suffix
   ("YYYY-MM-DD HH:MM:SS"), and JS new Date() parses such strings as
   local time. The blocking message was showing "running 3h" for
   workflows started seconds ago in a UTC+3 timezone.

   Added parseDbTimestamp helper that:
   - Returns Date.getTime() unchanged for Date inputs (PG path)
   - Treats SQLite-style strings as UTC by appending Z

   Used at both call sites: the lock query (selfStartedAt) and the
   blocking message duration.

Tests:
- 4 new tests in duration.test.ts for parseDbTimestamp covering
  Date input, SQLite UTC interpretation, explicit Z, and explicit
  +/-HH:MM offsets.
- Updated workflows.test.ts assertion for ISO serialization.

E2E smoke verified end-to-end:
- Sanity (single dispatch) succeeds.
- Two concurrent --no-worktree dispatches: one wins, one blocked
  with actionable message showing correct "Xs" duration.
- Resume + back-to-back resume both succeed (orphan correctly
  cancelled when resume activates).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/workflows/src/executor.ts`:
- Around line 490-504: The user-facing message and error assume the blocker is
"running"; update the logic that formats the notification and returned error
(around sendCriticalMessage and the returned error where activeWorkflow is used)
to read and use activeWorkflow.status instead of hardcoding "running": map
status values like "running", "pending", "paused" to appropriate human
verbs/phrases (e.g., "running", "pending", "paused" or "paused (interactive)")
and change the suggestions accordingly (for paused suggest resume/unpause, for
pending suggest wait or cancel) so the message and the error string reflect the
actual status from getActiveWorkflowRunByPath(); ensure you still include
workflow_name and shortId in the text.
- Around line 375-401: The code only releases the pre-created run's lock when
resumeWorkflowRun succeeds, so if resumeWorkflowRun() or
getActiveWorkflowRunByPath() throws the preCreatedRun (orphanPreCreated) can
remain in pending and hold the working_path lock; update the logic around
resumeWorkflowRun (and the earlier getActiveWorkflowRunByPath code paths noted
at lines ~507-516) to ensure the preCreatedRun is released on all failure paths:
capture orphanPreCreated (preCreatedRun && preCreatedRun.id !== resumableRun.id)
before calling resumeWorkflowRun(), then wrap resumeWorkflowRun() (and any code
that may throw after obtaining resumableRun) in a try/catch/finally where in
catch/finally you call deps.store.updateWorkflowRun(orphanPreCreated.id, {
status: 'cancelled' }) and log errors the same way as the existing cleanup so
the pending preCreatedRun is always cancelled regardless of resume/getActive
errors.
- Around line 473-477: resumeWorkflowRun preserves the original started_at,
which lets a resumed row win the older-wins tiebreaker in
getActiveWorkflowRunByPath and bypass the path-lock guard; update
resumeWorkflowRun to refresh started_at (or add a resume_at) to the current
timestamp when marking a run resumed so the subsequent call to
getActiveWorkflowRunByPath uses the resume time for ordering; locate the
resumeWorkflowRun function and ensure it writes a new started_at (or resume_at)
alongside status/completed_at/last_activity_at so the path-lock comparison
treats the resumed attempt as fresh.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3bda85de-759e-426e-b2a1-fbfd4b4dec14

📥 Commits

Reviewing files that changed from the base of the PR and between daaab9d and 301a64c.

📒 Files selected for processing (5)
  • packages/core/src/db/workflows.test.ts
  • packages/core/src/db/workflows.ts
  • packages/workflows/src/executor.ts
  • packages/workflows/src/utils/duration.test.ts
  • packages/workflows/src/utils/duration.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • packages/core/src/db/workflows.test.ts
  • packages/workflows/src/utils/duration.ts
  • packages/core/src/db/workflows.ts

Comment thread packages/workflows/src/executor.ts
Comment thread packages/workflows/src/executor.ts Outdated
Comment thread packages/workflows/src/executor.ts Outdated
@Wirasm

Wirasm commented Apr 14, 2026

Copy link
Copy Markdown
Collaborator Author

PR Review Summary (6 specialized agents)

Multi-agent review of the path-based workflow-run lock (fixes #1036 and #1188 part 2).

Critical Issues (2 found)

Agent Issue Location
code-reviewer SQLite tiebreaker format mismatch silently disables older-wins logic. SQLite stores started_at as TEXT DEFAULT (datetime('now'))"YYYY-MM-DD HH:MM:SS". The tiebreaker binds selfStartedAt.toISOString()"YYYY-MM-DDTHH:MM:SS.mmmZ". SQLite compares these lexicographically; at position 10 space (0x20) < T (0x54), so every DB value is always "less than" every ISO param. Both near-simultaneous dispatches therefore see each other as "older" and both abort — the exact race this PR aims to prevent. SQLite is the default deployment. packages/core/src/db/workflows.ts:236-241
code-reviewer Lock-query catch leaks a pending row. When getActiveWorkflowRunByPath throws, the executor returns a DB-error message but does not cancel the row it just created. The path stays blocked for up to 5 min via the stale window with no user explanation — subsequent dispatches will be rejected by a ghost lock. packages/workflows/src/executor.ts:520-529

Important Issues (4 found)

Agent Issue Location
code-reviewer formatDuration(0) returns "0s" despite the comment saying sub-second rounds up to 1s. A run that just started will display "0s" in the lock message. Guard should be ms < 0 not ms <= 0. packages/workflows/src/utils/duration.ts:35
comment-analyzer Comment // We acquired the lock via createWorkflowRun, but lost the older-wins tiebreaker is inaccurate — createWorkflowRun is not a lock-acquisition mechanism; it creates a row. The lock is determined after creation by the query. Rewrite or remove. packages/workflows/src/executor.ts:492-493
type-design-analyzer excludeId + selfStartedAt must be passed together or not at all. Expressing them as two independent optionals lets future callers pass one without the other — the implementation silently degrades (skips tiebreaker). Collapse into self?: { excludeId; selfStartedAt } to make the paired-or-nothing invariant structural. packages/workflows/src/store.ts:46-56
docs-impact-agent workflow abandon docs say "marks as failed" — implementation calls cancelWorkflowRun → status is cancelled. User following the doc to diagnose would be confused. packages/docs-web/src/content/docs/reference/cli.md:174

Suggestions (7 found)

Agent Suggestion Location
pr-test-analyzer Add SQLite-dialect tests for the stale-orphan window boundary (4:59 / 5:00 / 5:01) and the numeric value (300 vs 300000). Currently only the PG clause is asserted. packages/core/src/db/workflows.test.ts
pr-test-analyzer Add a test where updateWorkflowRun throws during guard self-cancel — current tests only cover the success path. packages/workflows/src/executor.test.ts
pr-test-analyzer Add a result-level tiebreaker test with equal started_at and different ids to prove id-lexicographic ordering actually determines the winner. packages/core/src/db/workflows.test.ts
silent-failure-hunter Add cwd to workflow.guard_self_cancel_failed log context. packages/workflows/src/executor.ts:497-501
silent-failure-hunter Add pendingRunId: workflowRun?.id to db_active_workflow_check_failed log so operators can correlate the leaked row (related to Critical #2). packages/workflows/src/executor.ts:520-528
docs-impact-agent packages/docs-web/src/content/docs/reference/database.md:145 — "Prevents concurrent workflow execution" is now misleading. Reword to "Locks concurrent execution per working path; a second run on the same path is auto-cancelled with an actionable message." docs-web/reference/database.md:145
comment-analyzer Remove test-body comments that restate the test name ('passes self-id and started_at...', 'marks self as cancelled...', stale-pending window tests). WHY belongs in the implementation docstring, which already has it. executor.test.ts, workflows.test.ts

Strengths

  • parseDbTimestamp correctly handles the SQLite-without-Z UTC bug, with precise tests reproducing the original issue.
  • formatDuration is pure and well-tested (NaN, Infinity, negatives, sub-second rounding).
  • Resume orphan cleanup is properly classified as best-effort, logged at warn with err/orphanId/resumedRunId, and backed by the 5-min stale-window safety net.
  • IWorkflowStore.getActiveWorkflowRunByPath docstring accurately describes the three-part contract.
  • Dedicated tests for all four error paths (guard fires, guard self-cancel fails, lock query throws, resume-orphan cleanup fails).
  • Older-wins tiebreaker is the right design primitive (when it works on both dialects).

Verdict

NEEDS FIXES — Critical #1 silently defeats the PR's core guarantee under the default SQLite deployment; Critical #2 leaves a user-visible ghost lock on DB errors.

Recommended Actions

  1. Fix SQLite timestamp format. Either (a) format selfStartedAt as "YYYY-MM-DD HH:MM:SS" (UTC) when running SQLite, or (b) wrap both sides in datetime(...) in the SQLite clause so the comparison is date-based, not string-based. Add a SQLite-dialect tiebreaker test with near-simultaneous timestamps proving only the newer dispatch aborts.
  2. Cancel the pending row in the outer catch of the guard block, mirroring the if (activeWorkflow) branch's best-effort cancel + warn.
  3. Flip formatDuration guard to ms < 0 so 0ms elapsed displays as 1s.
  4. Fix the misleading executor.ts lock-acquisition comment and update docs-web/reference/cli.md:174 (abandoncancelled not failed).
  5. (Optional but worthwhile) Collapse excludeId + selfStartedAt into a self?: SelfContext pair so partial-call bugs become unrepresentable.

Re-run /prp-review-agents 1212 after fixes to verify.

CodeRabbit review on #1212 surfaced three real correctness gaps:

CRITICAL — resumeWorkflowRun preserved historical started_at, letting
a resumed row sort ahead of a currently-active holder in the lock
query's older-wins tiebreaker. Two active workflows could end up on
the same working_path. Fix: refresh started_at to NOW() in
resumeWorkflowRun. Original creation time is recoverable from
workflow_events history if needed for analytics.

MAJOR — lock-leak failure paths:
- If resumeWorkflowRun() throws, the orchestrator's pre-created row
  was left as 'pending' until the 5-min stale window. Fix: cancel
  preCreatedRun in the resume catch.
- If getActiveWorkflowRunByPath() throws, workflowRun (possibly
  already promoted to 'running' via resume) was left active with no
  auto-cleanup. Fix: cancel workflowRun in the guard catch.

MINOR — the blocking message always said "running" but the lock
query returns running, paused, AND fresh-pending rows. Telling a
user to "wait for it to finish" on a paused run (waiting on user
approval) would block them indefinitely. Fix: status-aware copy:
- paused: "paused waiting for user input" + approve/reject actions
- pending: "starting" verb
- running: keep current

Tests:
- New: resume refreshes started_at (asserts SQL contains
  `started_at = NOW()`)
- New: cancels preCreatedRun when resumeWorkflowRun throws
- New: cancels workflowRun when guard query throws
- New: paused message uses approve/reject actions, NOT "wait"
- New: pending message uses "starting" verb
- New: running message uses default copy
- Updated: existing tests for new error string ("already active"
  reflects status-aware semantics, not just "running")

Note: the user-facing error string changed from "already running on
this path" to "already active on this path (status)". Internal use
only — surfaced via getResult().error, not directly to users.
@Wirasm

Wirasm commented Apr 14, 2026

Copy link
Copy Markdown
Collaborator Author

Review Response

All three CodeRabbit findings addressed in `008013da`. They were real correctness gaps, not style.

CRITICAL — `resumeWorkflowRun` preserved historical `started_at`

Confirmed: the resumed row could sort ahead of a currently-active holder in the older-wins tiebreaker, letting two active workflows land on the same `working_path`. Fix: refresh `started_at = NOW()` in `resumeWorkflowRun` so the resumed row competes with its current dispatch time. Original creation time is recoverable from `workflow_events` history if needed for analytics.

MAJOR — Lock-leak failure paths

Two cases left ghost locks:

Failure path Before After
`resumeWorkflowRun()` throws preCreatedRun left as 'pending' (5-min stale window only safety) Cancel preCreatedRun in resume catch
`getActiveWorkflowRunByPath()` throws workflowRun (possibly already 'running' via resume) left active with no auto-cleanup Cancel workflowRun in guard catch

Both use best-effort cleanup with a warn log on failure — same pattern as the success-path cancellation.

MINOR — Status-aware blocking message

The lock query returns running, paused, AND fresh-pending rows. The old message always said "running" and told the user to "wait for it to finish" — which is wrong for paused (run is waiting on user approval, will sit forever). New copy:

Status Verb Actions
`paused` "paused waiting for user input" approve, reject, cancel, different branch
`pending` "starting" wait, cancel, different branch
`running` "running" wait, cancel, different branch (unchanged)

Tests

8 new tests:

  • `resumeWorkflowRun` SQL contains `started_at = NOW()`
  • preCreatedRun cancelled when resumeWorkflowRun throws
  • workflowRun cancelled when guard query throws
  • paused message uses approve/reject (asserts `Wait for it to finish` is NOT in the message)
  • pending message uses "starting" verb
  • running message uses default copy
  • Updated 2 existing tests for new error string ("already active" reflects status-aware semantics; surfaced internally only via `result.error`)

Validation

  • `bun run type-check`: clean
  • `bun run lint`: zero warnings
  • 101 tests pass across affected files (8 new)
  • Full suite: green

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/workflows/src/executor-preamble.test.ts`:
- Around line 294-297: Update the test comment to reflect the new cleanup
behavior: instead of stating the newly-created row is left 'pending' to expire
via the 5-minute stale window, note that when the guard/lock query fails the
executor immediately cancels the just-created run. In the test around the mocked
call to store.createWorkflowRun, change the explanatory comment accordingly and
add an assertion that the run transitioned to `cancelled` (i.e., assert the
mock/call that records the cancelled transition was invoked) so the test
documents and verifies the new cancellation contract.

In `@packages/workflows/src/executor.test.ts`:
- Around line 260-291: The test uses a live Date.now() delta to assert "2m 5s"
which can flake; make the test deterministic by freezing time or mocking the
clock before creating started_at and calling executeWorkflow (or else relax the
assertion to a tolerant regex). Specifically, in the test that constructs
otherRun (via makeRun) and sets started_at, install a fake timer or mock
Date.now() to a fixed timestamp so executeWorkflow/formatted message will always
produce the expected duration string, then restore timers; alternatively change
the sentMessage assertions around sendMessageSpy/sentMessage to match a pattern
that accepts both "2m 5s" and "2m 6s". Ensure you update the test that calls
executeWorkflow/sendMessageSpy so the timing is deterministic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec65e155-a71f-49d1-acce-ca185a4e98d9

📥 Commits

Reviewing files that changed from the base of the PR and between 301a64c and 008013d.

📒 Files selected for processing (5)
  • packages/core/src/db/workflows.test.ts
  • packages/core/src/db/workflows.ts
  • packages/workflows/src/executor-preamble.test.ts
  • packages/workflows/src/executor.test.ts
  • packages/workflows/src/executor.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • packages/core/src/db/workflows.test.ts
  • packages/workflows/src/executor.ts
  • packages/core/src/db/workflows.ts

Comment on lines +294 to +297
// The row is created BEFORE the guard runs (so the guard can exclude
// self). When the lock query throws, we abort early — the just-created
// row stays as 'pending' and falls out via the 5-min stale window.
expect((store.createWorkflowRun as ReturnType<typeof mock>).mock.calls.length).toBe(1);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update this comment to match the new cleanup behavior.

Lines 294-297 still describe the old "leave it pending and wait for the stale window" path, but the executor now cancels the just-created row when the guard query fails. Please rewrite the comment and, if this file is meant to document the contract, assert that cancelled transition here too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workflows/src/executor-preamble.test.ts` around lines 294 - 297,
Update the test comment to reflect the new cleanup behavior: instead of stating
the newly-created row is left 'pending' to expire via the 5-minute stale window,
note that when the guard/lock query fails the executor immediately cancels the
just-created run. In the test around the mocked call to store.createWorkflowRun,
change the explanatory comment accordingly and add an assertion that the run
transitioned to `cancelled` (i.e., assert the mock/call that records the
cancelled transition was invoked) so the test documents and verifies the new
cancellation contract.

Comment on lines +260 to +291
it('uses the actionable "in use" message format with workflow name, duration, and short id', async () => {
const otherRun = makeRun({
id: 'abc12345-rest-of-uuid',
workflow_name: 'archon-implement',
status: 'running',
started_at: new Date(Date.now() - 125000).toISOString(), // 2m 5s ago
});
const sendMessageSpy = mock(async () => {});
const platform = {
sendMessage: sendMessageSpy,
getPlatformType: mock(() => 'test' as const),
} as unknown as IWorkflowPlatform;
const store = makeStore({
getActiveWorkflowRunByPath: mock(async () => otherRun),
});
const deps = makeDeps(store);

await executeWorkflow(
deps,
platform,
'conv-1',
'/tmp',
makeWorkflow(),
'test message',
'db-conv-1'
);

expect(sendMessageSpy).toHaveBeenCalled();
const sentMessage = (sendMessageSpy.mock.calls[0] as [string, string])[1];
expect(sentMessage).toContain('archon-implement');
expect(sentMessage).toContain('abc12345');
expect(sentMessage).toContain('2m 5s');

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Freeze the clock for this exact duration assertion.

Line 291 hard-codes 2m 5s from a live Date.now() delta. If CI stalls for just over a second between creating started_at and formatting the message, the same correct behavior becomes 2m 6s and this test flakes. Pin the clock here or avoid asserting an exact second value.

Based on learnings, keep tests deterministic with no flaky timing or network dependence without guardrails.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workflows/src/executor.test.ts` around lines 260 - 291, The test
uses a live Date.now() delta to assert "2m 5s" which can flake; make the test
deterministic by freezing time or mocking the clock before creating started_at
and calling executeWorkflow (or else relax the assertion to a tolerant regex).
Specifically, in the test that constructs otherRun (via makeRun) and sets
started_at, install a fake timer or mock Date.now() to a fixed timestamp so
executeWorkflow/formatted message will always produce the expected duration
string, then restore timers; alternatively change the sentMessage assertions
around sendMessageSpy/sentMessage to match a pattern that accepts both "2m 5s"
and "2m 6s". Ensure you update the test that calls
executeWorkflow/sendMessageSpy so the timing is deterministic.

CodeRabbit second review found one critical issue and several polish
items not addressed in 008013d.

CRITICAL — SQLite tiebreaker silently broken under default deployment.
SQLite stores started_at as TEXT "YYYY-MM-DD HH:MM:SS" (space sep).
Our ISO param is "YYYY-MM-DDTHH:MM:SS.mmmZ" (T sep). SQLite compares
text lexically: char 11 is space (0x20) in column vs T (0x54) in param,
so EVERY column value lex-sorts before EVERY ISO param. Result:
`started_at < $param` is always TRUE regardless of actual time. In
true concurrent dispatches, both sides see each other as "older" and
both abort — defeating the older-wins guarantee under SQLite, which
is the default deployment.

Fix: dialect-aware comparison in getActiveWorkflowRunByPath:
  - PostgreSQL: `started_at < $3::timestamptz` (TIMESTAMPTZ + cast)
  - SQLite: `datetime(started_at) < datetime($3)` (forces chronological
    via SQLite's date/time functions)

Documented with reproducer tests in adapters/sqlite.test.ts: lexical
returns wrong answer for "2026-04-14 12:00:00" < "2026-04-14T10:00:00Z";
datetime() returns correct answer.

Type design — collapse paired params into struct.
`excludeId` and `selfStartedAt` had to travel together (tiebreaker
references both) but were two independent optionals — future callers
could pass one without the other and silently degrade. Replaced with
a single `self?: { id: string; startedAt: Date }` to make the
paired-or-nothing invariant structural.

formatDuration(0) consistency.
Old: `if (ms <= 0) return '0s'` — special-cased 0ms despite the
"sub-second rounds up to 1s" comment. Fixed to `ms < 0` so 0ms
returns '1s' (a run that just started in the same DB second should
display as active, not literal zero).

Comment fix: "We acquired the lock via createWorkflowRun" was
misleading — createWorkflowRun creates a row; the lock is determined
later by the query.

Log context: added cwd to workflow.guard_self_cancel_failed and
pendingRunId to db_active_workflow_check_failed so operators can
correlate leaked rows.

Doc fixes:
- /workflow abandon doc said "marks as failed" — actually 'cancelled'
- database.md "Prevents concurrent workflow execution" → accurate
  description of path-based lock with stale-pending tolerance

Test additions:
- 3 SQLite-direct tests in adapters/sqlite.test.ts proving the
  lexical-vs-chronological bug and the datetime() fix
- Guard self-cancel update throw still surfaces failure to user

Signature change rippled through:
- IWorkflowStore.getActiveWorkflowRunByPath now takes (path, self?)
- All internal callers updated
@Wirasm

Wirasm commented Apr 14, 2026

Copy link
Copy Markdown
Collaborator Author

Review Response (round 2)

The second review caught a real Critical I missed. Fixed in `04069eee`.

CRITICAL — SQLite tiebreaker silently broken

Confirmed the bug at the SQL level:

```sql
sqlite> SELECT '2026-04-14 12:00:00' < '2026-04-14T10:00:00.000Z';
1 -- WRONG: 12:00 should not be < 10:00
sqlite> SELECT datetime('2026-04-14 12:00:00') < datetime('2026-04-14T10:00:00.000Z');
0 -- correct
```

SQLite TEXT lex-compares char-by-char. Position 11 is space (0x20) in the column vs T (0x54) in the ISO param, so the condition was always TRUE regardless of actual time. Under true concurrent dispatch on SQLite (the default deployment), both runs saw each other as "older" and both aborted — defeating the older-wins guarantee.

Fix: dialect-aware comparison in `getActiveWorkflowRunByPath`:

  • PostgreSQL: `started_at < $3::timestamptz` (TIMESTAMPTZ column + explicit cast)
  • SQLite: `datetime(started_at) < datetime($3)` (forces chronological via date/time functions)

3 new SQLite-direct tests in `adapters/sqlite.test.ts` document the bug and prove the fix.

Important — collapsed paired params into struct

`excludeId` and `selfStartedAt` had to travel together but were two independent optionals → future callers could pass one without the other and silently degrade. Now:

```typescript
getActiveWorkflowRunByPath(workingPath: string, self?: { id: string; startedAt: Date })
```

Paired-or-nothing is now structural.

Important — formatDuration(0) consistency

Comment said "sub-second rounds up to 1s" but `ms <= 0` returned `'0s'`. Changed to `ms < 0` so 0ms (just-started run in same DB second) displays as `'1s'`.

Polish

  • Comment "We acquired the lock via createWorkflowRun" → accurate description (lock is determined by the query)
  • Added `cwd` to `workflow.guard_self_cancel_failed` log
  • Added `pendingRunId` to `db_active_workflow_check_failed` log
  • `/workflow abandon` doc said "marks as failed" → actually `'cancelled'`
  • `database.md` "Prevents concurrent workflow execution" → accurate description of path-lock with stale-pending tolerance
  • Test for guard self-cancel update throw

Note on Critical #2

The reviewer's "Lock-query catch leaks a pending row" was already addressed in `008013da` (cancel workflowRun in guard catch) — the review snapshot was taken before that commit. The `pendingRunId` log addition is the new structured-fields polish on top.

Validation

  • type-check: clean
  • lint: 0 warnings
  • 283 tests pass in affected files (4 new + 1 reframed)
  • Full suite: green

@Wirasm Wirasm merged commit 33d31c4 into dev Apr 14, 2026
4 checks passed
@Wirasm Wirasm deleted the fix/workflow-run-path-lock branch April 14, 2026 12:19
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…part 2) (coleam00#1212)

* fix: lock workflow runs by working_path (coleam00#1036, coleam00#1188 part 2)

Both bugs reduce to the same primitive: there's no enforced lock on
working_path, so two dispatches that resolve to the same filesystem
location can race. The DB row is the lock token; pending/running/paused
are "lock held"; terminal statuses release.

Changes:

- getActiveWorkflowRunByPath includes `pending` (with 5-min stale-orphan
  age window), accepts excludeId + selfStartedAt, and orders by
  (started_at ASC, id ASC) for a deterministic older-wins tiebreaker.
  Eliminates the both-abort race where two near-simultaneous dispatches
  with similar timestamps could mutually abort each other.

- Move the executor's guard call site to AFTER workflowRun is finalized
  (preCreated, resumed, or freshly created). This guarantees we always
  have self-ID + started_at to pass to the lock query.

- On guard fire after row creation: mark self as 'cancelled' so we don't
  leave a zombie pending row that would then become its own lock holder.

- New error message includes workflow name, duration, short run id, and
  three concrete next-action commands (status / cancel / different
  branch). Replaces the vague "Workflow already running".

- Resume orphan fix: when executor activates a resumable run, mark the
  orchestrator's pre-created row as 'cancelled'. Without this, every
  resume leaks a pending row that would block the user's own
  back-to-back resume until the 5-min stale window.

- New formatDuration helper for the error message (8 unit tests).

Tests:

- 5 new tests in db/workflows.test.ts: pending in active set, age window,
  excludeId exclusion, tiebreaker SQL shape, ordering.
- 5 new tests in executor.test.ts: self-id passed to query, self-cancel
  on guard fire, new message format, resume orphan cancellation,
  resume proceeds even if orphan cancel fails.
- Updated 2 executor-preamble tests for new structural behavior
  (row-then-guard, new message format).
- 8 new tests for formatDuration.

Deferred (kept scope tight):
- Worktree-layer advisory lockfile (residual coleam00#1188.2 microsecond race
  where both dispatches reach provider.create — bounded by git's own
  atomicity for `worktree add`).
- Startup cleanup of pre-existing stale pending rows (5-min age window
  makes them harmless).
- DB partial UNIQUE constraint migration (code-only is sufficient).

Fixes coleam00#1036
Fixes coleam00#1188 (part 2)

* fix: SQLite Date binding + UTC timestamp parse for path lock guard

Two issues found during E2E smoke testing:

1. bun:sqlite rejects Date objects as bindings ("Binding expected
   string, TypedArray, boolean, number, bigint or null"). Serialize
   selfStartedAt to ISO string before passing — PostgreSQL accepts
   ISO strings for TIMESTAMPTZ comparison too.

2. SQLite returns datetimes as plain strings without timezone suffix
   ("YYYY-MM-DD HH:MM:SS"), and JS new Date() parses such strings as
   local time. The blocking message was showing "running 3h" for
   workflows started seconds ago in a UTC+3 timezone.

   Added parseDbTimestamp helper that:
   - Returns Date.getTime() unchanged for Date inputs (PG path)
   - Treats SQLite-style strings as UTC by appending Z

   Used at both call sites: the lock query (selfStartedAt) and the
   blocking message duration.

Tests:
- 4 new tests in duration.test.ts for parseDbTimestamp covering
  Date input, SQLite UTC interpretation, explicit Z, and explicit
  +/-HH:MM offsets.
- Updated workflows.test.ts assertion for ISO serialization.

E2E smoke verified end-to-end:
- Sanity (single dispatch) succeeds.
- Two concurrent --no-worktree dispatches: one wins, one blocked
  with actionable message showing correct "Xs" duration.
- Resume + back-to-back resume both succeed (orphan correctly
  cancelled when resume activates).

* fix: address review — resume timestamp, lock-leak paths, status copy

CodeRabbit review on coleam00#1212 surfaced three real correctness gaps:

CRITICAL — resumeWorkflowRun preserved historical started_at, letting
a resumed row sort ahead of a currently-active holder in the lock
query's older-wins tiebreaker. Two active workflows could end up on
the same working_path. Fix: refresh started_at to NOW() in
resumeWorkflowRun. Original creation time is recoverable from
workflow_events history if needed for analytics.

MAJOR — lock-leak failure paths:
- If resumeWorkflowRun() throws, the orchestrator's pre-created row
  was left as 'pending' until the 5-min stale window. Fix: cancel
  preCreatedRun in the resume catch.
- If getActiveWorkflowRunByPath() throws, workflowRun (possibly
  already promoted to 'running' via resume) was left active with no
  auto-cleanup. Fix: cancel workflowRun in the guard catch.

MINOR — the blocking message always said "running" but the lock
query returns running, paused, AND fresh-pending rows. Telling a
user to "wait for it to finish" on a paused run (waiting on user
approval) would block them indefinitely. Fix: status-aware copy:
- paused: "paused waiting for user input" + approve/reject actions
- pending: "starting" verb
- running: keep current

Tests:
- New: resume refreshes started_at (asserts SQL contains
  `started_at = NOW()`)
- New: cancels preCreatedRun when resumeWorkflowRun throws
- New: cancels workflowRun when guard query throws
- New: paused message uses approve/reject actions, NOT "wait"
- New: pending message uses "starting" verb
- New: running message uses default copy
- Updated: existing tests for new error string ("already active"
  reflects status-aware semantics, not just "running")

Note: the user-facing error string changed from "already running on
this path" to "already active on this path (status)". Internal use
only — surfaced via getResult().error, not directly to users.

* fix: SQLite tiebreaker dialect bug + paired self struct + UX polish

CodeRabbit second review found one critical issue and several polish
items not addressed in 008013d.

CRITICAL — SQLite tiebreaker silently broken under default deployment.
SQLite stores started_at as TEXT "YYYY-MM-DD HH:MM:SS" (space sep).
Our ISO param is "YYYY-MM-DDTHH:MM:SS.mmmZ" (T sep). SQLite compares
text lexically: char 11 is space (0x20) in column vs T (0x54) in param,
so EVERY column value lex-sorts before EVERY ISO param. Result:
`started_at < $param` is always TRUE regardless of actual time. In
true concurrent dispatches, both sides see each other as "older" and
both abort — defeating the older-wins guarantee under SQLite, which
is the default deployment.

Fix: dialect-aware comparison in getActiveWorkflowRunByPath:
  - PostgreSQL: `started_at < $3::timestamptz` (TIMESTAMPTZ + cast)
  - SQLite: `datetime(started_at) < datetime($3)` (forces chronological
    via SQLite's date/time functions)

Documented with reproducer tests in adapters/sqlite.test.ts: lexical
returns wrong answer for "2026-04-14 12:00:00" < "2026-04-14T10:00:00Z";
datetime() returns correct answer.

Type design — collapse paired params into struct.
`excludeId` and `selfStartedAt` had to travel together (tiebreaker
references both) but were two independent optionals — future callers
could pass one without the other and silently degrade. Replaced with
a single `self?: { id: string; startedAt: Date }` to make the
paired-or-nothing invariant structural.

formatDuration(0) consistency.
Old: `if (ms <= 0) return '0s'` — special-cased 0ms despite the
"sub-second rounds up to 1s" comment. Fixed to `ms < 0` so 0ms
returns '1s' (a run that just started in the same DB second should
display as active, not literal zero).

Comment fix: "We acquired the lock via createWorkflowRun" was
misleading — createWorkflowRun creates a row; the lock is determined
later by the query.

Log context: added cwd to workflow.guard_self_cancel_failed and
pendingRunId to db_active_workflow_check_failed so operators can
correlate leaked rows.

Doc fixes:
- /workflow abandon doc said "marks as failed" — actually 'cancelled'
- database.md "Prevents concurrent workflow execution" → accurate
  description of path-based lock with stale-pending tolerance

Test additions:
- 3 SQLite-direct tests in adapters/sqlite.test.ts proving the
  lexical-vs-chronological bug and the datetime() fix
- Guard self-cancel update throw still surfaces failure to user

Signature change rippled through:
- IWorkflowStore.getActiveWorkflowRunByPath now takes (path, self?)
- All internal callers updated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant