fix: lock workflow runs by working_path (#1036, #1188 part 2)#1212
Conversation
Both bugs reduce to the same primitive: there's no enforced lock on working_path, so two dispatches that resolve to the same filesystem location can race. The DB row is the lock token; pending/running/paused are "lock held"; terminal statuses release. Changes: - getActiveWorkflowRunByPath includes `pending` (with 5-min stale-orphan age window), accepts excludeId + selfStartedAt, and orders by (started_at ASC, id ASC) for a deterministic older-wins tiebreaker. Eliminates the both-abort race where two near-simultaneous dispatches with similar timestamps could mutually abort each other. - Move the executor's guard call site to AFTER workflowRun is finalized (preCreated, resumed, or freshly created). This guarantees we always have self-ID + started_at to pass to the lock query. - On guard fire after row creation: mark self as 'cancelled' so we don't leave a zombie pending row that would then become its own lock holder. - New error message includes workflow name, duration, short run id, and three concrete next-action commands (status / cancel / different branch). Replaces the vague "Workflow already running". - Resume orphan fix: when executor activates a resumable run, mark the orchestrator's pre-created row as 'cancelled'. Without this, every resume leaks a pending row that would block the user's own back-to-back resume until the 5-min stale window. - New formatDuration helper for the error message (8 unit tests). Tests: - 5 new tests in db/workflows.test.ts: pending in active set, age window, excludeId exclusion, tiebreaker SQL shape, ordering. - 5 new tests in executor.test.ts: self-id passed to query, self-cancel on guard fire, new message format, resume orphan cancellation, resume proceeds even if orphan cancel fails. - Updated 2 executor-preamble tests for new structural behavior (row-then-guard, new message format). - 8 new tests for formatDuration. Deferred (kept scope tight): - Worktree-layer advisory lockfile (residual #1188.2 microsecond race where both dispatches reach provider.create — bounded by git's own atomicity for `worktree add`). - Startup cleanup of pre-existing stale pending rows (5-min age window makes them harmless). - DB partial UNIQUE constraint migration (code-only is sufficient). Fixes #1036 Fixes #1188 (part 2)
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (5)
📝 WalkthroughWalkthroughgetActiveWorkflowRunByPath now accepts an optional self object and treats recent Changes
Sequence Diagram(s)sequenceDiagram
participant Orch as Orchestrator
participant Exec as Executor
participant Store as WorkflowStore
participant DB as Database
Orch->>DB: INSERT pre-created row (status='pending')
Orch->>Exec: executeWorkflow(workingPath, runId, startedAt)
rect rgba(120,160,200,0.5)
Note over Exec,Store: Path-lock guard (self-excluding)
Exec->>Store: getActiveWorkflowRunByPath(workingPath, self={id: runId, startedAt})
Store->>DB: SELECT ... WHERE working_path=$1 AND (status IN ('running','paused') OR (status='pending' AND started_at > cutoff)) AND optional exclusion/tiebreaker ORDER BY started_at ASC, id ASC LIMIT 1
alt active other run found
DB-->>Store: active_run
Store-->>Exec: active_run
Exec->>DB: updateWorkflowRun(runId, {status:'cancelled'}) (best-effort)
Exec-->>Orch: return failure + "in use" message (elapsed, id, commands)
else no conflict
DB-->>Store: null
Store-->>Exec: null
Exec->>Exec: proceed with execution (resume/create)
Exec-->>Orch: success
end
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Two issues found during E2E smoke testing:
1. bun:sqlite rejects Date objects as bindings ("Binding expected
string, TypedArray, boolean, number, bigint or null"). Serialize
selfStartedAt to ISO string before passing — PostgreSQL accepts
ISO strings for TIMESTAMPTZ comparison too.
2. SQLite returns datetimes as plain strings without timezone suffix
("YYYY-MM-DD HH:MM:SS"), and JS new Date() parses such strings as
local time. The blocking message was showing "running 3h" for
workflows started seconds ago in a UTC+3 timezone.
Added parseDbTimestamp helper that:
- Returns Date.getTime() unchanged for Date inputs (PG path)
- Treats SQLite-style strings as UTC by appending Z
Used at both call sites: the lock query (selfStartedAt) and the
blocking message duration.
Tests:
- 4 new tests in duration.test.ts for parseDbTimestamp covering
Date input, SQLite UTC interpretation, explicit Z, and explicit
+/-HH:MM offsets.
- Updated workflows.test.ts assertion for ISO serialization.
E2E smoke verified end-to-end:
- Sanity (single dispatch) succeeds.
- Two concurrent --no-worktree dispatches: one wins, one blocked
with actionable message showing correct "Xs" duration.
- Resume + back-to-back resume both succeed (orphan correctly
cancelled when resume activates).
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/workflows/src/executor.ts`:
- Around line 490-504: The user-facing message and error assume the blocker is
"running"; update the logic that formats the notification and returned error
(around sendCriticalMessage and the returned error where activeWorkflow is used)
to read and use activeWorkflow.status instead of hardcoding "running": map
status values like "running", "pending", "paused" to appropriate human
verbs/phrases (e.g., "running", "pending", "paused" or "paused (interactive)")
and change the suggestions accordingly (for paused suggest resume/unpause, for
pending suggest wait or cancel) so the message and the error string reflect the
actual status from getActiveWorkflowRunByPath(); ensure you still include
workflow_name and shortId in the text.
- Around line 375-401: The code only releases the pre-created run's lock when
resumeWorkflowRun succeeds, so if resumeWorkflowRun() or
getActiveWorkflowRunByPath() throws the preCreatedRun (orphanPreCreated) can
remain in pending and hold the working_path lock; update the logic around
resumeWorkflowRun (and the earlier getActiveWorkflowRunByPath code paths noted
at lines ~507-516) to ensure the preCreatedRun is released on all failure paths:
capture orphanPreCreated (preCreatedRun && preCreatedRun.id !== resumableRun.id)
before calling resumeWorkflowRun(), then wrap resumeWorkflowRun() (and any code
that may throw after obtaining resumableRun) in a try/catch/finally where in
catch/finally you call deps.store.updateWorkflowRun(orphanPreCreated.id, {
status: 'cancelled' }) and log errors the same way as the existing cleanup so
the pending preCreatedRun is always cancelled regardless of resume/getActive
errors.
- Around line 473-477: resumeWorkflowRun preserves the original started_at,
which lets a resumed row win the older-wins tiebreaker in
getActiveWorkflowRunByPath and bypass the path-lock guard; update
resumeWorkflowRun to refresh started_at (or add a resume_at) to the current
timestamp when marking a run resumed so the subsequent call to
getActiveWorkflowRunByPath uses the resume time for ordering; locate the
resumeWorkflowRun function and ensure it writes a new started_at (or resume_at)
alongside status/completed_at/last_activity_at so the path-lock comparison
treats the resumed attempt as fresh.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3bda85de-759e-426e-b2a1-fbfd4b4dec14
📒 Files selected for processing (5)
packages/core/src/db/workflows.test.tspackages/core/src/db/workflows.tspackages/workflows/src/executor.tspackages/workflows/src/utils/duration.test.tspackages/workflows/src/utils/duration.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- packages/core/src/db/workflows.test.ts
- packages/workflows/src/utils/duration.ts
- packages/core/src/db/workflows.ts
PR Review Summary (6 specialized agents)Multi-agent review of the path-based workflow-run lock (fixes #1036 and #1188 part 2). Critical Issues (2 found)
Important Issues (4 found)
Suggestions (7 found)
Strengths
VerdictNEEDS FIXES — Critical #1 silently defeats the PR's core guarantee under the default SQLite deployment; Critical #2 leaves a user-visible ghost lock on DB errors. Recommended Actions
Re-run |
CodeRabbit review on #1212 surfaced three real correctness gaps: CRITICAL — resumeWorkflowRun preserved historical started_at, letting a resumed row sort ahead of a currently-active holder in the lock query's older-wins tiebreaker. Two active workflows could end up on the same working_path. Fix: refresh started_at to NOW() in resumeWorkflowRun. Original creation time is recoverable from workflow_events history if needed for analytics. MAJOR — lock-leak failure paths: - If resumeWorkflowRun() throws, the orchestrator's pre-created row was left as 'pending' until the 5-min stale window. Fix: cancel preCreatedRun in the resume catch. - If getActiveWorkflowRunByPath() throws, workflowRun (possibly already promoted to 'running' via resume) was left active with no auto-cleanup. Fix: cancel workflowRun in the guard catch. MINOR — the blocking message always said "running" but the lock query returns running, paused, AND fresh-pending rows. Telling a user to "wait for it to finish" on a paused run (waiting on user approval) would block them indefinitely. Fix: status-aware copy: - paused: "paused waiting for user input" + approve/reject actions - pending: "starting" verb - running: keep current Tests: - New: resume refreshes started_at (asserts SQL contains `started_at = NOW()`) - New: cancels preCreatedRun when resumeWorkflowRun throws - New: cancels workflowRun when guard query throws - New: paused message uses approve/reject actions, NOT "wait" - New: pending message uses "starting" verb - New: running message uses default copy - Updated: existing tests for new error string ("already active" reflects status-aware semantics, not just "running") Note: the user-facing error string changed from "already running on this path" to "already active on this path (status)". Internal use only — surfaced via getResult().error, not directly to users.
Review ResponseAll three CodeRabbit findings addressed in `008013da`. They were real correctness gaps, not style. CRITICAL — `resumeWorkflowRun` preserved historical `started_at`Confirmed: the resumed row could sort ahead of a currently-active holder in the older-wins tiebreaker, letting two active workflows land on the same `working_path`. Fix: refresh `started_at = NOW()` in `resumeWorkflowRun` so the resumed row competes with its current dispatch time. Original creation time is recoverable from `workflow_events` history if needed for analytics. MAJOR — Lock-leak failure pathsTwo cases left ghost locks:
Both use best-effort cleanup with a warn log on failure — same pattern as the success-path cancellation. MINOR — Status-aware blocking messageThe lock query returns running, paused, AND fresh-pending rows. The old message always said "running" and told the user to "wait for it to finish" — which is wrong for paused (run is waiting on user approval, will sit forever). New copy:
Tests8 new tests:
Validation
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/workflows/src/executor-preamble.test.ts`:
- Around line 294-297: Update the test comment to reflect the new cleanup
behavior: instead of stating the newly-created row is left 'pending' to expire
via the 5-minute stale window, note that when the guard/lock query fails the
executor immediately cancels the just-created run. In the test around the mocked
call to store.createWorkflowRun, change the explanatory comment accordingly and
add an assertion that the run transitioned to `cancelled` (i.e., assert the
mock/call that records the cancelled transition was invoked) so the test
documents and verifies the new cancellation contract.
In `@packages/workflows/src/executor.test.ts`:
- Around line 260-291: The test uses a live Date.now() delta to assert "2m 5s"
which can flake; make the test deterministic by freezing time or mocking the
clock before creating started_at and calling executeWorkflow (or else relax the
assertion to a tolerant regex). Specifically, in the test that constructs
otherRun (via makeRun) and sets started_at, install a fake timer or mock
Date.now() to a fixed timestamp so executeWorkflow/formatted message will always
produce the expected duration string, then restore timers; alternatively change
the sentMessage assertions around sendMessageSpy/sentMessage to match a pattern
that accepts both "2m 5s" and "2m 6s". Ensure you update the test that calls
executeWorkflow/sendMessageSpy so the timing is deterministic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ec65e155-a71f-49d1-acce-ca185a4e98d9
📒 Files selected for processing (5)
packages/core/src/db/workflows.test.tspackages/core/src/db/workflows.tspackages/workflows/src/executor-preamble.test.tspackages/workflows/src/executor.test.tspackages/workflows/src/executor.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- packages/core/src/db/workflows.test.ts
- packages/workflows/src/executor.ts
- packages/core/src/db/workflows.ts
| // The row is created BEFORE the guard runs (so the guard can exclude | ||
| // self). When the lock query throws, we abort early — the just-created | ||
| // row stays as 'pending' and falls out via the 5-min stale window. | ||
| expect((store.createWorkflowRun as ReturnType<typeof mock>).mock.calls.length).toBe(1); |
There was a problem hiding this comment.
Update this comment to match the new cleanup behavior.
Lines 294-297 still describe the old "leave it pending and wait for the stale window" path, but the executor now cancels the just-created row when the guard query fails. Please rewrite the comment and, if this file is meant to document the contract, assert that cancelled transition here too.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/workflows/src/executor-preamble.test.ts` around lines 294 - 297,
Update the test comment to reflect the new cleanup behavior: instead of stating
the newly-created row is left 'pending' to expire via the 5-minute stale window,
note that when the guard/lock query fails the executor immediately cancels the
just-created run. In the test around the mocked call to store.createWorkflowRun,
change the explanatory comment accordingly and add an assertion that the run
transitioned to `cancelled` (i.e., assert the mock/call that records the
cancelled transition was invoked) so the test documents and verifies the new
cancellation contract.
| it('uses the actionable "in use" message format with workflow name, duration, and short id', async () => { | ||
| const otherRun = makeRun({ | ||
| id: 'abc12345-rest-of-uuid', | ||
| workflow_name: 'archon-implement', | ||
| status: 'running', | ||
| started_at: new Date(Date.now() - 125000).toISOString(), // 2m 5s ago | ||
| }); | ||
| const sendMessageSpy = mock(async () => {}); | ||
| const platform = { | ||
| sendMessage: sendMessageSpy, | ||
| getPlatformType: mock(() => 'test' as const), | ||
| } as unknown as IWorkflowPlatform; | ||
| const store = makeStore({ | ||
| getActiveWorkflowRunByPath: mock(async () => otherRun), | ||
| }); | ||
| const deps = makeDeps(store); | ||
|
|
||
| await executeWorkflow( | ||
| deps, | ||
| platform, | ||
| 'conv-1', | ||
| '/tmp', | ||
| makeWorkflow(), | ||
| 'test message', | ||
| 'db-conv-1' | ||
| ); | ||
|
|
||
| expect(sendMessageSpy).toHaveBeenCalled(); | ||
| const sentMessage = (sendMessageSpy.mock.calls[0] as [string, string])[1]; | ||
| expect(sentMessage).toContain('archon-implement'); | ||
| expect(sentMessage).toContain('abc12345'); | ||
| expect(sentMessage).toContain('2m 5s'); |
There was a problem hiding this comment.
Freeze the clock for this exact duration assertion.
Line 291 hard-codes 2m 5s from a live Date.now() delta. If CI stalls for just over a second between creating started_at and formatting the message, the same correct behavior becomes 2m 6s and this test flakes. Pin the clock here or avoid asserting an exact second value.
Based on learnings, keep tests deterministic with no flaky timing or network dependence without guardrails.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/workflows/src/executor.test.ts` around lines 260 - 291, The test
uses a live Date.now() delta to assert "2m 5s" which can flake; make the test
deterministic by freezing time or mocking the clock before creating started_at
and calling executeWorkflow (or else relax the assertion to a tolerant regex).
Specifically, in the test that constructs otherRun (via makeRun) and sets
started_at, install a fake timer or mock Date.now() to a fixed timestamp so
executeWorkflow/formatted message will always produce the expected duration
string, then restore timers; alternatively change the sentMessage assertions
around sendMessageSpy/sentMessage to match a pattern that accepts both "2m 5s"
and "2m 6s". Ensure you update the test that calls
executeWorkflow/sendMessageSpy so the timing is deterministic.
CodeRabbit second review found one critical issue and several polish items not addressed in 008013d. CRITICAL — SQLite tiebreaker silently broken under default deployment. SQLite stores started_at as TEXT "YYYY-MM-DD HH:MM:SS" (space sep). Our ISO param is "YYYY-MM-DDTHH:MM:SS.mmmZ" (T sep). SQLite compares text lexically: char 11 is space (0x20) in column vs T (0x54) in param, so EVERY column value lex-sorts before EVERY ISO param. Result: `started_at < $param` is always TRUE regardless of actual time. In true concurrent dispatches, both sides see each other as "older" and both abort — defeating the older-wins guarantee under SQLite, which is the default deployment. Fix: dialect-aware comparison in getActiveWorkflowRunByPath: - PostgreSQL: `started_at < $3::timestamptz` (TIMESTAMPTZ + cast) - SQLite: `datetime(started_at) < datetime($3)` (forces chronological via SQLite's date/time functions) Documented with reproducer tests in adapters/sqlite.test.ts: lexical returns wrong answer for "2026-04-14 12:00:00" < "2026-04-14T10:00:00Z"; datetime() returns correct answer. Type design — collapse paired params into struct. `excludeId` and `selfStartedAt` had to travel together (tiebreaker references both) but were two independent optionals — future callers could pass one without the other and silently degrade. Replaced with a single `self?: { id: string; startedAt: Date }` to make the paired-or-nothing invariant structural. formatDuration(0) consistency. Old: `if (ms <= 0) return '0s'` — special-cased 0ms despite the "sub-second rounds up to 1s" comment. Fixed to `ms < 0` so 0ms returns '1s' (a run that just started in the same DB second should display as active, not literal zero). Comment fix: "We acquired the lock via createWorkflowRun" was misleading — createWorkflowRun creates a row; the lock is determined later by the query. Log context: added cwd to workflow.guard_self_cancel_failed and pendingRunId to db_active_workflow_check_failed so operators can correlate leaked rows. Doc fixes: - /workflow abandon doc said "marks as failed" — actually 'cancelled' - database.md "Prevents concurrent workflow execution" → accurate description of path-based lock with stale-pending tolerance Test additions: - 3 SQLite-direct tests in adapters/sqlite.test.ts proving the lexical-vs-chronological bug and the datetime() fix - Guard self-cancel update throw still surfaces failure to user Signature change rippled through: - IWorkflowStore.getActiveWorkflowRunByPath now takes (path, self?) - All internal callers updated
Review Response (round 2)The second review caught a real Critical I missed. Fixed in `04069eee`. CRITICAL — SQLite tiebreaker silently brokenConfirmed the bug at the SQL level: ```sql SQLite TEXT lex-compares char-by-char. Position 11 is space (0x20) in the column vs T (0x54) in the ISO param, so the condition was always TRUE regardless of actual time. Under true concurrent dispatch on SQLite (the default deployment), both runs saw each other as "older" and both aborted — defeating the older-wins guarantee. Fix: dialect-aware comparison in `getActiveWorkflowRunByPath`:
3 new SQLite-direct tests in `adapters/sqlite.test.ts` document the bug and prove the fix. Important — collapsed paired params into struct`excludeId` and `selfStartedAt` had to travel together but were two independent optionals → future callers could pass one without the other and silently degrade. Now: ```typescript Paired-or-nothing is now structural. Important — formatDuration(0) consistencyComment said "sub-second rounds up to 1s" but `ms <= 0` returned `'0s'`. Changed to `ms < 0` so 0ms (just-started run in same DB second) displays as `'1s'`. Polish
Note on Critical #2The reviewer's "Lock-query catch leaks a pending row" was already addressed in `008013da` (cancel workflowRun in guard catch) — the review snapshot was taken before that commit. The `pendingRunId` log addition is the new structured-fields polish on top. Validation
|
…part 2) (coleam00#1212) * fix: lock workflow runs by working_path (coleam00#1036, coleam00#1188 part 2) Both bugs reduce to the same primitive: there's no enforced lock on working_path, so two dispatches that resolve to the same filesystem location can race. The DB row is the lock token; pending/running/paused are "lock held"; terminal statuses release. Changes: - getActiveWorkflowRunByPath includes `pending` (with 5-min stale-orphan age window), accepts excludeId + selfStartedAt, and orders by (started_at ASC, id ASC) for a deterministic older-wins tiebreaker. Eliminates the both-abort race where two near-simultaneous dispatches with similar timestamps could mutually abort each other. - Move the executor's guard call site to AFTER workflowRun is finalized (preCreated, resumed, or freshly created). This guarantees we always have self-ID + started_at to pass to the lock query. - On guard fire after row creation: mark self as 'cancelled' so we don't leave a zombie pending row that would then become its own lock holder. - New error message includes workflow name, duration, short run id, and three concrete next-action commands (status / cancel / different branch). Replaces the vague "Workflow already running". - Resume orphan fix: when executor activates a resumable run, mark the orchestrator's pre-created row as 'cancelled'. Without this, every resume leaks a pending row that would block the user's own back-to-back resume until the 5-min stale window. - New formatDuration helper for the error message (8 unit tests). Tests: - 5 new tests in db/workflows.test.ts: pending in active set, age window, excludeId exclusion, tiebreaker SQL shape, ordering. - 5 new tests in executor.test.ts: self-id passed to query, self-cancel on guard fire, new message format, resume orphan cancellation, resume proceeds even if orphan cancel fails. - Updated 2 executor-preamble tests for new structural behavior (row-then-guard, new message format). - 8 new tests for formatDuration. Deferred (kept scope tight): - Worktree-layer advisory lockfile (residual coleam00#1188.2 microsecond race where both dispatches reach provider.create — bounded by git's own atomicity for `worktree add`). - Startup cleanup of pre-existing stale pending rows (5-min age window makes them harmless). - DB partial UNIQUE constraint migration (code-only is sufficient). Fixes coleam00#1036 Fixes coleam00#1188 (part 2) * fix: SQLite Date binding + UTC timestamp parse for path lock guard Two issues found during E2E smoke testing: 1. bun:sqlite rejects Date objects as bindings ("Binding expected string, TypedArray, boolean, number, bigint or null"). Serialize selfStartedAt to ISO string before passing — PostgreSQL accepts ISO strings for TIMESTAMPTZ comparison too. 2. SQLite returns datetimes as plain strings without timezone suffix ("YYYY-MM-DD HH:MM:SS"), and JS new Date() parses such strings as local time. The blocking message was showing "running 3h" for workflows started seconds ago in a UTC+3 timezone. Added parseDbTimestamp helper that: - Returns Date.getTime() unchanged for Date inputs (PG path) - Treats SQLite-style strings as UTC by appending Z Used at both call sites: the lock query (selfStartedAt) and the blocking message duration. Tests: - 4 new tests in duration.test.ts for parseDbTimestamp covering Date input, SQLite UTC interpretation, explicit Z, and explicit +/-HH:MM offsets. - Updated workflows.test.ts assertion for ISO serialization. E2E smoke verified end-to-end: - Sanity (single dispatch) succeeds. - Two concurrent --no-worktree dispatches: one wins, one blocked with actionable message showing correct "Xs" duration. - Resume + back-to-back resume both succeed (orphan correctly cancelled when resume activates). * fix: address review — resume timestamp, lock-leak paths, status copy CodeRabbit review on coleam00#1212 surfaced three real correctness gaps: CRITICAL — resumeWorkflowRun preserved historical started_at, letting a resumed row sort ahead of a currently-active holder in the lock query's older-wins tiebreaker. Two active workflows could end up on the same working_path. Fix: refresh started_at to NOW() in resumeWorkflowRun. Original creation time is recoverable from workflow_events history if needed for analytics. MAJOR — lock-leak failure paths: - If resumeWorkflowRun() throws, the orchestrator's pre-created row was left as 'pending' until the 5-min stale window. Fix: cancel preCreatedRun in the resume catch. - If getActiveWorkflowRunByPath() throws, workflowRun (possibly already promoted to 'running' via resume) was left active with no auto-cleanup. Fix: cancel workflowRun in the guard catch. MINOR — the blocking message always said "running" but the lock query returns running, paused, AND fresh-pending rows. Telling a user to "wait for it to finish" on a paused run (waiting on user approval) would block them indefinitely. Fix: status-aware copy: - paused: "paused waiting for user input" + approve/reject actions - pending: "starting" verb - running: keep current Tests: - New: resume refreshes started_at (asserts SQL contains `started_at = NOW()`) - New: cancels preCreatedRun when resumeWorkflowRun throws - New: cancels workflowRun when guard query throws - New: paused message uses approve/reject actions, NOT "wait" - New: pending message uses "starting" verb - New: running message uses default copy - Updated: existing tests for new error string ("already active" reflects status-aware semantics, not just "running") Note: the user-facing error string changed from "already running on this path" to "already active on this path (status)". Internal use only — surfaced via getResult().error, not directly to users. * fix: SQLite tiebreaker dialect bug + paired self struct + UX polish CodeRabbit second review found one critical issue and several polish items not addressed in 008013d. CRITICAL — SQLite tiebreaker silently broken under default deployment. SQLite stores started_at as TEXT "YYYY-MM-DD HH:MM:SS" (space sep). Our ISO param is "YYYY-MM-DDTHH:MM:SS.mmmZ" (T sep). SQLite compares text lexically: char 11 is space (0x20) in column vs T (0x54) in param, so EVERY column value lex-sorts before EVERY ISO param. Result: `started_at < $param` is always TRUE regardless of actual time. In true concurrent dispatches, both sides see each other as "older" and both abort — defeating the older-wins guarantee under SQLite, which is the default deployment. Fix: dialect-aware comparison in getActiveWorkflowRunByPath: - PostgreSQL: `started_at < $3::timestamptz` (TIMESTAMPTZ + cast) - SQLite: `datetime(started_at) < datetime($3)` (forces chronological via SQLite's date/time functions) Documented with reproducer tests in adapters/sqlite.test.ts: lexical returns wrong answer for "2026-04-14 12:00:00" < "2026-04-14T10:00:00Z"; datetime() returns correct answer. Type design — collapse paired params into struct. `excludeId` and `selfStartedAt` had to travel together (tiebreaker references both) but were two independent optionals — future callers could pass one without the other and silently degrade. Replaced with a single `self?: { id: string; startedAt: Date }` to make the paired-or-nothing invariant structural. formatDuration(0) consistency. Old: `if (ms <= 0) return '0s'` — special-cased 0ms despite the "sub-second rounds up to 1s" comment. Fixed to `ms < 0` so 0ms returns '1s' (a run that just started in the same DB second should display as active, not literal zero). Comment fix: "We acquired the lock via createWorkflowRun" was misleading — createWorkflowRun creates a row; the lock is determined later by the query. Log context: added cwd to workflow.guard_self_cancel_failed and pendingRunId to db_active_workflow_check_failed so operators can correlate leaked rows. Doc fixes: - /workflow abandon doc said "marks as failed" — actually 'cancelled' - database.md "Prevents concurrent workflow execution" → accurate description of path-based lock with stale-pending tolerance Test additions: - 3 SQLite-direct tests in adapters/sqlite.test.ts proving the lexical-vs-chronological bug and the datetime() fix - Guard self-cancel update throw still surfaces failure to user Signature change rippled through: - IWorkflowStore.getActiveWorkflowRunByPath now takes (path, self?) - All internal callers updated
Summary
Both #1036 (concurrent-run guard TOCTOU) and #1188 part 2 (parallel worktree collision) reduce to the same primitive: there's no enforced lock on `working_path`, so two dispatches that resolve to the same filesystem location can race.
This PR treats the workflow_runs row as a lock token. `pending / running / paused` = lock held; terminal statuses release.
Fixes #1036
Fixes #1188 (part 2)
What changed
Design — why this approach
First-principles framing: the resource being protected is the filesystem (`working_path`), not workflow run identity. The DB row is the lock token.
Why the older-wins tiebreaker matters: without ordering by `(started_at, id)`, two dispatches with similar timestamps could mutually see each other and BOTH abort. With the total order, both always agree which is "first" — only the newer aborts.
Why the 5-min stale window: orphaned `pending` rows from crashed dispatches or resume-replaced rows would otherwise permanently block a path. The window is conservative — workflows transition pending→running in seconds.
Why move the guard: the old guard ran BEFORE `workflowRun` was finalized (no self-ID to exclude). Moving it after lets us pass self-ID + started_at to the query, eliminating "self-detection" false positives and enabling the tiebreaker.
Resume orphan fix: when executor activates resume (`resumableRun` replaces `workflowRun`), the orchestrator's pre-created row is now redundant. Without releasing it, the user's back-to-back resume would block on its own ghost lock until the 5-min window. Best-effort cancellation; if it fails, the stale-pending window is the safety net.
New error message:
```
❌ This worktree is in use by `archon-implement` (running 2m 5s, run `abc12345`).
• Wait for it to finish: `/workflow status`
• Cancel it: `/workflow cancel abc12345`
• Use a different branch: `--branch `
```
Every line is a concrete next action.
Deferred (intentional)
Testing
Manual reproduction (recommended verification)
```bash
#1036 reproduction
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --no-worktree "run1" &
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --no-worktree "run2" &
wait
Expected: one succeeds; the other shows the new "in use" message
#1188.2 reproduction
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --branch test/collide "a" &
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --branch test/collide "b" &
wait
Resume regression (back-to-back resume must NOT be blocked)
1. Run a failing workflow
2. Resume it
3. Immediately resume again → must succeed
```
Summary by CodeRabbit
Bug Fixes
New Features
Tests
Documentation