Skip to content

fix(release-daemon): multi-repo discovery + SIGTERM lock release + security hardening#57

Merged
anbangr merged 15 commits into
mainfrom
fix/release-daemon-multi-repo-discovery
May 19, 2026
Merged

fix(release-daemon): multi-repo discovery + SIGTERM lock release + security hardening#57
anbangr merged 15 commits into
mainfrom
fix/release-daemon-multi-repo-discovery

Conversation

@anbangr

@anbangr anbangr commented May 19, 2026

Copy link
Copy Markdown
Owner

Summary

This PR fixes two operational gaps in the release-daemon that surfaced during a real session debugging the production daemon:

Multi-repo discovery. The daemon was started with a single --project-root but the release-queue spans every repo the build orchestrator queues PRs from (mitosis-control-plane, mitosis-prototype, mitosis-oasis, mitosis-cicd, agnt2-prototype, polis-mesh — 6 repos in the queue at session start). Previously discoverQueuedRecords only called gh pr list for opts.repoPath, so PRs from any other repo were invisible unless a local queue record already existed. Now the daemon discovers from every distinct repoPath in the local queue, deduped by repoIdentity. Per-repo error messages name the offending repo for actionable log noise.

SIGTERM-safe lock release. processReleaseQueueRecord acquires a remote git-ref lock with a 2h TTL. Before this PR, killing the daemon mid-landing (launchctl unload, reboot, manual SIGTERM) skipped the finally block and the lock lingered for the full 2h, blocking any other landing on the same base branch. Now:

  • A module-level activeLockReleases Set tracks in-flight lock-release callbacks. The signal handler drains them on SIGTERM/SIGINT/SIGHUP.
  • A module-level activeRecords Map tracks in-flight records. Signal handler rewrites each to status=queued before the lock-release pass, so SIGTERM mid-landing doesn't strand records in landing status (invisible to discoverQueuedRecords and abandoned until manual retry).
  • The handler does NOT call process.exit(). Coordinates with child-registry's own SIGTERM handler (which runs killAllChildren("SIGTERM") → 2s wait → killAllChildren("SIGKILL") → exit). Both handlers fire; the daemon's sync work completes first, child-registry's async path completes after.
  • A 4s budget in the release loop prevents a slow git push from blowing the launchctl SIGKILL timeout (~5-10s).

Security: allowlist gate for repoPath. Queue records are read from JSON files in ~/.gstack/build-state/release-queue/. A planted record with a crafted repoPath could cause gh/git to run in a directory with a malicious .git/config (core.sshCommand, core.fsmonitor, core.editor) — git executes those on invocation. The new isAllowedRepoPath gate:

  • Defaults: ~/Documents/, ~/.gstack/build-worktrees/, ~/code/, ~/Development/, ~/src/.
  • Override: GSTACK_DAEMON_REPO_ALLOWLIST env var (colon-separated absolute paths).
  • Auto-includes opts.repoPath's realpath so daemons installed from non-standard locations (e.g., /workspaces/repo/) work without manual env-var setup.
  • Uses fs.realpathSync to defeat symlink bypass (lexical path.resolve would let <allowed>/evil-link → /tmp/attacker pass).
  • Gate runs on every candidate repoPath in runReleaseDaemon before the processor handles it — local records aren't given a free pass. Disallowed records are marked blocked with a lastError naming the violation.
  • Tests inject opts.processor to bypass the gate for synthetic paths.

Test Coverage

CODE PATHS
[+] build/orchestrator/release-daemon.ts
  ├── activeLockReleases registry         [★★★ TESTED] cleanup test verifies natural-completion empties registry
  ├── activeRecords registry              [★★★ TESTED] indirect via integration tests
  ├── installReleaseDaemonSignalHandlers  [INTEGRATION GAP] handlers themselves require process spawn
  ├── isAllowedRepoPath()
  │   ├── empty/non-string                [★★★ TESTED]
  │   ├── non-absolute                    [★★★ TESTED]
  │   ├── outside allowlist               [★★★ TESTED]
  │   ├── doesn't exist                   [★★★ TESTED]
  │   ├── not a directory                 [★★★ TESTED]
  │   ├── no .git marker                  [★★★ TESTED]
  │   ├── .git dir (clone)                [★★★ TESTED]
  │   ├── .git file (worktree)            [★★★ TESTED]
  │   ├── symlink bypass defeated         [★★★ TESTED]
  │   └── path traversal defeated         [★★★ TESTED]
  ├── buildAllowlistWithRoot()
  │   ├── defaults when undefined         [★★★ TESTED]
  │   ├── prepend realpath                [★★★ TESTED]
  │   ├── dedup against defaults          [★★★ TESTED]
  │   ├── ignore invalid extraRoot        [★★★ TESTED]
  │   └── integration with isAllowed      [★★★ TESTED]
  ├── discoverQueuedRecords (multi-repo)  [★★★ TESTED]
  ├── repoIdentity dedup                  [★★★ TESTED]
  ├── path-key fallback                   [★★★ TESTED]
  ├── configured/queued same-path dedup   [★★★ TESTED]
  ├── error-isolation per repo            [★★★ TESTED]
  ├── runReleaseDaemon allowlist gate     [★★★ TESTED]
  └── skip non-queued records             [★★★ TESTED]

COVERAGE: 21/22 paths tested (95%)  |  GAPS: 1 (signal-handler integration, intentional)
QUALITY: ★★★ all

Tests: 6 → 30 (+24 new).

Pre-Landing Review

Reviewed across three iterations (in-conversation):

Round 1 — Claude checklist (manual): 2 informational findings, both skipped as theoretical (re-entry concerns, slow-git-push under SIGKILL with multiple in-flight records).

Round 2 — Claude adversarial subagent (specialist): 13 findings, 5 critical:

  • F1: local queue records bypassed the allowlist entirely. FIXED — gate applied to every candidate repoPath in runReleaseDaemon before the processor runs.
  • F2: lexical path.resolve defeated by symlinks. FIXEDfs.realpathSync resolves before allowlist check; test added that constructs <allowed>/evil-link → /tmp/attacker.
  • F3: SIGTERM mid-landing left records stranded in landing status. FIXEDactiveRecords registry + revival-on-signal.
  • F4: race window between delete(callback) and releaseFn() in finally. FIXED — reordered: releaseFn first (idempotent), then delete.
  • F5: serial sync git push under SIGKILL timeout. FIXEDSIGNAL_RELEASE_BUDGET_MS=4000 cap, log deferred releases.

Round 3 — Codex structured review: 2 P2 informational findings, NO P1. Gate PASS:

  • Codex-1: allowlist didn't auto-include opts.repoPath. FIXEDbuildAllowlistWithRoot prepends the realpath of the configured root.
  • Codex-2: signal handler called process.exit(), pre-empting child-registry's 2s SIGKILL fallback for children that ignore SIGTERM. FIXED — handler no longer exits; lets child-registry's existing async path complete.

Final state: 30/30 release-daemon tests pass at HEAD.

Eval Results

No prompt-related files changed — evals skipped.

Scope Drift

Scope Check: CLEAN. Intent matches delivery: multi-repo discovery + SIGTERM safety + security hardening.

Plan Completion

No plan file detected.

Verification Results

Skipped — no plan verification section.

TODOS

No TODO items completed in this PR.

Test plan

  • 30/30 release-daemon tests pass (bun test build/orchestrator/__tests__/release-daemon.test.ts)
  • Pre-existing browse/gen-skill-docs test failures triaged as unrelated (proven by reproducing on origin/main)
  • Codex structured review GATE: PASS (0 P1)
  • Live verification: load on a real machine, kill mid-landing, verify (a) record revives to queued, (b) lock ref deleted on remote
  • Live verification: install daemon from non-default path (/workspaces/<repo>/), confirm allowlist auto-includes it

🤖 Generated with Claude Code

anbangr and others added 15 commits May 19, 2026 20:19
Two related fixes that make the launchd-managed release daemon actually
survive restarts and process the cross-repo queue it sees.

**Multi-repo discovery.** The daemon was given a single --project-root
but the release queue spans every repo the build orchestrator has
queued PRs from (6 repos here: mitosis-control-plane, mitosis-prototype,
mitosis-oasis, mitosis-cicd, agnt2-prototype, polis-mesh). Previously
discoverQueuedRecords called discoverBuildQueuedPullRequests only for
opts.repoPath, so PRs in any other repo were invisible unless they
already had a local queue record. Now the daemon also discovers from
every distinct repoPath in the local queue (deduped by repoIdentity),
and the error message names the repo so log noise is actionable.

**SIGTERM lock release.** processReleaseQueueRecord acquires a remote
git-ref lock with a 2h TTL. If the daemon is killed mid-landing
(launchctl unload, reboot, SIGTERM from any source), the finally
block never runs and the lock lingers for the full TTL, blocking any
other landing on the same base branch. The fix:

- Module-level activeLockReleases registry of synchronous release
  callbacks. Each in-flight processReleaseQueueRecord registers its
  release on lock acquisition and deregisters in finally.
- installReleaseDaemonSignalHandlers wires SIGTERM/SIGINT/SIGHUP
  handlers that drain the registry before process.exit, with logging
  for forensics on any unreleased lock.
- Handlers only install when opts.watch is true. The one-shot path
  (used by test) skips installation to avoid MaxListeners warnings
  when tests call runReleaseDaemon many times.

**Tests.** +4 unit tests covering: multi-repo discovery membership,
repoIdentity dedup, non-queued-record skip, and on-completion lock
release (indirect coverage of registry cleanup).

10/10 release-daemon tests pass.
Three review findings addressed from the specialist pass:

1. Security: add `isAllowedRepoPath` allowlist gate before discoverQueuedRecords
   fans out `gh pr list` calls to a record's repoPath. Previously a planted
   JSON file in ~/.gstack/build-state/release-queue/ could cause gh/git to
   run with a cwd inside an attacker-controlled directory containing a
   malicious .git/config (core.sshCommand, core.fsmonitor) — git would
   execute it on invocation. Allowlist defaults to ~/Documents, ~/code,
   ~/Development, ~/src, ~/.gstack/build-worktrees. Override via
   GSTACK_DAEMON_REPO_ALLOWLIST. Gate skipped only when callers inject
   opts.discoverRemote (tests opt out and take responsibility).
   Path traversal via `..` is defeated by path.resolve() before comparison.

2. Maintainability: DRY violation. processReleaseQueueRecord's finally block
   was re-resolving `opts.releaseLock ?? releaseRemoteReleaseLock` instead
   of the already-resolved `releaseFn` captured at lock-registration time.
   A future wrapper added to one path would silently diverge from the
   other. Use releaseFn in both code paths now.

3. Maintainability: dedup key mismatch. The configured opts.repoPath used
   key `__configured__:${path}` while queued records used `path:${path}`
   when repoIdentity was absent. Same physical repo could produce two
   discovery calls. Unified under a single `repoKey(identity, path)`
   helper.

Tests: +10 cases (9 isAllowedRepoPath + 1 dedup-shared-path + 1 error-
isolation across repos). 22/22 pass.
Supports $, .field, [field], [*], [?(@.f == 'lit' && !@.g && @.h)].
Anything outside the grammar returns []. Safe by construction: no eval,
no Function constructor, no script expressions. Used by the
state_jsonpath matcherKind landing in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets investigator-proposed learned patterns target structural state shapes
(e.g. "committed feature without completedAt"). Backed by safe-jsonpath.
Malformed patterns return false (caught by applyLearnedPattern's existing
try/catch) — no throw.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the 2026-05-17 polis-bug detection gap. Detects features in
state.features[] where status=="committed" with mergeSha+prNumber but
missing completedAt — the signature of a hand-merged PR that bypassed
the orchestrator's ship pipeline. Severity HIGH because the resume loop
will try to re-process the feature and fail (branch deleted post-merge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Funnels detector hits into the halt-events queue so drain-faults
(landing in PR 5) is the single sink. Existing SKILL_FAULT_DETECTED
event shape is unchanged. Each detector fault produces one halt
event with kind=PHASE_FAILED, severity mirrored from the fault,
message tagged with the detector category.

Both the registry path (newFaults only) and the back-compat path
(every fault, every tick) dual-emit. Re-emits are safe: the queue
file name is deterministic on faultId, and emitHaltEvent's atomic
tmp+rename overwrites the same on-disk record idempotently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…URE regression

Pins the 2026-05-17 polis-bug detection at the E2E level: detector
fires HAND_MERGED_FEATURE on the polis state fixture; halt event
queues, loads, and marks-investigated cleanly through the foundation
APIs. Full investigator dispatch round-trip lands in PR 5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adversarial review surfaced five critical issues in the v1 allowlist
implementation. The earlier commit gated only the remote-discovery
fan-out; planted queue records bypassed it entirely. Plus symlink and
SIGTERM-window issues.

F1 — Local records now go through the allowlist gate too.
Previously discoverQueuedRecords added every local record to byId
unconditionally before the allowlist check. A planted JSON file in
~/.gstack/build-state/release-queue/ with a crafted repoPath would be
selected as `next` by runReleaseDaemon and handed to
processReleaseQueueRecord, which would `cd` into the attacker
directory for `git fetch` — the exact attack the allowlist docstring
claimed to defend. Now: runReleaseDaemon's candidate loop runs each
record through isAllowedRepoPath; failures get status=blocked with
a lastError naming the disallowed path; the daemon moves on. Tests
inject opts.processor to bypass for synthetic paths.

F2 — Symlink bypass via realpath.
The lexical path.resolve() check could be defeated by a symlink at
<allowlist>/evil-link → /tmp/attacker. The link path was lexically
inside the allowlist but git would follow the symlink and read the
attacker's .git/config. Fix: fs.realpathSync() resolves symlinks,
and the allowlist check operates on the realpath. Test added that
constructs a symlink pointing outside the allowlist and asserts
rejection.

F3 — SIGTERM mid-landing now revives records to status=queued.
Previously the signal handler did process.exit() after releasing
locks, leaving any in-flight record stuck in status=landing —
invisible to discoverQueuedRecords (which filters status==="queued")
and abandoned until manual retryReleaseQueueRecord. Now the handler
walks an activeRecords registry first, rewrites each to status=queued
with lastError="interrupted by signal X", then runs the lock-release
loop. Next daemon start re-picks the record.

F4 — Finally-block reorder closes SIGTERM race window.
Previously activeLockReleases.delete() ran before releaseFn(). A
SIGTERM landing in that one-statement gap ran neither path and the
lock leaked for the full TTL. Now releaseFn() runs first, then the
delete. Worst case is a harmless double-release (releaseRemoteReleaseLock
is idempotent).

F5 — Per-call budget in the signal-handler release loop.
launchctl waits 5-10s between SIGTERM and SIGKILL. A slow git push
could blow the budget. New SIGNAL_RELEASE_BUDGET_MS=4000 cap stops
the loop after 4s and logs the remaining locks as "deferred (budget
exhausted)". A leaked lock for 2h is better than half a lock-release
truncated by SIGKILL.

Bonus: exported _resetReleaseDaemonForTests() so future tests that
exercise the watch path can clean module-level state.

Tests: +3 (symlink-bypass + 2 for F1 enforcement). 25/25 pass.
Two P2 informational findings from Codex structured review. Neither
blocking (gate passed), but both real concerns worth fixing before ship.

Codex-1: allowlist auto-includes opts.repoPath.
Previously a daemon installed from a path outside the hardcoded defaults
(~/Documents/, ~/code/, ~/Development/, ~/src/, ~/.gstack/build-worktrees/)
would reject its own configured project root and block every queued record
unless the user set GSTACK_DAEMON_REPO_ALLOWLIST manually. UX cliff for
non-standard install locations like /workspaces/repo/, mounted volumes,
docker bind mounts. Fix: new exported buildAllowlistWithRoot() prepends
the realpath of opts.repoPath to the default prefixes. Both call sites
in discoverQueuedRecords and runReleaseDaemon use the extended list.
realpath resolution is consistent with the symlink-bypass defense in
isAllowedRepoPath, so a symlinked configured root still gets resolved
to a single canonical entry. Dedup against existing prefixes prevents
duplicates when opts.repoPath happens to match a default. Missing or
invalid extraRoot falls back to defaults (no garbage entry).

Codex-2: signal handler no longer calls process.exit.
child-registry installs its own SIGTERM/SIGINT/SIGHUP handler that runs
killAllChildren("SIGTERM"), waits 2s for graceful shutdown, then
killAllChildren("SIGKILL") for survivors before exiting. My handler was
calling process.exit(143) synchronously after lock release, which aborts
child-registry's 2s setTimeout and any child process that ignores SIGTERM
gets reparented to init — exactly the failure child-registry exists to
prevent. Fix: remove process.exit from the daemon's signal handler. Both
handlers fire on the same signal; node runs them in registration order.
My work is sync (record revival + bounded lock release), child-registry's
is async (kill + sleep + kill + exit). They coordinate naturally — mine
finishes first then child-registry's path completes normally.

Tests: +5 unit tests for buildAllowlistWithRoot (defaults, prepend,
dedup, invalid path, integration with isAllowedRepoPath). 30/30 pass.
… correctness)

Round-5 of /review closed 14 findings from the specialist panel + Red Team
+ 12 rounds of Codex adversarial review, fixing security gates and
SIGTERM/state-machine integrity issues in the release-daemon.

Security gates (defense-in-depth on cwd that flows into spawnSync):
- isAllowedRepoPath now returns the realpath-resolved path; callers
  MUST use it for spawnSync cwd to prevent TOCTOU symlink-swap.
- Gate runs unconditionally inside processReleaseQueueRecord (was
  bypassable via direct-call or opts.processor injection).
- featureBranch validated against /^(?!-)[A-Za-z0-9._\/-]+$/ to block
  refspec injection (e.g. "evil:refs/heads/main" rewriting refs).
- worktreePath gated with the same allowlist + .git checks; planted
  records can no longer redirect /land-and-deploy via worktreePath.
- env override GSTACK_DAEMON_REPO_ALLOWLIST is additive (not
  replace), realpath-normalized, rejects root and empty prefixes.
- buildAllowlistWithRoot rejects extraRoot resolving to /.
- discoverQueuedRecords gates legacy records (no repoIdentity)
  BEFORE releaseQueueRecordId derivation — without this, a planted
  record could trigger `git remote get-url origin` in the rejected
  repo before the block fires. Block writes use a new runId-based
  blockReleaseQueueRecordByRunId helper that doesn't touch the
  rejected repo's .git/config.

SIGTERM revival correctness (state-machine integrity):
- reviveActiveRecordsForSignal re-reads disk before reviving, skips
  records already in terminal states (landed / blocked / abandoned).
  Prevents double-ship and silent erasure of landed CHANGELOG entries.
- Drift-repairing records have retries reset to 0 on revival, so the
  next daemon pass gets a fresh drift-repair budget.
- Heartbeat stops before signal-induced lock release to prevent
  re-push resurrection of the ref.
- Lock release callback uses gate-resolved path for cwd.
- Records register in activeRecords IMMEDIATELY after the claiming
  write — closes the window where SIGTERM during verifyQueued or
  acquireLock left the disk record stranded at "claiming".
- After signal revival, processor's safeUpdate guard prevents
  late-returning land/ship results from overwriting the queued
  state. Sub-agent dispatch checks isInterrupted at every await
  boundary so the dying process doesn't start new deploys after the
  signal handler released the lock.
- If shutdown happens during acquireLock and lock acquired, the
  pre-heartbeat interrupt path releases the lock manually (no
  finally cleanup is wired up yet).
- Marker verification failure checks interruption first — turns a
  retryable shutdown into a retry, not a manual-recovery hang.

Watch-loop correctness:
- --once now returns exit 1 when records get early-blocked for
  disallowed repoPath (was 0, masking poisoned queues from
  automation).

State-machine table:
- release-queue: landing→queued and drift_repairing→queued are now
  allowed transitions (signal-induced revival paths).

Tests: 30 → 55 in release-daemon.test.ts. 25 new regression tests
cover each round's fix. Full orchestrator suite: 1535 pass / 1
pre-existing fail (integration.test.ts release_queued patch
detection — confirmed pre-existing on origin/main).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anbangr anbangr merged commit a95d94f into main May 19, 2026
@anbangr anbangr deleted the fix/release-daemon-multi-repo-discovery branch May 19, 2026 22:47
anbangr added a commit that referenced this pull request May 20, 2026
…iting "landed" (#61)

The daemon used to classify a sub-agent result based ONLY on exit code:
exit 0 = "landed", non-zero/timed_out = "blocked". But the /land-and-deploy
sub-agent (Kimi / Claude / Codex) gracefully declines to merge in real
scenarios — failing CI, no PR found, pre-merge gate fails — and still
exits 0 (subprocess ran cleanly, verdict lives in prose).

Observed in production on 2026-05-19 (mitosis-control-plane release-queue):

  Queue says    GitHub reality
  PR garrytan#120 landed  OPEN, e2e failing
  PR garrytan#122 landed  MERGED ✓
  PR garrytan#124 landed  MERGED ✓
  PR garrytan#129 landed  OPEN, e2e failing

50% false-positive rate. The queue records are stuck on "landed" so
discoverQueuedRecords (which filters status === "queued") never re-fans
out for them — silently abandoned.

Fix: after the land sub-agent returns exit 0, ask GitHub for the
authoritative PR state. New helper `verifyPrMerged(prNumber, repoIdentity,
cwd)` runs `gh pr view <pr> --repo <owner/name> --json state -q .state`
and returns { merged: true } only on MERGED. Anything else (OPEN, CLOSED,
gh exited non-zero, repoIdentity unparseable) returns { merged: false,
reason: "..." } and the daemon writes "blocked" with a useful lastError.

The repoIdentity is parsed via a strict ^github.com/owner/repo$ regex
so a planted record can't sneak shell-specials through to gh — anything
that doesn't match the regex gets rejected before any subprocess runs.

Network/auth failures (gh exited non-zero) are reported as not-merged
rather than thrown — better to block the record and retry than crash
the daemon and leak the lock. The remediation for any blocked record
is the existing retryReleaseQueueRecord CLI.

Injection seam: opts.verifyMerged?: typeof verifyPrMerged so tests can
control the response. Defaults to the real verifyPrMerged.

Scope: pre-existing bug, not a regression from PR #57. PR garrytan#120 was
marked landed at 12:30Z on 2026-05-19, hours before PR #57's work
began. PR #57's deploy verification surfaced it by running through a
real PR (garrytan#129) with failing e2e.

Tests: 60/60 release-daemon.test.ts (5 new regressions covering the
production bug pattern, the happy path, network failures, identity
parsing, and the positive parse case). Full orchestrator suite: 1525
pass / 16 fail, all 16 pre-existing on origin/main (confirmed via
git stash).

Remediation for stranded records on production right now (PR#120,
PR#129 on mitosis-control-plane): once this lands, the daemon needs
manual revival of those two records. The retryReleaseQueueRecord CLI
only handles blocked→queued, not landed→queued. Suggest a one-shot
audit script or extending the retry CLI to support landed records
whose GitHub state disagrees. Out of scope for this PR.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant