fix(release-daemon): multi-repo discovery + SIGTERM lock release + security hardening#57
Merged
Merged
Conversation
Two related fixes that make the launchd-managed release daemon actually survive restarts and process the cross-repo queue it sees. **Multi-repo discovery.** The daemon was given a single --project-root but the release queue spans every repo the build orchestrator has queued PRs from (6 repos here: mitosis-control-plane, mitosis-prototype, mitosis-oasis, mitosis-cicd, agnt2-prototype, polis-mesh). Previously discoverQueuedRecords called discoverBuildQueuedPullRequests only for opts.repoPath, so PRs in any other repo were invisible unless they already had a local queue record. Now the daemon also discovers from every distinct repoPath in the local queue (deduped by repoIdentity), and the error message names the repo so log noise is actionable. **SIGTERM lock release.** processReleaseQueueRecord acquires a remote git-ref lock with a 2h TTL. If the daemon is killed mid-landing (launchctl unload, reboot, SIGTERM from any source), the finally block never runs and the lock lingers for the full TTL, blocking any other landing on the same base branch. The fix: - Module-level activeLockReleases registry of synchronous release callbacks. Each in-flight processReleaseQueueRecord registers its release on lock acquisition and deregisters in finally. - installReleaseDaemonSignalHandlers wires SIGTERM/SIGINT/SIGHUP handlers that drain the registry before process.exit, with logging for forensics on any unreleased lock. - Handlers only install when opts.watch is true. The one-shot path (used by test) skips installation to avoid MaxListeners warnings when tests call runReleaseDaemon many times. **Tests.** +4 unit tests covering: multi-repo discovery membership, repoIdentity dedup, non-queued-record skip, and on-completion lock release (indirect coverage of registry cleanup). 10/10 release-daemon tests pass.
…lti-repo-discovery
Three review findings addressed from the specialist pass:
1. Security: add `isAllowedRepoPath` allowlist gate before discoverQueuedRecords
fans out `gh pr list` calls to a record's repoPath. Previously a planted
JSON file in ~/.gstack/build-state/release-queue/ could cause gh/git to
run with a cwd inside an attacker-controlled directory containing a
malicious .git/config (core.sshCommand, core.fsmonitor) — git would
execute it on invocation. Allowlist defaults to ~/Documents, ~/code,
~/Development, ~/src, ~/.gstack/build-worktrees. Override via
GSTACK_DAEMON_REPO_ALLOWLIST. Gate skipped only when callers inject
opts.discoverRemote (tests opt out and take responsibility).
Path traversal via `..` is defeated by path.resolve() before comparison.
2. Maintainability: DRY violation. processReleaseQueueRecord's finally block
was re-resolving `opts.releaseLock ?? releaseRemoteReleaseLock` instead
of the already-resolved `releaseFn` captured at lock-registration time.
A future wrapper added to one path would silently diverge from the
other. Use releaseFn in both code paths now.
3. Maintainability: dedup key mismatch. The configured opts.repoPath used
key `__configured__:${path}` while queued records used `path:${path}`
when repoIdentity was absent. Same physical repo could produce two
discovery calls. Unified under a single `repoKey(identity, path)`
helper.
Tests: +10 cases (9 isAllowedRepoPath + 1 dedup-shared-path + 1 error-
isolation across repos). 22/22 pass.
…lti-repo-discovery
…lti-repo-discovery
Supports $, .field, [field], [*], [?(@.f == 'lit' && !@.g && @.h)]. Anything outside the grammar returns []. Safe by construction: no eval, no Function constructor, no script expressions. Used by the state_jsonpath matcherKind landing in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets investigator-proposed learned patterns target structural state shapes (e.g. "committed feature without completedAt"). Backed by safe-jsonpath. Malformed patterns return false (caught by applyLearnedPattern's existing try/catch) — no throw. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the 2026-05-17 polis-bug detection gap. Detects features in state.features[] where status=="committed" with mergeSha+prNumber but missing completedAt — the signature of a hand-merged PR that bypassed the orchestrator's ship pipeline. Severity HIGH because the resume loop will try to re-process the feature and fail (branch deleted post-merge). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Funnels detector hits into the halt-events queue so drain-faults (landing in PR 5) is the single sink. Existing SKILL_FAULT_DETECTED event shape is unchanged. Each detector fault produces one halt event with kind=PHASE_FAILED, severity mirrored from the fault, message tagged with the detector category. Both the registry path (newFaults only) and the back-compat path (every fault, every tick) dual-emit. Re-emits are safe: the queue file name is deterministic on faultId, and emitHaltEvent's atomic tmp+rename overwrites the same on-disk record idempotently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…URE regression Pins the 2026-05-17 polis-bug detection at the E2E level: detector fires HAND_MERGED_FEATURE on the polis state fixture; halt event queues, loads, and marks-investigated cleanly through the foundation APIs. Full investigator dispatch round-trip lands in PR 5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adversarial review surfaced five critical issues in the v1 allowlist implementation. The earlier commit gated only the remote-discovery fan-out; planted queue records bypassed it entirely. Plus symlink and SIGTERM-window issues. F1 — Local records now go through the allowlist gate too. Previously discoverQueuedRecords added every local record to byId unconditionally before the allowlist check. A planted JSON file in ~/.gstack/build-state/release-queue/ with a crafted repoPath would be selected as `next` by runReleaseDaemon and handed to processReleaseQueueRecord, which would `cd` into the attacker directory for `git fetch` — the exact attack the allowlist docstring claimed to defend. Now: runReleaseDaemon's candidate loop runs each record through isAllowedRepoPath; failures get status=blocked with a lastError naming the disallowed path; the daemon moves on. Tests inject opts.processor to bypass for synthetic paths. F2 — Symlink bypass via realpath. The lexical path.resolve() check could be defeated by a symlink at <allowlist>/evil-link → /tmp/attacker. The link path was lexically inside the allowlist but git would follow the symlink and read the attacker's .git/config. Fix: fs.realpathSync() resolves symlinks, and the allowlist check operates on the realpath. Test added that constructs a symlink pointing outside the allowlist and asserts rejection. F3 — SIGTERM mid-landing now revives records to status=queued. Previously the signal handler did process.exit() after releasing locks, leaving any in-flight record stuck in status=landing — invisible to discoverQueuedRecords (which filters status==="queued") and abandoned until manual retryReleaseQueueRecord. Now the handler walks an activeRecords registry first, rewrites each to status=queued with lastError="interrupted by signal X", then runs the lock-release loop. Next daemon start re-picks the record. F4 — Finally-block reorder closes SIGTERM race window. Previously activeLockReleases.delete() ran before releaseFn(). A SIGTERM landing in that one-statement gap ran neither path and the lock leaked for the full TTL. Now releaseFn() runs first, then the delete. Worst case is a harmless double-release (releaseRemoteReleaseLock is idempotent). F5 — Per-call budget in the signal-handler release loop. launchctl waits 5-10s between SIGTERM and SIGKILL. A slow git push could blow the budget. New SIGNAL_RELEASE_BUDGET_MS=4000 cap stops the loop after 4s and logs the remaining locks as "deferred (budget exhausted)". A leaked lock for 2h is better than half a lock-release truncated by SIGKILL. Bonus: exported _resetReleaseDaemonForTests() so future tests that exercise the watch path can clean module-level state. Tests: +3 (symlink-bypass + 2 for F1 enforcement). 25/25 pass.
…lti-repo-discovery
Two P2 informational findings from Codex structured review. Neither
blocking (gate passed), but both real concerns worth fixing before ship.
Codex-1: allowlist auto-includes opts.repoPath.
Previously a daemon installed from a path outside the hardcoded defaults
(~/Documents/, ~/code/, ~/Development/, ~/src/, ~/.gstack/build-worktrees/)
would reject its own configured project root and block every queued record
unless the user set GSTACK_DAEMON_REPO_ALLOWLIST manually. UX cliff for
non-standard install locations like /workspaces/repo/, mounted volumes,
docker bind mounts. Fix: new exported buildAllowlistWithRoot() prepends
the realpath of opts.repoPath to the default prefixes. Both call sites
in discoverQueuedRecords and runReleaseDaemon use the extended list.
realpath resolution is consistent with the symlink-bypass defense in
isAllowedRepoPath, so a symlinked configured root still gets resolved
to a single canonical entry. Dedup against existing prefixes prevents
duplicates when opts.repoPath happens to match a default. Missing or
invalid extraRoot falls back to defaults (no garbage entry).
Codex-2: signal handler no longer calls process.exit.
child-registry installs its own SIGTERM/SIGINT/SIGHUP handler that runs
killAllChildren("SIGTERM"), waits 2s for graceful shutdown, then
killAllChildren("SIGKILL") for survivors before exiting. My handler was
calling process.exit(143) synchronously after lock release, which aborts
child-registry's 2s setTimeout and any child process that ignores SIGTERM
gets reparented to init — exactly the failure child-registry exists to
prevent. Fix: remove process.exit from the daemon's signal handler. Both
handlers fire on the same signal; node runs them in registration order.
My work is sync (record revival + bounded lock release), child-registry's
is async (kill + sleep + kill + exit). They coordinate naturally — mine
finishes first then child-registry's path completes normally.
Tests: +5 unit tests for buildAllowlistWithRoot (defaults, prepend,
dedup, invalid path, integration with isAllowedRepoPath). 30/30 pass.
…lti-repo-discovery
… correctness) Round-5 of /review closed 14 findings from the specialist panel + Red Team + 12 rounds of Codex adversarial review, fixing security gates and SIGTERM/state-machine integrity issues in the release-daemon. Security gates (defense-in-depth on cwd that flows into spawnSync): - isAllowedRepoPath now returns the realpath-resolved path; callers MUST use it for spawnSync cwd to prevent TOCTOU symlink-swap. - Gate runs unconditionally inside processReleaseQueueRecord (was bypassable via direct-call or opts.processor injection). - featureBranch validated against /^(?!-)[A-Za-z0-9._\/-]+$/ to block refspec injection (e.g. "evil:refs/heads/main" rewriting refs). - worktreePath gated with the same allowlist + .git checks; planted records can no longer redirect /land-and-deploy via worktreePath. - env override GSTACK_DAEMON_REPO_ALLOWLIST is additive (not replace), realpath-normalized, rejects root and empty prefixes. - buildAllowlistWithRoot rejects extraRoot resolving to /. - discoverQueuedRecords gates legacy records (no repoIdentity) BEFORE releaseQueueRecordId derivation — without this, a planted record could trigger `git remote get-url origin` in the rejected repo before the block fires. Block writes use a new runId-based blockReleaseQueueRecordByRunId helper that doesn't touch the rejected repo's .git/config. SIGTERM revival correctness (state-machine integrity): - reviveActiveRecordsForSignal re-reads disk before reviving, skips records already in terminal states (landed / blocked / abandoned). Prevents double-ship and silent erasure of landed CHANGELOG entries. - Drift-repairing records have retries reset to 0 on revival, so the next daemon pass gets a fresh drift-repair budget. - Heartbeat stops before signal-induced lock release to prevent re-push resurrection of the ref. - Lock release callback uses gate-resolved path for cwd. - Records register in activeRecords IMMEDIATELY after the claiming write — closes the window where SIGTERM during verifyQueued or acquireLock left the disk record stranded at "claiming". - After signal revival, processor's safeUpdate guard prevents late-returning land/ship results from overwriting the queued state. Sub-agent dispatch checks isInterrupted at every await boundary so the dying process doesn't start new deploys after the signal handler released the lock. - If shutdown happens during acquireLock and lock acquired, the pre-heartbeat interrupt path releases the lock manually (no finally cleanup is wired up yet). - Marker verification failure checks interruption first — turns a retryable shutdown into a retry, not a manual-recovery hang. Watch-loop correctness: - --once now returns exit 1 when records get early-blocked for disallowed repoPath (was 0, masking poisoned queues from automation). State-machine table: - release-queue: landing→queued and drift_repairing→queued are now allowed transitions (signal-induced revival paths). Tests: 30 → 55 in release-daemon.test.ts. 25 new regression tests cover each round's fix. Full orchestrator suite: 1535 pass / 1 pre-existing fail (integration.test.ts release_queued patch detection — confirmed pre-existing on origin/main). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
anbangr
added a commit
that referenced
this pull request
May 20, 2026
…iting "landed" (#61) The daemon used to classify a sub-agent result based ONLY on exit code: exit 0 = "landed", non-zero/timed_out = "blocked". But the /land-and-deploy sub-agent (Kimi / Claude / Codex) gracefully declines to merge in real scenarios — failing CI, no PR found, pre-merge gate fails — and still exits 0 (subprocess ran cleanly, verdict lives in prose). Observed in production on 2026-05-19 (mitosis-control-plane release-queue): Queue says GitHub reality PR garrytan#120 landed OPEN, e2e failing PR garrytan#122 landed MERGED ✓ PR garrytan#124 landed MERGED ✓ PR garrytan#129 landed OPEN, e2e failing 50% false-positive rate. The queue records are stuck on "landed" so discoverQueuedRecords (which filters status === "queued") never re-fans out for them — silently abandoned. Fix: after the land sub-agent returns exit 0, ask GitHub for the authoritative PR state. New helper `verifyPrMerged(prNumber, repoIdentity, cwd)` runs `gh pr view <pr> --repo <owner/name> --json state -q .state` and returns { merged: true } only on MERGED. Anything else (OPEN, CLOSED, gh exited non-zero, repoIdentity unparseable) returns { merged: false, reason: "..." } and the daemon writes "blocked" with a useful lastError. The repoIdentity is parsed via a strict ^github.com/owner/repo$ regex so a planted record can't sneak shell-specials through to gh — anything that doesn't match the regex gets rejected before any subprocess runs. Network/auth failures (gh exited non-zero) are reported as not-merged rather than thrown — better to block the record and retry than crash the daemon and leak the lock. The remediation for any blocked record is the existing retryReleaseQueueRecord CLI. Injection seam: opts.verifyMerged?: typeof verifyPrMerged so tests can control the response. Defaults to the real verifyPrMerged. Scope: pre-existing bug, not a regression from PR #57. PR garrytan#120 was marked landed at 12:30Z on 2026-05-19, hours before PR #57's work began. PR #57's deploy verification surfaced it by running through a real PR (garrytan#129) with failing e2e. Tests: 60/60 release-daemon.test.ts (5 new regressions covering the production bug pattern, the happy path, network failures, identity parsing, and the positive parse case). Full orchestrator suite: 1525 pass / 16 fail, all 16 pre-existing on origin/main (confirmed via git stash). Remediation for stranded records on production right now (PR#120, PR#129 on mitosis-control-plane): once this lands, the daemon needs manual revival of those two records. The retryReleaseQueueRecord CLI only handles blocked→queued, not landed→queued. Suggest a one-shot audit script or extending the retry CLI to support landed records whose GitHub state disagrees. Out of scope for this PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes two operational gaps in the release-daemon that surfaced during a real session debugging the production daemon:
Multi-repo discovery. The daemon was started with a single
--project-rootbut the release-queue spans every repo the build orchestrator queues PRs from (mitosis-control-plane, mitosis-prototype, mitosis-oasis, mitosis-cicd, agnt2-prototype, polis-mesh — 6 repos in the queue at session start). PreviouslydiscoverQueuedRecordsonly calledgh pr listforopts.repoPath, so PRs from any other repo were invisible unless a local queue record already existed. Now the daemon discovers from every distinctrepoPathin the local queue, deduped byrepoIdentity. Per-repo error messages name the offending repo for actionable log noise.SIGTERM-safe lock release.
processReleaseQueueRecordacquires a remote git-ref lock with a 2h TTL. Before this PR, killing the daemon mid-landing (launchctl unload, reboot, manual SIGTERM) skipped thefinallyblock and the lock lingered for the full 2h, blocking any other landing on the same base branch. Now:activeLockReleasesSet tracks in-flight lock-release callbacks. The signal handler drains them on SIGTERM/SIGINT/SIGHUP.activeRecordsMap tracks in-flight records. Signal handler rewrites each tostatus=queuedbefore the lock-release pass, so SIGTERM mid-landing doesn't strand records inlandingstatus (invisible todiscoverQueuedRecordsand abandoned until manual retry).process.exit(). Coordinates withchild-registry's own SIGTERM handler (which runskillAllChildren("SIGTERM")→ 2s wait →killAllChildren("SIGKILL")→ exit). Both handlers fire; the daemon's sync work completes first, child-registry's async path completes after.git pushfrom blowing the launchctl SIGKILL timeout (~5-10s).Security: allowlist gate for
repoPath. Queue records are read from JSON files in~/.gstack/build-state/release-queue/. A planted record with a craftedrepoPathcould causegh/gitto run in a directory with a malicious.git/config(core.sshCommand,core.fsmonitor,core.editor) — git executes those on invocation. The newisAllowedRepoPathgate:~/Documents/,~/.gstack/build-worktrees/,~/code/,~/Development/,~/src/.GSTACK_DAEMON_REPO_ALLOWLISTenv var (colon-separated absolute paths).opts.repoPath's realpath so daemons installed from non-standard locations (e.g.,/workspaces/repo/) work without manual env-var setup.fs.realpathSyncto defeat symlink bypass (lexicalpath.resolvewould let<allowed>/evil-link → /tmp/attackerpass).repoPathinrunReleaseDaemonbefore the processor handles it — local records aren't given a free pass. Disallowed records are markedblockedwith alastErrornaming the violation.opts.processorto bypass the gate for synthetic paths.Test Coverage
Tests: 6 → 30 (+24 new).
Pre-Landing Review
Reviewed across three iterations (in-conversation):
Round 1 — Claude checklist (manual): 2 informational findings, both skipped as theoretical (re-entry concerns, slow-git-push under SIGKILL with multiple in-flight records).
Round 2 — Claude adversarial subagent (specialist): 13 findings, 5 critical:
repoPathinrunReleaseDaemonbefore the processor runs.path.resolvedefeated by symlinks. FIXED —fs.realpathSyncresolves before allowlist check; test added that constructs<allowed>/evil-link → /tmp/attacker.landingstatus. FIXED —activeRecordsregistry + revival-on-signal.delete(callback)andreleaseFn()in finally. FIXED — reordered:releaseFnfirst (idempotent), then delete.git pushunder SIGKILL timeout. FIXED —SIGNAL_RELEASE_BUDGET_MS=4000cap, log deferred releases.Round 3 — Codex structured review: 2 P2 informational findings, NO P1. Gate PASS:
opts.repoPath. FIXED —buildAllowlistWithRootprepends the realpath of the configured root.process.exit(), pre-empting child-registry's 2s SIGKILL fallback for children that ignore SIGTERM. FIXED — handler no longer exits; lets child-registry's existing async path complete.Final state: 30/30 release-daemon tests pass at HEAD.
Eval Results
No prompt-related files changed — evals skipped.
Scope Drift
Scope Check: CLEAN. Intent matches delivery: multi-repo discovery + SIGTERM safety + security hardening.
Plan Completion
No plan file detected.
Verification Results
Skipped — no plan verification section.
TODOS
No TODO items completed in this PR.
Test plan
bun test build/orchestrator/__tests__/release-daemon.test.ts)queued, (b) lock ref deleted on remote/workspaces/<repo>/), confirm allowlist auto-includes it🤖 Generated with Claude Code