fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802)#1827
fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802)#1827diazMelgarejo wants to merge 2 commits into
Conversation
…f repo (garrytan#1802) /sync-gbrain's memory-ingest resume path adopted gbrain's import-checkpoint.json `dir` field as the staging directory without proving it was one. A poisoned checkpoint — `dir` = the repo root, written when an autopilot `gbrain import` was SIGTERM'd while CWD was the repo — was adopted on the next run and then recursively deleted by cleanupStagingDir(), destroying the user's working tree. Root cause is a TRUST failure, not path math: code deleted a path it never proved it owned. Changes: - NEW lib/staging-guard.ts: checkOwnedStagingDir() — single fail-closed proof of "safe to recurse-delete / resume into". Requires ALL of: realpath resolves; target is a directory; it is a direct child of $GSTACK_HOME named .staging-ingest-*; contains no .git (tripwire); and carries a .gstack-staging marker that is a REGULAR FILE (lstat, so a dir/symlink can't impersonate it). - makeStagingDir(): writes the .gstack-staging marker at creation. - cleanupStagingDir(): refuses rm -rf unless the guard passes (single deletion chokepoint — covers the finally block AND the SIGTERM handler). - decideResume(gstackHome = GSTACK_HOME): pure decision; returns the rejection reason so the caller logs "refused: <why>" instead of the misleading "gone" (the poisoned path often still exists on disk). - gstack-memory-ingest second entry point: the binary is runnable directly, so it independently guards GSTACK_INGEST_RESUME_DIR with checkOwnedStagingDir rather than trusting the env it receives. - test/regression-1611: marker minted in the resume test; added the garrytan#1802 poison matrix + a checkOwnedStagingDir unit matrix (9→32 assertions; +23). Compatibility: decideResume() gains an optional gstackHome param (default GSTACK_HOME); the sole caller is unchanged. 55 tests green across the two affected files. Deliberate non-change (documented): pre-marker staging dirs from an interrupted run made BEFORE this upgrade are refused (restage from scratch — seconds, no data loss) rather than accepted via a legacy name-based path, which would reintroduce the exact name-trust vulnerability this fix removes. Reviewed by a 4-model design steelman (Gemini, Codex, gpt-4o, qwen3.5-27b) plus an in-repo eng pass (Codex + an independent code-review agent on the actual diff). All converged that the durable fix is upstream in gbrain (staging-first checkpointing) — filed as a companion issue. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Companion root-cause issue filed upstream: garrytan/gbrain#1728 — |
…ross upgrades
gstack/gbrain `upgrade` overwrites the install dir and silently reverts any
not-yet-merged upstream fix — for #1802 that re-introduces the repo-deleting
rm -rf. This re-applies our fork patches after any upgrade, until they merge.
- scripts/fork-patches/apply-fork-patches.sh: registry-driven, detection-gated,
fail-closed driver. Detect (MARKERS grep OR `git apply --reverse --check`) →
no-op if present (we patched it OR upstream merged it). Else apply
(`--check` then `--3way` for drift; atomic) → VERIFY (bun test) → roll back
the patch's files on failure. Never half-patches; never blind-overwrites.
- patches/gstack-1802-staging-guard.{patch,meta}: first registered patch
(garrytan/gstack#1827). Retire by deleting the pair once it merges.
- ~/.zshrc shell-start trigger (silent no-op when patched).
- Folded into the mcp-install skill (Fork Self-Heal section) — no new skill.
- LESSONS + README capture the method and the fail-closed trust-boundary principle.
Proven: re-applies + passes bun test (55) on an unpatched worktree; idempotent
no-op on the patched live install. Fixed a worktree-gitlink detection bug found
by the re-apply test (.git is a file in a worktree → use rev-parse).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ross upgrades
gstack/gbrain `upgrade` overwrites the install dir and silently reverts any
not-yet-merged upstream fix — for #1802 that re-introduces the repo-deleting
rm -rf. This re-applies our fork patches after any upgrade, until they merge.
- scripts/fork-patches/apply-fork-patches.sh: registry-driven, detection-gated,
fail-closed driver. Detect (MARKERS grep OR `git apply --reverse --check`) →
no-op if present (we patched it OR upstream merged it). Else apply
(`--check` then `--3way` for drift; atomic) → VERIFY (bun test) → roll back
the patch's files on failure. Never half-patches; never blind-overwrites.
- patches/gstack-1802-staging-guard.{patch,meta}: first registered patch
(garrytan/gstack#1827). Retire by deleting the pair once it merges.
- ~/.zshrc shell-start trigger (silent no-op when patched).
- Folded into the mcp-install skill (Fork Self-Heal section) — no new skill.
- LESSONS + README capture the method and the fail-closed trust-boundary principle.
Proven: re-applies + passes bun test (55) on an unpatched worktree; idempotent
no-op on the patched live install. Fixed a worktree-gitlink detection bug found
by the re-apply test (.git is a file in a worktree → use rev-parse).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Validated on On
One question for the maintainer, not a blocker: since the ownership marker is minted by |
|
Thank you @diazMelgarejo — this is a clean, well-reasoned fix. Moving it to a base-repo branch so the eval/E2E CI gets secrets (fork PRs don't), and bundling three resume-machinery bugs an eng review + Codex pass surfaced in the same code (remote-http cleanup collision, timeout-resume deletion, resume failure-mapping) plus TOCTOU/marker hardening. Your commit is the first of six and you're credited in the CHANGELOG. Continues in #1856. |
… fixes (#1802) (#1856) * fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802) Adopts community fix #1827 by @diazMelgarejo (cyre). New lib/staging-guard.ts exports checkOwnedStagingDir(), the single fail-closed predicate for 'safe to recurse-delete or resume into', wired at cleanupStagingDir() (the deletion chokepoint), decideResume(), the ingest entry point, and makeStagingDir() (mints the .gstack-staging marker). Fixes #1802. Co-Authored-By: cyre <diazMelgarejo@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(sync): don't route the remote-http persistent transcript dir through cleanup (#1802) The ingest finally ran cleanupStagingDir() unconditionally, but in remote-http mode stagingDir is the PERSISTENT transcript dir (~/.gstack/transcripts/) that gstack-brain-sync push must consume. The remote-http branch documents the intent to skip cleanup, but a finally runs on its return. Gate the call on !remoteHttpMode so the ownership guard only ever sees .staging-ingest-* dirs. Pre-gate this dir was deleted outright (broken artifacts handoff); post-#1827 it produced a false 'prevent data loss' warning every sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(sync): preserve staging dir on internal import timeout (#1802 C3) The import-timeout branch printed 'checkpoint preserved' but the finally then deleted the staging dir: the SIGTERM forwarder's preserve branch only runs when the PARENT is signalled, and an internal runGbrainImport timeout kills just the child and returns normally. So #1611 resume-after-timeout never actually worked. Mirror the forwarder in the timeout branch: set preserveStaging only when gbrain checkpointed against this dir (finally then skips cleanup); otherwise clean up and tell the user it restages instead of falsely promising a resume. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(sync): resume must not mark failed files as ingested (#1802 C4) On resume, stagedPathToSource was rebuilt as an empty Map, so readNewFailures() could not map gbrain's per-file failures back to source paths. Every failure fell through to state recording — failed files were silently marked ingested and never retried. Reconstruct the map from the prepared pages via a shared stagedRelPath() helper (single source of truth with writeStaged, so the keys can never drift). Exports stagedRelPath + readNewFailures for a behavioral test proving the reconstructed map recovers the failure the empty map dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * harden(sync): close staging-guard TOCTOU + fail hard on marker write (#1802 C5) checkOwnedStagingDir() now returns the realpath-resolved canonicalPath on a pass, and cleanupStagingDir() rmSync's that instead of the raw input — closing the gap where the input is a symlink swapped between the ownership check and the delete. makeStagingDir() tears down the partial dir and rethrows if the marker write fails, so a marker-less dir (which the guard would refuse forever) can never leak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: v1.56.1.0 — staging-dir ownership guard + resume-correctness fixes (#1802) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: grant the eval report job issues:write so PR comment upsert stops 401ing Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: cyre <diazMelgarejo@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes #1802.
The bug
/sync-gbrain's memory-ingest resume path adopted gbrain's~/.gbrain/import-checkpoint.jsondirfield as the staging directory without proving it was one. A poisoned checkpoint —dir= the repo root, written when an autopilotgbrain importwas SIGTERM'd while CWD was the repo — was adopted on the next run and then recursively deleted bycleanupStagingDir()→rmSync(repoRoot, { recursive: true, force: true }), destroying the user's working tree. (Independently reproduced; anfs_usagetrace caughtbun … unlinkat …/<repo>.)Root cause is a trust failure, not path math: the code deleted a path it never proved it owned.
The fix — one predicate, two call sites, fail-closed
New
lib/staging-guard.tsexportscheckOwnedStagingDir(dir, gstackHome)— the single definition of "safe to recurse-delete or resume into." Ownership requires all of:realpathSyncsucceeds (collapses../symlinks first).statSync(canon).isDirectory().$GSTACK_HOMEnamed.staging-ingest-*..gitinside (screaming last-line tripwire)..gstack-stagingmarker that is a regular file (lstat, so a dir/symlink can't impersonate it), written bymakeStagingDir().Wired in at:
cleanupStagingDir()— the single deletion chokepoint (covers both thefinallyblock and the SIGTERM handler). Refusesrm -rfunless the guard passes.decideResume()— refuses to resume an unprovable checkpoint dir; returns the rejection reason so the caller logs "staging dir not usable: " instead of the misleading "gone" (the poisoned path often still exists).gstack-memory-ingestentry point — this binary is runnable directly, so it independently guardsGSTACK_INGEST_RESUME_DIRrather than trusting the env it receives.makeStagingDir()— mints the.gstack-stagingmarker at creation.Tests
test/regression-1611-gbrain-sync-resume.test.tsgrows 9 → 32 assertions: the #1802 poison matrix (repo-root +.git,/, staging-named-without-marker, nested, symlink escape) plus acheckOwnedStagingDirunit matrix. 55 tests green across the two affected files (bun test test/regression-1611-gbrain-sync-resume.test.ts test/gstack-memory-ingest.test.ts).Compatibility
decideResume()gains an optionalgstackHomeparam (defaultGSTACK_HOME) for testability; the sole caller is unchanged.Deliberate non-change (disclosed)
Pre-marker staging dirs from a run interrupted before this upgrade are refused (restage from scratch — seconds, no data loss) rather than accepted via a legacy name-based path. A legacy path would reintroduce the exact name-trust vulnerability this PR removes; the fail-closed cost is a one-time re-stage in a narrow window.
Scope
This guards gstack's own
rm -rfboundary. The durable fix is upstream in gbrain — checkpointdirshould always be a gbrain-minted staging dir, never CWD — filed as a companion issue ongarrytan/gbrain.Review
Design steelmanned by a 4-model panel (Gemini, Codex, gpt-4o, qwen3.5-27b); the in-repo eng pass (Codex + an independent code-review agent on the actual diff) hardened the marker check (regular-file), added the second-entry-point guard, and fixed the misleading log.
🤖 Generated with Claude Code
Cross-links
dircan resolve to CWD/repo root on SIGTERM — enforce staging-first checkpointing gbrain#1728