fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802) by diazMelgarejo · Pull Request #1827 · garrytan/gstack

diazMelgarejo · 2026-06-01T09:42:36Z

The bug

/sync-gbrain's memory-ingest resume path adopted gbrain's ~/.gbrain/import-checkpoint.json dir field as the staging directory without proving it was one. A poisoned checkpoint — dir = the repo root, written when an autopilot gbrain import was SIGTERM'd while CWD was the repo — was adopted on the next run and then recursively deleted by cleanupStagingDir() → rmSync(repoRoot, { recursive: true, force: true }), destroying the user's working tree. (Independently reproduced; an fs_usage trace caught bun … unlinkat …/<repo>.)

Root cause is a trust failure, not path math: the code deleted a path it never proved it owned.

The fix — one predicate, two call sites, fail-closed

New lib/staging-guard.ts exports checkOwnedStagingDir(dir, gstackHome) — the single definition of "safe to recurse-delete or resume into." Ownership requires all of:

Resolvable — realpathSync succeeds (collapses ../symlinks first).
A directory — statSync(canon).isDirectory().
Structural — canonical path is a direct child of $GSTACK_HOME named .staging-ingest-*.
Not a repo — no .git inside (screaming last-line tripwire).
Minted by us — a .gstack-staging marker that is a regular file (lstat, so a dir/symlink can't impersonate it), written by makeStagingDir().

Wired in at:

cleanupStagingDir() — the single deletion chokepoint (covers both the finally block and the SIGTERM handler). Refuses rm -rf unless the guard passes.
decideResume() — refuses to resume an unprovable checkpoint dir; returns the rejection reason so the caller logs "staging dir not usable: " instead of the misleading "gone" (the poisoned path often still exists).
gstack-memory-ingest entry point — this binary is runnable directly, so it independently guards GSTACK_INGEST_RESUME_DIR rather than trusting the env it receives.
makeStagingDir() — mints the .gstack-staging marker at creation.

Tests

test/regression-1611-gbrain-sync-resume.test.ts grows 9 → 32 assertions: the #1802 poison matrix (repo-root + .git, /, staging-named-without-marker, nested, symlink escape) plus a checkOwnedStagingDir unit matrix. 55 tests green across the two affected files (bun test test/regression-1611-gbrain-sync-resume.test.ts test/gstack-memory-ingest.test.ts).

Compatibility

decideResume() gains an optional gstackHome param (default GSTACK_HOME) for testability; the sole caller is unchanged.

Deliberate non-change (disclosed)

Pre-marker staging dirs from a run interrupted before this upgrade are refused (restage from scratch — seconds, no data loss) rather than accepted via a legacy name-based path. A legacy path would reintroduce the exact name-trust vulnerability this PR removes; the fail-closed cost is a one-time re-stage in a narrow window.

Scope

This guards gstack's own rm -rf boundary. The durable fix is upstream in gbrain — checkpoint dir should always be a gbrain-minted staging dir, never CWD — filed as a companion issue on garrytan/gbrain.

Review

Design steelmanned by a 4-model panel (Gemini, Codex, gpt-4o, qwen3.5-27b); the in-repo eng pass (Codex + an independent code-review agent on the actual diff) hardened the marker check (regular-file), added the second-entry-point guard, and fixed the misleading log.

🤖 Generated with Claude Code

Cross-links

…f repo (garrytan#1802) /sync-gbrain's memory-ingest resume path adopted gbrain's import-checkpoint.json `dir` field as the staging directory without proving it was one. A poisoned checkpoint — `dir` = the repo root, written when an autopilot `gbrain import` was SIGTERM'd while CWD was the repo — was adopted on the next run and then recursively deleted by cleanupStagingDir(), destroying the user's working tree. Root cause is a TRUST failure, not path math: code deleted a path it never proved it owned. Changes: - NEW lib/staging-guard.ts: checkOwnedStagingDir() — single fail-closed proof of "safe to recurse-delete / resume into". Requires ALL of: realpath resolves; target is a directory; it is a direct child of $GSTACK_HOME named .staging-ingest-*; contains no .git (tripwire); and carries a .gstack-staging marker that is a REGULAR FILE (lstat, so a dir/symlink can't impersonate it). - makeStagingDir(): writes the .gstack-staging marker at creation. - cleanupStagingDir(): refuses rm -rf unless the guard passes (single deletion chokepoint — covers the finally block AND the SIGTERM handler). - decideResume(gstackHome = GSTACK_HOME): pure decision; returns the rejection reason so the caller logs "refused: <why>" instead of the misleading "gone" (the poisoned path often still exists on disk). - gstack-memory-ingest second entry point: the binary is runnable directly, so it independently guards GSTACK_INGEST_RESUME_DIR with checkOwnedStagingDir rather than trusting the env it receives. - test/regression-1611: marker minted in the resume test; added the garrytan#1802 poison matrix + a checkOwnedStagingDir unit matrix (9→32 assertions; +23). Compatibility: decideResume() gains an optional gstackHome param (default GSTACK_HOME); the sole caller is unchanged. 55 tests green across the two affected files. Deliberate non-change (documented): pre-marker staging dirs from an interrupted run made BEFORE this upgrade are refused (restage from scratch — seconds, no data loss) rather than accepted via a legacy name-based path, which would reintroduce the exact name-trust vulnerability this fix removes. Reviewed by a 4-model design steelman (Gemini, Codex, gpt-4o, qwen3.5-27b) plus an in-repo eng pass (Codex + an independent code-review agent on the actual diff). All converged that the durable fix is upstream in gbrain (staging-first checkpointing) — filed as a companion issue. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

diazMelgarejo · 2026-06-01T09:46:06Z

Companion root-cause issue filed upstream: garrytan/gbrain#1728 — import-checkpoint.json dir resolving to CWD/repo-root on SIGTERM (staging-first checkpointing). This PR is gstack's own fail-closed rm -rf boundary; garrytan/gbrain#1728 makes the bug class impossible at the source.

…ross upgrades gstack/gbrain `upgrade` overwrites the install dir and silently reverts any not-yet-merged upstream fix — for #1802 that re-introduces the repo-deleting rm -rf. This re-applies our fork patches after any upgrade, until they merge. - scripts/fork-patches/apply-fork-patches.sh: registry-driven, detection-gated, fail-closed driver. Detect (MARKERS grep OR `git apply --reverse --check`) → no-op if present (we patched it OR upstream merged it). Else apply (`--check` then `--3way` for drift; atomic) → VERIFY (bun test) → roll back the patch's files on failure. Never half-patches; never blind-overwrites. - patches/gstack-1802-staging-guard.{patch,meta}: first registered patch (garrytan/gstack#1827). Retire by deleting the pair once it merges. - ~/.zshrc shell-start trigger (silent no-op when patched). - Folded into the mcp-install skill (Fork Self-Heal section) — no new skill. - LESSONS + README capture the method and the fail-closed trust-boundary principle. Proven: re-applies + passes bun test (55) on an unpatched worktree; idempotent no-op on the patched live install. Fixed a worktree-gitlink detection bug found by the re-apply test (.git is a file in a worktree → use rev-parse). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jbetala7 · 2026-06-02T06:06:09Z

Validated on main (3bef43bc): the destructive vector is real and the guard is correctly fail-closed.

On main the resume path adopts import-checkpoint.json's dir as the staging directory without proving it is one, and cleanupStagingDir() then rm -rfs it — so a checkpoint whose dir is the repo root (written when an autopilot gbrain import is SIGTERM'd while CWD is the repo) leads to the repo being recursively deleted. Your regression test #1802 checkpoint.dir = repo root with .git -> stale-staging-missing (not resumed) proves the point: important.py survives the verdict.

checkOwnedStagingDir is sound — realpath resolution, the STAGING_PREFIX basename check, the ownership marker, and the .git tripwire, with any failure => refuse. Fail-closed is the right posture for an rm -rf guard, and no competing open PR touches this path.

One question for the maintainer, not a blocker: since the ownership marker is minted by makeStagingDir, in-flight checkpoints written before this change won't carry the marker and will be refused on the next resume (a one-time re-import, never a wrong delete) — if I'm reading the precedence right. Worth a line in the PR body so the behavior on upgrade isn't surprising. Otherwise LGTM.

garrytan · 2026-06-03T14:41:45Z

Thank you @diazMelgarejo — this is a clean, well-reasoned fix. Moving it to a base-repo branch so the eval/E2E CI gets secrets (fork PRs don't), and bundling three resume-machinery bugs an eng review + Codex pass surfaced in the same code (remote-http cleanup collision, timeout-resume deletion, resume failure-mapping) plus TOCTOU/marker hardening. Your commit is the first of six and you're credited in the CHANGELOG. Continues in #1856.

@diazMelgarejo

… fixes (#1802) (#1856) * fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802) Adopts community fix #1827 by @diazMelgarejo (cyre). New lib/staging-guard.ts exports checkOwnedStagingDir(), the single fail-closed predicate for 'safe to recurse-delete or resume into', wired at cleanupStagingDir() (the deletion chokepoint), decideResume(), the ingest entry point, and makeStagingDir() (mints the .gstack-staging marker). Fixes #1802. Co-Authored-By: cyre <diazMelgarejo@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(sync): don't route the remote-http persistent transcript dir through cleanup (#1802) The ingest finally ran cleanupStagingDir() unconditionally, but in remote-http mode stagingDir is the PERSISTENT transcript dir (~/.gstack/transcripts/) that gstack-brain-sync push must consume. The remote-http branch documents the intent to skip cleanup, but a finally runs on its return. Gate the call on !remoteHttpMode so the ownership guard only ever sees .staging-ingest-* dirs. Pre-gate this dir was deleted outright (broken artifacts handoff); post-#1827 it produced a false 'prevent data loss' warning every sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(sync): preserve staging dir on internal import timeout (#1802 C3) The import-timeout branch printed 'checkpoint preserved' but the finally then deleted the staging dir: the SIGTERM forwarder's preserve branch only runs when the PARENT is signalled, and an internal runGbrainImport timeout kills just the child and returns normally. So #1611 resume-after-timeout never actually worked. Mirror the forwarder in the timeout branch: set preserveStaging only when gbrain checkpointed against this dir (finally then skips cleanup); otherwise clean up and tell the user it restages instead of falsely promising a resume. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(sync): resume must not mark failed files as ingested (#1802 C4) On resume, stagedPathToSource was rebuilt as an empty Map, so readNewFailures() could not map gbrain's per-file failures back to source paths. Every failure fell through to state recording — failed files were silently marked ingested and never retried. Reconstruct the map from the prepared pages via a shared stagedRelPath() helper (single source of truth with writeStaged, so the keys can never drift). Exports stagedRelPath + readNewFailures for a behavioral test proving the reconstructed map recovers the failure the empty map dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * harden(sync): close staging-guard TOCTOU + fail hard on marker write (#1802 C5) checkOwnedStagingDir() now returns the realpath-resolved canonicalPath on a pass, and cleanupStagingDir() rmSync's that instead of the raw input — closing the gap where the input is a symlink swapped between the ownership check and the delete. makeStagingDir() tears down the partial dir and rethrows if the marker write fails, so a marker-less dir (which the guard would refuse forever) can never leak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: v1.56.1.0 — staging-dir ownership guard + resume-correctness fixes (#1802) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: grant the eval report job issues:write so PR comment upsert stops 401ing Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: cyre <diazMelgarejo@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

diazMelgarejo mentioned this pull request Jun 1, 2026

sync-gbrain memory stage rm -rf'd my repo root: resume path trusts checkpoint.dir without verifying it's a gstack staging dir #1802

Closed

diazMelgarejo mentioned this pull request Jun 1, 2026

import-checkpoint.json dir can resolve to CWD/repo root on SIGTERM — enforce staging-first checkpointing garrytan/gbrain#1728

Open

Merge origin/main into staging ownership guard

f8fcf45

diazMelgarejo mentioned this pull request Jun 1, 2026

fix(import): make checkpoints staging-first (#1728) garrytan/gbrain#1731

Open

garrytan mentioned this pull request Jun 3, 2026

v1.56.1.0 fix(sync): staging-dir ownership guard + resume-correctness fixes (#1802) #1856

Merged

garrytan closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802)#1827

fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802)#1827
diazMelgarejo wants to merge 2 commits into
garrytan:mainfrom
diazMelgarejo:fix/1802-staging-ownership-guard

diazMelgarejo commented Jun 1, 2026 •

edited

Loading

Uh oh!

diazMelgarejo commented Jun 1, 2026

Uh oh!

jbetala7 commented Jun 2, 2026

Uh oh!

garrytan commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

diazMelgarejo commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The bug

The fix — one predicate, two call sites, fail-closed

Tests

Compatibility

Deliberate non-change (disclosed)

Scope

Review

Cross-links

Uh oh!

diazMelgarejo commented Jun 1, 2026

Uh oh!

jbetala7 commented Jun 2, 2026

Uh oh!

garrytan commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

diazMelgarejo commented Jun 1, 2026 •

edited

Loading