Skip to content

fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802)#1827

Closed
diazMelgarejo wants to merge 2 commits into
garrytan:mainfrom
diazMelgarejo:fix/1802-staging-ownership-guard
Closed

fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802)#1827
diazMelgarejo wants to merge 2 commits into
garrytan:mainfrom
diazMelgarejo:fix/1802-staging-ownership-guard

Conversation

@diazMelgarejo

@diazMelgarejo diazMelgarejo commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Fixes #1802.

The bug

/sync-gbrain's memory-ingest resume path adopted gbrain's ~/.gbrain/import-checkpoint.json dir field as the staging directory without proving it was one. A poisoned checkpoint — dir = the repo root, written when an autopilot gbrain import was SIGTERM'd while CWD was the repo — was adopted on the next run and then recursively deleted by cleanupStagingDir()rmSync(repoRoot, { recursive: true, force: true }), destroying the user's working tree. (Independently reproduced; an fs_usage trace caught bun … unlinkat …/<repo>.)

Root cause is a trust failure, not path math: the code deleted a path it never proved it owned.

The fix — one predicate, two call sites, fail-closed

New lib/staging-guard.ts exports checkOwnedStagingDir(dir, gstackHome) — the single definition of "safe to recurse-delete or resume into." Ownership requires all of:

  1. ResolvablerealpathSync succeeds (collapses ../symlinks first).
  2. A directorystatSync(canon).isDirectory().
  3. Structural — canonical path is a direct child of $GSTACK_HOME named .staging-ingest-*.
  4. Not a repo — no .git inside (screaming last-line tripwire).
  5. Minted by us — a .gstack-staging marker that is a regular file (lstat, so a dir/symlink can't impersonate it), written by makeStagingDir().

Wired in at:

  • cleanupStagingDir() — the single deletion chokepoint (covers both the finally block and the SIGTERM handler). Refuses rm -rf unless the guard passes.
  • decideResume() — refuses to resume an unprovable checkpoint dir; returns the rejection reason so the caller logs "staging dir not usable: " instead of the misleading "gone" (the poisoned path often still exists).
  • gstack-memory-ingest entry point — this binary is runnable directly, so it independently guards GSTACK_INGEST_RESUME_DIR rather than trusting the env it receives.
  • makeStagingDir() — mints the .gstack-staging marker at creation.

Tests

test/regression-1611-gbrain-sync-resume.test.ts grows 9 → 32 assertions: the #1802 poison matrix (repo-root + .git, /, staging-named-without-marker, nested, symlink escape) plus a checkOwnedStagingDir unit matrix. 55 tests green across the two affected files (bun test test/regression-1611-gbrain-sync-resume.test.ts test/gstack-memory-ingest.test.ts).

Compatibility

decideResume() gains an optional gstackHome param (default GSTACK_HOME) for testability; the sole caller is unchanged.

Deliberate non-change (disclosed)

Pre-marker staging dirs from a run interrupted before this upgrade are refused (restage from scratch — seconds, no data loss) rather than accepted via a legacy name-based path. A legacy path would reintroduce the exact name-trust vulnerability this PR removes; the fail-closed cost is a one-time re-stage in a narrow window.

Scope

This guards gstack's own rm -rf boundary. The durable fix is upstream in gbrain — checkpoint dir should always be a gbrain-minted staging dir, never CWD — filed as a companion issue on garrytan/gbrain.

Review

Design steelmanned by a 4-model panel (Gemini, Codex, gpt-4o, qwen3.5-27b); the in-repo eng pass (Codex + an independent code-review agent on the actual diff) hardened the marker check (regular-file), added the second-entry-point guard, and fixed the misleading log.

🤖 Generated with Claude Code

Cross-links

…f repo (garrytan#1802)

/sync-gbrain's memory-ingest resume path adopted gbrain's import-checkpoint.json
`dir` field as the staging directory without proving it was one. A poisoned
checkpoint — `dir` = the repo root, written when an autopilot `gbrain import`
was SIGTERM'd while CWD was the repo — was adopted on the next run and then
recursively deleted by cleanupStagingDir(), destroying the user's working tree.

Root cause is a TRUST failure, not path math: code deleted a path it never
proved it owned.

Changes:
- NEW lib/staging-guard.ts: checkOwnedStagingDir() — single fail-closed proof of
  "safe to recurse-delete / resume into". Requires ALL of: realpath resolves;
  target is a directory; it is a direct child of $GSTACK_HOME named
  .staging-ingest-*; contains no .git (tripwire); and carries a .gstack-staging
  marker that is a REGULAR FILE (lstat, so a dir/symlink can't impersonate it).
- makeStagingDir(): writes the .gstack-staging marker at creation.
- cleanupStagingDir(): refuses rm -rf unless the guard passes (single deletion
  chokepoint — covers the finally block AND the SIGTERM handler).
- decideResume(gstackHome = GSTACK_HOME): pure decision; returns the rejection
  reason so the caller logs "refused: <why>" instead of the misleading "gone"
  (the poisoned path often still exists on disk).
- gstack-memory-ingest second entry point: the binary is runnable directly, so
  it independently guards GSTACK_INGEST_RESUME_DIR with checkOwnedStagingDir
  rather than trusting the env it receives.
- test/regression-1611: marker minted in the resume test; added the garrytan#1802 poison
  matrix + a checkOwnedStagingDir unit matrix (9→32 assertions; +23).

Compatibility: decideResume() gains an optional gstackHome param (default
GSTACK_HOME); the sole caller is unchanged. 55 tests green across the two
affected files.

Deliberate non-change (documented): pre-marker staging dirs from an interrupted
run made BEFORE this upgrade are refused (restage from scratch — seconds, no
data loss) rather than accepted via a legacy name-based path, which would
reintroduce the exact name-trust vulnerability this fix removes.

Reviewed by a 4-model design steelman (Gemini, Codex, gpt-4o, qwen3.5-27b) plus
an in-repo eng pass (Codex + an independent code-review agent on the actual
diff). All converged that the durable fix is upstream in gbrain
(staging-first checkpointing) — filed as a companion issue.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@diazMelgarejo

Copy link
Copy Markdown
Contributor Author

Companion root-cause issue filed upstream: garrytan/gbrain#1728import-checkpoint.json dir resolving to CWD/repo-root on SIGTERM (staging-first checkpointing). This PR is gstack's own fail-closed rm -rf boundary; garrytan/gbrain#1728 makes the bug class impossible at the source.

diazMelgarejo added a commit to diazMelgarejo/orama-system that referenced this pull request Jun 1, 2026
…ross upgrades

gstack/gbrain `upgrade` overwrites the install dir and silently reverts any
not-yet-merged upstream fix — for #1802 that re-introduces the repo-deleting
rm -rf. This re-applies our fork patches after any upgrade, until they merge.

- scripts/fork-patches/apply-fork-patches.sh: registry-driven, detection-gated,
  fail-closed driver. Detect (MARKERS grep OR `git apply --reverse --check`) →
  no-op if present (we patched it OR upstream merged it). Else apply
  (`--check` then `--3way` for drift; atomic) → VERIFY (bun test) → roll back
  the patch's files on failure. Never half-patches; never blind-overwrites.
- patches/gstack-1802-staging-guard.{patch,meta}: first registered patch
  (garrytan/gstack#1827). Retire by deleting the pair once it merges.
- ~/.zshrc shell-start trigger (silent no-op when patched).
- Folded into the mcp-install skill (Fork Self-Heal section) — no new skill.
- LESSONS + README capture the method and the fail-closed trust-boundary principle.

Proven: re-applies + passes bun test (55) on an unpatched worktree; idempotent
no-op on the patched live install. Fixed a worktree-gitlink detection bug found
by the re-apply test (.git is a file in a worktree → use rev-parse).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cursor Bot pushed a commit to diazMelgarejo/orama-system that referenced this pull request Jun 1, 2026
…ross upgrades

gstack/gbrain `upgrade` overwrites the install dir and silently reverts any
not-yet-merged upstream fix — for #1802 that re-introduces the repo-deleting
rm -rf. This re-applies our fork patches after any upgrade, until they merge.

- scripts/fork-patches/apply-fork-patches.sh: registry-driven, detection-gated,
  fail-closed driver. Detect (MARKERS grep OR `git apply --reverse --check`) →
  no-op if present (we patched it OR upstream merged it). Else apply
  (`--check` then `--3way` for drift; atomic) → VERIFY (bun test) → roll back
  the patch's files on failure. Never half-patches; never blind-overwrites.
- patches/gstack-1802-staging-guard.{patch,meta}: first registered patch
  (garrytan/gstack#1827). Retire by deleting the pair once it merges.
- ~/.zshrc shell-start trigger (silent no-op when patched).
- Folded into the mcp-install skill (Fork Self-Heal section) — no new skill.
- LESSONS + README capture the method and the fail-closed trust-boundary principle.

Proven: re-applies + passes bun test (55) on an unpatched worktree; idempotent
no-op on the patched live install. Fixed a worktree-gitlink detection bug found
by the re-apply test (.git is a file in a worktree → use rev-parse).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jbetala7

jbetala7 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Validated on main (3bef43bc): the destructive vector is real and the guard is correctly fail-closed.

On main the resume path adopts import-checkpoint.json's dir as the staging directory without proving it is one, and cleanupStagingDir() then rm -rfs it — so a checkpoint whose dir is the repo root (written when an autopilot gbrain import is SIGTERM'd while CWD is the repo) leads to the repo being recursively deleted. Your regression test #1802 checkpoint.dir = repo root with .git -> stale-staging-missing (not resumed) proves the point: important.py survives the verdict.

checkOwnedStagingDir is sound — realpath resolution, the STAGING_PREFIX basename check, the ownership marker, and the .git tripwire, with any failure => refuse. Fail-closed is the right posture for an rm -rf guard, and no competing open PR touches this path.

One question for the maintainer, not a blocker: since the ownership marker is minted by makeStagingDir, in-flight checkpoints written before this change won't carry the marker and will be refused on the next resume (a one-time re-import, never a wrong delete) — if I'm reading the precedence right. Worth a line in the PR body so the behavior on upgrade isn't surprising. Otherwise LGTM.

@garrytan

garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Thank you @diazMelgarejo — this is a clean, well-reasoned fix. Moving it to a base-repo branch so the eval/E2E CI gets secrets (fork PRs don't), and bundling three resume-machinery bugs an eng review + Codex pass surfaced in the same code (remote-http cleanup collision, timeout-resume deletion, resume failure-mapping) plus TOCTOU/marker hardening. Your commit is the first of six and you're credited in the CHANGELOG. Continues in #1856.

@garrytan garrytan closed this Jun 3, 2026
garrytan added a commit that referenced this pull request Jun 7, 2026
… fixes (#1802) (#1856)

* fix(sync): fail-closed staging-dir ownership guard — prevent rm -rf of repo (#1802)

Adopts community fix #1827 by @diazMelgarejo (cyre). New lib/staging-guard.ts
exports checkOwnedStagingDir(), the single fail-closed predicate for 'safe to
recurse-delete or resume into', wired at cleanupStagingDir() (the deletion
chokepoint), decideResume(), the ingest entry point, and makeStagingDir()
(mints the .gstack-staging marker).

Fixes #1802.

Co-Authored-By: cyre <diazMelgarejo@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sync): don't route the remote-http persistent transcript dir through cleanup (#1802)

The ingest finally ran cleanupStagingDir() unconditionally, but in remote-http
mode stagingDir is the PERSISTENT transcript dir (~/.gstack/transcripts/) that
gstack-brain-sync push must consume. The remote-http branch documents the intent
to skip cleanup, but a finally runs on its return. Gate the call on
!remoteHttpMode so the ownership guard only ever sees .staging-ingest-* dirs.
Pre-gate this dir was deleted outright (broken artifacts handoff); post-#1827 it
produced a false 'prevent data loss' warning every sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sync): preserve staging dir on internal import timeout (#1802 C3)

The import-timeout branch printed 'checkpoint preserved' but the finally then
deleted the staging dir: the SIGTERM forwarder's preserve branch only runs when
the PARENT is signalled, and an internal runGbrainImport timeout kills just the
child and returns normally. So #1611 resume-after-timeout never actually worked.
Mirror the forwarder in the timeout branch: set preserveStaging only when gbrain
checkpointed against this dir (finally then skips cleanup); otherwise clean up
and tell the user it restages instead of falsely promising a resume.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sync): resume must not mark failed files as ingested (#1802 C4)

On resume, stagedPathToSource was rebuilt as an empty Map, so readNewFailures()
could not map gbrain's per-file failures back to source paths. Every failure
fell through to state recording — failed files were silently marked ingested and
never retried. Reconstruct the map from the prepared pages via a shared
stagedRelPath() helper (single source of truth with writeStaged, so the keys
can never drift). Exports stagedRelPath + readNewFailures for a behavioral test
proving the reconstructed map recovers the failure the empty map dropped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* harden(sync): close staging-guard TOCTOU + fail hard on marker write (#1802 C5)

checkOwnedStagingDir() now returns the realpath-resolved canonicalPath on a
pass, and cleanupStagingDir() rmSync's that instead of the raw input — closing
the gap where the input is a symlink swapped between the ownership check and the
delete. makeStagingDir() tears down the partial dir and rethrows if the marker
write fails, so a marker-less dir (which the guard would refuse forever) can
never leak.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: v1.56.1.0 — staging-dir ownership guard + resume-correctness fixes (#1802)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci: grant the eval report job issues:write so PR comment upsert stops 401ing

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: cyre <diazMelgarejo@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sync-gbrain memory stage rm -rf'd my repo root: resume path trusts checkpoint.dir without verifying it's a gstack staging dir

3 participants