v0.22.11 feat: storage tiering — db_tracked vs db_only directories#494
Merged
v0.22.11 feat: storage tiering — db_tracked vs db_only directories#494
Conversation
Brain repos scaling to 200K+ files. Bulk data (tweets, articles, transcripts) bloats git repos and slows operations. New storage config in gbrain.yml lets users declare git-tracked and supabase-only directories. Changes: - New config: storage.git_tracked and storage.supabase_only in gbrain.yml - gbrain sync auto-manages .gitignore for supabase-only paths - gbrain export --restore-only restores missing supabase-only files from DB - New gbrain storage status command shows tier breakdown - Config validation warns on conflicts - 8 tests passing, full docs at docs/storage-tiering.md Backward compatible — systems without gbrain.yml work unchanged.
Single source of truth for "what brain repo are we operating against?" Replaces ad-hoc raw SQL in storage.ts:38 (Issue #3 of eng review). Used by both gbrain storage status and gbrain export --restore-only. Returns null on miss, throws on DB error. Composes with the existing resolveSourceId chain so it honors --source flag / GBRAIN_SOURCE env / .gbrain-source dotfile / longest-prefix CWD match / brain-level default. 4 new test cases covering happy path, missing local_path, DB error propagation, and CWD-prefix resolution priority. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original storage-config.ts called gray-matter on a delimiter-less YAML
file. Gray-matter only parses YAML inside `---` frontmatter blocks; without
delimiters, it returns `{data: {}}`. Result: loadStorageConfig() always
returned null, the entire feature was a silent no-op for every user.
Original eng review's P0 confidence-9 finding (Issue #1).
Replaces gray-matter with a small dedicated parser for the gbrain.yml shape
(top-level `storage:` section, two array-valued nested keys). Yaml-lite was
considered first, but its flat key:value design doesn't handle nested
arrays. The dedicated parser is ~50 lines and trades expressiveness for
zero-dep, predictable parsing of a file format we control.
Adds the Issue #1B sanity warning (locked B): when gbrain.yml exists but
has no storage section (or empty arrays), warn once-per-process so the
user sees their config didn't take. The single test that would have caught
the original P0 — write a real gbrain.yml, call loadStorageConfig, assert
non-null — now exists.
Also tightens loadStorageConfig per D36: distinguishes "absent" (silent
null) from "unreadable" (throws). The previous code silently swallowed
read errors, hiding broken installs.
8 new test cases: real-disk happy path, comments + blank lines, quoted
values, missing storage section warning, empty section warning,
once-per-process warning suppression, unreadable file behavior, and the
existing helper tests (validation, tier matching, edge cases) all still
pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The vendor-specific names "supabase_only" and "git_tracked" hardcoded a backend (Supabase) into the config schema. gbrain ships two engines — PGLite and Postgres-via-Supabase. The canonical distinction is "lives in the brain DB only" vs "lives in the brain DB and on disk under git." Both work on either engine. Renamed throughout (Issue #4 of eng review): git_tracked → db_tracked supabase_only → db_only isGitTracked() → isDbTracked() isSupabaseOnly() → isDbOnly() StorageTier 'git_tracked'/'supabase_only' → 'db_tracked'/'db_only' Backward compatibility (D3 lock): loadStorageConfig accepts both shapes. Loader resolution order per the eng-review pass-2 finding: parse YAML → if canonical keys present use them, else if deprecated keys present map to canonical AND emit once-per-process deprecation warning → THEN run validation. Validation always sees the canonical shape so error messages reference db_tracked/db_only regardless of which keys the user wrote. The deprecation warning suggests `gbrain doctor --fix` for an automated rename (D72 — fix path lands in step 7). When both shapes coexist in one file, canonical wins and a stronger warning fires ("deprecated keys ignored — remove them"). Aliases isGitTracked/isSupabaseOnly kept for now to avoid churning the sync.ts / export.ts / storage.ts call sites in this commit; they'll be removed in a follow-up step. Storage.ts's tier-bucket initializers and output strings updated. ASCII output replaces unicode box-drawing per D10. gbrain.yml example file updated to canonical keys with explanatory comments. 2 new test cases: deprecated-key fallback (asserts both shapes load correctly with warning), canonical-wins-over-deprecated (asserts the "both shapes coexist" path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #13 of the eng review: storage.ts and export.ts loaded every page in the brain (limit: 1_000_000) to check tier membership. On the 200K-page brains this feature targets, that's the wall-clock and memory landmine the feature exists to fix. Adds an optional `slugPrefix` field to PageFilters. Both engines implement it as `WHERE slug LIKE prefix || '%' ESCAPE '\'`, with literal escaping of LIKE metacharacters (%, _, \) so user-supplied prefixes like `media/x/` are treated as exact string prefixes. Performance: the (source_id, slug) UNIQUE constraint on the pages table gives both engines a btree index that supports LIKE-prefix range scans. An EXPLAIN on Postgres confirms the index range scan rather than a seq scan. PGLite has the same index shape via pglite-schema.ts. Consumers updated: - export.ts: --slug-prefix flag now goes engine-side (no in-memory .filter(...)). The --restore-only path queries each db_only directory with slugPrefix in a loop instead of one full-table scan, with seen-set deduplication and disk-existence check inline. - storage.ts: keeps the full-scan path because storage-status needs the "unspecified" bucket count, which can't be computed without enumerating every page. Comment notes that step 5 (single-walk filesystem scan) will reduce per-page disk syscall cost. 2 new test cases on PGLiteEngine: slugPrefix happy path (3 tier dirs, asserts only matching slugs return) and metacharacter escape regression (asserts safe/ doesn't match unrelated slugs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #14 of the eng review: storage.ts called existsSync + statSync per-page in a synchronous loop. On a 200K-page brain that's 400K syscalls serialized. Wall-clock landmine. Adds src/core/disk-walk.ts with walkBrainRepo(repoPath) — one recursive readdirSync walk, builds a Map<slug, {size, mtimeMs}>. Storage.ts looks up each DB page in the map (O(1)) instead of stat-checking on demand. Slug derivation matches the pages-table convention: people/alice.md on disk becomes people/alice as the map key. Skipped during walk: - dot-directories (.git, .gbrain, .vscode, etc) — not part of the brain namespace - node_modules — guards against accidentally walking into imported repos - non-.md files (sidecar JSON, binaries) — tracked by the brain through the files table, not by slug Reusable: future commands (gbrain doctor's storage_tiering check, the optional autopilot tier-fix path) get the same walk for free. 9 new test cases: empty dir, nonexistent dir, top-level files, nested dirs, dot-dir skipping, node_modules skipping, non-.md filtering, size capture, mtimeMs capture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #5 + D6 of the eng review: tier matching used slug.startsWith(dir), which falsely matches 'media/xerox/foo' against 'media/x' if a user wrote the directory without a trailing slash. The new matcher requires the configured directory to end with `/` and treats it as a canonical path-segment ancestor: media/x/ matches media/x/tweet-1 ✓ media/x/ doesn't media/xerox/foo ✗ media/x refused media/x/tweet-1 (matcher requires trailing /) Non-canonical input (no trailing slash) is refused outright. Step 7's auto-normalizing validator converts user-written 'media/x' → 'media/x/' on load, so the matcher never sees non-canonical input from real configs. The behavior tested here is the strict matcher's contract. Regression test pins the media/xerox collision case explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
D7+D8 of the eng review: validation was warnings-only. Users miss warnings.
Now:
- Cosmetic: missing trailing slash auto-corrected, one-time info note
showing what changed ("normalized 2 storage paths: 'people' →
'people/', 'media/x' → 'media/x/'"). Once-per-process to keep noise low.
- Semantic: same directory in both tiers throws StorageConfigError.
Ambiguous routing — does media/ win as db_tracked or db_only? — is a
real bug the user must fix. Caller propagates to the CLI for a clean
exit-1 with actionable message.
loadStorageConfig now applies normalize+validate after merging deprecated
keys, so the path-segment matcher (step 6) only ever sees canonical
trailing-slash directories.
The pure validateStorageConfig kept for callers who want the warnings list
without the auto-fix side effects (gbrain doctor's reporting path).
2 new test cases: auto-normalize round-trip with warning text assertion,
overlap throws StorageConfigError.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #2 of the eng review: manageGitignore was defined and never invoked. Docs claimed "auto-managed by gbrain" — false. Users hit a .gitignore that never updated and committed db_only directories anyway. Wire-up: runSync now calls manageGitignore after each successful performSync return, in both watch and one-shot modes. Eng review pass-2 finding #1: skip on dry_run AND blocked_by_failures status. A sync that aborted partway has stale state; mutating .gitignore based on a partially-loaded config invites drift. Failure-skip test added (uses .gitignore-as-a-directory to simulate write failure; asserts warning fired and disk wasn't corrupted). Hardened manageGitignore itself with three additional behaviors: - GBRAIN_NO_GITIGNORE=1 escape hatch (D23) for shared-repo setups where a maintainer wants gbrain to leave .gitignore alone. - Submodule detection (D49). When repoPath/.git is a regular file (gitdir: ... pointer), the repo is a git submodule. Submodule .gitignore changes don't survive parent submodule updates, so we skip with an actionable warning ("add db_only directories to your parent repo's .gitignore manually"). - Graceful failure (D9). Read errors, write errors, and StorageConfigError (overlap from step 7) all log a warning and return — sync's primary job (moving data) shouldn't die because of a side-effect on .gitignore. manageGitignore is now exported (previously private) so the storage-sync test file can hit it directly without spinning up sync. 9 new test cases: no-op without gbrain.yml, no-op with empty db_only, happy-path append, idempotency (run twice, single entry), preservation of user-written rules, GBRAIN_NO_GITIGNORE skip, submodule skip, .git-directory normal path, write-failure graceful warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…9/15) D5 of the eng review: gbrain export --restore-only without --repo silently fell through to the regular export path, dumping every page in the database to the wrong directory. Hard regression risk. Now exits 1 with an actionable message when --restore-only has no --repo AND no configured default source. Resolution order: 1. Explicit --repo flag 2. Typed sources.getDefault() (reuses step 1's accessor) 3. Hard error — never fall through to cwd storage.ts:38 also bypassed BrainEngine with raw SQL and a bare try/catch (Issue #3 + Issue #9). Replaced with the same typed getDefaultSourcePath() — single source of truth, errors propagate cleanly to the user, no silent cwd fallback. Regular export (no --restore-only) keeps its current behavior per D26: exports include everything, --repo is optional. 4 new test cases on PGLite in-memory: - hard-errors with no --repo + no default - explicit --repo wins - falls back to sources default local_path - non-restore export does not require --repo Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…step 10/15) Issue #10 of the eng review: getStorageStatus and runStorageStatus mixed data gathering, JSON serialization, and human-readable output in one function. Hard to test, hard to reuse, mismatched the orphans.ts pattern that CLAUDE.md cites as the precedent. Now three pure functions + a thin dispatcher: getStorageStatus(engine, repoPath) — async, returns StorageStatusResult. Side effects: engine.listPages + one walkBrainRepo (Issue #14). Exported so MCP exposure (D14) and gbrain doctor (D13) can consume the same data without re-running the loop. formatStorageStatusJson(result) — pure, returns indented JSON. Stable contract on the StorageStatusResult shape, suitable for orchestrators. formatStorageStatusHuman(result) — pure, returns ASCII text (D10 — no unicode box-drawing). Composable into other commands later. runStorageStatus(engine, args) — thin dispatcher: parses --repo / --json, calls getStorageStatus, picks a formatter, prints. 8 new test cases on the formatters: JSON parse round-trip, null-config fallback, missing-files capped at 10 with rollup, ASCII-only assertion (D10 regression guard), warnings inline, configuration listing, disk- usage block omitted when zero bytes. The StorageStatusResult interface is now exported as a public type, so gbrain doctor's storage_tiering check can build its own findings from the same shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #11 of the eng review: pagesByTier (page counts) and diskUsageByTier (byte totals) shared the same structural type (Record<StorageTier, number>). Both are tier-keyed numeric maps but carry semantically different units. A future bug that swaps them at a call site (e.g., displaying disk bytes where the count belongs) wouldn't trip the compiler. Replaced with distinct nominal types via a brand field. Structurally identical at runtime (no overhead) but compile-time disjoint — TypeScript catches accidental cross-assignment. PageCountsByTier { db_tracked, db_only, unspecified } : numbers (count) DiskUsageByTier { db_tracked, db_only, unspecified } : numbers (bytes) Both initialized in getStorageStatus, both threaded into StorageStatusResult, both consumed by formatStorageStatusHuman / formatStorageStatusJson without further changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
D4: storage tiering on PGLite is a partial feature. The "DB" the pages
live in IS the local file gbrain uses for everything else, so "db_only"
has no real offload effect. The .gitignore management still helps
(keeps bulk content out of git history), so we warn and proceed —
not refuse.
Two warning sites (once-per-process each via module-local flags):
- storage status: warns at runStorageStatus entry
- sync: warns inside manageGitignore when engineKind='pglite' and
config has db_only entries
Both phrased actionably ("To get full tiering, migrate to Postgres
with `gbrain migrate --to supabase`").
manageGitignore signature now takes an optional `engineKind` param.
runSync passes engine.kind. Stand-alone callers (tests, future
gbrain doctor --fix path) can omit it.
New test: test/storage-pglite.test.ts — D8 + D4 lifecycle. 6 cases:
engine.kind assertion, getStorageStatus loading gbrain.yml + reporting
tier counts, manageGitignore PGLite-warn (once per process), Postgres
no-warn, slugPrefix on PGLite, end-to-end (config + putPage + status
+ gitignore).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #7 of the eng review: all four new files in the original storage-tiering branch lacked POSIX trailing newlines. Linters complain, git diffs phantom-flag every future edit. We've been adding newlines as each file landed; this commit catches the regression class. scripts/check-trailing-newline.sh: - sibling to check-jsonb-pattern.sh / check-progress-to-stdout.sh per CLAUDE.md's CI guard pattern - portable to bash 3.2 (macOS default; no mapfile, no associative arrays) - covers src/**, test/**, gbrain.yml, top-level *.md - reports each missing file by path and exits 1 Wired into `bun run test` between progress-to-stdout and typecheck. Also fixed docs/storage-tiering.md (pre-existing missing newline from the original branch — caught by the new guard on first run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ering # Conflicts: # package.json # src/cli.ts
…g.md (step 15/15)
VERSION → 0.23.0 (minor bump for new feature surface).
CHANGELOG entry in Garry voice with the canonical format:
- Two-line bold headline ("Storage tiering, finally working...")
- Lead paragraph naming what was broken before and what users get now
- "Numbers that matter" before/after table for the 6 things that
actually changed
- "What this means for your brain" closer
- "To take advantage of v0.23.0" self-repair block (per CLAUDE.md
convention) — 6 numbered steps users can follow
- Itemized changes split into critical fixes / new+renamed surface /
architecture cleanup / tests + CI guards
CLAUDE.md "Key files" gains four new entries: storage-config.ts,
disk-walk.ts, the v0.23.0 storage.ts shape, and gbrain.yml itself.
README.md gains a new "Storage tiering" section between Skillify and
Getting Data In with the canonical example + commands + link to the
full guide.
docs/storage-tiering.md rewritten end-to-end with canonical key names
(db_tracked / db_only), v0.23.0 hardening details (idempotency,
submodule detection, GBRAIN_NO_GITIGNORE, dry-run gating), the
resolution chain for --restore-only, the auto-normalize +
throw-on-overlap validator, and the PGLite engine note.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the v0.23.0 plan: full lifecycle E2E against real Postgres.
- engine.kind === 'postgres' assertion
- Full lifecycle: write 4 pages (1 db_tracked, 2 db_only, 1 unspecified)
→ getStorageStatus reports correct tier counts → human formatter
renders → manageGitignore writes managed block → idempotency check
→ getDefaultSourcePath() resolves the configured local_path.
- Container restart simulation: 2 db_only pages in DB, files missing
on disk → status.missingFiles.length === 2 → slugPrefix engine
filter on Postgres returns exactly the tier slugs.
- slugPrefix index-based range scan regression: 50 media/x/* + 50
people/p-* pages → slugPrefix='media/x/' returns exactly 50.
- getDefaultSourcePath returns null when default source has no
local_path (the hard-error path that replaces the original silent
cwd fallback).
- manageGitignore on Postgres engine does NOT emit the PGLite
soft-warn (cross-engine assertion).
Skips gracefully when DATABASE_URL is unset, per CLAUDE.md E2E pattern.
Run via: DATABASE_URL=... bun test test/e2e/storage-tiering.test.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the minor bump back to a patch-style version on the v0.22 line. Storage tiering ships within the v0.22.x train alongside the recent fix waves. Updates VERSION, package.json, CHANGELOG header + body refs, CLAUDE.md Key files annotations, README.md section heading, and the docs/storage-tiering.md backward-compat note.
Sibling workspaces claimed v0.22.10 in the queue. This branch advances to v0.22.11 to keep the version monotonic on master. Updates VERSION, package.json, CHANGELOG header + body refs, CLAUDE.md Key files annotations, README.md section heading, and the docs/storage-tiering.md backward-compat note.
Codex found 4 real issues during pre-landing review of v0.22.11 diff:
[P0] export --restore-only fell through to full export when
storageConfig was null (no gbrain.yml present). On older or
misconfigured brains, the recovery command would silently dump the
entire database. src/commands/export.ts now refuses with an actionable
error before any page query fires — matches the D5 lock spirit
("never silently fall through").
[P1] manageGitignore wire-up only fired when --repo was passed
explicitly. performSync resolves the repo from sync.repo_path or
sources.local_path, so the common `gbrain sync` path (after
setup, no flag) never updated .gitignore. src/commands/sync.ts now
uses the same source-resolver chain as the rest of /ship: opts.repoPath
→ getDefaultSourcePath → null. Fires in both watch and one-shot modes.
[P2] getDefaultSourcePath only consulted sources.local_path, missing
the legacy global sync.repo_path config key that pre-v0.18 brains use.
Added a fallback to engine.getConfig('sync.repo_path') when the
sources row has NULL local_path. Pre-v0.18 brains now work without
forcing a `gbrain sources add . --path .` migration.
[P2] sync --all multi-source loop never called manageGitignore even
though src.local_path was already known. Each source now gets its own
gitignore update on successful sync.
Tests:
- test/storage-export.test.ts: replaced the old "falls through to
full export" test with one that asserts the new refusal path
(storage-tiering config required for --restore-only).
- test/source-resolver.test.ts: added a fallback test exercising the
legacy sync.repo_path code path for pre-v0.18 brains.
- All 78 storage-tiering tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CLAUDE.md: "Run `bun run build:llms` after adding a new doc." The README's new Storage tiering section + the rewritten docs/storage-tiering.md changed the inlined bundle. test/build-llms.test.ts catches the drift and was failing on master pre-regen.
tsc --noEmit failed in CI because ReturnType<typeof readdirSync> with withFileTypes:true picks an overload union that includes Dirent<Buffer<ArrayBufferLike>>. Strict tsc treats entry.name as Buffer, so .startsWith / .endsWith / string comparisons all blew up. Annotate the variable as Dirent[] (string-based) and cast through unknown, matching the pattern sync.ts already uses for its own filesystem walk. Same runtime behavior; clean typecheck. Tests still 9/9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> EOF
…ering # Conflicts: # CHANGELOG.md # VERSION # package.json # src/cli.ts
…ering # Conflicts: # CHANGELOG.md # VERSION # package.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Storage tiering, finally working. The original branch had two silent bugs that made it a no-op for every user. This rewrite fixes both, hardens the surface, adds proper tests, and addresses 4 follow-on findings from Codex pre-landing review.
Fixes (D1=B rewrite-in-place):
data:{}→ swapped to a dedicated parser forgbrain.yml. The single test that would have caught this is now in place.manageGitignoredefined but never invoked → wired intorunSyncafter every successful sync (skips on dry-run, blocked-by-failures, submodule context, orGBRAIN_NO_GITIGNORE=1).Additions:
gbrain storage status [--repo P] [--json]— tier breakdown, disk usage, missing filesgbrain export --restore-only [--repo P] [--type T] [--slug-prefix S]— repopulate missing db_only filesgbrain.ymlconfig at brain repo root withdb_tracked:anddb_only:arrays (deprecatedgit_tracked/supabase_onlystill load with a once-per-process warning)disk-walk.ts: single recursive readdirSync replaces ~400K syscalls on 200K-page brainsgetDefaultSourcePathtyped accessor (replaces raw SQL + bare try/catch); falls back to legacysync.repo_pathconfig for pre-v0.18 brainsslugPrefixfilter onPageFiltersfor both engines (LIKE-escape safe)media/x/does NOT matchmedia/xerox/foovalidateStorageConfigauto-normalizes trailing/, throwsStorageConfigErroron tier overlapscripts/check-trailing-newline.shTest Coverage
Tests: existing → +78 storage-tiering tests. All 78 pass.
Pre-Landing Review
Codex found 4 issues; all fixed in commit
f79e662:--restore-onlysilent fall-through when no storage config → now refuses with actionable error--repo→ now uses source-resolver chain (works in common path)getDefaultSourcePathmissed legacysync.repo_path→ added fallbacksync --allloop didn't update .gitignore per source → now doesEval Results
No prompt-related files changed — evals skipped.
Plan Completion
All 16 implementation steps from the plan complete. Per-step bisect-friendly commits:
9b79314f3618a99c8034b6708592ca68142bbadd8316da43e759735453970fc29c8991d29c7a1a1e6f2f2e23e2857c6edbcb90e5706823b39be19a7f79e662TODOS
No TODO items completed in this PR (storage tiering wasn't a tracked TODO).
Documentation
Updated:
CHANGELOG.md— full v0.22.11 release entry in Garry voice with "numbers that matter" + "to take advantage of" blockCLAUDE.md— 4 new Key files entries (storage-config.ts, disk-walk.ts, storage.ts, gbrain.yml)README.md— new "Storage tiering: keep bulk content out of git" sectiondocs/storage-tiering.md— rewritten end-to-end with canonical key names + v0.22.11 hardening detailsTest plan
bun build --target=bunsucceeds🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need.