Skip to content

v0.22.0 feat: source-aware search ranking — curated pages win, swamp dampened#439

Merged
garrytan merged 8 commits intomasterfrom
garrytan/better-search
Apr 26, 2026
Merged

v0.22.0 feat: source-aware search ranking — curated pages win, swamp dampened#439
garrytan merged 8 commits intomasterfrom
garrytan/better-search

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented Apr 26, 2026

Summary

Search stops getting swamped by chat logs. Curated pages win by default.

8 commits on top of v0.21.0 (Cathedral II) ship source-aware retrieval. Multi-word topic queries against a real brain previously returned chat-log pages at #1/#2 because chat is 50KB and contains every topic; the actual article you wrote was buried at #5+. v0.22.0 fixes that at the SQL layer — ts_rank on chunk-grain FTS and the new HNSW-safe two-stage CTE in searchVector both multiply by a longest-prefix-match source-factor CASE. Plus four hard-exclude prefixes (test/, archive/, attachments/, .raw/) filter at the chunk-rank stage so they never enter the candidate set.

Core changes (8 commits):

  • Plumbing: SearchOpts gains exclude_slug_prefixes (additive) + include_slug_prefixes (subtractive opt-back-in).
  • Helpers: src/core/search/source-boost.ts (default boost map + env parsers + resolvers) and src/core/search/sql-ranking.ts (pure SQL string builders with three-meta-char LIKE escape, single-quote SQL-literal doubling, longest-prefix-match CASE, detail-gate temporal-bypass).
  • Engine wiring (Postgres): searchKeyword, searchKeywordChunks, searchVector all multiply ts_rank (or raw_score) by source-factor. searchVector becomes a two-stage CTE — pure-distance HNSW inner ORDER BY, source-boost re-rank in the outer SELECT, innerLimit = offset + max(limit*5, 100) to preserve pagination contract, p.source_id carried through.
  • Engine wiring (PGLite): mirrors Postgres. CTE aliased as hc to disambiguate the staleness correlated subquery.
  • Tests: 39 unit cases + 20 E2E (search-swamp, search-exclude, engine-parity).
  • Adversarial fixes: loose-string detail normalization (catches "HIGH" / "high " from MCP boundary) + PGLite CTE alias.
  • Docs: CHANGELOG, CLAUDE.md, README, llms-full.txt all updated for v0.22.0. Test infra fix for GBRAIN_ALLOW_SHELL_JOBS in Postgres minions-shell test.

Test Coverage

96% AI-assessed, 2 minor structural gaps (deferred to TODOS).

source-boost.ts                              [████████████████████] 100%
├─ DEFAULT_SOURCE_BOOSTS map      ★★★ E2E swamp confirms map values active
├─ DEFAULT_HARD_EXCLUDES          ★★★ E2E exclude confirms defaults applied
├─ parseSourceBoostEnv            ★★★ malformed/factor=0/negative/last-colon
├─ parseHardExcludesEnv           ★★  parse/undefined/whitespace
├─ resolveBoostMap                ★★★ defaults/override/env-only-add
└─ resolveHardExcludes            ★★★ caller-add/include-opt-in/env-add/sub

sql-ranking.ts                               [████████████████████] 100%
├─ escapeLikePattern              ★★★ %, _, \, combo, no-op
├─ escapeSqlLiteral               ★★★ apostrophe + injection-inert
├─ buildLikePrefixLiteral         ★★★ trailing-%/meta/quote
├─ buildSourceFactorCase          ★★★ detail=high bypass + loose-string,
│                                       longest-prefix sort, NaN/Inf/<0 reject
└─ buildHardExcludeClause         ★★★ empty/OR-chain/backslash/injection

postgres-engine.ts methods (3)               [██████████████████░░] 90%
├─ searchKeyword (CTE+source-boost)★★★ parity DATABASE_URL-gated
├─ searchKeywordChunks            ★★  parity covers code path indirectly
└─ searchVector (two-stage CTE)   ★★★ parity top-result + hard-exclude parity

pglite-engine.ts methods (3)                 [████████████████████] 100%
├─ searchKeyword                  ★★★ swamp + 4 exclude cases
├─ searchKeywordChunks            ★★  exercised via search paths
└─ searchVector                   ★★★ swamp + vector exclude + opt-in

types.ts SearchOpts new fields                [████████████████████] 100%
├─ exclude_slug_prefixes          ★★★ unit + E2E additive
└─ include_slug_prefixes          ★★★ unit + E2E subtractive

Gaps: (1) explicit pagination case for offset > 100 on searchVector — innerLimit math is structurally documented but not asserted by an explicit case; (2) language/symbolKind passthrough on the new sql.unsafe builds — covered indirectly by parity tests, no dedicated case.

Pre-Landing Review

0 CRITICAL, 0 INFORMATIONAL (auto-fix). All SQL injection vectors covered by three-meta-char LIKE escape + single-quote SQL-literal doubling. slugColumn parameter is engine-supplied ('p.slug' / 'hc.slug' / 'slug'), never user-controllable. Numeric factors validated Number.isFinite && >= 0. Two-stage CTE preserves HNSW (inner CTE pure-distance ORDER BY, source-boost only in outer SELECT).

Adversarial Review

Claude adversarial subagent found 5 findings:

  • FIXED: Loose-string detail over MCP boundary ("HIGH", "high " now normalized)
  • FIXED: PGLite searchVector correlated subquery scope ambiguity (CTE aliased as hc)
  • DEFERRED: HNSW + hard-exclude planner behavior on real Postgres (needs EXPLAIN on a 50K+ chunk Supabase corpus)
  • DEFERRED: searchKeywordChunks pagination pool growth (would change v0.21.0 contract)
  • DEFERRED: resolveBoostMap re-reads process.env per call (intentional — enables mid-process env reload for tuning)

Plan Completion

13 DONE, 1 CHANGED (engine-parity test landed at test/e2e/ instead of test/ — same intent), 1 DEFERRED to companion PR (BrainBench Cat 13a qrels in gbrain-evals merged to main as garrytan/gbrain-evals#1).

BrainBench Cat 13b — Source Swamp Resistance (companion PR in gbrain-evals)

gbrain version Top-1 hit Top-3 hit Swamp@top
v0.20.4 (pre-Cathedral II) 90.0% 100.0% 10.0%
v0.21.0 (Cathedral II two-pass) 90.0% 100.0% 10.0%
v0.22.0 (this release) 93.3% 100.0% 6.7%

v0.21.0's two-pass retrieval is orthogonal to source-swamp — v0.22.0 adds +3.3pts top-1 / -3.3pts swamp on top of either base.

Test plan

  • bun run test — 2714 pass / 0 fail
  • bun run test:e2e (with test DB) — 225 pass / 0 fail across 24 files
  • Adversarial fixes verified — 137 pass / 0 fail across affected files
  • gbrain search "<multi-word topic>" against ~/git/brain/ returns curated content at feat: GBrain v0.1.0 — Postgres-native personal knowledge brain #1
  • gbrain search "<phrase>" --detail high lets chat re-surface

🤖 Generated with Claude Code

garrytan and others added 8 commits April 25, 2026 21:23
…archOpts

The two new fields plumb prefix-based hard-exclude through the search API.
exclude_slug_prefixes is additive over the engine's default hard-exclude set
(test/, archive/, attachments/, .raw/) and the GBRAIN_SEARCH_EXCLUDE env var.
include_slug_prefixes subtracts entries from the resolved set so callers can
opt back into directories that are hidden by default.

Stand-alone change — no engine wiring yet (lands in subsequent commits).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new modules + unit tests. Pure functions, zero engine dependencies.

source-boost.ts:
  - DEFAULT_SOURCE_BOOSTS map (originals/ 1.5, concepts/ 1.3, writing/ 1.4,
    people/ 1.2, daily/ 0.8, media/x/ 0.7, wintermute/chat/ 0.5, etc.) —
    grounded in the composition of the canonical brain.
  - DEFAULT_HARD_EXCLUDES = ['test/', 'archive/', 'attachments/', '.raw/'].
  - GBRAIN_SOURCE_BOOST + GBRAIN_SEARCH_EXCLUDE env-var parsers, malformed
    entries skipped silently.
  - resolveBoostMap / resolveHardExcludes merge defaults + env + caller opts.

sql-ranking.ts:
  - buildSourceFactorCase emits a CASE expression for the source factor.
    Returns literal '1.0' when detail==='high' so temporal queries bypass
    source-boost (matches the COMPILED_TRUTH_BOOST gate in hybrid.ts).
    Prefixes sorted by length desc so longest-match wins.
  - buildHardExcludeClause emits NOT (col LIKE 'p1%' OR col LIKE 'p2%').
    NOT a NOT LIKE ALL/ANY array — those quantifiers don't express
    set-exclusion correctly for multi-pattern LIKE.
  - LIKE meta-character escape covers all three: %, _, AND \. Backslash
    coverage matters because it's Postgres LIKE's default escape char —
    a literal backslash in a user env prefix would otherwise be
    interpreted as 'escape the next char' and silently match wrong rows.
  - SQL string literals get single-quote doubling so injection-style
    inputs render as inert text inside the quoted string.

39 unit tests cover escape behavior, longest-prefix-match, detail-gate
bypass, malformed env, factor=0 (legal), negative-factor rejection,
SQL-injection-as-literal, and resolver merge semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
search-swamp.test.ts: reproduces the v3-plan headline case. Seeds a
curated originals/talks/article-outline-fat-code page against two
wintermute/chat/ pages stuffed with 'fat code thin harness' repetitions.
Asserts the article wins both keyword and vector ranking, and that
detail=high lets the chat swamp re-surface (temporal-query workflow
preserved). Also asserts source_id passes through the two-stage CTE.

search-exclude.test.ts: verifies test/ + archive/ pages are hidden by
default, that include_slug_prefixes opts back in, and that
exclude_slug_prefixes adds to defaults.

engine-parity.test.ts: codex flagged that searchKeyword's structural
behavior differs between engines (Postgres ranks pages then picks best
chunk; PGLite returns chunks directly). Without parity coverage the fix
could pass on PGLite and silently fail on Postgres. Seeds identical
corpus into both engines, runs identical queries, asserts top-result +
result-set match. Includes a vector-search parity case and a hard-exclude
parity case. Skips gracefully when DATABASE_URL is unset, per the
CLAUDE.md E2E lifecycle pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d + searchKeywordChunks + two-stage searchVector

Layers source-aware ranking on top of v0.21.0's Cathedral II
chunk-grain FTS architecture, in both Postgres and PGLite engines.

postgres-engine.ts:
  - searchKeyword (chunk-grain CTE → DISTINCT ON page dedup): the inner
    ranked_chunks CTE multiplies ts_rank by the source-factor CASE
    expression, hard-exclude prefixes (test/, archive/, attachments/,
    .raw/ by default + env + caller) become a NOT-LIKE OR-chain on
    the WHERE clause, language/symbol-kind filters preserved.
  - searchKeywordChunks (chunk-grain anchor primitive used by two-pass
    Layer 7): same source-boost treatment so the anchor pool that
    feeds two-pass retrieval is also dampened on chat/daily/x dirs.
  - searchVector becomes a two-stage CTE: inner CTE keeps pure
    HNSW ORDER BY (folding source-boost into it would force a
    sequential scan over every chunk), outer SELECT re-ranks by
    raw_score × source-factor. innerLimit scales with offset to
    preserve pagination contract. p.source_id passes through
    inner→outer for v0.18 multi-source callers.
  - All three methods stay inside sql.begin + SET LOCAL
    statement_timeout from v0.19+ (transaction-scoped GUC; bare SET
    leaks onto pooled connections, documented DoS vector).

pglite-engine.ts: mirrors the same three methods. Same SQL shape,
same source-factor + hard-exclude. Two-stage CTE also lifts stale-flag
computation into the outer SELECT (it referenced p.updated_at which
now lives only inside the inner CTE).

Detail-gate (`detail !== 'high'`) inherited from buildSourceFactorCase
... temporal queries bypass source-boost so chat surfaces normally for
date-framed lookups. Same gate pattern as the existing
COMPILED_TRUTH_BOOST in hybrid.ts.

Tests: 142 pass across pglite-engine, postgres-engine, sql-ranking,
search-swamp E2E, search-exclude E2E.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…master)

CHANGELOG: new v0.22.0 entry above v0.21.0 (Cathedral II). Headline
positions v0.22.0 as additive on top of v0.21.0's two-pass retrieval
... different mechanism, +3.3pts top-1 / -3.3pts swamp on the new
Cat 13b benchmark in the sibling gbrain-evals repo.

CLAUDE.md:
  - postgres-engine.ts entry mentions all three updated methods
    (searchKeyword, searchKeywordChunks, searchVector) and the
    two-stage CTE for searchVector specifically.
  - pglite-engine.ts entry parallels the Postgres notes.
  - src/core/search/ entry calls out source-aware ranking +
    hard-exclude defaults + detail-gate parity with COMPILED_TRUTH_BOOST.
  - Added entries for src/core/search/source-boost.ts and
    src/core/search/sql-ranking.ts in the Key Files section.
  - Added test/sql-ranking.test.ts and the three new E2E test
    files (search-swamp, search-exclude, engine-parity) to the
    test listings.

README.md: SEARCH PIPELINE diagram in the "many strategies in concert"
section gains two lines for source-aware ranking and hard-exclude
filtering.

VERSION: 0.21.0 → 0.22.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two test fixes uncovered while running the full bun run test + E2E
suite at zero defects.

test/e2e/engine-parity.test.ts: BrainEngine was being imported from
src/core/types.ts but it's actually exported from src/core/engine.ts;
the import was silently working under bare `bun test` but failing
typecheck. Fixed the import path and annotated 6 implicit-any
SearchResult callbacks. (No behavior change ... typecheck only.)

test/e2e/minions-shell.test.ts: the Postgres minions-shell test was
missing the `GBRAIN_ALLOW_SHELL_JOBS=1` env-var setup that the
PGLite sibling test in test/e2e/minions-shell-pglite.test.ts already
has. Without it the shell handler short-circuits and the job lands
in `dead`, not `completed`. The env var is the operator-trust gate
for the shell handler ... separate from the trusted-add
allowProtectedSubmit flag. Adding the same beforeAll/afterAll
setup-and-restore pattern from the PGLite sibling brings the test
to green.

Both bugs were latent on master ... bare `bun test` skipped the
typecheck and the minions-shell E2E was a pre-existing flake
(documented as such in earlier branch summary).

Verified: full unit suite 2714 pass / 0 fail (`bun run test`),
full E2E suite 225 pass / 0 fail across 24 files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the v0.22.0 entries added to CLAUDE.md (source-boost.ts,
sql-ranking.ts, three new E2E test files, postgres/pglite engine
search-method updates). The build-llms.test.ts regen-drift guard
was failing because the committed bundle didn't match the current
generator output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…CTE alias

Two FIXABLE findings from /ship's adversarial subagent pass:

1. **buildSourceFactorCase: tolerate loose-string `detail` over the MCP
   boundary.** TypeScript narrows the typed callers, but agents passing
   JSON across MCP can send `"HIGH"` (uppercase) or `"high "` (trailing
   space). Before this change, those values silently fell through the
   `detail === 'high'` strict-equality check and got boosted ranking
   instead of the temporal bypass — the opposite of what the agent asked
   for. Now the gate normalizes `String(detail).trim().toLowerCase()`
   before comparing. Three new test cases cover `"HIGH"`, `"high "`, and
   `"  High  "`.

2. **PGLite searchVector: alias the hnsw_candidates CTE as `hc` and
   qualify the correlated subquery.** The prior shape had
   `WHERE te.page_id = page_id` in the staleness subquery — unqualified
   `page_id` resolved by lexical-scope fallback to
   `hnsw_candidates.page_id`, but if the inner column is ever renamed or
   the parser changes, it would silently bind to `te.page_id` itself
   (always true) and every result returns `stale=true`. Aliasing the CTE
   as `hc` and qualifying both `hc.page_id` and `hc.slug` (via building
   the source-factor CASE with `'hc.slug'`) eliminates the ambiguity.
   Postgres `searchVector` was already safe — it uses `false AS stale`
   (no correlated subquery) — so no symmetric change needed there.

Three INVESTIGATE findings deferred:
- HNSW + hard-exclude planner behavior on real Postgres (needs EXPLAIN on
  a 50K+ chunk Supabase corpus, not reproducible on PGLite)
- searchKeywordChunks pagination pool growth (would change the v0.21.0
  contract; inherits the original Cathedral II shape)
- resolveBoostMap re-reads process.env per call (cheap, intentional —
  enables mid-process env reload for tuning)

Verified: 137 pass / 0 fail across sql-ranking + pglite-engine +
search-swamp + search-exclude tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 172b55b into master Apr 26, 2026
4 checks passed
garrytan added a commit that referenced this pull request Apr 26, 2026
Catches up to v0.22.0 source-aware ranking (#439) and bumps the
upgrade-hardening wave to v0.22.5. The bootstrap now layers cleanly
on top of v0.22.0 — no functional conflict between source-aware
search and the pre-schema bootstrap.

CHANGELOG entries reordered: v0.22.5 (this wave) on top, v0.22.0
(master) below, prior entries below that. Version references in the
v0.22.5 entry, in CLAUDE.md, and in package.json all updated together.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant