docs: retrieval incident — vector search returns pages by their weakest chunk (RFC + fix menu + eval design)#1616
Conversation
…kest chunk (RFC + boil-the-ocean fix menu + eval design)
…cessary-not-sufficient; title-boost + alias resolution close the store-by-name vs retrieve-by-literal gap)
|
Superseded by the retrieval-cathedral wave (branch garrytan/vector-search-max-pool). The docs-only diagnosis here is folded in and corrected against verified ground truth from the code:
The replacement wave does the complete cure, not the proximate patch: a Phase-0 stage-by-stage diagnostic, per-page max-pool (both engines, shared SQL builder), page.title as a first-class boost, a free-text alias-resolution layer (new page_aliases table), an evidence-based confidence contract (evidence enum + create_safety, so the agent's don't-duplicate decision keys off WHY a page matched, not a blended score), Thank you for the writeup — it was the right call to file it; closing only because the fix supersedes the doc-only RFC. |
…as + evidence (#1657) * feat(search): per-page max-pool in searchVector (both engines) T1 of the retrieval-cathedral wave (supersedes #1616). Vector search returned chunk-grain top-k with no DISTINCT ON, so a page could be represented by a weak chunk while a hub page's chunks crowded a distinct page's strong chunk out of the candidate set entirely. Keyword search always pooled per page; the vector path did not. - New shared buildBestPerPagePoolCte() in sql-ranking.ts — single source of truth consumed by searchKeyword + searchVector across postgres + pglite, so the two engines can't drift (the recurring parity bug class). - searchVector both engines: compute score as a select-list expr (HNSW ORDER BY stays pure-distance), pool DISTINCT ON (slug) over the full candidate set before the user LIMIT, deterministic tiebreak (slug, score DESC, page_id ASC, chunk_id ASC). - All keyword pooling blocks refactored onto the shared builder (DRY). - Regression test: a hub page's chunks no longer crowd out a distinct page's strong chunk; results are one-per-page by best chunk. Fails on old path. Verified: real-Postgres engine-parity 22/22, PGLite hermetic suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(search): title-phrase boost (page.title first-class signal) T2 of the retrieval-cathedral wave. A query that is a phrase from a page's title ("Greek amphitheater" -> "The Mingtang - Indoor Greek Amphitheater") matched a weak body chunk instead of being recognized as a title hit. Names of things deserve weight. - New pure title-match.ts: isTitlePhraseMatch (contiguous token-run inside page.title OR exact full-title match). Precision guards: >= 2 content tokens OR exact full-title; stopword filter; token-boundary match (no raw substring). Reused by the eval later so production + bench can't drift. - applyTitleBoost post-fusion stage in hybrid.ts: reads page.title (not the brittle "first chunk"), floor-ratio-gated, stamps title_match_boost for --explain, never touches base_score (the agent's dedup confidence). - ModeBundle.title_boost knob (1.25, on in all modes - cheap gated correctness fix), search.title_boost config key, dashboard description. - KNOBS_HASH_VERSION 6 -> 7 so a boost-on cache write can't serve a boost-off lookup; all version-pin + canonical-bundle assertions updated. - 18 new tests (matcher 13 + stage 5); typecheck clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(search): page_aliases data layer (T3 foundation) Free-text alias resolution for search. gbrain stored a page's chosen names in pages.frontmatter `aliases:` JSONB but search never consulted them, so a query like "Hall of Light" or "明堂" couldn't surface the "Mingtang" page. DELIBERATELY SEPARATE from slug_aliases (re-grounded against current code): - slug_aliases: old-slug -> canonical-slug (wikilink/get_page redirect, populated only from concept-redirect conversions) - page_aliases: normalized free-text name -> canonical slug (search hop) Overloading slug_aliases would muddy two distinct semantics, so this is a new table, not an extension (honors DRY by keeping concepts separate). - src/core/search/alias-normalize.ts: ONE normalizeAlias() (NFKC + lowercase + ws-collapse + quote-strip) + normalizeAliasList() shared by the write (ingest) and read (search) paths so they match on the same key (CQ2). - Migration v108 page_aliases (source_id, alias_norm, slug); btree (source_id, alias_norm) for indexed-equality hop, NOT ILIKE; unique TRIPLE (not source_id+alias_norm) so two pages may claim one alias — collisions reported + resolved at query time, not blocked at ingest (Codex#8). Mirror in pglite-schema.ts; Postgres fresh gets it from the migration. - engine.resolveAliases(aliasNorms, {sourceId|sourceIds}) read + setPageAliases(slug, source, aliasNorms) write, both engines, source-scoped. - 17 tests: normalize round-trip, collision, source-scope, replace, clear. Ingest projection + the hybridSearch alias hop land next (T3 wiring). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(search): alias hop + ingest projection (T3 wiring) Wires the page_aliases data layer into ingest (write) and hybridSearch (read) so a query that is a page's declared chosen name surfaces that page — the named-thing class neither max-pool nor title-boost can fix (true synonyms with zero surface overlap: "Hall of Light" / "明堂" -> the Mingtang page). - Ingest projection (import-file.ts): after the page write commits, normalizeAliasList(frontmatter.aliases) -> engine.setPageAliases. Always called (even []) so removing an alias clears its row; content_hash includes non-timestamp frontmatter so alias edits reach this path, not the skip branch. Fail-soft + pre-v108-safe (isUndefinedTableError swallowed). - applyAliasHop (hybrid.ts), AFTER rerank so a named query reliably surfaces its page: FULL normalized-query exact match only (no substring/n-grams), skip >6-token prose queries, present-boost 1.10x / inject absent canonical at top-of-organic + epsilon (never absolute 1.0, D3), collisions alpha-ordered + capped at 3, fail-open on pre-v108 / lookup error (D9). Stamps alias_hit for the T4 evidence contract. - SearchResult.alias_hit attribution field. - 8 tests: inject/boost/CJK/no-match/long-skip/collision + ingest projection round-trip + alias-removal-clears. 73 pass across the T1/T2/T3 + import suite. Backfill of existing pages' aliases lands as T8 (reindex --aliases). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(search): evidence/create_safety contract + search→cheap-hybrid + per-call mode (T4) The agent-facing fix for the incident's ROOT behavior: tonight the agent read a single blended 0.64 score, decided "no strong match, safe to write a new page", and wrote a duplicate on a developed concept page. A blended RRF/cosine score is not a calibrated probability, so the don't-duplicate decision must key off WHY a page matched, not a raw number. - evidence.ts: classifyEvidence (alias_hit > exact_title_match > high_vector_match > keyword_exact > weak_semantic) + createSafetyFor (exists | probable | unknown). stampEvidence runs at the end of every hybrid return path (main + both keyword fallbacks). SearchResult gains evidence + create_safety. The agent keys don't-duplicate off create_safety='exists', not a score threshold. - search op → cheap-hybrid everywhere (D4/D15): full vector+keyword+RRF+pool+ title+alias, expansion OFF (no per-call LLM cost); `query` stays full-control. search.mcp_keyword_only escape hatch (D17) keeps the old keyword-only behavior for operators who don't want query text sent to an embedding provider. - Alias hop + evidence now also run on the keyword-only fallback paths (the named-thing fix is most valuable exactly when vector is unavailable). - Per-call `mode` (D5): honored ONLY for local/trusted callers (ctx.remote=== false) so a remote OAuth client can't escalate to costly tokenmax; local + unknown mode rejects loudly; threaded into resolveSearchMode + the cache key. - 30 tests (evidence classifier incl. before/after-incident cases, per-call mode gate, alias hop). Updated mcp-eval-capture to the new cheap-hybrid contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(cli): reconcile `gbrain search` dispatch (T5) After T4 made the `search` op cheap-hybrid, `gbrain search "x"` already does the right thing — but `gbrain search modes/stats/tune` would have run a hybrid search for the literal word "modes" instead of opening the config dashboard (the op intercepts before the unreachable handleCliOnly dashboard path). Add a pre-dispatch interception in main(): `search` + subArgs[0] in {modes,stats,tune} → runSearch dashboard (with the v0.41.6.0 read-only connect+ dispatch 10s timeout preserved); everything else (free-text) falls through to the cheap-hybrid `search` op. Subprocess test pins all three routes: modes/stats → dashboard, free-text → search op ("No results", not "Unknown subcommand"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(eval): NamedThingBench retrieval-quality gate (T6) The eval that makes the retrieval-maxpool incident impossible to reintroduce silently. 7 query families, each a failure class the incident exposed: title-substring, generic-to-named, alias-synonym, multi-chunk-dilution, short-vs-rich, graph-relationship, hard-negative. - src/eval/retrieval-quality/harness.ts: pure scoring (Hit@1/Hit@3/MRR per family) + injected SearchFn (CLI uses hybridSearch; tests stub it) + evaluateGate. D12 gate: hard-gate the families that ARE the bug from day one (title-substring Hit@1>=0.95, alias-synonym Hit@1>=0.98, dilution Hit@3=1.0), warn-then-enforce the softer families. Env-overridable floors. - `gbrain eval retrieval-quality <fixture.jsonl> [--json] [--source]` + dispatch in eval.ts. Exit 0 PASS / 1 FAIL / 2 USAGE. - Synthetic fixture (placeholder names only, privacy-grep guarded) + hermetic gate test: seeds a synthetic brain, forces the keyword+title+alias path (embed transport stubbed to throw — free, deterministic), asserts the bug families pass. The vector max-pool guarantee is pinned separately by searchvector-maxpool.test.ts. - CI gate: the hermetic test is a normal unit test, so it runs in every PR shard — the gate is live on every change. - 23 tests (harness unit + hermetic gate + fixture privacy guard). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(telemetry): rank-1 score drift signal (T7) Standing observability so a retrieval regression is caught before a human hits it in chat (like tonight). Aggregate columns on search_telemetry (NOT per-query rows, D10): sum_rank1_score + count_rank1 + 3 coarse buckets (<0.6 / 0.6-0.85 / >=0.85). The mean rank-1 base_score is the headline; a downward drift = retrieval quality regressing. - hybrid.ts: capture rank-1 base_score at all three return paths, thread through emitMeta → recordSearchTelemetry opts (like results_count). - telemetry.ts: Bucket + record + flush ON CONFLICT-add + readSearchStats expose avg_rank1_score (null when no samples — no NaN) + rank1_distribution. - Migration v109 ADD COLUMN IF NOT EXISTS (both engines; search_telemetry lives only in migration v57, so the v57+v109 chain covers fresh + upgrade). Columns exempted in schema-bootstrap-coverage (no forward-ref index → no bootstrap need). - `gbrain search stats` surfaces the avg + bucket line; JSON envelope auto-carries the fields. "true-positive" wording dropped per Codex#14 — production has no labels, so this is an unlabeled rank-1 score histogram; labeled calibration lives in NamedThingBench (T6). - 3 round-trip tests (mean+buckets, no-result excluded, empty=null). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(reindex): gbrain reindex --aliases backfill (T8) Import-time projection (T3) covers new + changed pages; this backfills EXISTING pages whose frontmatter `aliases:` predate v108 / the projection. Walks listAllPageRefs (cheap cross-source (source_id, slug) enumeration), reads each page's frontmatter aliases, writes page_aliases via setPageAliases. Idempotent (setPageAliases replaces) so re-running is convergent — no op-checkpoint needed (fast, no embedding). --dry-run reports would-write counts, --source narrows, --limit caps, --json envelope, progress reporter. Wired into the `reindex` dispatch alongside --markdown / --multimodal. 4 tests: backfill from array + comma-scalar frontmatter, --dry-run writes nothing, idempotent second run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(search): pre-migration fail-open regression (T9) Pins that pre-v108 brains (no page_aliases table) keep working: applyAliasHop returns input unchanged + doesn't throw, importFromContent with frontmatter aliases still imports (projection swallows table-missing via isUndefinedTableError), and resolveAliases surfaces the error for the caller to catch. Completes the T9 mandatory regression set (dilution → searchvector-maxpool, dispatch → cli-search-dispatch, MCP contract → mcp-eval-capture, engine parity → engine-parity 22/22, pre-migration → here). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(search): Phase-0 retrieval diagnostic — `gbrain search diagnose` (T0) The operator-facing trace the user runs against the production brain to pin which retrieval layer surfaces (or misses) a target page — the diagnostic the plan front-loaded so we don't ship a fix that doesn't move the incident. `gbrain search diagnose "<query>" --target <slug> [--json] [--source]` reports, for the target: keyword rank+score, vector rank+score (skipped/graceful if no embedding provider), whether the query is a registered alias, and the hybrid final rank + evidence + create_safety + which boosts fired (title/alias). The verdict names the layer that surfaces the target at rank 1 (or "none"), telling you whether the lever is max-pool/innerLimit (vector) vs title/alias. Wired into the `search` dispatch alongside modes/stats/tune (60s timeout since it runs real retrieval). 2 hermetic tests (alias-query trace + title-phrase trace). For the Mingtang incident, run: gbrain search diagnose "Greek amphitheater" --target projects/new-greek-theater/concept_v0 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(retrieval): corrected incident record + named-thing layers + glossary (T10) - RETRIEVAL_MAXPOOL_INCIDENT.md: replaces closed PR #1616's RFC with the verified record — what happened, the disease, the corrections to the RFC's mechanics (search was keyword-only, --mode unthreaded, hybrid already pooled at dedup, aliases dead to search), the four-layer fix that shipped, and the triage commands (search diagnose / reindex --aliases / search stats / eval retrieval-quality). - RETRIEVAL.md: new "Named-thing retrieval" section documenting per-page pool + title boost + alias hop + the evidence contract, reconciling the doc with the shipped pipeline (closes the doc/reality gap). - metric-glossary.ts + regenerated METRIC_GLOSSARY.md: Hit@1, Hit@3, avg_rank1_score (drift signal, not labeled accuracy), and create_safety (the evidence contract) now carry plain-English glossary entries. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(eval): NamedThingBench fixture privacy guard via slug-shape (T6 fixup) The banned-name literal list itself tripped check-privacy/check-test-real-names. Replace it with the load-bearing assertion: every fixture slug must be an *-example placeholder (no real brain page can be referenced). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(search): source-isolation in per-page pool + alias hop (P0, codex adversarial) Codex outside-voice caught two source-isolation P0s in the retrieval wave — the exact class the v0.34.1 seal guards. Both fixed before merge. P0-1: buildBestPerPagePoolCte pooled on `slug` alone. In a federated brain, two pages with the same slug in different sources collapsed before ranking/pagination (the neighbor-source page dropped). Now DISTINCT ON (COALESCE(source_id,'default'), slug) — composite key matching dedup.ts's pageKey. Also fixes the PRE-EXISTING keyword-path bug (best_per_page was slug-only before this wave); real-PG parity 23/23. P0-2: the alias hop dropped source_id. resolveAliases returned bare slugs and applyAliasHop hydrated via getPage(slug, undefined), so a federated caller could get the default-source page injected or the right allowed-source page suppressed. resolveAliases now returns {slug, source_id} pairs; applyAliasHop matches by (source_id, slug) and fetches each canonical in its OWN source. Regression tests: alias hop boosts only the aliased source (not same-slug in another source); resolveAliases keeps cross-source same-slug distinct. Deferred as documented tradeoffs (TODO): evidence high_vector_match label uses blended base_score not pure cosine; deep-pagination candidate budget is chunk-bounded; telemetry writes swallow errors pre-v109 on rolling deploys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.41.30.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: v0.41.30.0 retrieval cathedral — CLAUDE.md key files + llms regen Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: renumber release v0.41.30.0 → v0.41.34.0 (queue moved) Version trio + CHANGELOG header + CLAUDE.md key-file annotations + TODOS heading + regenerated llms bundles, all moved to 0.41.34.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(ci): restore glossary roster + harden facts-anti-loop hook budget Two CI failures surfaced after the master merges that brought the branch to 111 migrations: 1. shard 1 — `ALL_METRICS roster > matches the renderer output (no orphans)`: the merge took master's `renderMetricGlossaryMarkdown` whose `groups` array lacked this branch's 4 retrieval-quality keys (hit@1, hit@3, avg_rank1_score, create_safety). `ALL_METRICS` (derived via Object.keys) kept them, so the roster test saw 4 orphans. The freshness check (check:eval-glossary) passed because renderer-output == committed doc — it can't catch a renderer that drops a metric; the roster test can. Restored the "Retrieval-Quality / Evidence Metrics (NamedThingBench)" group + regenerated docs/eval/METRIC_GLOSSARY.md. 2. shard 2 — facts-anti-loop's two engine-dependent put_page tests failed while the two engine-free extractFactsFromTurn tests passed (the signature of a partially-failed beforeAll). This file has a documented PGLite-cold-start-under-deep-shard-load timeout history; the 30s budget was tuned for 95 migrations and the chain is now 111. Bumped to 60s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(ci): isolate facts-anti-loop in its own process (serial) Follow-up to the prior hook-timeout bump, which was the wrong theory: the [58ms]/[71ms] body times in the re-run prove beforeAll did NOT time out — the engine connects and the two put_page tests run and fail for real, while the two engine-free extractFactsFromTurn tests in the same file pass. put_page (via dispatchToolCall) touches process-global singletons (the facts queue + the AI gateway used by importFromContent's embed step). Some sibling file in the 78-file shard-2 process leaves residual global state that makes put_page's pre-backstop path fail on the CI runner. The failure is NOT reproducible alone, in a Linux oven/bun:1 container, or in a full local shard-2 run (1172 pass) — only on the GitHub runner, deterministically. Per CLAUDE.md's test-isolation rules, a test coupled to shared process state belongs in its own process. Renamed to *.serial.test.ts so it runs in the dedicated serial-tests job (scripts/run-serial-tests.sh spawns a fresh `bun test` per serial file), where it passes deterministically; test-shard.sh excludes serial files from the matrix. Updated the comment to reflect the real cause and refreshed the test-weights.json key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(ci): close cross-file gateway-config pollution in test shards The prior serial-move theory was incomplete. The real, single root cause behind all three shard failures (2, 5, 10) is cross-file AI-gateway config pollution within a shard's bun process: - A test calls configureGateway() and doesn't restore the gateway on exit. The legacy-embedding preload pins OpenAI/1536 ONCE at process start and re-pins per-test ONLY when the gateway slot is empty — so a leaker that reconfigured the gateway to the v0.37 default (zeroentropyai:zembed-1 / 1280-d) and never reset poisons every later file in the shard. - Victim A (shard 5, test/search/searchvector-maxpool.test.ts): runs initSchema in beforeAll under the leaked gateway → content_chunks.embedding becomes vector(1280) → inserting its hardcoded 1536-d basis vectors throws pgvector CheckExpectedDim. - Victims B/C (shard 10 facts-backstop-gating, shard 2 facts-anti-loop): put_page's importFromContent embeds by design (embed failure PROPAGATES, Codex C2). Under a leaked fake-key gateway the embed step 401s and put_page returns isError → the backstop assertions fail. My branch's shard re-partition (added test files + weight changes) merely co-located leakers with victims; the hazard was latent. Fixes (root cause + self-sufficient victims): - test/search/rerank.test.ts (the shard-5 leaker): add afterAll(resetGateway). Its stub omits embedding_model, so it fell back to the ZE/1280 default; now it restores the empty slot so the preload re-pins legacy for the next file. - test/search/searchvector-maxpool.test.ts: pin configureGateway(openai/1536) in beforeAll BEFORE initSchema (initSchema runs before any preload beforeEach, so it can't rely on the inherited slot). - test/facts-backstop-gating.test.ts + test/facts-anti-loop.test.ts: reset the gateway in beforeEach so put_page's embed is a graceful no-op; reverted anti-loop from the serial quarantine back into the matrix (the serial move was the wrong fix for a gateway-state problem). Validated deterministically: a non-resetting leaker that poisons the gateway to ZE, run first in one bun process, no longer breaks any of the three victims (14/14 pass). verify 29/29, typecheck clean, isolation lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Docs-only PR (no code). Full incident writeup + boil-the-ocean fix menu + eval design for a brain-wide retrieval defect.
TL;DR
Querying "Greek amphitheater" failed to confidently surface the page titled "The Mingtang (明堂) — Indoor Greek amphitheater…", returning it at cosine 0.64 via a weak body chunk. The same page's title chunk scores 0.9866 on a richer query — the embedding is fine, the retrieval aggregation is broken.
Root cause:
searchVector(both postgres + pglite engines) returns top-k chunks with no per-page max-pool. A page is represented by whichever chunk survives the candidate cut, not its best chunk. The keyword path (searchKeyword) already does the correctDISTINCT ON (slug) ORDER BY score DESC. The vector path does not.Secondary:
reranker_enabled/expansionare a no-op on this query across all modes — suggesting the CLI default does not exercise the documented hybrid stack (RETRIEVAL.md claims 49.1 P@5 / 97.9 R@5). Doc/reality gap flagged.In the doc
Recommendation
Ship Tier 1, measure, then decide Tiers 2–6 by data. No code in this PR by request — review the plan before implementation.