Skip to content

docs: retrieval incident — vector search returns pages by their weakest chunk (RFC + fix menu + eval design)#1616

Closed
garrytan-agents wants to merge 2 commits into
garrytan:masterfrom
garrytan-agents:docs/retrieval-maxpool-incident
Closed

docs: retrieval incident — vector search returns pages by their weakest chunk (RFC + fix menu + eval design)#1616
garrytan-agents wants to merge 2 commits into
garrytan:masterfrom
garrytan-agents:docs/retrieval-maxpool-incident

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Docs-only PR (no code). Full incident writeup + boil-the-ocean fix menu + eval design for a brain-wide retrieval defect.

TL;DR

Querying "Greek amphitheater" failed to confidently surface the page titled "The Mingtang (明堂) — Indoor Greek amphitheater…", returning it at cosine 0.64 via a weak body chunk. The same page's title chunk scores 0.9866 on a richer query — the embedding is fine, the retrieval aggregation is broken.

Root cause: searchVector (both postgres + pglite engines) returns top-k chunks with no per-page max-pool. A page is represented by whichever chunk survives the candidate cut, not its best chunk. The keyword path (searchKeyword) already does the correct DISTINCT ON (slug) ORDER BY score DESC. The vector path does not.

Secondary: reranker_enabled / expansion are a no-op on this query across all modes — suggesting the CLI default does not exercise the documented hybrid stack (RETRIEVAL.md claims 49.1 P@5 / 97.9 R@5). Doc/reality gap flagged.

In the doc

  • Incident narrative + reproduction with exact CLI output
  • Code-level root cause with line refs (searchVector vs searchKeyword asymmetry, innerLimit, pglite parity)
  • Boil-the-ocean fix menu: Tier 1 (per-page max-pool — the actual bug, ~1 day) → Tier 6 (graph-assisted named-entity recall)
  • Eval design ("NamedThingBench") on the existing eval/code-retrieval harness + baseline→compare→gate: title-substring, generic-label→named-entity, alias/synonym, multi-chunk dilution, short-vs-rich pairs, hard negatives; Hit@1/3, MRR, nDCG@5, score calibration vs the 0.85 dedup threshold; CI gate + engine parity

Recommendation

Ship Tier 1, measure, then decide Tiers 2–6 by data. No code in this PR by request — review the plan before implementation.

Wintermute added 2 commits May 29, 2026 05:21
…kest chunk (RFC + boil-the-ocean fix menu + eval design)
…cessary-not-sufficient; title-boost + alias resolution close the store-by-name vs retrieve-by-literal gap)
@garrytan

Copy link
Copy Markdown
Owner

Superseded by the retrieval-cathedral wave (branch garrytan/vector-search-max-pool).

The docs-only diagnosis here is folded in and corrected against verified ground truth from the code:

  • gbrain search is keyword-only (ts_rank); the 0.64/0.98 cosine scores came from the hybrid query/MCP path Wintermute actually hit — the repro command was mislabeled.
  • --mode is not a param of either op; mode resolves server-side from the search.mode config key, which is why all three 'modes' returned identical results (the flag is silently dropped; thorough isn't a real mode). Rerank/expansion also fail-open to no-ops when no provider is configured.
  • hybridSearch ALREADY max-pools per page at the dedup layer (top-3 → cap-2, score-sorted), so the per-page max-pool fix's real win is candidate-set page recall, not final-rank — and it is necessary-but-not-sufficient (the title chunk may simply score below the body chunk on a 2-word query, or fall outside the candidate pool).

The replacement wave does the complete cure, not the proximate patch: a Phase-0 stage-by-stage diagnostic, per-page max-pool (both engines, shared SQL builder), page.title as a first-class boost, a free-text alias-resolution layer (new page_aliases table), an evidence-based confidence contract (evidence enum + create_safety, so the agent's don't-duplicate decision keys off WHY a page matched, not a blended score), gbrain search→hybrid with a remote-provider policy + escape hatch, and NamedThingBench (a hard CI gate). The narrative here will be carried into a corrected docs/architecture/RETRIEVAL_MAXPOOL_INCIDENT.md with Co-Authored-By preserved.

Thank you for the writeup — it was the right call to file it; closing only because the fix supersedes the doc-only RFC.

@garrytan garrytan closed this May 30, 2026
garrytan added a commit that referenced this pull request May 30, 2026
…as + evidence (#1657)

* feat(search): per-page max-pool in searchVector (both engines)

T1 of the retrieval-cathedral wave (supersedes #1616). Vector search returned
chunk-grain top-k with no DISTINCT ON, so a page could be represented by a
weak chunk while a hub page's chunks crowded a distinct page's strong chunk
out of the candidate set entirely. Keyword search always pooled per page; the
vector path did not.

- New shared buildBestPerPagePoolCte() in sql-ranking.ts — single source of
  truth consumed by searchKeyword + searchVector across postgres + pglite, so
  the two engines can't drift (the recurring parity bug class).
- searchVector both engines: compute score as a select-list expr (HNSW
  ORDER BY stays pure-distance), pool DISTINCT ON (slug) over the full
  candidate set before the user LIMIT, deterministic tiebreak
  (slug, score DESC, page_id ASC, chunk_id ASC).
- All keyword pooling blocks refactored onto the shared builder (DRY).
- Regression test: a hub page's chunks no longer crowd out a distinct page's
  strong chunk; results are one-per-page by best chunk. Fails on old path.

Verified: real-Postgres engine-parity 22/22, PGLite hermetic suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(search): title-phrase boost (page.title first-class signal)

T2 of the retrieval-cathedral wave. A query that is a phrase from a page's
title ("Greek amphitheater" -> "The Mingtang - Indoor Greek Amphitheater")
matched a weak body chunk instead of being recognized as a title hit. Names
of things deserve weight.

- New pure title-match.ts: isTitlePhraseMatch (contiguous token-run inside
  page.title OR exact full-title match). Precision guards: >= 2 content
  tokens OR exact full-title; stopword filter; token-boundary match (no raw
  substring). Reused by the eval later so production + bench can't drift.
- applyTitleBoost post-fusion stage in hybrid.ts: reads page.title (not the
  brittle "first chunk"), floor-ratio-gated, stamps title_match_boost for
  --explain, never touches base_score (the agent's dedup confidence).
- ModeBundle.title_boost knob (1.25, on in all modes - cheap gated
  correctness fix), search.title_boost config key, dashboard description.
- KNOBS_HASH_VERSION 6 -> 7 so a boost-on cache write can't serve a
  boost-off lookup; all version-pin + canonical-bundle assertions updated.
- 18 new tests (matcher 13 + stage 5); typecheck clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(search): page_aliases data layer (T3 foundation)

Free-text alias resolution for search. gbrain stored a page's chosen names
in pages.frontmatter `aliases:` JSONB but search never consulted them, so a
query like "Hall of Light" or "明堂" couldn't surface the "Mingtang" page.

DELIBERATELY SEPARATE from slug_aliases (re-grounded against current code):
  - slug_aliases:  old-slug -> canonical-slug (wikilink/get_page redirect,
    populated only from concept-redirect conversions)
  - page_aliases:  normalized free-text name -> canonical slug (search hop)
Overloading slug_aliases would muddy two distinct semantics, so this is a
new table, not an extension (honors DRY by keeping concepts separate).

- src/core/search/alias-normalize.ts: ONE normalizeAlias() (NFKC + lowercase
  + ws-collapse + quote-strip) + normalizeAliasList() shared by the write
  (ingest) and read (search) paths so they match on the same key (CQ2).
- Migration v108 page_aliases (source_id, alias_norm, slug); btree
  (source_id, alias_norm) for indexed-equality hop, NOT ILIKE; unique TRIPLE
  (not source_id+alias_norm) so two pages may claim one alias — collisions
  reported + resolved at query time, not blocked at ingest (Codex#8).
  Mirror in pglite-schema.ts; Postgres fresh gets it from the migration.
- engine.resolveAliases(aliasNorms, {sourceId|sourceIds}) read +
  setPageAliases(slug, source, aliasNorms) write, both engines, source-scoped.
- 17 tests: normalize round-trip, collision, source-scope, replace, clear.

Ingest projection + the hybridSearch alias hop land next (T3 wiring).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(search): alias hop + ingest projection (T3 wiring)

Wires the page_aliases data layer into ingest (write) and hybridSearch (read)
so a query that is a page's declared chosen name surfaces that page — the
named-thing class neither max-pool nor title-boost can fix (true synonyms with
zero surface overlap: "Hall of Light" / "明堂" -> the Mingtang page).

- Ingest projection (import-file.ts): after the page write commits,
  normalizeAliasList(frontmatter.aliases) -> engine.setPageAliases. Always
  called (even []) so removing an alias clears its row; content_hash includes
  non-timestamp frontmatter so alias edits reach this path, not the skip branch.
  Fail-soft + pre-v108-safe (isUndefinedTableError swallowed).
- applyAliasHop (hybrid.ts), AFTER rerank so a named query reliably surfaces
  its page: FULL normalized-query exact match only (no substring/n-grams),
  skip >6-token prose queries, present-boost 1.10x / inject absent canonical at
  top-of-organic + epsilon (never absolute 1.0, D3), collisions alpha-ordered +
  capped at 3, fail-open on pre-v108 / lookup error (D9). Stamps alias_hit for
  the T4 evidence contract.
- SearchResult.alias_hit attribution field.
- 8 tests: inject/boost/CJK/no-match/long-skip/collision + ingest projection
  round-trip + alias-removal-clears. 73 pass across the T1/T2/T3 + import suite.

Backfill of existing pages' aliases lands as T8 (reindex --aliases).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(search): evidence/create_safety contract + search→cheap-hybrid + per-call mode (T4)

The agent-facing fix for the incident's ROOT behavior: tonight the agent read a
single blended 0.64 score, decided "no strong match, safe to write a new page",
and wrote a duplicate on a developed concept page. A blended RRF/cosine score is
not a calibrated probability, so the don't-duplicate decision must key off WHY a
page matched, not a raw number.

- evidence.ts: classifyEvidence (alias_hit > exact_title_match > high_vector_match
  > keyword_exact > weak_semantic) + createSafetyFor (exists | probable | unknown).
  stampEvidence runs at the end of every hybrid return path (main + both keyword
  fallbacks). SearchResult gains evidence + create_safety. The agent keys
  don't-duplicate off create_safety='exists', not a score threshold.
- search op → cheap-hybrid everywhere (D4/D15): full vector+keyword+RRF+pool+
  title+alias, expansion OFF (no per-call LLM cost); `query` stays full-control.
  search.mcp_keyword_only escape hatch (D17) keeps the old keyword-only behavior
  for operators who don't want query text sent to an embedding provider.
- Alias hop + evidence now also run on the keyword-only fallback paths (the
  named-thing fix is most valuable exactly when vector is unavailable).
- Per-call `mode` (D5): honored ONLY for local/trusted callers (ctx.remote===
  false) so a remote OAuth client can't escalate to costly tokenmax; local +
  unknown mode rejects loudly; threaded into resolveSearchMode + the cache key.
- 30 tests (evidence classifier incl. before/after-incident cases, per-call mode
  gate, alias hop). Updated mcp-eval-capture to the new cheap-hybrid contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(cli): reconcile `gbrain search` dispatch (T5)

After T4 made the `search` op cheap-hybrid, `gbrain search "x"` already does the
right thing — but `gbrain search modes/stats/tune` would have run a hybrid search
for the literal word "modes" instead of opening the config dashboard (the op
intercepts before the unreachable handleCliOnly dashboard path).

Add a pre-dispatch interception in main(): `search` + subArgs[0] in
{modes,stats,tune} → runSearch dashboard (with the v0.41.6.0 read-only connect+
dispatch 10s timeout preserved); everything else (free-text) falls through to the
cheap-hybrid `search` op. Subprocess test pins all three routes:
modes/stats → dashboard, free-text → search op ("No results", not "Unknown
subcommand").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(eval): NamedThingBench retrieval-quality gate (T6)

The eval that makes the retrieval-maxpool incident impossible to reintroduce
silently. 7 query families, each a failure class the incident exposed:
title-substring, generic-to-named, alias-synonym, multi-chunk-dilution,
short-vs-rich, graph-relationship, hard-negative.

- src/eval/retrieval-quality/harness.ts: pure scoring (Hit@1/Hit@3/MRR per
  family) + injected SearchFn (CLI uses hybridSearch; tests stub it) +
  evaluateGate. D12 gate: hard-gate the families that ARE the bug from day one
  (title-substring Hit@1>=0.95, alias-synonym Hit@1>=0.98, dilution Hit@3=1.0),
  warn-then-enforce the softer families. Env-overridable floors.
- `gbrain eval retrieval-quality <fixture.jsonl> [--json] [--source]` +
  dispatch in eval.ts. Exit 0 PASS / 1 FAIL / 2 USAGE.
- Synthetic fixture (placeholder names only, privacy-grep guarded) + hermetic
  gate test: seeds a synthetic brain, forces the keyword+title+alias path
  (embed transport stubbed to throw — free, deterministic), asserts the bug
  families pass. The vector max-pool guarantee is pinned separately by
  searchvector-maxpool.test.ts.
- CI gate: the hermetic test is a normal unit test, so it runs in every PR
  shard — the gate is live on every change.
- 23 tests (harness unit + hermetic gate + fixture privacy guard).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(telemetry): rank-1 score drift signal (T7)

Standing observability so a retrieval regression is caught before a human hits
it in chat (like tonight). Aggregate columns on search_telemetry (NOT per-query
rows, D10): sum_rank1_score + count_rank1 + 3 coarse buckets (<0.6 / 0.6-0.85 /
>=0.85). The mean rank-1 base_score is the headline; a downward drift = retrieval
quality regressing.

- hybrid.ts: capture rank-1 base_score at all three return paths, thread through
  emitMeta → recordSearchTelemetry opts (like results_count).
- telemetry.ts: Bucket + record + flush ON CONFLICT-add + readSearchStats expose
  avg_rank1_score (null when no samples — no NaN) + rank1_distribution.
- Migration v109 ADD COLUMN IF NOT EXISTS (both engines; search_telemetry lives
  only in migration v57, so the v57+v109 chain covers fresh + upgrade). Columns
  exempted in schema-bootstrap-coverage (no forward-ref index → no bootstrap need).
- `gbrain search stats` surfaces the avg + bucket line; JSON envelope auto-carries
  the fields. "true-positive" wording dropped per Codex#14 — production has no
  labels, so this is an unlabeled rank-1 score histogram; labeled calibration
  lives in NamedThingBench (T6).
- 3 round-trip tests (mean+buckets, no-result excluded, empty=null).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(reindex): gbrain reindex --aliases backfill (T8)

Import-time projection (T3) covers new + changed pages; this backfills EXISTING
pages whose frontmatter `aliases:` predate v108 / the projection. Walks
listAllPageRefs (cheap cross-source (source_id, slug) enumeration), reads each
page's frontmatter aliases, writes page_aliases via setPageAliases.

Idempotent (setPageAliases replaces) so re-running is convergent — no op-checkpoint
needed (fast, no embedding). --dry-run reports would-write counts, --source
narrows, --limit caps, --json envelope, progress reporter. Wired into the
`reindex` dispatch alongside --markdown / --multimodal.

4 tests: backfill from array + comma-scalar frontmatter, --dry-run writes
nothing, idempotent second run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(search): pre-migration fail-open regression (T9)

Pins that pre-v108 brains (no page_aliases table) keep working: applyAliasHop
returns input unchanged + doesn't throw, importFromContent with frontmatter
aliases still imports (projection swallows table-missing via isUndefinedTableError),
and resolveAliases surfaces the error for the caller to catch.

Completes the T9 mandatory regression set (dilution → searchvector-maxpool,
dispatch → cli-search-dispatch, MCP contract → mcp-eval-capture, engine parity
→ engine-parity 22/22, pre-migration → here).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(search): Phase-0 retrieval diagnostic — `gbrain search diagnose` (T0)

The operator-facing trace the user runs against the production brain to pin
which retrieval layer surfaces (or misses) a target page — the diagnostic the
plan front-loaded so we don't ship a fix that doesn't move the incident.

`gbrain search diagnose "<query>" --target <slug> [--json] [--source]` reports,
for the target: keyword rank+score, vector rank+score (skipped/graceful if no
embedding provider), whether the query is a registered alias, and the hybrid
final rank + evidence + create_safety + which boosts fired (title/alias). The
verdict names the layer that surfaces the target at rank 1 (or "none"), telling
you whether the lever is max-pool/innerLimit (vector) vs title/alias.

Wired into the `search` dispatch alongside modes/stats/tune (60s timeout since
it runs real retrieval). 2 hermetic tests (alias-query trace + title-phrase
trace). For the Mingtang incident, run:
  gbrain search diagnose "Greek amphitheater" --target projects/new-greek-theater/concept_v0

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(retrieval): corrected incident record + named-thing layers + glossary (T10)

- RETRIEVAL_MAXPOOL_INCIDENT.md: replaces closed PR #1616's RFC with the
  verified record — what happened, the disease, the corrections to the RFC's
  mechanics (search was keyword-only, --mode unthreaded, hybrid already pooled
  at dedup, aliases dead to search), the four-layer fix that shipped, and the
  triage commands (search diagnose / reindex --aliases / search stats / eval
  retrieval-quality).
- RETRIEVAL.md: new "Named-thing retrieval" section documenting per-page pool +
  title boost + alias hop + the evidence contract, reconciling the doc with the
  shipped pipeline (closes the doc/reality gap).
- metric-glossary.ts + regenerated METRIC_GLOSSARY.md: Hit@1, Hit@3,
  avg_rank1_score (drift signal, not labeled accuracy), and create_safety
  (the evidence contract) now carry plain-English glossary entries.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(eval): NamedThingBench fixture privacy guard via slug-shape (T6 fixup)

The banned-name literal list itself tripped check-privacy/check-test-real-names.
Replace it with the load-bearing assertion: every fixture slug must be an
*-example placeholder (no real brain page can be referenced).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(search): source-isolation in per-page pool + alias hop (P0, codex adversarial)

Codex outside-voice caught two source-isolation P0s in the retrieval wave — the
exact class the v0.34.1 seal guards. Both fixed before merge.

P0-1: buildBestPerPagePoolCte pooled on `slug` alone. In a federated brain, two
pages with the same slug in different sources collapsed before ranking/pagination
(the neighbor-source page dropped). Now DISTINCT ON (COALESCE(source_id,'default'),
slug) — composite key matching dedup.ts's pageKey. Also fixes the PRE-EXISTING
keyword-path bug (best_per_page was slug-only before this wave); real-PG parity 23/23.

P0-2: the alias hop dropped source_id. resolveAliases returned bare slugs and
applyAliasHop hydrated via getPage(slug, undefined), so a federated caller could
get the default-source page injected or the right allowed-source page suppressed.
resolveAliases now returns {slug, source_id} pairs; applyAliasHop matches by
(source_id, slug) and fetches each canonical in its OWN source.

Regression tests: alias hop boosts only the aliased source (not same-slug in
another source); resolveAliases keeps cross-source same-slug distinct.

Deferred as documented tradeoffs (TODO): evidence high_vector_match label uses
blended base_score not pure cosine; deep-pagination candidate budget is
chunk-bounded; telemetry writes swallow errors pre-v109 on rolling deploys.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.41.30.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: v0.41.30.0 retrieval cathedral — CLAUDE.md key files + llms regen

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: renumber release v0.41.30.0 → v0.41.34.0 (queue moved)

Version trio + CHANGELOG header + CLAUDE.md key-file annotations + TODOS
heading + regenerated llms bundles, all moved to 0.41.34.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(ci): restore glossary roster + harden facts-anti-loop hook budget

Two CI failures surfaced after the master merges that brought the branch to
111 migrations:

1. shard 1 — `ALL_METRICS roster > matches the renderer output (no orphans)`:
   the merge took master's `renderMetricGlossaryMarkdown` whose `groups`
   array lacked this branch's 4 retrieval-quality keys (hit@1, hit@3,
   avg_rank1_score, create_safety). `ALL_METRICS` (derived via Object.keys)
   kept them, so the roster test saw 4 orphans. The freshness check
   (check:eval-glossary) passed because renderer-output == committed doc —
   it can't catch a renderer that drops a metric; the roster test can.
   Restored the "Retrieval-Quality / Evidence Metrics (NamedThingBench)"
   group + regenerated docs/eval/METRIC_GLOSSARY.md.

2. shard 2 — facts-anti-loop's two engine-dependent put_page tests failed
   while the two engine-free extractFactsFromTurn tests passed (the
   signature of a partially-failed beforeAll). This file has a documented
   PGLite-cold-start-under-deep-shard-load timeout history; the 30s budget
   was tuned for 95 migrations and the chain is now 111. Bumped to 60s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(ci): isolate facts-anti-loop in its own process (serial)

Follow-up to the prior hook-timeout bump, which was the wrong theory: the
[58ms]/[71ms] body times in the re-run prove beforeAll did NOT time out —
the engine connects and the two put_page tests run and fail for real, while
the two engine-free extractFactsFromTurn tests in the same file pass.

put_page (via dispatchToolCall) touches process-global singletons (the
facts queue + the AI gateway used by importFromContent's embed step). Some
sibling file in the 78-file shard-2 process leaves residual global state
that makes put_page's pre-backstop path fail on the CI runner. The failure
is NOT reproducible alone, in a Linux oven/bun:1 container, or in a full
local shard-2 run (1172 pass) — only on the GitHub runner, deterministically.

Per CLAUDE.md's test-isolation rules, a test coupled to shared process
state belongs in its own process. Renamed to *.serial.test.ts so it runs
in the dedicated serial-tests job (scripts/run-serial-tests.sh spawns a
fresh `bun test` per serial file), where it passes deterministically;
test-shard.sh excludes serial files from the matrix. Updated the comment
to reflect the real cause and refreshed the test-weights.json key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(ci): close cross-file gateway-config pollution in test shards

The prior serial-move theory was incomplete. The real, single root cause
behind all three shard failures (2, 5, 10) is cross-file AI-gateway config
pollution within a shard's bun process:

- A test calls configureGateway() and doesn't restore the gateway on exit.
  The legacy-embedding preload pins OpenAI/1536 ONCE at process start and
  re-pins per-test ONLY when the gateway slot is empty — so a leaker that
  reconfigured the gateway to the v0.37 default (zeroentropyai:zembed-1 /
  1280-d) and never reset poisons every later file in the shard.
- Victim A (shard 5, test/search/searchvector-maxpool.test.ts): runs
  initSchema in beforeAll under the leaked gateway → content_chunks.embedding
  becomes vector(1280) → inserting its hardcoded 1536-d basis vectors throws
  pgvector CheckExpectedDim.
- Victims B/C (shard 10 facts-backstop-gating, shard 2 facts-anti-loop):
  put_page's importFromContent embeds by design (embed failure PROPAGATES,
  Codex C2). Under a leaked fake-key gateway the embed step 401s and put_page
  returns isError → the backstop assertions fail.

My branch's shard re-partition (added test files + weight changes) merely
co-located leakers with victims; the hazard was latent.

Fixes (root cause + self-sufficient victims):
- test/search/rerank.test.ts (the shard-5 leaker): add afterAll(resetGateway).
  Its stub omits embedding_model, so it fell back to the ZE/1280 default;
  now it restores the empty slot so the preload re-pins legacy for the next
  file.
- test/search/searchvector-maxpool.test.ts: pin configureGateway(openai/1536)
  in beforeAll BEFORE initSchema (initSchema runs before any preload
  beforeEach, so it can't rely on the inherited slot).
- test/facts-backstop-gating.test.ts + test/facts-anti-loop.test.ts: reset
  the gateway in beforeEach so put_page's embed is a graceful no-op; reverted
  anti-loop from the serial quarantine back into the matrix (the serial move
  was the wrong fix for a gateway-state problem).

Validated deterministically: a non-resetting leaker that poisons the gateway
to ZE, run first in one bun process, no longer breaks any of the three
victims (14/14 pass). verify 29/29, typecheck clean, isolation lint clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants