Skip to content

feat(#251 follow-up): Layer 2 ablation eval + tuned multipliers + floor gate#260

Merged
jayzalowitz merged 2 commits into
mainfrom
jayzalowitz/251-layer2-eval
May 12, 2026
Merged

feat(#251 follow-up): Layer 2 ablation eval + tuned multipliers + floor gate#260
jayzalowitz merged 2 commits into
mainfrom
jayzalowitz/251-layer2-eval

Conversation

@jayzalowitz

Copy link
Copy Markdown
Owner

Summary

Built the labeled-retrieval ablation that PR #259 was waiting on, ran it, and the eval immediately found two real issues with the shipped Layer 2. Both are now fixed.

Closes (partial): #251 Layer 2 validation gate. Layer 2 stays beta / opt-in (the eval is now the gate, not the result). Default-on rollout remains blocked on real OpenAI embeddings + the tier-aware exclude UI.

The eval

packages/memory-gbrain/src/__tests__/tier-ablation-eval.test.ts runs the same query set against the same corpus twice (flag off vs on) and prints R@5 / P@5 / MRR-of-primary side-by-side, broken down by query class. Three classes:

  • user_behavior — queries where tier weighting should lift the user-authored variant.
  • received_content — queries about specific received content; must not collapse.
  • neutral — entity lookups, must not break.

Hand-built fixture: 17 labeled signals across 7 queries (each query has both authored + received variants matching the same text), plus 40 realistic-mix distractors (12% authored / 40% personal / 15% broadcast / 20% newsletter / 13% automated — matches real Gmail volume).

What the numbers say

Hash-trick embeddings, 47-signal corpus:

Metric pure-RRF tier-weighted
user_behavior MRR (n=3) 0.667 1.000
received_content MRR (n=3) 1.000 0.542
neutral MRR (n=1) 1.000 1.000
aggregate R@5 1.000 0.857
aggregate MRR primary 0.857 0.804

Two real issues found and fixed

Issue 1: Aggressive demote weights pushed legitimate primary hits below distractors

Original normal-band multipliers (newsletter 0.4×, automated 0.2×) were strong enough that the rank-1 "AWS billing alert" notification, demoted to 0.2×, fell BELOW the distractor pool entirely. Layer 2 became unusable for any query about specific received content.

Fixed by rebalancing to promote-strong / demote-soft:

normal (was → now)
user_sent_originated 1.5 → 1.5
user_sent_reply 1.2 → 1.2
inbox_personal 1.0 → 1.0
inbox_broadcast 0.8 → 0.9
inbox_newsletter 0.4 → 0.85
inbox_automated 0.2 → 0.8

The boost-authored side stays; the demote-received side relaxes significantly. The product intent of Layer 2 was "prefer authored on ambiguous queries," not "suppress received."

Issue 2: Multiplicative weighting without a gate let weak matches leapfrog strong ones

RRF scores decay slowly: 1 / (60 + rank). A rank-1 page scores 0.0164; a rank-30 page scores 0.0111. With multiplicative weights, a rank-30 authored distractor at 0.0111 × 1.5 = 0.0167 beat a rank-1 received primary at 0.0164 × 0.8 = 0.0131 — even with the gentlest possible demote.

Fixed by adding RrfFoldOptions.tierWeightFloorRatio to the RRF fold (default 0.85). Only pages whose raw rrfScore is at least 85% of the top page's raw score are eligible for the multiplier. The tail of the candidate pool keeps its unweighted score. Layer 2's intent is to rerank plausible candidates, never to promote noise.

What remains

received_content MRR landed at 0.542 — better than the 0.0 we'd have had without the floor gate, but still down from 1.0. The honest read: when a received_content query has an authored secondary (q4: AWS billing alert vs the user's reply about it), Layer 2 surfaces the user's reply first. This is sometimes-right (the user usually wants their own response) and sometimes-wrong (sometimes you really want the raw alert).

Most of the gap is from hash-trick spurious overlap — unrelated authored content gets enough phantom token-collision to enter the top candidate pool. With real OpenAI embeddings the spurious matches should largely disappear and received_content MRR should improve materially.

Layer 2 stays opt-in. The eval test is now the gate. Default-on rollout is still blocked on running this with OpenAI embeddings against a real-traffic corpus + the tier-aware exclude UI (privacy sub-issue, separate from this PR).

Test plan

  • pnpm --filter @skytwin/memory-gbrain test — 95 pass, 1 skipped (no regression)
  • pnpm --filter @skytwin/memory-gbrain-crdb-adapter test — 70 pass, 6 skipped (DB-gated)
  • pnpm build --concurrency=1 — 35/35 packages
  • pnpm test — 70/70 turbo tasks, all green
  • The ablation test passes its guardrail bars (user_behavior must lift, received_content ≥ 0.40, neutral must not regress, aggregate R@5 ≥ 0.7)
  • Existing tier-weighted-retrieval.test.ts rewritten to match the rebalanced multipliers — brief-reply downweight test now compares brief vs full authored (the actual mechanism), not brief vs newsletter (which only worked under the old extreme demote)

Files

  • rrf.ts — floor-ratio gate + defensive multiplier validation.
  • tier-weights.ts — rebalanced sparse/normal/dense tables.
  • tier-ablation-corpus.ts (new) — purpose-built labeled fixture.
  • tier-ablation-eval.test.ts (new) — the ablation + side-by-side report.
  • Existing tier tests updated to the new weight values.

🤖 Generated with Claude Code

…or gate

Built a labeled-retrieval ablation that runs the same query set against
the same corpus twice (Layer 2 off vs on) and reports R@5, P@5, and
MRR-of-primary side-by-side, broken down by query class:

  user_behavior     0.667 → 1.000   (intended: tier weighting lifts these)
  received_content  1.000 → 0.542   (acceptable: see below)
  neutral           1.000 → 1.000   (must not regress)

The eval immediately surfaced two real issues in the shipped Layer 2:

1. Aggressive demote weights (newsletter 0.4x, automated 0.2x in the
   normal band) pushed primary-hit notifications BELOW the distractor
   pool. "Find my AWS billing alert" queries lost the alert entirely.
   Fixed by rebalancing to promote-strong/demote-soft: newsletter 0.85x,
   automated 0.8x. Authored side stays at 1.5x/1.2x.

2. Multiplicative weighting without a gate let weak-match distractors
   leapfrog strong primary hits. A rank-30 authored distractor at
   1/(60+30) * 1.5 = 0.0167 beat a rank-1 received primary at
   1/(60+1) * 0.8 = 0.0131. The RRF score curve decays slowly enough
   that a 50% raw-score drop still leaves room for a 50% boost to
   overtake it.

   Fixed by adding `RrfFoldOptions.tierWeightFloorRatio` (default 0.85)
   to the RRF fold. Only pages whose raw rrfScore is at least 85% of
   the top page's raw score get the multiplier; the tail of the
   candidate pool keeps its unweighted score. The intent of Layer 2
   is "rerank plausible candidates" — never "promote noise."

Files:
- `rrf.ts`: floor-ratio gate, NaN/Infinity coercion, negative → 0 clamp.
- `tier-weights.ts`: rebalanced sparse/normal/dense multiplier tables.
- `tier-ablation-corpus.ts` (new): 17 labeled + 40 realistic-mix
  distractor signals. Mix matches real Gmail volume (12% authored, etc.)
  to avoid the earlier-fixture pitfall where 40% authored gave Layer 2
  artificial wins.
- `tier-ablation-eval.test.ts` (new): the ablation, prints side-by-side
  table on every run as a tuning artifact, asserts a guardrail bar that
  catches regression without enforcing impossible-with-hash-trick perfection.
- `tier-weights.test.ts` / `tier-weighted-retrieval.test.ts`: updated to
  match rebalanced weights. Brief-reply downweight test now compares
  brief vs full-authored (the actual mechanism) instead of brief vs
  newsletter (which only worked because the old demote was extreme).

Layer 2 stays beta / opt-in. The 0.542 received_content MRR is the
honest hash-trick floor; OpenAI embeddings should improve it materially
by reducing spurious overlap in the candidate pool. The eval test
remains in CI as a permanent regression guardrail.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 12, 2026 14:39

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a labeled Layer 2 ablation eval and uses it to retune tier multipliers plus introduce an RRF “floor gate” so tier weighting only reranks near-top candidates (intended to preserve received-content recall while still boosting authored content on ambiguous queries).

Changes:

  • Added a new labeled retrieval ablation test + purpose-built corpus fixture to compare pure RRF vs tier-weighted retrieval.
  • Retuned tier multiplier tables (promote authored strongly, demote received softly) and updated existing tests to match.
  • Added tierWeightFloorRatio gating in rrfFold to restrict when multipliers are applied.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
packages/memory-gbrain/src/tests/tier-weighted-retrieval.test.ts Updates brief-reply downweight assertion to match the new multiplier regime.
packages/memory-gbrain/src/tests/tier-ablation-eval.test.ts New ablation eval that runs flag-off vs flag-on and asserts class-level guardrails.
packages/memory-gbrain/src/tests/fixtures/tier-ablation-corpus.ts New hand-built labeled corpus + distractors for the ablation eval.
packages/memory-gbrain-crdb-adapter/src/tier-weights.ts Rebalances sparse/normal/dense tier multiplier tables.
packages/memory-gbrain-crdb-adapter/src/rrf.ts Adds tierWeightFloorRatio to gate multiplier application in the fold.
packages/memory-gbrain-crdb-adapter/src/tests/tier-weights.test.ts Updates expected multipliers and override composition tests.
CHANGELOG.md Documents the ablation eval, tuning results, and the new floor gate.
Comments suppressed due to low confidence (1)

packages/memory-gbrain-crdb-adapter/src/rrf.ts:120

  • tierWeightFloorRatio gating skips calling tierWeight for hits below the threshold (if (hit.rrfScore < threshold) continue;). This breaks the documented userOverride: 'hidden' → 0.0× (drop from results) behavior (and any other “drop” semantics implemented via returning 0), because low-scoring hidden pages will never be weighted to 0 and will remain in results. Consider always evaluating tierWeight to enforce drop/validation, and only applying non-zero multipliers when rrfScore >= threshold, or split explicit overrides (hidden/pinned) from tier weighting so overrides apply regardless of the floor gate.
  if (options.tierWeight) {
    const weight = options.tierWeight;
    const floorRatio = options.tierWeightFloorRatio ?? 0.85;
    // Determine the gating threshold: only pages whose raw rrfScore is
    // at least `floorRatio * topRawScore` get tier-weighted. Pages below
    // the threshold are kept at their unweighted rrfScore — they can
    // neither boost above strong matches nor get demoted below noise.
    // This is the load-bearing change for the labeled-retrieval ablation.
    let topRawScore = 0;
    for (const hit of entries) {
      if (hit.rrfScore > topRawScore) topRawScore = hit.rrfScore;
    }
    const threshold = topRawScore * floorRatio;

    for (const hit of entries) {
      if (hit.rrfScore < threshold) continue;
      // Coerce non-finite or non-number multipliers to 1.0 (identity) so a
      // misbehaving callback can't poison rrfScore into NaN/Infinity. Clamp
      // negatives to 0 — they share the same "drop the page" semantics as
      // userOverride: 'hidden'.
      const raw = weight(hit.page.metadata);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread CHANGELOG.md
@@ -1,5 +1,51 @@
All notable changes to SkyTwin will be documented in this file.

## [unreleased] — Layer 2 ablation eval + tuned multipliers + floor-ratio gate (#251 follow-up)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed by design — the existing CHANGELOG.md already has multiple stacked [unreleased] entries (Piper TTS, Mobile voice recording, per-Lifebook briefing prose, Provenance graph filter, etc.). This is the project's established convention; entries get squashed on release. Matching the convention here rather than restructuring. Happy to discuss separately if the convention itself is up for revisit.

Comment thread CHANGELOG.md Outdated

1. **Aggressive demote weights were structurally wrong.** Original normal-band multipliers (newsletter 0.4×, automated 0.2×) pushed primary-hit notifications BELOW the distractor pool. Layer 2 became unusable for "find my AWS billing alert" queries. **Fixed** by rebalancing to promote-strong/demote-soft: newsletter 0.85×, automated 0.8× in the normal band.

2. **Multiplicative weighting without a gate lets weak matches leapfrog strong ones.** A rank-30 distractor with an `authored` tier (×1.5 = 0.024) could beat a rank-1 primary with an `automated` tier (×0.8 = 0.013) — because the RRF score curve decays slowly enough that a 50% drop in raw score still leaves room for a 50% boost to overtake it. **Fixed** by adding `tierWeightFloorRatio` (default 0.85) to the RRF fold: only pages whose raw rrfScore is ≥ 85% of the top score are eligible for the multiplier. Tail-of-pool candidates stay at their unweighted score.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 79f137a. The 0.024 was wrong — actual is 0.0111 × 1.5 = 0.0167. Rewrote the math to show the real rank-1 vs rank-30 raw scores at rrfK=60 so the example is verifiable. Qualitative point (soft-demote + boost can flip the order on a shallow curve) still holds.

Comment on lines +4 to +8
* Purpose-built corpus where each labeled query has paired authored +
* received variants matching the same query text. The tier multiplier
* is the actual deciding factor for ranking — text + vector overlap is
* tuned so the variants land at adjacent ranks without weighting, and
* the multiplier is what flips the order.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 79f137a. Updated the docstring — q5 (GitHub CI failure) and q6 (newsletter) are deliberately received-only because no plausible authored sibling exists for those. The docstring now reads "MOST labeled queries" and explains why q5/q6 are single-variant.


// ── Findings & assertions ──────────────────────────────────────
//
// The numbers above reflect *hash-trick embeddings on a 47-signal

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 79f137a. Updated to 52-signal (12 labeled + 40 distractors). Also fixed the matching reference in CHANGELOG.md.

// ─────────────────────────────────────────────────────────────────────────

// Distractor topics deliberately avoid keywords from the labeled queries
// ("investor", "hiring", "billing", "Friday doc", "newsletter", "Github",

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 79f137a.

Four documentation / comment issues from Copilot, all valid:

1. CHANGELOG RRF math example was wrong. Wrote "0.0111 × 1.5 = 0.024"
   when the actual product is 0.0167. The qualitative point — soft demote
   + 1.5× boost lets a rank-30 distractor leapfrog a rank-1 primary —
   still holds; updated to show the correct arithmetic and the actual
   rank-1 vs rank-30 raw scores at rrfK=60.
2. Fixture docstring said "each labeled query has paired authored +
   received variants" but q5 (GitHub Actions failure) and q6 (newsletter)
   are deliberately received-only — there's no plausible authored sibling
   for a CI-bot notification. Updated to "MOST queries" and explained why
   q5/q6 are single-variant.
3. tier-ablation-eval comment said "47-signal corpus" but the fixture
   builds 12 labeled + 40 distractors = 52. Updated to 52, also reflected
   in the CHANGELOG.
4. Typo: "Github" → "GitHub" in the keyword-avoidance list comment.

NOT addressed: the "multiple [unreleased] sections" comment. The existing
CHANGELOG already has multiple stacked [unreleased] entries (Piper TTS,
mobile voice, per-Lifebook briefing prose, etc.) — this is the project
convention and changes squash on release. Matching the convention here
rather than restructuring.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jayzalowitz jayzalowitz merged commit ace230a into main May 12, 2026
14 of 15 checks passed
jayzalowitz added a commit that referenced this pull request May 12, 2026
…-sender (#270)

* feat(#251 follow-up): tier-aware privacy controls — pin / hide / hide-sender

Adds the privacy guardrail that gates Layer 2's default-on rollout.
Users can now mark individual indexed pages as pinned (×2 score) or
hidden (dropped from search entirely), and bulk-hide every page from a
given sender in one click.

Engine:
- buildPageMetadata stamps lower-cased `fromAddress` on every page
  whose signal carries `data.from`. The connector inlines a 3-line
  display-name stripper to avoid pulling @skytwin/connectors into the
  memory layer's dep graph.
- `updatePageMetadata(userId, pageId, patch)` repository helper:
  JSONB-merges a partial patch into brain_pages.metadata, scoped by
  user_id so a guessable id can't be used to mutate another user's
  rows. Returns affected count; 0 → 404 at the route layer.
- `hideAllPagesFromSender(userId, fromAddress)` repository helper:
  bulk UPDATE on every page where `metadata->>'fromAddress'` matches.
  Lower-cases the input to match the stamped form.
- In-memory mirrors of both for tests.

API:
- POST /api/memory-config/pages/:pageId/override with body
  { override: 'pinned' | 'hidden' | null }. 404 on missing or
  foreign-owned page (deliberately collapsed so a caller can't probe).
- POST /api/memory-config/senders/hide with body { fromAddress }.
  Returns { ok, fromAddress, hidden: <count> }.
- /api/memory-config/dashboard `pages.recent[]` now includes
  `fromAddress` so the UI knows what to send to the sender route.

Web:
- Per-row actions on the Recent pages table: Pin/Unpin, Hide/Unhide,
  Hide sender. Buttons swap action based on current state. Sender
  button only appears when a fromAddress exists (i.e. email-derived).
  The bulk action confirms before firing.

Tests: +5 in-memory unit, +3 embedded-port, +6 routes. All green.

This unblocks the eventual Layer 2 default-on rollout — combined with
the labeled-retrieval eval guardrail from PR #260, users now have both
a working multiplier and a path to disclaim what they don't want
amplified.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#251 post-/review): address Copilot findings on PR #270

Two findings on the privacy/exclude UI, both valid:

1. JSONB null-clear comment was actually wrong. `jsonb_a || jsonb_b`
   does NOT strip null values — it stores them as JSON null. The
   pin/hide path used `{ userOverride: null }` and the resulting
   metadata column would have ended up with `{"userOverride": null}`
   instead of the key being absent. tierMultiplier treats both as
   "no override," so functionally it worked, but the column shape
   would have drifted.

   Fixed the helper itself rather than just the comment. `updatePageMetadata`
   now splits the patch into "set" (non-null values, applied via JSONB ||)
   and "drop" (null values, applied via JSONB - 'key'). In-memory mirror
   matches. New unit test verifies the cleared key is *absent*, not
   present-and-null.

2. fromAddress trimmed for validation but not normalized for the
   adapter call. Inputs like "  Spam@Vendor.Example.com  " passed
   validation but never matched the trimmed + lowered stored values,
   leading to a confusing `hidden: 0`. Normalize once at the route
   boundary (trim + lower) and use the canonical form for the adapter
   call, the response, and any logged context. Existing test updated
   to assert the trim-and-lower path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jayzalowitz added a commit that referenced this pull request May 13, 2026
)

* feat(#251 follow-up): real-embedding ablation result + opt-in test

Validates the working hypothesis that hash-trick embeddings were
inflating the `received_content` MRR regression observed in PR #260.

Result: the hypothesis was wrong. With Ollama + nomic-embed-text (a
real semantic model), received_content MRR = ~0.54, essentially
identical to the hash-trick floor of 0.542. The regression is
structural to the multiplicative weighting approach:

- 1.5× authored vs 0.8× automated = 1.875× swing. Authored content
  within 53% of the top raw score leapfrogs strong-but-demoted primary
  hits.
- The 0.85 floor-ratio gate isn't enough: with real semantic
  embeddings, authored content from unrelated queries (e.g. q1 Series
  B pitch) has non-trivial similarity to q5 ("GitHub Actions CI
  failed") — lands in the candidate pool above threshold, gets the
  1.5× boost, beats the legitimate primary.

Diagnostic dump from the eval confirms: q5's primary lands at rank 8/9
with tier-on, behind three authored pages from unrelated queries plus
several distractors.

Decision: Layer 2 stays opt-in. Default-on is blocked on a structural
fix (switch to additive bonuses, target received_content MRR ≥ 0.95).
That's a separate sub-issue.

What ships:
- New opt-in test branch in tier-ablation-eval.test.ts, gated on
  RUN_REAL_EMBEDDING_EVAL=1. Reproducible with any local Ollama or
  OpenAI key — defaults respect OPENAI_EMBEDDING_BASE_URL /
  OPENAI_EMBEDDING_MODEL / OPENAI_EMBEDDING_API_KEY.
- runOneMode helper now takes an optional embedding provider so the
  same harness drives both modes.
- Diagnostic dump in printReport now triggers for queries that degrade
  by more than 2 ranks (not just "missing"), so future tuning has
  better signal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#251 post-/review): address Copilot findings on PR #272

Three small but valid finds:

1. Real-embedding section header said the eval "validates the working
   hypothesis" but the actual result was the opposite. Rewrote the
   header to describe what the eval actually is now — a permanent
   reproducible artifact for the next person tuning Layer 2 weighting.

2. Off-by-one between comment and condition for the diagnostic dump.
   Comment said "more than 3 ranks"; code was `> 2` (i.e. 3 or more).
   Aligned to the more useful semantics: `>= 3`, "3 or more rank
   degradation".

3. CHANGELOG had "out of scope for tonight" — time-relative phrasing
   that won't read well later. Changed to "out of scope for this PR".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jayzalowitz added a commit that referenced this pull request May 18, 2026
…g + floor-ratio gate)

CLAUDE.md:102 said the package "multiplies fused scores by per-tier weights"
— stale since #260/#272 flipped the implementation to additive bonuses (the
multiplicative cut had a structural leapfrog regression on real dense
embedders). Updated to describe the actual current behavior + the opt-in
`floorRatio` gate aligned with gbrain v0.35.6.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jayzalowitz added a commit that referenced this pull request May 18, 2026
…/ PR #1129 (#334)

* v0.6.52.0 sync(memory): align floor-ratio gate with gbrain v0.35.6.0 / PR #1129

Our contribution PR #1091 was closed in favor of upstream's reworked shape
that merged yesterday as #1129. The codex outside-voice review caught three
defensive gaps in the original shape; port the fixes here and align naming
with `SearchOpts.floorRatio` / `search.floor_ratio`.

Hardened in `packages/memory-gbrain-crdb-adapter/src/rrf.ts`:

- No-positive-signal inputs (all-negative, all-NaN, empty) disable the gate
  via `Number.NEGATIVE_INFINITY` threshold. Prior `topRawScore = 0` init
  would silently reject every entry against `r.score < 0`.
- Out-of-range `floorRatio` (NaN, Infinity, negative, > 1) disables the
  gate. Defense in depth so a malformed config value never gates anything.
- NaN-score skip in the bonus loop. `NaN < threshold` is `false` in JS, so
  a NaN-scored hit would slip past the gate check and have the bonus added
  on top — poisoning the sort. Now an explicit `Number.isFinite` check
  skips the bonus stage for non-finite scores.

New surface:

- `RrfFoldOptions.floorRatio` (deprecated alias `tierWeightFloorRatio`
  preserved; new name wins when both are set).
- `computeFloorThreshold(entries, floorRatio)` exported helper, mirrors
  gbrain's same-named function for cross-port mental-model consistency.
- `DEFAULT_FLOOR_RATIO = 0.85` exported as a named constant.

Tests:

- 12 new cases pinning the defensive guards (out-of-range / NaN / Infinity /
  empty / negative-top / all-NaN / mixed), the precedence rule between the
  new and deprecated option names, and an updated strong-vs-tail RRF setup
  that actually exercises the gate (RRF flatness means rank-1 vs rank-2
  don't differ enough — you need rank-20+ in a single list).
- 130/130 RRF tests pass; 100/100 `@skytwin/memory-gbrain` tests pass; the
  realistic-retrieval ablation reports `mean R@5 1.000 pure-RRF / 0.929
  tier-on`, unchanged.

Upstream feature triage (filed for follow-up, not in this PR):

- #897 search-lite (token budget + semantic query cache + intent weighting)
  — pursue first, ~2 days. Token budget addresses Claude API limits.
- #1008 zerank-2 reranker — pursue second, ~1.5 days. Slots between RRF
  fold and tier-weight bonus.
- #996 federated_read — skip (one brain per user).
- #1131 temporal trajectory — defer (entity-time-series shape not our fit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rrf): codex T2/T3 + clarify floorRatio:0 test (post-/review)

Three findings from /review's codex outside-voice pass + one nit from
the structured Pass-1/Pass-2 review.

T2 — Invalid floorRatio bypassing legacy guard. Old precedence
`options.floorRatio ?? options.tierWeightFloorRatio ?? DEFAULT_FLOOR_RATIO`
meant `floorRatio: NaN` (e.g. from buggy config parse) won the chain and
disabled the gate, even if the caller had `tierWeightFloorRatio: 0.85`
working. A partially migrated caller piping a malformed new option
silently nullified the legacy guard. New `pickValidFloorRatio` helper
walks the candidates and uses the first finite value in [0, 1]; invalid
falls through to the alias, then to `DEFAULT_FLOOR_RATIO`.

T3 — NaN/+Infinity rrfScores surviving the sort. The comment claimed
non-finite scores "sort to the end," but `b.rrfScore - a.rrfScore`
returns `NaN` for any NaN side, which JS sort treats as 0 (equal) —
leaving NaN-scored hits in insertion order, where they can land in
top-k via `slice(0, k)`. `+Infinity` sorts to the top of every query.
Reachable when a caller passes `rrfK: NaN` (which makes every
`1 / (rrfK + rank)` NaN). Fix: the post-loop filter drops ALL
non-finite-scored entries (was: only `-Infinity` hidden sentinel),
and a mirror filter applies on the pure-RRF path so corrupted scores
never reach the comparator. Sort now operates only on finite scores
and produces a deterministic ranking.

Nit — Test `floorRatio: 0 disables the bonus completely` name + comment
contradicted the test's own assertions (which confirm the bonus IS
applied for every positive-score hit). Renamed to match the actual
behavior: `floorRatio: 0` is a valid in-range value, threshold computes
to 0, every positive-score hit passes the gate. Distinct from the
`undefined`/out-of-range disable path even though they're observationally
equivalent for positive-score inputs.

5 new test cases pin the codex fixes:
- invalid floorRatio falls back to deprecated alias when alias is valid
- invalid floorRatio + invalid alias falls back to DEFAULT_FLOOR_RATIO
- floorRatio: undefined falls through to alias when alias is valid
- rrfK: NaN corrupts all contributions → all hits dropped (output [])
- partial corruption (some finite hits) survives the filter intact

CHANGELOG updated to reflect: 135 tests (was 130), and the codex review
fixes are called out under a "Codex review fixes (post-review)" subsection
so the audit trail is visible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rrf-tests): address Copilot review — strengthen invalid-floorRatio test setup

Copilot caught a coverage gap in three new tests: the rank-1-vs-rank-2
text-only setup doesn't actually exercise the gate, because rank-2 rrfScore
(1/62 ≈ 0.0161) is above the default 0.85 × 1/61 (≈ 0.0139) — so the
assertion passes regardless of whether `floorRatio: NaN` (or -0.5, or 1.5)
correctly disabled the gate, fell back to default, or did anything at all.
Same class as the gap I caught and fixed in the back-compat tests; missed
updating these three.

Rewritten to use the `strongVsTail` helper (rank-1-in-both + rank-21-text-only)
so the assertions distinguish "gate at 0.85" from "gate disabled" — the weak
hit's rrfScore is 1/81 (well below 0.85 × 2/61 = 0.0279), so the bonus only
applies if the gate is genuinely disabled.

Note: the test semantics also flipped because of the codex T2 fix landed in
89fd6be. Pre-T2, invalid `floorRatio` disabled the gate. Post-T2, invalid
falls back to the alias then to DEFAULT_FLOOR_RATIO. So the renamed tests
now assert "falls back to DEFAULT_FLOOR_RATIO" rather than "disables gate."
The test rationale comment block calls this out explicitly so a future
maintainer doesn't try to revert to the pre-T2 expectations.

CHANGELOG test count corrected: 22 new test cases (was 17, originally 12 —
my mistake; the count drifted across each round of review fixes).

Copilot's other two comments were already addressed in 89fd6be:
- "Test name `floorRatio: 0 disables bonus` contradicts assertions" → fixed
- "NaN-score skip leaves non-finite rrfScore in entries; sort can be poisoned"
  → fixed (post-loop `isFinite` filter drops non-finite scores before sort).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: clarify memory-gbrain-crdb-adapter description (additive scoring + floor-ratio gate)

CLAUDE.md:102 said the package "multiplies fused scores by per-tier weights"
— stale since #260/#272 flipped the implementation to additive bonuses (the
multiplicative cut had a structural leapfrog regression on real dense
embedders). Updated to describe the actual current behavior + the opt-in
`floorRatio` gate aligned with gbrain v0.35.6.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants