feat(#251 follow-up): Layer 2 ablation eval + tuned multipliers + floor gate by jayzalowitz · Pull Request #260 · jayzalowitz/skytwin

jayzalowitz · 2026-05-12T14:39:57Z

Summary

Built the labeled-retrieval ablation that PR #259 was waiting on, ran it, and the eval immediately found two real issues with the shipped Layer 2. Both are now fixed.

Closes (partial): #251 Layer 2 validation gate. Layer 2 stays beta / opt-in (the eval is now the gate, not the result). Default-on rollout remains blocked on real OpenAI embeddings + the tier-aware exclude UI.

The eval

packages/memory-gbrain/src/__tests__/tier-ablation-eval.test.ts runs the same query set against the same corpus twice (flag off vs on) and prints R@5 / P@5 / MRR-of-primary side-by-side, broken down by query class. Three classes:

user_behavior — queries where tier weighting should lift the user-authored variant.
received_content — queries about specific received content; must not collapse.
neutral — entity lookups, must not break.

Hand-built fixture: 17 labeled signals across 7 queries (each query has both authored + received variants matching the same text), plus 40 realistic-mix distractors (12% authored / 40% personal / 15% broadcast / 20% newsletter / 13% automated — matches real Gmail volume).

What the numbers say

Hash-trick embeddings, 47-signal corpus:

Metric	pure-RRF	tier-weighted
user_behavior MRR (n=3)	0.667	1.000
received_content MRR (n=3)	1.000	0.542
neutral MRR (n=1)	1.000	1.000
aggregate R@5	1.000	0.857
aggregate MRR primary	0.857	0.804

Two real issues found and fixed

Issue 1: Aggressive demote weights pushed legitimate primary hits below distractors

Original normal-band multipliers (newsletter 0.4×, automated 0.2×) were strong enough that the rank-1 "AWS billing alert" notification, demoted to 0.2×, fell BELOW the distractor pool entirely. Layer 2 became unusable for any query about specific received content.

Fixed by rebalancing to promote-strong / demote-soft:

	normal (was → now)
user_sent_originated	1.5 → 1.5
user_sent_reply	1.2 → 1.2
inbox_personal	1.0 → 1.0
inbox_broadcast	0.8 → 0.9
inbox_newsletter	0.4 → 0.85
inbox_automated	0.2 → 0.8

The boost-authored side stays; the demote-received side relaxes significantly. The product intent of Layer 2 was "prefer authored on ambiguous queries," not "suppress received."

Issue 2: Multiplicative weighting without a gate let weak matches leapfrog strong ones

RRF scores decay slowly: 1 / (60 + rank). A rank-1 page scores 0.0164; a rank-30 page scores 0.0111. With multiplicative weights, a rank-30 authored distractor at 0.0111 × 1.5 = 0.0167 beat a rank-1 received primary at 0.0164 × 0.8 = 0.0131 — even with the gentlest possible demote.

Fixed by adding RrfFoldOptions.tierWeightFloorRatio to the RRF fold (default 0.85). Only pages whose raw rrfScore is at least 85% of the top page's raw score are eligible for the multiplier. The tail of the candidate pool keeps its unweighted score. Layer 2's intent is to rerank plausible candidates, never to promote noise.

What remains

received_content MRR landed at 0.542 — better than the 0.0 we'd have had without the floor gate, but still down from 1.0. The honest read: when a received_content query has an authored secondary (q4: AWS billing alert vs the user's reply about it), Layer 2 surfaces the user's reply first. This is sometimes-right (the user usually wants their own response) and sometimes-wrong (sometimes you really want the raw alert).

Most of the gap is from hash-trick spurious overlap — unrelated authored content gets enough phantom token-collision to enter the top candidate pool. With real OpenAI embeddings the spurious matches should largely disappear and received_content MRR should improve materially.

Layer 2 stays opt-in. The eval test is now the gate. Default-on rollout is still blocked on running this with OpenAI embeddings against a real-traffic corpus + the tier-aware exclude UI (privacy sub-issue, separate from this PR).

Test plan

pnpm --filter @skytwin/memory-gbrain test — 95 pass, 1 skipped (no regression)
pnpm --filter @skytwin/memory-gbrain-crdb-adapter test — 70 pass, 6 skipped (DB-gated)
pnpm build --concurrency=1 — 35/35 packages
pnpm test — 70/70 turbo tasks, all green
The ablation test passes its guardrail bars (user_behavior must lift, received_content ≥ 0.40, neutral must not regress, aggregate R@5 ≥ 0.7)
Existing tier-weighted-retrieval.test.ts rewritten to match the rebalanced multipliers — brief-reply downweight test now compares brief vs full authored (the actual mechanism), not brief vs newsletter (which only worked under the old extreme demote)

Files

rrf.ts — floor-ratio gate + defensive multiplier validation.
tier-weights.ts — rebalanced sparse/normal/dense tables.
tier-ablation-corpus.ts (new) — purpose-built labeled fixture.
tier-ablation-eval.test.ts (new) — the ablation + side-by-side report.
Existing tier tests updated to the new weight values.

🤖 Generated with Claude Code

…or gate Built a labeled-retrieval ablation that runs the same query set against the same corpus twice (Layer 2 off vs on) and reports R@5, P@5, and MRR-of-primary side-by-side, broken down by query class: user_behavior 0.667 → 1.000 (intended: tier weighting lifts these) received_content 1.000 → 0.542 (acceptable: see below) neutral 1.000 → 1.000 (must not regress) The eval immediately surfaced two real issues in the shipped Layer 2: 1. Aggressive demote weights (newsletter 0.4x, automated 0.2x in the normal band) pushed primary-hit notifications BELOW the distractor pool. "Find my AWS billing alert" queries lost the alert entirely. Fixed by rebalancing to promote-strong/demote-soft: newsletter 0.85x, automated 0.8x. Authored side stays at 1.5x/1.2x. 2. Multiplicative weighting without a gate let weak-match distractors leapfrog strong primary hits. A rank-30 authored distractor at 1/(60+30) * 1.5 = 0.0167 beat a rank-1 received primary at 1/(60+1) * 0.8 = 0.0131. The RRF score curve decays slowly enough that a 50% raw-score drop still leaves room for a 50% boost to overtake it. Fixed by adding `RrfFoldOptions.tierWeightFloorRatio` (default 0.85) to the RRF fold. Only pages whose raw rrfScore is at least 85% of the top page's raw score get the multiplier; the tail of the candidate pool keeps its unweighted score. The intent of Layer 2 is "rerank plausible candidates" — never "promote noise." Files: - `rrf.ts`: floor-ratio gate, NaN/Infinity coercion, negative → 0 clamp. - `tier-weights.ts`: rebalanced sparse/normal/dense multiplier tables. - `tier-ablation-corpus.ts` (new): 17 labeled + 40 realistic-mix distractor signals. Mix matches real Gmail volume (12% authored, etc.) to avoid the earlier-fixture pitfall where 40% authored gave Layer 2 artificial wins. - `tier-ablation-eval.test.ts` (new): the ablation, prints side-by-side table on every run as a tuning artifact, asserts a guardrail bar that catches regression without enforcing impossible-with-hash-trick perfection. - `tier-weights.test.ts` / `tier-weighted-retrieval.test.ts`: updated to match rebalanced weights. Brief-reply downweight test now compares brief vs full-authored (the actual mechanism) instead of brief vs newsletter (which only worked because the old demote was extreme). Layer 2 stays beta / opt-in. The 0.542 received_content MRR is the honest hash-trick floor; OpenAI embeddings should improve it materially by reducing spurious overlap in the candidate pool. The eval test remains in CI as a permanent regression guardrail. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a labeled Layer 2 ablation eval and uses it to retune tier multipliers plus introduce an RRF “floor gate” so tier weighting only reranks near-top candidates (intended to preserve received-content recall while still boosting authored content on ambiguous queries).

Changes:

Added a new labeled retrieval ablation test + purpose-built corpus fixture to compare pure RRF vs tier-weighted retrieval.
Retuned tier multiplier tables (promote authored strongly, demote received softly) and updated existing tests to match.
Added tierWeightFloorRatio gating in rrfFold to restrict when multipliers are applied.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
packages/memory-gbrain/src/tests/tier-weighted-retrieval.test.ts	Updates brief-reply downweight assertion to match the new multiplier regime.
packages/memory-gbrain/src/tests/tier-ablation-eval.test.ts	New ablation eval that runs flag-off vs flag-on and asserts class-level guardrails.
packages/memory-gbrain/src/tests/fixtures/tier-ablation-corpus.ts	New hand-built labeled corpus + distractors for the ablation eval.
packages/memory-gbrain-crdb-adapter/src/tier-weights.ts	Rebalances sparse/normal/dense tier multiplier tables.
packages/memory-gbrain-crdb-adapter/src/rrf.ts	Adds `tierWeightFloorRatio` to gate multiplier application in the fold.
packages/memory-gbrain-crdb-adapter/src/tests/tier-weights.test.ts	Updates expected multipliers and override composition tests.
CHANGELOG.md	Documents the ablation eval, tuning results, and the new floor gate.

Comments suppressed due to low confidence (1)

packages/memory-gbrain-crdb-adapter/src/rrf.ts:120

tierWeightFloorRatio gating skips calling tierWeight for hits below the threshold (if (hit.rrfScore < threshold) continue;). This breaks the documented userOverride: 'hidden' → 0.0× (drop from results) behavior (and any other “drop” semantics implemented via returning 0), because low-scoring hidden pages will never be weighted to 0 and will remain in results. Consider always evaluating tierWeight to enforce drop/validation, and only applying non-zero multipliers when rrfScore >= threshold, or split explicit overrides (hidden/pinned) from tier weighting so overrides apply regardless of the floor gate.

  if (options.tierWeight) {
    const weight = options.tierWeight;
    const floorRatio = options.tierWeightFloorRatio ?? 0.85;
    // Determine the gating threshold: only pages whose raw rrfScore is
    // at least `floorRatio * topRawScore` get tier-weighted. Pages below
    // the threshold are kept at their unweighted rrfScore — they can
    // neither boost above strong matches nor get demoted below noise.
    // This is the load-bearing change for the labeled-retrieval ablation.
    let topRawScore = 0;
    for (const hit of entries) {
      if (hit.rrfScore > topRawScore) topRawScore = hit.rrfScore;
    }
    const threshold = topRawScore * floorRatio;

    for (const hit of entries) {
      if (hit.rrfScore < threshold) continue;
      // Coerce non-finite or non-number multipliers to 1.0 (identity) so a
      // misbehaving callback can't poison rrfScore into NaN/Infinity. Clamp
      // negatives to 0 — they share the same "drop the page" semantics as
      // userOverride: 'hidden'.
      const raw = weight(hit.page.metadata);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jayzalowitz · 2026-05-12T14:55:58Z

@@ -1,5 +1,51 @@
 All notable changes to SkyTwin will be documented in this file.

+## [unreleased] — Layer 2 ablation eval + tuned multipliers + floor-ratio gate (#251 follow-up)


Not addressed by design — the existing CHANGELOG.md already has multiple stacked [unreleased] entries (Piper TTS, Mobile voice recording, per-Lifebook briefing prose, Provenance graph filter, etc.). This is the project's established convention; entries get squashed on release. Matching the convention here rather than restructuring. Happy to discuss separately if the convention itself is up for revisit.

jayzalowitz · 2026-05-12T14:56:00Z

+
+1. **Aggressive demote weights were structurally wrong.** Original normal-band multipliers (newsletter 0.4×, automated 0.2×) pushed primary-hit notifications BELOW the distractor pool. Layer 2 became unusable for "find my AWS billing alert" queries. **Fixed** by rebalancing to promote-strong/demote-soft: newsletter 0.85×, automated 0.8× in the normal band.
+
+2. **Multiplicative weighting without a gate lets weak matches leapfrog strong ones.** A rank-30 distractor with an `authored` tier (×1.5 = 0.024) could beat a rank-1 primary with an `automated` tier (×0.8 = 0.013) — because the RRF score curve decays slowly enough that a 50% drop in raw score still leaves room for a 50% boost to overtake it. **Fixed** by adding `tierWeightFloorRatio` (default 0.85) to the RRF fold: only pages whose raw rrfScore is ≥ 85% of the top score are eligible for the multiplier. Tail-of-pool candidates stay at their unweighted score.


Addressed in 79f137a. The 0.024 was wrong — actual is 0.0111 × 1.5 = 0.0167. Rewrote the math to show the real rank-1 vs rank-30 raw scores at rrfK=60 so the example is verifiable. Qualitative point (soft-demote + boost can flip the order on a shallow curve) still holds.

jayzalowitz · 2026-05-12T14:56:02Z

+ * Purpose-built corpus where each labeled query has paired authored +
+ * received variants matching the same query text. The tier multiplier
+ * is the actual deciding factor for ranking — text + vector overlap is
+ * tuned so the variants land at adjacent ranks without weighting, and
+ * the multiplier is what flips the order.


Addressed in 79f137a. Updated the docstring — q5 (GitHub CI failure) and q6 (newsletter) are deliberately received-only because no plausible authored sibling exists for those. The docstring now reads "MOST labeled queries" and explains why q5/q6 are single-variant.

jayzalowitz · 2026-05-12T14:56:04Z

+
+    // ── Findings & assertions ──────────────────────────────────────
+    //
+    // The numbers above reflect *hash-trick embeddings on a 47-signal


Addressed in 79f137a. Updated to 52-signal (12 labeled + 40 distractors). Also fixed the matching reference in CHANGELOG.md.

jayzalowitz · 2026-05-12T14:56:05Z

+// ─────────────────────────────────────────────────────────────────────────
+
+// Distractor topics deliberately avoid keywords from the labeled queries
+// ("investor", "hiring", "billing", "Friday doc", "newsletter", "Github",


Addressed in 79f137a.

Four documentation / comment issues from Copilot, all valid: 1. CHANGELOG RRF math example was wrong. Wrote "0.0111 × 1.5 = 0.024" when the actual product is 0.0167. The qualitative point — soft demote + 1.5× boost lets a rank-30 distractor leapfrog a rank-1 primary — still holds; updated to show the correct arithmetic and the actual rank-1 vs rank-30 raw scores at rrfK=60. 2. Fixture docstring said "each labeled query has paired authored + received variants" but q5 (GitHub Actions failure) and q6 (newsletter) are deliberately received-only — there's no plausible authored sibling for a CI-bot notification. Updated to "MOST queries" and explained why q5/q6 are single-variant. 3. tier-ablation-eval comment said "47-signal corpus" but the fixture builds 12 labeled + 40 distractors = 52. Updated to 52, also reflected in the CHANGELOG. 4. Typo: "Github" → "GitHub" in the keyword-avoidance list comment. NOT addressed: the "multiple [unreleased] sections" comment. The existing CHANGELOG already has multiple stacked [unreleased] entries (Piper TTS, mobile voice, per-Lifebook briefing prose, etc.) — this is the project convention and changes squash on release. Matching the convention here rather than restructuring. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-sender (#270) * feat(#251 follow-up): tier-aware privacy controls — pin / hide / hide-sender Adds the privacy guardrail that gates Layer 2's default-on rollout. Users can now mark individual indexed pages as pinned (×2 score) or hidden (dropped from search entirely), and bulk-hide every page from a given sender in one click. Engine: - buildPageMetadata stamps lower-cased `fromAddress` on every page whose signal carries `data.from`. The connector inlines a 3-line display-name stripper to avoid pulling @skytwin/connectors into the memory layer's dep graph. - `updatePageMetadata(userId, pageId, patch)` repository helper: JSONB-merges a partial patch into brain_pages.metadata, scoped by user_id so a guessable id can't be used to mutate another user's rows. Returns affected count; 0 → 404 at the route layer. - `hideAllPagesFromSender(userId, fromAddress)` repository helper: bulk UPDATE on every page where `metadata->>'fromAddress'` matches. Lower-cases the input to match the stamped form. - In-memory mirrors of both for tests. API: - POST /api/memory-config/pages/:pageId/override with body { override: 'pinned' | 'hidden' | null }. 404 on missing or foreign-owned page (deliberately collapsed so a caller can't probe). - POST /api/memory-config/senders/hide with body { fromAddress }. Returns { ok, fromAddress, hidden: <count> }. - /api/memory-config/dashboard `pages.recent[]` now includes `fromAddress` so the UI knows what to send to the sender route. Web: - Per-row actions on the Recent pages table: Pin/Unpin, Hide/Unhide, Hide sender. Buttons swap action based on current state. Sender button only appears when a fromAddress exists (i.e. email-derived). The bulk action confirms before firing. Tests: +5 in-memory unit, +3 embedded-port, +6 routes. All green. This unblocks the eventual Layer 2 default-on rollout — combined with the labeled-retrieval eval guardrail from PR #260, users now have both a working multiplier and a path to disclaim what they don't want amplified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#251 post-/review): address Copilot findings on PR #270 Two findings on the privacy/exclude UI, both valid: 1. JSONB null-clear comment was actually wrong. `jsonb_a || jsonb_b` does NOT strip null values — it stores them as JSON null. The pin/hide path used `{ userOverride: null }` and the resulting metadata column would have ended up with `{"userOverride": null}` instead of the key being absent. tierMultiplier treats both as "no override," so functionally it worked, but the column shape would have drifted. Fixed the helper itself rather than just the comment. `updatePageMetadata` now splits the patch into "set" (non-null values, applied via JSONB ||) and "drop" (null values, applied via JSONB - 'key'). In-memory mirror matches. New unit test verifies the cleared key is *absent*, not present-and-null. 2. fromAddress trimmed for validation but not normalized for the adapter call. Inputs like " Spam@Vendor.Example.com " passed validation but never matched the trimmed + lowered stored values, leading to a confusing `hidden: 0`. Normalize once at the route boundary (trim + lower) and use the canonical form for the adapter call, the response, and any logged context. Existing test updated to assert the trim-and-lower path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

) * feat(#251 follow-up): real-embedding ablation result + opt-in test Validates the working hypothesis that hash-trick embeddings were inflating the `received_content` MRR regression observed in PR #260. Result: the hypothesis was wrong. With Ollama + nomic-embed-text (a real semantic model), received_content MRR = ~0.54, essentially identical to the hash-trick floor of 0.542. The regression is structural to the multiplicative weighting approach: - 1.5× authored vs 0.8× automated = 1.875× swing. Authored content within 53% of the top raw score leapfrogs strong-but-demoted primary hits. - The 0.85 floor-ratio gate isn't enough: with real semantic embeddings, authored content from unrelated queries (e.g. q1 Series B pitch) has non-trivial similarity to q5 ("GitHub Actions CI failed") — lands in the candidate pool above threshold, gets the 1.5× boost, beats the legitimate primary. Diagnostic dump from the eval confirms: q5's primary lands at rank 8/9 with tier-on, behind three authored pages from unrelated queries plus several distractors. Decision: Layer 2 stays opt-in. Default-on is blocked on a structural fix (switch to additive bonuses, target received_content MRR ≥ 0.95). That's a separate sub-issue. What ships: - New opt-in test branch in tier-ablation-eval.test.ts, gated on RUN_REAL_EMBEDDING_EVAL=1. Reproducible with any local Ollama or OpenAI key — defaults respect OPENAI_EMBEDDING_BASE_URL / OPENAI_EMBEDDING_MODEL / OPENAI_EMBEDDING_API_KEY. - runOneMode helper now takes an optional embedding provider so the same harness drives both modes. - Diagnostic dump in printReport now triggers for queries that degrade by more than 2 ranks (not just "missing"), so future tuning has better signal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#251 post-/review): address Copilot findings on PR #272 Three small but valid finds: 1. Real-embedding section header said the eval "validates the working hypothesis" but the actual result was the opposite. Rewrote the header to describe what the eval actually is now — a permanent reproducible artifact for the next person tuning Layer 2 weighting. 2. Off-by-one between comment and condition for the diagnostic dump. Comment said "more than 3 ranks"; code was `> 2` (i.e. 3 or more). Aligned to the more useful semantics: `>= 3`, "3 or more rank degradation". 3. CHANGELOG had "out of scope for tonight" — time-relative phrasing that won't read well later. Changed to "out of scope for this PR". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…g + floor-ratio gate) CLAUDE.md:102 said the package "multiplies fused scores by per-tier weights" — stale since #260/#272 flipped the implementation to additive bonuses (the multiplicative cut had a structural leapfrog regression on real dense embedders). Updated to describe the actual current behavior + the opt-in `floorRatio` gate aligned with gbrain v0.35.6.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…/ PR #1129 (#334) * v0.6.52.0 sync(memory): align floor-ratio gate with gbrain v0.35.6.0 / PR #1129 Our contribution PR #1091 was closed in favor of upstream's reworked shape that merged yesterday as #1129. The codex outside-voice review caught three defensive gaps in the original shape; port the fixes here and align naming with `SearchOpts.floorRatio` / `search.floor_ratio`. Hardened in `packages/memory-gbrain-crdb-adapter/src/rrf.ts`: - No-positive-signal inputs (all-negative, all-NaN, empty) disable the gate via `Number.NEGATIVE_INFINITY` threshold. Prior `topRawScore = 0` init would silently reject every entry against `r.score < 0`. - Out-of-range `floorRatio` (NaN, Infinity, negative, > 1) disables the gate. Defense in depth so a malformed config value never gates anything. - NaN-score skip in the bonus loop. `NaN < threshold` is `false` in JS, so a NaN-scored hit would slip past the gate check and have the bonus added on top — poisoning the sort. Now an explicit `Number.isFinite` check skips the bonus stage for non-finite scores. New surface: - `RrfFoldOptions.floorRatio` (deprecated alias `tierWeightFloorRatio` preserved; new name wins when both are set). - `computeFloorThreshold(entries, floorRatio)` exported helper, mirrors gbrain's same-named function for cross-port mental-model consistency. - `DEFAULT_FLOOR_RATIO = 0.85` exported as a named constant. Tests: - 12 new cases pinning the defensive guards (out-of-range / NaN / Infinity / empty / negative-top / all-NaN / mixed), the precedence rule between the new and deprecated option names, and an updated strong-vs-tail RRF setup that actually exercises the gate (RRF flatness means rank-1 vs rank-2 don't differ enough — you need rank-20+ in a single list). - 130/130 RRF tests pass; 100/100 `@skytwin/memory-gbrain` tests pass; the realistic-retrieval ablation reports `mean R@5 1.000 pure-RRF / 0.929 tier-on`, unchanged. Upstream feature triage (filed for follow-up, not in this PR): - #897 search-lite (token budget + semantic query cache + intent weighting) — pursue first, ~2 days. Token budget addresses Claude API limits. - #1008 zerank-2 reranker — pursue second, ~1.5 days. Slots between RRF fold and tier-weight bonus. - #996 federated_read — skip (one brain per user). - #1131 temporal trajectory — defer (entity-time-series shape not our fit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rrf): codex T2/T3 + clarify floorRatio:0 test (post-/review) Three findings from /review's codex outside-voice pass + one nit from the structured Pass-1/Pass-2 review. T2 — Invalid floorRatio bypassing legacy guard. Old precedence `options.floorRatio ?? options.tierWeightFloorRatio ?? DEFAULT_FLOOR_RATIO` meant `floorRatio: NaN` (e.g. from buggy config parse) won the chain and disabled the gate, even if the caller had `tierWeightFloorRatio: 0.85` working. A partially migrated caller piping a malformed new option silently nullified the legacy guard. New `pickValidFloorRatio` helper walks the candidates and uses the first finite value in [0, 1]; invalid falls through to the alias, then to `DEFAULT_FLOOR_RATIO`. T3 — NaN/+Infinity rrfScores surviving the sort. The comment claimed non-finite scores "sort to the end," but `b.rrfScore - a.rrfScore` returns `NaN` for any NaN side, which JS sort treats as 0 (equal) — leaving NaN-scored hits in insertion order, where they can land in top-k via `slice(0, k)`. `+Infinity` sorts to the top of every query. Reachable when a caller passes `rrfK: NaN` (which makes every `1 / (rrfK + rank)` NaN). Fix: the post-loop filter drops ALL non-finite-scored entries (was: only `-Infinity` hidden sentinel), and a mirror filter applies on the pure-RRF path so corrupted scores never reach the comparator. Sort now operates only on finite scores and produces a deterministic ranking. Nit — Test `floorRatio: 0 disables the bonus completely` name + comment contradicted the test's own assertions (which confirm the bonus IS applied for every positive-score hit). Renamed to match the actual behavior: `floorRatio: 0` is a valid in-range value, threshold computes to 0, every positive-score hit passes the gate. Distinct from the `undefined`/out-of-range disable path even though they're observationally equivalent for positive-score inputs. 5 new test cases pin the codex fixes: - invalid floorRatio falls back to deprecated alias when alias is valid - invalid floorRatio + invalid alias falls back to DEFAULT_FLOOR_RATIO - floorRatio: undefined falls through to alias when alias is valid - rrfK: NaN corrupts all contributions → all hits dropped (output []) - partial corruption (some finite hits) survives the filter intact CHANGELOG updated to reflect: 135 tests (was 130), and the codex review fixes are called out under a "Codex review fixes (post-review)" subsection so the audit trail is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rrf-tests): address Copilot review — strengthen invalid-floorRatio test setup Copilot caught a coverage gap in three new tests: the rank-1-vs-rank-2 text-only setup doesn't actually exercise the gate, because rank-2 rrfScore (1/62 ≈ 0.0161) is above the default 0.85 × 1/61 (≈ 0.0139) — so the assertion passes regardless of whether `floorRatio: NaN` (or -0.5, or 1.5) correctly disabled the gate, fell back to default, or did anything at all. Same class as the gap I caught and fixed in the back-compat tests; missed updating these three. Rewritten to use the `strongVsTail` helper (rank-1-in-both + rank-21-text-only) so the assertions distinguish "gate at 0.85" from "gate disabled" — the weak hit's rrfScore is 1/81 (well below 0.85 × 2/61 = 0.0279), so the bonus only applies if the gate is genuinely disabled. Note: the test semantics also flipped because of the codex T2 fix landed in 89fd6be. Pre-T2, invalid `floorRatio` disabled the gate. Post-T2, invalid falls back to the alias then to DEFAULT_FLOOR_RATIO. So the renamed tests now assert "falls back to DEFAULT_FLOOR_RATIO" rather than "disables gate." The test rationale comment block calls this out explicitly so a future maintainer doesn't try to revert to the pre-T2 expectations. CHANGELOG test count corrected: 22 new test cases (was 17, originally 12 — my mistake; the count drifted across each round of review fixes). Copilot's other two comments were already addressed in 89fd6be: - "Test name `floorRatio: 0 disables bonus` contradicts assertions" → fixed - "NaN-score skip leaves non-finite rrfScore in entries; sort can be poisoned" → fixed (post-loop `isFinite` filter drops non-finite scores before sort). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify memory-gbrain-crdb-adapter description (additive scoring + floor-ratio gate) CLAUDE.md:102 said the package "multiplies fused scores by per-tier weights" — stale since #260/#272 flipped the implementation to additive bonuses (the multiplicative cut had a structural leapfrog regression on real dense embedders). Updated to describe the actual current behavior + the opt-in `floorRatio` gate aligned with gbrain v0.35.6.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 12, 2026 14:39

Copilot started reviewing on behalf of jayzalowitz May 12, 2026 14:40 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

jayzalowitz merged commit ace230a into main May 12, 2026
14 of 15 checks passed

jayzalowitz mentioned this pull request May 12, 2026

feat(#251 follow-up): tier-aware privacy controls — pin / hide / hide-sender #270

Merged

3 tasks

jayzalowitz mentioned this pull request May 13, 2026

feat(#251 follow-up): real-embedding ablation result + opt-in test #272

Merged

3 tasks

jayzalowitz mentioned this pull request May 13, 2026

docs: sync README + CLAUDE.md with v0.6.18-0.6.21 merge sweep #273

Merged

3 tasks

jayzalowitz mentioned this pull request May 18, 2026

v0.6.52.0 sync(memory): align floor-ratio gate with gbrain v0.35.6.0 / PR #1129 #334

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#251 follow-up): Layer 2 ablation eval + tuned multipliers + floor gate#260

feat(#251 follow-up): Layer 2 ablation eval + tuned multipliers + floor gate#260
jayzalowitz merged 2 commits into
mainfrom
jayzalowitz/251-layer2-eval

jayzalowitz commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jayzalowitz May 12, 2026

Uh oh!

jayzalowitz May 12, 2026

Uh oh!

jayzalowitz May 12, 2026

Uh oh!

jayzalowitz May 12, 2026

Uh oh!

jayzalowitz May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,5 +1,51 @@
		All notable changes to SkyTwin will be documented in this file.

		## [unreleased] — Layer 2 ablation eval + tuned multipliers + floor-ratio gate (#251 follow-up)


		1. Aggressive demote weights were structurally wrong. Original normal-band multipliers (newsletter 0.4×, automated 0.2×) pushed primary-hit notifications BELOW the distractor pool. Layer 2 became unusable for "find my AWS billing alert" queries. Fixed by rebalancing to promote-strong/demote-soft: newsletter 0.85×, automated 0.8× in the normal band.

		2. Multiplicative weighting without a gate lets weak matches leapfrog strong ones. A rank-30 distractor with an `authored` tier (×1.5 = 0.024) could beat a rank-1 primary with an `automated` tier (×0.8 = 0.013) — because the RRF score curve decays slowly enough that a 50% drop in raw score still leaves room for a 50% boost to overtake it. Fixed by adding `tierWeightFloorRatio` (default 0.85) to the RRF fold: only pages whose raw rrfScore is ≥ 85% of the top score are eligible for the multiplier. Tail-of-pool candidates stay at their unweighted score.

Conversation

jayzalowitz commented May 12, 2026

Summary

The eval

What the numbers say

Two real issues found and fixed

Issue 1: Aggressive demote weights pushed legitimate primary hits below distractors

Issue 2: Multiplicative weighting without a gate let weak matches leapfrog strong ones

What remains

Test plan

Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jayzalowitz May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jayzalowitz May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jayzalowitz May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jayzalowitz May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jayzalowitz May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants