Skip to content

Commit ea5d567

Browse files
jpheinclaude
andcommitted
feat(search): RRF vs convex-blend A/B run + findings (#162)
Closes #162 — the harness landed in PR #247; this is the measurement + writeup it gated. Harness change — eval_fusion_ab.py gains a self-contained ``--mine-corpus PATH`` mode mirroring scripts/eval_multi_encoder_rrf.py: mines PATH into a temp local-Chroma palace, runs the convex-vs-RRF A/B against it, doesn't touch the daemon or share GPU capacity. The ``--i-know-the-backfill-is-done`` gate stays on the ``--palace-path`` mode for future daemon-side wiring. Loader change — _load_probes now accepts both the v2 dict shape used by scripts/probes_v2_git_derived.json ({"_meta": ..., "probes": [...]}) and the legacy list-of-lists shape. Matches eval_multi_encoder_rrf.py's loader so probe sets are interchangeable. Run — mempalace/ corpus (3413 drawers, identical setup to the 2026-05-15 multi-encoder eval), probes_v2_git_derived.json (200 probes), candidate_strategy="hybrid", top-K=10. Finding — RRF underperforms the convex blend on this corpus: - MRR 0.4075 → 0.3758 (Δ -0.0318) - Recall@5 47.0% → 45.0% (Δ -2 pp) - Recall@10 52.0% → 47.0% (Δ -5 pp) - 10 improved / 20 regressed / 170 tied The regressions are severe — most are rank-1-under-convex → miss-under- rrf. The convex blend's vector-heavy 0.6/0.4 weighting (plus the 0.9 floor for bm25_postgres/bm25_sqlite-surfaced rows) is doing real work that RRF's score-scale-agnostic treatment discards. Consistent with #82's through-pipeline finding. Recommendation — convex stays the default; ``fusion_mode="rrf"`` stays shipping as the explicit opt-in for callers who want a score-agnostic fusion (multi-list paths, encoder ensembles). The harness stays in the tree so the next "should we flip?" gets a numeric answer. Writeup at docs/research/2026-05-28-rrf-vs-hybrid-rerank-ab.md; companion raw JSON at the .json sibling. 29 tests in tests/test_eval_fusion_ab.py cover the new loader shapes + main-mode guards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5025d2f commit ea5d567

8 files changed

Lines changed: 891 additions & 194 deletions

File tree

FORK_CHANGELOG.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,56 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
1818
---
1919

2020

21+
## [2026-05-28]
22+
23+
24+
### Added
25+
26+
27+
- **RRF vs convex-blend rerank — A/B measurement on our corpus (#162)** ([`TBD`](https://github.com/techempower-org/mempalace/commit/TBD))
28+
Closes the measurement promised by [#162][162] (the harness landed
29+
in #247 — this PR runs it). Adds a self-contained ``--mine-corpus``
30+
mode to ``scripts/eval_fusion_ab.py`` that mines the named
31+
directory into a fresh local ChromaDB palace and runs the
32+
convex-vs-RRF A/B against it. Doesn't touch the production daemon
33+
or share GPU capacity with real callers; the prior
34+
``--i-know-the-backfill-is-done`` gate stays on the
35+
``--palace-path`` mode for future daemon-side wiring.
36+
37+
The probe-set loader now accepts both the v2 dict shape
38+
(``{"_meta": ..., "probes": [{query, expected, why}, ...]}``,
39+
which the only checked-in probe file ``scripts/probes_v2_git_derived.json``
40+
uses) and the legacy list-of-lists shape. This matches the
41+
multi-encoder eval harness's loader so probe sets are
42+
interchangeable between the two.
43+
44+
Findings + per-probe data in
45+
``docs/research/2026-05-28-rrf-vs-hybrid-rerank-ab.md``
46+
(the human-readable summary) and the companion ``.json`` (raw
47+
ranks + deltas for follow-up analysis). **One-line:** RRF
48+
underperforms the convex blend on this corpus — MRR 0.4075 →
49+
0.3758 (−0.0318), Recall@10 52% → 47% (−5 pp), 20 regressions
50+
to 10 improvements. The convex blend's vector-heavy weighting
51+
(0.6/0.4) is doing real work; RRF's score-scale-agnostic
52+
treatment loses strong vector signal in the rank-1-under-convex
53+
cases. Convex stays the default; ``fusion_mode="rrf"`` stays
54+
shipping as the explicit opt-in for callers who want it.
55+
56+
Not in scope:
57+
58+
* Running against the production palace. The daemon's
59+
``/search/hybrid`` hard-codes ``candidate_strategy="hybrid"``
60+
and doesn't forward a ``fusion_mode`` body field, so RRF can't
61+
be driven remotely today. Filed forward as a palace-daemon
62+
change if the A/B result motivates it.
63+
* Sweeping the RRF ``k`` smoothing constant. Default 60 (Cormack
64+
2009); a sweep is a follow-up if RRF is competitive enough to
65+
be worth refining.
66+
67+
*Tests:* 29 — tests/test_eval_fusion_ab.py (probe-loader accepts v2 dict + legacy list shapes, rejects scalars and dicts-without-probes-list; main rejects --mine-corpus + --palace-path together; main refuses --palace-path without --i-know-the-backfill-is-done; pre-existing run-orchestration + scoring-math tests retained)
68+
*Files:* `scripts/eval_fusion_ab.py`, `tests/test_eval_fusion_ab.py`, `docs/research/2026-05-28-rrf-vs-hybrid-rerank-ab.md`, `docs/research/2026-05-28-rrf-vs-hybrid-rerank-ab.json`
69+
70+
2171
## [2026-05-27]
2272

2373

README.md

Lines changed: 74 additions & 73 deletions
Large diffs are not rendered by default.

docs/fork-changes.yaml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,58 @@
2424

2525
entries:
2626

27+
- id: rrf-vs-hybrid-rerank-ab-run
28+
date: 2026-05-28
29+
bucket: Added
30+
commit: TBD
31+
area: Search
32+
summary: "RRF vs convex-blend rerank — A/B measurement on our corpus (#162)"
33+
tests: "29 — tests/test_eval_fusion_ab.py (probe-loader accepts v2 dict + legacy list shapes, rejects scalars and dicts-without-probes-list; main rejects --mine-corpus + --palace-path together; main refuses --palace-path without --i-know-the-backfill-is-done; pre-existing run-orchestration + scoring-math tests retained)"
34+
files:
35+
- scripts/eval_fusion_ab.py
36+
- tests/test_eval_fusion_ab.py
37+
- docs/research/2026-05-28-rrf-vs-hybrid-rerank-ab.md
38+
- docs/research/2026-05-28-rrf-vs-hybrid-rerank-ab.json
39+
body: |
40+
Closes the measurement promised by [#162][162] (the harness landed
41+
in #247 — this PR runs it). Adds a self-contained ``--mine-corpus``
42+
mode to ``scripts/eval_fusion_ab.py`` that mines the named
43+
directory into a fresh local ChromaDB palace and runs the
44+
convex-vs-RRF A/B against it. Doesn't touch the production daemon
45+
or share GPU capacity with real callers; the prior
46+
``--i-know-the-backfill-is-done`` gate stays on the
47+
``--palace-path`` mode for future daemon-side wiring.
48+
49+
The probe-set loader now accepts both the v2 dict shape
50+
(``{"_meta": ..., "probes": [{query, expected, why}, ...]}``,
51+
which the only checked-in probe file ``scripts/probes_v2_git_derived.json``
52+
uses) and the legacy list-of-lists shape. This matches the
53+
multi-encoder eval harness's loader so probe sets are
54+
interchangeable between the two.
55+
56+
Findings + per-probe data in
57+
``docs/research/2026-05-28-rrf-vs-hybrid-rerank-ab.md``
58+
(the human-readable summary) and the companion ``.json`` (raw
59+
ranks + deltas for follow-up analysis). **One-line:** RRF
60+
underperforms the convex blend on this corpus — MRR 0.4075 →
61+
0.3758 (−0.0318), Recall@10 52% → 47% (−5 pp), 20 regressions
62+
to 10 improvements. The convex blend's vector-heavy weighting
63+
(0.6/0.4) is doing real work; RRF's score-scale-agnostic
64+
treatment loses strong vector signal in the rank-1-under-convex
65+
cases. Convex stays the default; ``fusion_mode="rrf"`` stays
66+
shipping as the explicit opt-in for callers who want it.
67+
68+
Not in scope:
69+
70+
* Running against the production palace. The daemon's
71+
``/search/hybrid`` hard-codes ``candidate_strategy="hybrid"``
72+
and doesn't forward a ``fusion_mode`` body field, so RRF can't
73+
be driven remotely today. Filed forward as a palace-daemon
74+
change if the A/B result motivates it.
75+
* Sweeping the RRF ``k`` smoothing constant. Default 60 (Cormack
76+
2009); a sweep is a follow-up if RRF is competitive enough to
77+
be worth refining.
78+
2779
- id: cli-bulk-move-relocation
2880
date: 2026-05-27
2981
bucket: Added
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
{
2+
"label_a": "convex",
3+
"label_b": "rrf",
4+
"metrics_a": {
5+
"n_probes": 200,
6+
"mrr": 0.4075,
7+
"recall_at_5": 0.47,
8+
"recall_at_10": 0.52,
9+
"found": 104
10+
},
11+
"metrics_b": {
12+
"n_probes": 200,
13+
"mrr": 0.3758,
14+
"recall_at_5": 0.45,
15+
"recall_at_10": 0.47,
16+
"found": 94
17+
},
18+
"delta": {
19+
"mrr": -0.0318,
20+
"recall_at_5": -0.02,
21+
"recall_at_10": -0.05
22+
},
23+
"improved": [
24+
{
25+
"query": "Plainto_tsquery + ILIKE fallback for underscore identifiers",
26+
"rank_a": 2,
27+
"rank_b": 1
28+
},
29+
{
30+
"query": "Add --mode session for per-session manifest drawers",
31+
"rank_a": 3,
32+
"rank_b": 2
33+
},
34+
{
35+
"query": "Quote ChromaBackend annotation for Python 3.9 compatibility",
36+
"rank_a": null,
37+
"rank_b": 9
38+
},
39+
{
40+
"query": "Add hook_verbatim_mode toggle for transcript ingest",
41+
"rank_a": null,
42+
"rank_b": 5
43+
},
44+
{
45+
"query": "Atomic write to prevent partial corruption on crash",
46+
"rank_a": 5,
47+
"rank_b": 3
48+
},
49+
{
50+
"query": "Reconfigure stdio to UTF-8 on Windows",
51+
"rank_a": 6,
52+
"rank_b": 5
53+
},
54+
{
55+
"query": "Avoid false hnsw divergence fallback",
56+
"rank_a": 4,
57+
"rank_b": 3
58+
},
59+
{
60+
"query": "Harden HNSW startup preflight",
61+
"rank_a": null,
62+
"rank_b": 2
63+
},
64+
{
65+
"query": "Add multi-agent support roadmap (Claude Code, OpenCode, Cursor, Aider, Gemini CLI, Codex CLI, Warp)",
66+
"rank_a": 5,
67+
"rank_b": 3
68+
},
69+
{
70+
"query": "Paginate closet_llm col.get (#1073)",
71+
"rank_a": null,
72+
"rank_b": 6
73+
}
74+
],
75+
"regressed": [
76+
{
77+
"query": "Mempalace rooms subcommand + example config.yaml",
78+
"rank_a": 10,
79+
"rank_b": null
80+
},
81+
{
82+
"query": "Emit only canonical rooms per hybrid-search-taxonomy spec",
83+
"rank_a": 6,
84+
"rank_b": null
85+
},
86+
{
87+
"query": "Use pg_available_extensions.name (not extname) in preflight",
88+
"rank_a": 1,
89+
"rank_b": 2
90+
},
91+
{
92+
"query": "MEMPALACE_KG_BACKEND routing — sqlite default, age opt-in (Phase 2.4)",
93+
"rank_a": 1,
94+
"rank_b": 2
95+
},
96+
{
97+
"query": "Identify lock holder + exit non-zero on contention",
98+
"rank_a": 6,
99+
"rank_b": null
100+
},
101+
{
102+
"query": "Retry tool_search once on Chroma \"Error finding id\" transient (#1315)",
103+
"rank_a": 1,
104+
"rank_b": null
105+
},
106+
{
107+
"query": "Fix ruff bugbear and silent-except findings",
108+
"rank_a": 7,
109+
"rank_b": null
110+
},
111+
{
112+
"query": "Clamp similarity to [0,1] to avoid negative values",
113+
"rank_a": 10,
114+
"rank_b": null
115+
},
116+
{
117+
"query": "Handle null JSON-RPC request payloads safely",
118+
"rank_a": 5,
119+
"rank_b": null
120+
},
121+
{
122+
"query": "Clamp effective_distance to valid cosine range [0, 2]",
123+
"rank_a": 1,
124+
"rank_b": null
125+
},
126+
{
127+
"query": "Preflight SQLite integrity before rebuild",
128+
"rank_a": 3,
129+
"rank_b": 4
130+
},
131+
{
132+
"query": "Verify write roundtrip before bailout",
133+
"rank_a": 3,
134+
"rank_b": null
135+
},
136+
{
137+
"query": "Don't write chunking defaults in cfg.init()",
138+
"rank_a": 3,
139+
"rank_b": null
140+
},
141+
{
142+
"query": "Extract Windows UTF-8 reconfigure into shared helper",
143+
"rank_a": 1,
144+
"rank_b": 2
145+
},
146+
{
147+
"query": "Per-stream stdio errors policy on Windows",
148+
"rank_a": 7,
149+
"rank_b": null
150+
},
151+
{
152+
"query": "Address Copilot review on #1306",
153+
"rank_a": 1,
154+
"rank_b": null
155+
},
156+
{
157+
"query": "Basename source_file in tool_get_drawer responses",
158+
"rank_a": 9,
159+
"rank_b": null
160+
},
161+
{
162+
"query": "Close active backend before rollback restore",
163+
"rank_a": 1,
164+
"rank_b": 2
165+
},
166+
{
167+
"query": "Split get_or_create_collection on reopen (follow-up to #1262)",
168+
"rank_a": 1,
169+
"rank_b": null
170+
},
171+
{
172+
"query": "Address release review feedback",
173+
"rank_a": 1,
174+
"rank_b": 2
175+
}
176+
],
177+
"candidate_strategy": "hybrid",
178+
"n_results": 10,
179+
"timing_secs": {
180+
"convex": 32.21,
181+
"rrf": 29.51
182+
},
183+
"palace_path": "/tmp/fusion_ab_palace_sqz0u5_f",
184+
"mined_corpus": "/home/jp/Projects/memorypalace/.claude/worktrees/feat-162-hybrid-rrf-ab/mempalace",
185+
"drawer_count": 3413
186+
}

0 commit comments

Comments
 (0)