Simplify DiskBBQ dynamic visit ratio to linear#142784
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
benwtrent
left a comment
There was a problem hiding this comment.
visitratio goes from 0-100.
This calculation just won't work.
One other thing to consider is moving this "dynamic visit percentage" based on num_candidates to be "per-segment" instead of globally calculated. This may help as well.
|
Based on the problem of the possibility of large segments the capped linear formula won't work so we chnage it to a sublinear option and also added more modern dataset. New Formulas
Property Comparison
Benchmarking: main (
|
| numCands | Main recall/visited/ms | α=1.5 recall/visited/ms | vs main visited |
|---|---|---|---|
| 15 | 0.94 / 3,806 / 0.24 | 0.93 / 3,382 / 0.20 | 0.89× |
| 50 | 0.94 / 3,806 / 0.23 | 0.93 / 3,382 / 0.22 | 0.89× |
| 100 | 0.94 / 7,410 / 0.37 | 0.94 / 5,815 / 0.31 | 0.78× |
| 200 | 0.96 / 14,650 / 0.63 | 0.95 / 10,168 / 0.47 | 0.69× |
| 500 | 0.97 / 36,246 / 1.47 | 0.96 / 21,225 / 0.88 | 0.59× |
| 1000 | 0.97 / 72,263 / 2.89 | 0.97 / 36,563 / 1.49 | 0.51× |
| 5000 | 0.97 / 360,347 / 14.13 | 0.97 / 122,424 / 4.87 | 0.34× |
| 10000 | 0.97 / 720,354 / 27.49 | 0.97 / 198,647 / 7.87 | 0.28× |
Wiki-Cohere (N=934,024)
| numCands | Main recall/visited/ms | α=1.5 recall/visited/ms | vs main visited |
|---|---|---|---|
| 15 | 0.60 / 3,882 / 0.19 | 0.56 / 3,400 / 0.19 | 0.88× |
| 50 | 0.60 / 3,882 / 0.20 | 0.56 / 3,400 / 0.20 | 0.88× |
| 100 | 0.72 / 7,456 / 0.30 | 0.67 / 5,869 / 0.29 | 0.79× |
| 200 | 0.81 / 14,540 / 0.52 | 0.76 / 10,136 / 0.40 | 0.70× |
| 500 | 0.89 / 35,974 / 1.17 | 0.83 / 20,964 / 0.75 | 0.58× |
| 1000 | 0.92 / 71,580 / 2.37 | 0.89 / 36,099 / 1.25 | 0.50× |
| 5000 | 0.96 / 356,804 / 11.15 | 0.94 / 120,135 / 4.03 | 0.34× |
| 10000 | 0.96 / 713,239 / 21.83 | 0.95 / 194,276 / 6.44 | 0.27× |
|
from the reported numbers, it seems like we are exploring less, but it seems this exploration translates with similar multiplicative factor to recall drops, at least for lower |
|
@ah89 I need to see the actual visit percentages per segment and segment count, etc. visiting only 1% maximally with 100M vectors seems like a horrible side effect. |
@benwtrent The implementation is per-segment. GIST-1M — HNSW (31 segments, ~33.3k vectors each)
GIST-1M — IVF (29 segments, ~33.3k vectors each)
Wiki-Cohere — HNSW (24 segments, ~39.5k vectors each)
Wiki-Cohere — IVF (22 segments, ~42.5k vectors each)
|
|
Important Review skippedAuto reviews are limited based on label configuration. 🏷️ Required labels (at least one) (2)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Gist-1M (1M, 1 segment, k=10, MIP)
Wiki-Cohere (934k, 1 segment, k=10, MIP)
|
| if (numCands > 0) { | ||
| // Two-signal model: smooth log-based visit ratio from num_candidates/k ratio and k. | ||
| // Produces visitRatio in [V_MIN, V_MAX] which, after the SOAR 2× multiplier, | ||
| // gives actual visit percentage in [~1%, ~10%]. | ||
| // The nc/k ratio (capped at R_MAX=100) drives 85% of the signal via log scaling, | ||
| // while k itself (capped at K_MAX=10000) contributes the remaining 15%. | ||
| final float V_MIN = 0.005f; | ||
| final float V_MAX = 0.05f; | ||
| final double R_MAX = 100.0; | ||
| final double K_MAX = 10_000.0; | ||
| double r = (double) numCands / Math.max(k, 1); | ||
| double x = Math.log1p(r) / Math.log1p(R_MAX); | ||
| double y = Math.log1p(k) / Math.log1p(K_MAX); | ||
| x = Math.max(0.0, Math.min(1.0, x)); | ||
| y = Math.max(0.0, Math.min(1.0, y)); | ||
| double z = 0.85 * x + 0.15 * y; | ||
| visitRatio = (float) (V_MIN + (V_MAX - V_MIN) * z); | ||
| } else { | ||
| // Fallback when called without IVFKnnSearchStrategy (e.g. checkIndex). | ||
| // Use log-based heuristic for reasonable default behavior. | ||
| final float V_MIN = 0.005f; | ||
| final float V_MAX = 0.05f; | ||
| double r = 1.0; // assume nc == k | ||
| double x = Math.log1p(r) / Math.log1p(100.0); | ||
| double y = Math.log1p(k) / Math.log1p(10_000.0); | ||
| x = Math.max(0.0, Math.min(1.0, x)); | ||
| y = Math.max(0.0, Math.min(1.0, y)); | ||
| double z = 0.85 * x + 0.15 * y; | ||
| visitRatio = (float) (V_MIN + (V_MAX - V_MIN) * z); | ||
| } | ||
| } |
There was a problem hiding this comment.
lots of unnecessary duplicates, the only thing that changes is the r?
Also Math.log1p(R_MAX) and its partner should just be private static final values. Same for all the other constants (why are they local instances???)
Additionally, do we need to divide by 2 to account for the 2x application down stream?
There was a problem hiding this comment.
Fixed — constants are now private static final, log1p values are pre-computed, and the duplicate log/clamp is extracted into a logScale helper.
Regarding the ÷2 question, I don't think we actually need to halve it. If I understand correctly, the 2× downstream acts more as a SOAR overlap budget (since the same vector might appear in two posting lists), rather than a true multiplier on the unique vectors visited. Let me know if I'm looking at this the wrong way!
|
Please disregard the previous results, as they followed the legacy path: Wiki-Cohere — 934,024 vectors, 768-dim, 1 segment, k=10, MIP
Gist-1M — 1,000,000 vectors, 960-dim, 1 segment, k=10, MIP
DBpedia-Arctic — 4,635,922 vectors, 768-dim, 1 segment, k=10, MIP
|
Updated default
|
|
Full analysis on 4 tables: Wiki-Cohere 768d (934K docs, 1 segment, IVF BBQ-4bit)
GIST-1M 960d (1.0M docs, 1 segment, IVF BBQ-4bit)
DBpedia-arctic 768d (4.6M docs, 1 segment, IVF BBQ-4bit)
MSMARCO-v2 1024d (72M docs, 1 segment, IVF BBQ-4bit)
|
|
FWIW: My opinion is that for dynamic visit ratio, we should compute the value for each segment separately, taking into account only the size of the segment instead of a value for all segments and the divide the work evenly. |
Yes — the Two-Signal formula computes a visit ratio (not an absolute count), and each segment independently applies that ratio to its own size via Check IVFVectorsReader.java ~line 327: long maxVectorVisited = (long) (2.0 * visitRatio * numVectors);where |
|
Oh. I thought that if we are moving to compute this ratio in a per segment basis, there is not really point to compute it in the query and instead doing it in the codec (I hate we had this code in two places). I see. you have not changed the implementation here: Are you planing to do something with this? |
|
I see that maxVectorVisited is dependant on the size of the segment but the visit ratio is the same for each segment. My experience is that the visit ratio can be smaller as the segment gets bigger to hit a specific recall. |
|
Yes, every segment gets the same visit ratio, regardless of whether it contains 100K or 70M vectors. That makes sense given the uniform treatment of segments when no prior information is provided. If, however, you mean that larger segments should have a smaller ratio because of IVF clustering, I can add a |
|
I think this linear transformation we have in this PR is a good starting point. @ah89 please do adjust the other places were we are using the old formulae. What do y'all think of having a small and limited bias that adjusts the final percentage according to segment size? It would have an asymptotic limit (to prevent bad scenarios like 0.00001% visit), and start near |
|
I think a segment-size bias makes sense — large segments have better-populated IVF clusters so they need a smaller visit fraction, and tiny segments (<10K) with poor clustering could benefit from visiting more. That said, our Two-Signal formula already produces very small ratios (0.3%–4%), so the effect of an additive bias near 0 for small segments may be minimal. It also adds a tuning dimension that's hard to calibrate without benchmarks across many segment sizes. I'm leaning towards adding this as a follow-up once we have more data, but I'll leave it to @benwtrent @iverase to decide. |
|
I am fine leaving it as a follow up. My suggestion is to index a fixed number of documents and force merge into one segment. Then compute how much visit percentage is needed to achieve a target recall. Try the same thing for different number of documents so we can get a nice curve segment-size / visit percentage. If we do it for a few datasets we can get a good idea on how visit percentage should change with the segment size? |
benwtrent
left a comment
There was a problem hiding this comment.
Please verify IVFVectorsReader#search. It is possible that this format is called without the IVF strategy and thus your logic won't work in that scenario.
Good catch — verified and fixed. When called without Fixed by splitting the dynamic fallback: when Do you have a better idea? |
| segmentInfo.append(segVectors); | ||
| } | ||
| segmentInfo.append("]"); | ||
| logger.info("Segment layout: {}", segmentInfo); |
There was a problem hiding this comment.
do we really want this to be logged via info all the time?
This seems like a debug option.
There was a problem hiding this comment.
Good point — changed to debug. This is only useful when troubleshooting, not needed in normal runs.
| logger.debug( | ||
| "IVF search segment: numVectors=[{}], visitRatio=[{}], maxVectorVisited=[{}], numCands=[{}], k=[{}]", | ||
| numVectors, | ||
| visitRatio, | ||
| maxVectorVisited, | ||
| numCands, | ||
| k | ||
| ); |
There was a problem hiding this comment.
Let's remove all this logging, it isn't necessary. if anything its "trace" and I don't think it provides useful information?
| } | ||
| float percentFiltered = Math.max(0f, Math.min(1f, approximateCost / numVectors)); | ||
| int numCands = 0; | ||
| int k = knnCollector.k(); |
There was a problem hiding this comment.
If we know k, but don't know numcands we know for a fact that numCands >= k. So, let's just lean into that. Do the new logic with k==numCands.
| // Two-Signal Model constants for dynamic visit ratio | ||
| private static final float V_MIN = 0.003f; | ||
| private static final float V_MAX = 0.04f; | ||
| private static final double LOG1P_R_MAX = Math.log1p(10.0); | ||
| private static final double LOG1P_K_MAX = Math.log1p(10_000.0); | ||
| private static final double RATIO_WEIGHT = 0.85; | ||
| private static final double K_WEIGHT = 0.15; |
There was a problem hiding this comment.
you might want to have this logic shareable so that the format can use it AND the query. Possibly in the IVFFormat file as a static method.
There was a problem hiding this comment.
Done — the logic now lives only in IVFVectorsReader.computeDynamicVisitRatio(). The query side (AbstractIVFKnnVectorQuery) just passes visitRatio=0.0f through to the codec and doesn't compute anything itself, so there's nothing to share.
benwtrent
left a comment
There was a problem hiding this comment.
Looks good! Please follow up with some experiments as Ignacio suggests and augment with some additional bias towards segment size.
@iverase @benwtrent Here are the segment-size vs visit-percentage curves. Setup: Same dataset indexed at different sizes (10K, 50K, 100K, 250K, 500K, full), force-merged to 1 segment, IVF BBQ-4bit, k=10, nc=1000. Sweep visit% from 0.1% to 20%. GIST-1M (960d, euclidean)Recall at each visit percentage
Minimum actual visit% for target recall
Wiki-Cohere (768d, MIP)Recall at each visit percentage
Minimum actual visit% for target recall
Key observations & Possibilities
The data clearly shows our current Two-Signal model (fixed ratio regardless of segment size) works well for segments ≥100K, but under-visits small segments. This supports @benwtrent's segment-size bias idea. Given that the data is now available and the shape of the curve is clear, do you think we should add the segment-size bias in this PR, or keep it as a follow-up? |
|
10K vectors went from 0.77 → 0.94 recall (+0.17), 100K went from 0.86 → 0.98 (+0.12). The bias is applied per-segment, which means it fires for ALL sizes here. The threshold of for small segments is New : with Bias
|
|
@ah89 this shows me that in a realistic scenario, adding the bias the way you did is expensive with no realistic benefit. The bias that Ignacio was focusing on was REDUCING the visit percentage as segments get larger, not significantly increasing the visit percentage on smaller segments. I think the answer there is simply because the clusters are static and larger. We should not optimize for tiny segments, that just isn't realistic. Things will be merged and we will have a diversity of segment sizes. |
@benwtrent This change is addressing your earlier point on small segments (<10k or so). The bias improves recall, but it increases visit % significantly and adds ~1–2ms latency while applying to all segment sizes (N_REF=500k). So it’s still a tradeoff for limited benefit in the scenarios we care about, and it’s different from Ignacio’s focus on larger segments. My point on his suggestion is the same as mentioned before, beyond uniformity, this starts to feel ad hoc data dependent solution. |
Replace the log10(N)^2 * clampedNumCands / N formula in AbstractIVFKnnVectorQuery.rewrite with a simple linear clampedNumCands / N mapping. This makes num_candidates directly represent the number of vectors to traverse, giving users a predictable, size-independent tuning knob. Also: - Extract computeDynamicVisitRatio as a testable static method - Add numCands to equals/hashCode since it now directly drives visit ratio when providedVisitRatio == 0 - Rename maxVectorVisited to maxExpectedTraversed in IVFVectorsReader for clarity on pre-filter vs post-filter semantics Closes elastic#142617
… improved visit ratio calculation
Introduces a new two-signal log-based heuristic for controlling IVF visit ratios, aiming for smoother scaling and more predictable recall across varying parameter ranges. Enhances logging to capture segment layouts and vector distributions, and tracks actual visit percentages for better observability of search behavior. Updates reporting to include segment counts and visit metrics for more comprehensive benchmarking.
Switches from a custom log-based model to a more principled two-signal approach for estimating the visit ratio, improving consistency and maintainability. Centralizes the two-signal logic and constants, reducing duplication and potential inconsistencies. Enhances the robustness of vector search parameterization, especially when candidate values are not explicitly provided.
Updates the logic for determining the approximate cost of accepted documents, correctly handling the case when all documents are accepted to avoid misleading cost estimations. Improves accuracy of candidate filtering and prevents potential downstream errors.
Relocates the Two-Signal Model visit ratio calculation from the query layer (AbstractIVFKnnVectorQuery) into the codec layer (IVFVectorsReader) where it can leverage per-segment context. Removes verbose debug logging from IVF search and downgrades segment layout logging from info to debug level to reduce log noise. Initializes numCands to k instead of 0 so the dynamic ratio computation has a meaningful default when no search strategy overrides it.
Replace the original log10(N)^2 * k / N visit-ratio formula with a Two-Signal model that derives the visit ratio from both the num_candidates/k ratio (85% weight) and k magnitude (15% weight), bounded to [0.3%, 4.0%]. Add a segment-size power-law cap (0.045 * (1M/N)^0.35) that limits visit ratio for large segments (~10% at 100K, ~4.5% at 1M, ~2% at 10M) while leaving small segments governed by the two-signal model alone. The final visit ratio per segment is min(two_signal, cap).
24d0254 to
26891cb
Compare
Add segment-size-aware visit ratio cap to IVF searchMotivation (as per @iverase suggestion)The Two-Signal model computes a dynamic visit ratio from Experiment DesignWe ran controlled benchmarks across four datasets to measure recall as a function of visit% and segment size:
GIST & Wiki-Cohere sweep: Force-merged to 6 segment sizes (10K, 50K, 100K, 250K, 500K, full) × 11 visit percentages (0.1%–20%), k=10, nc=100, IVF 4-bit BBQ. DBpedia segment sweep: 7 segment counts (1, 2, 4, 6, 8, 10, 12 segments → ~515K to 4.6M docs/segment) × 10 visit percentages (0.1%–20%). DBpedia ceiling + oversampling test: Tested visit% up to 70% to find the recall ceiling (0.90–0.92 without oversampling), then tested oversample=5x and 10x (5x sufficient, 10x identical). FormulaFitted a power-law curve to the empirical data: Where N = number of vectors in the segment, targetRecall = 0.9 (default). Example cap values (targetRecall=0.9):
Per-Segment ApplicationThe cap is computed independently for each segment at search time. In visitRatio = Math.min(computeDynamicVisitRatio(numCands, k), computeSegmentSizeCap(numVectors));This means in a multi-segment index, each segment gets a different final visit ratio based on its size. A 100K-doc segment gets cap≈10% (two-signal dominates since max is 4%), while a 5M-doc segment gets cap≈2.6% (cap dominates). The two-signal value is the same across all segments (it depends only on query parameters nc and k), but the cap varies per segment. Comparison: origin/main vs Branch (Two-Signal + Cap)All runs: IVF BBQ-4bit, k=10, Wiki-Cohere (934K docs, 2 segments, ~467K docs/seg)
GIST-1M (1M docs, 3 segments, ~333K docs/seg)
DBpedia-Arctic (4.6M docs, 9 segments, ~515K docs/seg)
Key Observations
|
tteofili
left a comment
There was a problem hiding this comment.
nice looking numbers! LGTM
…rics
* upstream/main: (21 commits)
Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:external-basic.topSnippetsFunction} elastic#145353
Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:external-basic.scoreFunction} elastic#145352
[DiskBBQ] Fix bug in NeighborQueue#popRawAndAddRaw (elastic#145324)
Fix dense_vector default index options when using BFLOAT16 (elastic#145202)
Use checked exceptions in entitlement constructor rules (elastic#145234)
ESQL: DS: datasource file plugins should not return TEXT types (elastic#145334)
Plumb DLM error store through to DlmFrozenTransition classes (elastic#145243)
Make Settings.Builder.remove() fluent (elastic#145294)
Add FLS tests for METRICS_INFO and TS_INFO (elastic#145211)
Fix flaky SecurityFeatureResetTests (elastic#145063)
[DOCS] Fix conflict markers in ESQL processing command list (elastic#145338)
Skip certain metric assertions on Windows (elastic#144933)
[ES|QL] Add schema reconciliation for multi-file external sources (elastic#145220)
Simplify DiskBBQ dynamic visit ratio to linear (elastic#142784)
ESQL: Disallow unmapped_fields=load with partial non-KEYWORD (elastic#144109)
[Transform] Track Linked Projects (elastic#144399)
Fix bulk scoring to process last batch instead of falling through to scalar tail (elastic#145316)
Clean up TickerScheduleEngineTests (elastic#145303)
[CI] ShardBulkInferenceActionFilterIT testRestart - Ensuring that secrets-inference index is available after full restart and unmuting test (elastic#145317)
Add CRUD doc to the DistributedArchitectureGuide (elastic#144710)
...
* Simplify DiskBBQ dynamic visit ratio to linear Replace the log10(N)^2 * clampedNumCands / N formula in AbstractIVFKnnVectorQuery.rewrite with a simple linear clampedNumCands / N mapping. This makes num_candidates directly represent the number of vectors to traverse, giving users a predictable, size-independent tuning knob. Also: - Extract computeDynamicVisitRatio as a testable static method - Add numCands to equals/hashCode since it now directly drives visit ratio when providedVisitRatio == 0 - Rename maxVectorVisited to maxExpectedTraversed in IVFVectorsReader for clarity on pre-filter vs post-filter semantics Closes elastic#142617 * Enhance IVFKnnSearchStrategy to include numCands and k parameters for improved visit ratio calculation * [CI] Auto commit changes from spotless * Improves IVF search logging and visit ratio calc Introduces a new two-signal log-based heuristic for controlling IVF visit ratios, aiming for smoother scaling and more predictable recall across varying parameter ranges. Enhances logging to capture segment layouts and vector distributions, and tracks actual visit percentages for better observability of search behavior. Updates reporting to include segment counts and visit metrics for more comprehensive benchmarking. * Refines dynamic visit ratio calc for IVF search Switches from a custom log-based model to a more principled two-signal approach for estimating the visit ratio, improving consistency and maintainability. Centralizes the two-signal logic and constants, reducing duplication and potential inconsistencies. Enhances the robustness of vector search parameterization, especially when candidate values are not explicitly provided. * Retune Two-Signal visit ratio formula * Fixes approximate cost calculation for doc filter Updates the logic for determining the approximate cost of accepted documents, correctly handling the case when all documents are accepted to avoid misleading cost estimations. Improves accuracy of candidate filtering and prevents potential downstream errors. * Move dynamic visit ratio computation into codec layer Relocates the Two-Signal Model visit ratio calculation from the query layer (AbstractIVFKnnVectorQuery) into the codec layer (IVFVectorsReader) where it can leverage per-segment context. Removes verbose debug logging from IVF search and downgrades segment layout logging from info to debug level to reduce log noise. Initializes numCands to k instead of 0 so the dynamic ratio computation has a meaningful default when no search strategy overrides it. * Replace dynamic visit ratio with Two-Signal model and cap Replace the original log10(N)^2 * k / N visit-ratio formula with a Two-Signal model that derives the visit ratio from both the num_candidates/k ratio (85% weight) and k magnitude (15% weight), bounded to [0.3%, 4.0%]. Add a segment-size power-law cap (0.045 * (1M/N)^0.35) that limits visit ratio for large segments (~10% at 100K, ~4.5% at 1M, ~2% at 10M) while leaving small segments governed by the two-signal model alone. The final visit ratio per segment is min(two_signal, cap). --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
* Simplify DiskBBQ dynamic visit ratio to linear Replace the log10(N)^2 * clampedNumCands / N formula in AbstractIVFKnnVectorQuery.rewrite with a simple linear clampedNumCands / N mapping. This makes num_candidates directly represent the number of vectors to traverse, giving users a predictable, size-independent tuning knob. Also: - Extract computeDynamicVisitRatio as a testable static method - Add numCands to equals/hashCode since it now directly drives visit ratio when providedVisitRatio == 0 - Rename maxVectorVisited to maxExpectedTraversed in IVFVectorsReader for clarity on pre-filter vs post-filter semantics Closes elastic#142617 * Enhance IVFKnnSearchStrategy to include numCands and k parameters for improved visit ratio calculation * [CI] Auto commit changes from spotless * Improves IVF search logging and visit ratio calc Introduces a new two-signal log-based heuristic for controlling IVF visit ratios, aiming for smoother scaling and more predictable recall across varying parameter ranges. Enhances logging to capture segment layouts and vector distributions, and tracks actual visit percentages for better observability of search behavior. Updates reporting to include segment counts and visit metrics for more comprehensive benchmarking. * Refines dynamic visit ratio calc for IVF search Switches from a custom log-based model to a more principled two-signal approach for estimating the visit ratio, improving consistency and maintainability. Centralizes the two-signal logic and constants, reducing duplication and potential inconsistencies. Enhances the robustness of vector search parameterization, especially when candidate values are not explicitly provided. * Retune Two-Signal visit ratio formula * Fixes approximate cost calculation for doc filter Updates the logic for determining the approximate cost of accepted documents, correctly handling the case when all documents are accepted to avoid misleading cost estimations. Improves accuracy of candidate filtering and prevents potential downstream errors. * Move dynamic visit ratio computation into codec layer Relocates the Two-Signal Model visit ratio calculation from the query layer (AbstractIVFKnnVectorQuery) into the codec layer (IVFVectorsReader) where it can leverage per-segment context. Removes verbose debug logging from IVF search and downgrades segment layout logging from info to debug level to reduce log noise. Initializes numCands to k instead of 0 so the dynamic ratio computation has a meaningful default when no search strategy overrides it. * Replace dynamic visit ratio with Two-Signal model and cap Replace the original log10(N)^2 * k / N visit-ratio formula with a Two-Signal model that derives the visit ratio from both the num_candidates/k ratio (85% weight) and k magnitude (15% weight), bounded to [0.3%, 4.0%]. Add a segment-size power-law cap (0.045 * (1M/N)^0.35) that limits visit ratio for large segments (~10% at 100K, ~4.5% at 1M, ~2% at 10M) while leaving small segments governed by the two-signal model alone. The final visit ratio per segment is min(two_signal, cap). --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Solution Design
Replace the
log10(N)² × min(10_000, max(numCands, 5*k)) / Nformula inAbstractIVFKnnVectorQuery.rewritewith a simple linearmin(10_000, max(numCands, 5*k)) / Nmapping — removing only thelog10(N)²multiplier. The floor (max(numCands, 5*k)) and ceiling (min(..., 10_000)) are unchanged. This makesnum_candidatesdirectly represent the number of vectors to traverse, giving users a predictable, size-independent tuning knob.Also:
computeDynamicVisitRatioas a testable static methodnumCandsto equals/hashCode since it now directly drives visit ratio whenprovidedVisitRatio == 0maxVectorVisitedtomaxExpectedTraversedinIVFVectorsReaderfor clarity on pre-filter vs post-filter semanticsBenchmarking:
main(log₁₀²) vsfix(linear)Here are the comparison results on GIST-1M (1M vectors, 960 dims, euclidean, IVF 1-bit BBQ, force-merged to 1 segment):
Key observations
The old formula massively over-visits. At
num_candidates=100on 1M vectors,mainvisits 7,595 vectors (0.76%), whilefixvisits 672 (0.07%). The oldlog₁₀(N)² ≈ 36×multiplier inflates the visit budget by ~11× at this scale.Recall difference is modest. Going from 7,595 visited → 672 only drops recall from 0.34 → 0.13. The old formula burned 10× more vectors for only +0.21 recall. At higher num_candidates the gap narrows: at 10K, it's 0.41 vs 0.38.
The linear formula is dramatically faster. At
num_candidates=10000: 0.44ms vs 13.39ms — a 30× speedup — because it visits 20K vectors instead of 720K.Diminishing returns on
main. The old formula already tops out at recall ~0.41 aroundnum_candidates=500and can't improve further even at 10K (visiting 72% of all vectors). This suggests GIST-1M is a hard dataset where recall is bounded by quantization quality, not visit budget.The fix makes
num_candidatesa direct, predictable knob. With the linear formula,num_candidatesdirectly controls how many vectors are traversed. Users who need higher recall can increase it, understanding the cost linearly.Closes #142617