fix(#4334): convert JVector EUCLIDEAN similarity to L2² distance in all search paths#4343
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses issue #4334 where the EUCLIDEAN K-NN search incorrectly returned the worst matches first. The fix correctly converts JVector's similarity score back to squared L2 distance across five different search paths in LSMVectorIndex.java. Additionally, a comprehensive regression test suite and documentation have been added. The review feedback suggests refactoring the duplicated similarity-to-distance conversion logic into a single centralized helper method to improve maintainability and code quality.
| final float distance = switch (metadata.similarityFunction) { | ||
| case COSINE -> 2.0f * (1.0f - score); | ||
| case EUCLIDEAN -> score; | ||
| case EUCLIDEAN -> score > 0 ? (1.0f / score) - 1.0f : Float.MAX_VALUE; | ||
| case DOT_PRODUCT -> -score; | ||
| default -> score; | ||
| }; |
There was a problem hiding this comment.
The logic to convert similarity scores to distances is duplicated across five different methods in LSMVectorIndex.java:
mergeWithDeltaScan(lines 2733–2738)bruteForceScan(lines 2779–2784)findNeighborsFromVector(lines 2971–2982)findNeighborsFromVectorGrouped(lines 3169–3174)findNeighborsFromVectorApproximate(lines 3318–3327)
To improve maintainability, reduce code duplication, and ensure consistency, this conversion logic should be refactored into a single private helper method. For example:
private float scoreToDistance(final VectorSimilarityFunction similarityFunction, final float score) {
return switch (similarityFunction) {
case COSINE -> 2.0f * (1.0f - score);
case EUCLIDEAN -> score > 0.0f ? (1.0f / score) - 1.0f : Float.MAX_VALUE;
case DOT_PRODUCT -> -score;
default -> score;
};
}This centralizes the formula and its associated comments, making future updates to similarity functions much safer and cleaner.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 5 |
🟢 Coverage 92.86% diff coverage
Metric Results Coverage variation Report missing for a68f6631 Diff coverage ✅ 92.86% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (a68f663) Report Missing Report Missing Report Missing Head commit (2eb2449) 159317 105635 66.30% Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4343) 14 13 92.86% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%1 Codacy didn't receive coverage data for the commit, or there was an error processing the received data. Check your integration for errors and validate that your coverage setup is correct.
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
|
Test comment from claude bot - please ignore |
|
deleted |
Code Review - PR #4343The fix is mathematically correct and well-targeted. Applied consistently across all five code paths in Issues to address1. r.setProperty("score", 1.0f - distance);Before this fix: EUCLIDEAN distance was raw similarity in (0,1], so After this fix: EUCLIDEAN distance is L2^2 in [0, inf), so This conversion is correct only for COSINE (where 2. Delta-scan and brute-force paths not covered by regression tests (minor) The two new tests commit their vectors before searching, so JVector routes them through the HNSW graph. That exercises the fix at
The PR claims all five paths are covered, but the tests only directly exercise one code path (primary graph search). 3. Breaking change in returned distance values (documentation only) Users of
The ordering fix is correct, but this is a breaking API change for any user who stores or thresholds on the raw distance values. Worth calling out in release notes. Minor nibThe Summary: The core fix is mathematically correct and should be merged. The |
|
Code review posted - see details below |
Code ReviewSummaryThis fix correctly addresses the root cause: JVector's Positives
Issues and Suggestions1. Javadoc comment violates project style
2. The switch has 3. Delta-scan path not covered by new tests Both new tests call 4. Minor:
VerdictThe fix is correct and well-structured. Items 1 and 2 are the most actionable before merge; item 3 is a nice-to-have for test completeness. |
Code ReviewSummary: This PR fixes a correctness bug where JVector's What works wellMathematical correctness: The EUCLIDEAN inversion formula Refactoring to a helper: Extracting Removing Test coverage: The four test methods cover the fix at multiple layers - the helper formula directly, the index API, and the full Cypher query path. Issues to address1. Mixed similarity functions across shards (DbIndexVectorQueryNodes.java:127)
2. COSINE score can be negative in DbIndexVectorQueryNodes (pre-existing, newly exposed) For COSINE,
The test 3. Documentation file should not be committed
Minor notes
Verdict: The core bug fix is correct and the refactoring is clean. Address items 1-3 before merging - especially item 2 (COSINE negative score) and item 3 (remove the docs file). |
|
Code Review for fix/#4334: EUCLIDEAN KNN ordering The fix is correct. JVector EUCLIDEAN.compare() returns similarity = 1/(1+L2sq), not distance. scoreToDistance correctly inverts it to L2sq = 1/similarity - 1, so ascending sort returns nearest first. The refactor eliminates 5 identical switch blocks cleanly. Issues to consider:
Minor nit: docs/4334-euclidean-knn-wrong-order.md has useful root-cause docs, but the Review cycles and Cycle N declined items sections are process bookkeeping that will become stale. Trimming those before merge would leave a cleaner artifact. Verdict: The core fix is correct and the refactoring is clean. Recommend opening a follow-up for the COSINE/DOT_PRODUCT inversion inconsistency in DbIndexVectorQueryNodes. |
…dexVectorQueryNodes, make switch exhaustive, trim Javadoc
ab18995 to
2eb2449
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4343 +/- ##
============================================
- Coverage 65.37% 64.62% -0.76%
- Complexity 0 432 +432
============================================
Files 1610 1647 +37
Lines 123789 127639 +3850
Branches 26776 27360 +584
============================================
+ Hits 80925 82484 +1559
- Misses 31429 33504 +2075
- Partials 11435 11651 +216 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Since engine PR #4343, findNeighborsFromVector exposes a squared Euclidean distance (lower is better) for the EUCLIDEAN metric, matching COSINE and DOT_PRODUCT which already return distances. The Python binding still treated EUCLIDEAN as a similarity: it sorted EUCLIDEAN results in reverse (returning the worst matches first) and test_lsm_euclidean_distance asserted a 1/(1+d²) similarity. Drop the EUCLIDEAN reverse-sort special case so every metric sorts ascending by distance, update the class docstring, and assert the squared-distance values (origin -> 0.0, point -> 25.0) in the test. Verified against freshly built 26.6.1 engine jars: the full tests/test_vector.py suite passes (38 passed, 4 skipped).
Closes #4334
JVector's
VectorSimilarityFunction.EUCLIDEAN.compare(u, v)returns a similarity value of the form1/(1+L2²)(larger = closer), not a distance. All five distance-conversion switch expressions inLSMVectorIndexwere usingscoreas-is and then sorting ascending, which placed the least similar (farthest) candidates first.The fix converts similarity back to squared L2 distance before sorting:
Since
similarity = 1/(1+L2²), inverting givesL2² = 1/similarity - 1. Ascending sort on L2² correctly returns nearest neighbours first. Thescore > 0guard is defensive; JVector's EUCLIDEAN score is always positive.Affected paths fixed (all in
LSMVectorIndex.java):mergeWithDeltaScan- delta vectors inserted since last graph rebuildbruteForceScan- brute-force fallback when graph returns too few resultsfindNeighborsFromVector- primary graph search pathfindNeighborsFromVectorGroupBy- grouped search pathTest plan
LSMVectorIndexEuclideanOrderingTest.euclideanKnnReturnClosestFirst- k=1 with a near and far vector: verifies the near vector is returnedLSMVectorIndexEuclideanOrderingTest.euclideanKnnOrdering- k=3 with three vectors at known L2² distances: verifies results are in non-decreasing distance order and nearest is firstLSMVectorIndex*test suite: 139 tests, 0 failuresCypherCallVectorNeighborsTest,SelectVectorTest: no regressions