Skip to content

fix(#4334): convert JVector EUCLIDEAN similarity to L2² distance in all search paths#4343

Merged
robfrank merged 4 commits into
mainfrom
fix/4334-euclidean-knn-wrong-order
May 26, 2026
Merged

fix(#4334): convert JVector EUCLIDEAN similarity to L2² distance in all search paths#4343
robfrank merged 4 commits into
mainfrom
fix/4334-euclidean-knn-wrong-order

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Closes #4334

JVector's VectorSimilarityFunction.EUCLIDEAN.compare(u, v) returns a similarity value of the form 1/(1+L2²) (larger = closer), not a distance. All five distance-conversion switch expressions in LSMVectorIndex were using score as-is and then sorting ascending, which placed the least similar (farthest) candidates first.

The fix converts similarity back to squared L2 distance before sorting:

case EUCLIDEAN -> score > 0 ? (1.0f / score) - 1.0f : Float.MAX_VALUE;

Since similarity = 1/(1+L2²), inverting gives L2² = 1/similarity - 1. Ascending sort on L2² correctly returns nearest neighbours first. The score > 0 guard is defensive; JVector's EUCLIDEAN score is always positive.

Affected paths fixed (all in LSMVectorIndex.java):

  • mergeWithDeltaScan - delta vectors inserted since last graph rebuild
  • bruteForceScan - brute-force fallback when graph returns too few results
  • findNeighborsFromVector - primary graph search path
  • findNeighborsFromVectorGroupBy - grouped search path
  • Zero-disk-I/O PQ search path

Test plan

  • LSMVectorIndexEuclideanOrderingTest.euclideanKnnReturnClosestFirst - k=1 with a near and far vector: verifies the near vector is returned
  • LSMVectorIndexEuclideanOrderingTest.euclideanKnnOrdering - k=3 with three vectors at known L2² distances: verifies results are in non-decreasing distance order and nearest is first
  • Full LSMVectorIndex* test suite: 139 tests, 0 failures
  • CypherCallVectorNeighborsTest, SelectVectorTest: no regressions

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #4334 where the EUCLIDEAN K-NN search incorrectly returned the worst matches first. The fix correctly converts JVector's similarity score back to squared L2 distance across five different search paths in LSMVectorIndex.java. Additionally, a comprehensive regression test suite and documentation have been added. The review feedback suggests refactoring the duplicated similarity-to-distance conversion logic into a single centralized helper method to improve maintainability and code quality.

Comment on lines 2733 to 2738
final float distance = switch (metadata.similarityFunction) {
case COSINE -> 2.0f * (1.0f - score);
case EUCLIDEAN -> score;
case EUCLIDEAN -> score > 0 ? (1.0f / score) - 1.0f : Float.MAX_VALUE;
case DOT_PRODUCT -> -score;
default -> score;
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to convert similarity scores to distances is duplicated across five different methods in LSMVectorIndex.java:

  1. mergeWithDeltaScan (lines 2733–2738)
  2. bruteForceScan (lines 2779–2784)
  3. findNeighborsFromVector (lines 2971–2982)
  4. findNeighborsFromVectorGrouped (lines 3169–3174)
  5. findNeighborsFromVectorApproximate (lines 3318–3327)

To improve maintainability, reduce code duplication, and ensure consistency, this conversion logic should be refactored into a single private helper method. For example:

private float scoreToDistance(final VectorSimilarityFunction similarityFunction, final float score) {
  return switch (similarityFunction) {
    case COSINE -> 2.0f * (1.0f - score);
    case EUCLIDEAN -> score > 0.0f ? (1.0f / score) - 1.0f : Float.MAX_VALUE;
    case DOT_PRODUCT -> -score;
    default -> score;
  };
}

This centralizes the formula and its associated comments, making future updates to similarity functions much safer and cleaner.

@codacy-production

codacy-production Bot commented May 26, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 5 complexity

Metric Results
Complexity 5

View in Codacy

🟢 Coverage 92.86% diff coverage

Metric Results
Coverage variation Report missing for a68f6631
Diff coverage 92.86% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (a68f663) Report Missing Report Missing Report Missing
Head commit (2eb2449) 159317 105635 66.30%

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4343) 14 13 92.86%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

1 Codacy didn't receive coverage data for the commit, or there was an error processing the received data. Check your integration for errors and validate that your coverage setup is correct.

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Test comment from claude bot - please ignore

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

deleted

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Code Review - PR #4343

The fix is mathematically correct and well-targeted. VectorSimilarityFunction.EUCLIDEAN.compare() returns similarity = 1/(1+L2^2), so inverting to L2^2 via (1/score) - 1 and sorting ascending correctly places the nearest candidates first. The defensive score > 0 guard is appropriate.

Applied consistently across all five code paths in LSMVectorIndex.java
Tests use AssertJ as per project conventions ✓
PR description clearly documents the root cause and fix ✓


Issues to address

1. DbIndexVectorQueryNodes.java:135 - score computation breaks for EUCLIDEAN (medium)

r.setProperty("score", 1.0f - distance);

Before this fix: EUCLIDEAN distance was raw similarity in (0,1], so 1.0f - distance was bounded in [0,1). The computation was semantically odd but numerically harmless.

After this fix: EUCLIDEAN distance is L2^2 in [0, inf), so 1.0f - L2^2 can be hugely negative for any non-trivial vector pair. Callers of db.index.vector.queryNodes with EUCLIDEAN indexes will now receive scores like -199.0, breaking any downstream logic that expects a similarity in [0,1].

This conversion is correct only for COSINE (where distance = 1 - cos, so 1 - distance = cos). A per-similarity-function conversion is needed here too, analogous to what was just fixed in LSMVectorIndex.java. For example, 1.0f / (1.0f + distance) maps L2^2 back to the familiar (0, 1] similarity scale.

2. Delta-scan and brute-force paths not covered by regression tests (minor)

The two new tests commit their vectors before searching, so JVector routes them through the HNSW graph. That exercises the fix at findNeighborsFromVector lines 2971-2978. However:

  • mergeWithDeltaScan (line 2735) runs for vectors inserted since the last graph rebuild. A test that inserts a vector without triggering a rebuild, then queries, would target this path directly.
  • bruteForceScan (line 2781) fires only when the graph returns far fewer results than expected - harder to trigger deterministically.
  • findNeighborsFromVectorGrouped and findNeighborsFromVectorApproximate are also untested by the new tests.

The PR claims all five paths are covered, but the tests only directly exercise one code path (primary graph search).

3. Breaking change in returned distance values (documentation only)

Users of vectorNeighbors() SQL function or Gremlin vectorNeighbors will see different numerical distance values for EUCLIDEAN after this fix:

  • Before: distance in (0, 1] (raw similarity - wrong ordering)
  • After: distance in [0, inf) (L2^2 - correct ordering, different scale)

The ordering fix is correct, but this is a breaking API change for any user who stores or thresholds on the raw distance values. Worth calling out in release notes.


Minor nib

The default -> score arm in all five switch expressions passes through the raw score unchanged. Any future similarity-returning function added to JVector would silently get wrong sort order again. Pre-existing issue, but worth a comment to make the invariant explicit.


Summary: The core fix is mathematically correct and should be merged. The DbIndexVectorQueryNodes score regression needs to be addressed before or alongside the merge. The test coverage gap for the delta/brute-force paths is a nice-to-have follow-up.

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Code review posted - see details below

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Code Review

Summary

This fix correctly addresses the root cause: JVector's VectorSimilarityFunction.EUCLIDEAN.compare() returns a similarity value 1/(1+L2) (larger = closer), not a distance. All five search paths in LSMVectorIndex were using the raw score for ascending sort, placing the farthest vectors first. The fix and refactoring are sound.


Positives

  • Correct math. (1/score) - 1 properly recovers L2 from JVector's EUCLIDEAN similarity. The score > 0 guard is defensive but harmless.
  • Good refactoring. Consolidating five identical switch expressions into scoreToDistance eliminates copy-paste bugs and is the right design.
  • Test quality. The two regression tests are clear, use distinct known distances, and assert the right invariants.
  • All five paths fixed. Delta-scan, brute-force, graph, grouped, and PQ approximate paths are all covered by the refactor.

Issues and Suggestions

1. Javadoc comment violates project style

LSMVectorIndex.java lines 2705-2710: project guidelines disallow multi-line comment blocks (one short line max). The six-line Javadoc block should be trimmed to a single-line comment:

// JVector returns similarity (larger=closer); convert to distance for ascending sort.

2. default branch silences future enum additions

The switch has default -> score (returning raw similarity), inherited from the old per-site code. If JVector adds a fourth similarity function, this silently returns the wrong value. Java 21+ exhaustive switches produce a compile error for missing cases, so removing default is safer (assuming VectorSimilarityFunction has exactly COSINE, EUCLIDEAN, DOT_PRODUCT):

return switch (similarityFunction) {
  case COSINE      -> 2.0f * (1.0f - score);
  case EUCLIDEAN   -> score > 0 ? (1.0f / score) - 1.0f : Float.MAX_VALUE;
  case DOT_PRODUCT -> -score;
};

3. Delta-scan path not covered by new tests

Both new tests call findNeighborsFromVector, exercising the graph path. The mergeWithDeltaScan path fires only when vectors were inserted after the last graph rebuild (the deltaVectors buffer is non-empty). A test that inserts, queries before triggering a rebuild, and checks ascending order would close this coverage gap.

4. Minor: docs/ file duplicates the PR description

docs/4334-euclidean-knn-wrong-order.md follows the established repo pattern (4317, 4320, etc.) so it is consistent. No blocking concern.


Verdict

The fix is correct and well-structured. Items 1 and 2 are the most actionable before merge; item 3 is a nice-to-have for test completeness.

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Code Review

Summary: This PR fixes a correctness bug where JVector's EUCLIDEAN.compare() returns a similarity value 1/(1+L2^2) (larger = closer), but the code was treating it as a distance and sorting ascending - returning farthest candidates first. The fix extracts a scoreToDistance helper and applies the correct inversion formula across all five affected search paths.


What works well

Mathematical correctness: The EUCLIDEAN inversion formula score > 0 ? (1.0f / score) - 1.0f : Float.MAX_VALUE is correct. Given similarity = 1/(1+L2^2), solving for L2^2 gives exactly (1/similarity) - 1. The defensive guard for zero similarity producing Float.MAX_VALUE is appropriate.

Refactoring to a helper: Extracting scoreToDistance removes five copies of the same (previously buggy) switch expression. The test class can call it directly because it is in the same package - a clean design.

Removing default -> score: The original switch expressions had a silent default -> score fallback that would have returned raw similarity as a distance for any future enum value added to VectorSimilarityFunction. The new exhaustive switch is strictly safer - the compiler will require explicit handling of new enum values.

Test coverage: The four test methods cover the fix at multiple layers - the helper formula directly, the index API, and the full Cypher query path. scoreToDistanceHelperRoundsTripEuclidean is a particularly clean unit test for verifying the math.


Issues to address

1. Mixed similarity functions across shards (DbIndexVectorQueryNodes.java:127)

resolveVectorIndexes can return multiple LSMVectorIndex instances (one per bucket/shard), and the code uses only the first one's similarity function to convert all distances back to scores. If two shards somehow have different similarity functions, the output scores for results from non-first shards will be computed with the wrong formula. In practice this is unlikely since all shards on the same type/property share the same metadata, but it would be more defensive to validate that all resolved indexes agree on a single VectorSimilarityFunction and throw clearly if not - analogous to the existing validation in extractQueryVector.

2. COSINE score can be negative in DbIndexVectorQueryNodes (pre-existing, newly exposed)

For COSINE, scoreToDistance returns 2*(1-rawScore) which ranges in [0, 2]. The current fallback 1.0f - distance maps that back to [-1, 1], meaning two near-opposite vectors could produce a negative score through the Cypher API. Since the PR already touches this code block, it is a natural place to fix. A full switch mirroring the inverse of scoreToDistance would be consistent:

  • COSINE: 1.0f - distance / 2.0f (maps [0,2] to [0,1])
  • DOT_PRODUCT: 1.0f + distance (maps negated dot product back)
  • EUCLIDEAN: existing formula 1.0f / (1.0f + distance)

The test queryNodesEuclideanScoreIsPositiveSimilarity only validates the EUCLIDEAN path; a corresponding COSINE test would catch this.

3. Documentation file should not be committed

docs/4334-euclidean-knn-wrong-order.md duplicates the PR description and contains ephemeral details (line numbers, status checkboxes). The context belongs in the PR description, the commit message, and inline Javadoc on scoreToDistance - not as a tracked file in docs/.


Minor notes

  • scoreToDistanceHelperRoundsTripEuclidean uses isEqualTo for float comparison. The specific values chosen (L2^2=4.0) happen to round-trip exactly in IEEE 754, but isCloseTo(..., offset(1e-5f)) would be more robust for future edits.
  • Tests are correctly placed in package com.arcadedb.index.vector to access the package-private scoreToDistance without widening its visibility.
  • No @Tag("slow") annotation needed - the tests are lightweight enough for standard CI.

Verdict: The core bug fix is correct and the refactoring is clean. Address items 1-3 before merging - especially item 2 (COSINE negative score) and item 3 (remove the docs file).

@claude

claude Bot commented May 26, 2026

Copy link
Copy Markdown

Code Review for fix/#4334: EUCLIDEAN KNN ordering

The fix is correct. JVector EUCLIDEAN.compare() returns similarity = 1/(1+L2sq), not distance. scoreToDistance correctly inverts it to L2sq = 1/similarity - 1, so ascending sort returns nearest first. The refactor eliminates 5 identical switch blocks cleanly.

Issues to consider:

  1. Score reconstruction in DbIndexVectorQueryNodes remains inconsistent for COSINE and DOT_PRODUCT (pre-existing, out of scope here). After the fix: EUCLIDEAN uses 1/(1+distance) which is correct. The else branch uses 1 - distance for both COSINE (scoreToDistance gives distance = 2*(1-rawScore), so correct inverse is 1 - distance/2, not 1 - distance) and DOT_PRODUCT (scoreToDistance gives distance = -rawScore, so 1 - distance = 1 + rawScore which exceeds 1.0). These were already wrong before this PR. A follow-up issue tracking the COSINE/DOT_PRODUCT inversion would prevent this from being lost.

  2. Multi-shard similarity function assumption is implicit. The code takes the similarity function only from vectorIndexes.getFirst(). The claim that all shards share one LSMVectorIndexMetadata is likely correct by construction, but a one-line comment naming that invariant would make silent corruption impossible to miss if it ever breaks.

  3. Tests only exercise bruteForceScan and mergeWithDeltaScan paths (2-3 vectors is too small to trigger HNSW graph construction). The graph search path and grouped search path are covered indirectly only. Acceptable as a regression test; a follow-up test with enough vectors to force graph build would give stronger path coverage.

Minor nit: docs/4334-euclidean-knn-wrong-order.md has useful root-cause docs, but the Review cycles and Cycle N declined items sections are process bookkeeping that will become stale. Trimming those before merge would leave a cleaner artifact.

Verdict: The core fix is correct and the refactoring is clean. Recommend opening a follow-up for the COSINE/DOT_PRODUCT inversion inconsistency in DbIndexVectorQueryNodes.

@robfrank robfrank force-pushed the fix/4334-euclidean-knn-wrong-order branch from ab18995 to 2eb2449 Compare May 26, 2026 12:22
@codecov

codecov Bot commented May 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 64.62%. Comparing base (128872a) to head (2eb2449).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...java/com/arcadedb/index/vector/LSMVectorIndex.java 88.88% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4343      +/-   ##
============================================
- Coverage     65.37%   64.62%   -0.76%     
- Complexity        0      432     +432     
============================================
  Files          1610     1647      +37     
  Lines        123789   127639    +3850     
  Branches      26776    27360     +584     
============================================
+ Hits          80925    82484    +1559     
- Misses        31429    33504    +2075     
- Partials      11435    11651     +216     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@robfrank robfrank merged commit 8317db4 into main May 26, 2026
28 of 40 checks passed
robfrank added a commit that referenced this pull request May 27, 2026
… contract

After PR #4343 (issue #4334), findNeighborsFromVector returns L2 squared
distance for EUCLIDEAN, not the raw JVector similarity. Update the stale
assertion so identical vectors compare to 0.0 instead of 1.0.
@robfrank robfrank added this to the 26.6.1 milestone May 27, 2026
@robfrank robfrank self-assigned this May 27, 2026
robfrank added a commit that referenced this pull request May 28, 2026
Since engine PR #4343, findNeighborsFromVector exposes a squared
Euclidean distance (lower is better) for the EUCLIDEAN metric, matching
COSINE and DOT_PRODUCT which already return distances. The Python
binding still treated EUCLIDEAN as a similarity: it sorted EUCLIDEAN
results in reverse (returning the worst matches first) and
test_lsm_euclidean_distance asserted a 1/(1+d²) similarity.

Drop the EUCLIDEAN reverse-sort special case so every metric sorts
ascending by distance, update the class docstring, and assert the
squared-distance values (origin -> 0.0, point -> 25.0) in the test.

Verified against freshly built 26.6.1 engine jars: the full
tests/test_vector.py suite passes (38 passed, 4 skipped).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LSMVectorIndex treats JVector's EUCLIDEAN return as distance — K-NN returns the worst matches first

1 participant