Skip to content

LSMVectorIndex treats JVector's EUCLIDEAN return as distance — K-NN returns the worst matches first #4334

@ruispereira

Description

@ruispereira

Affected version: 26.4.2 (confirmed still present in 26.5.1)
Component: com.arcadedb.index.vector.LSMVectorIndex

Summary

JVector's VectorSimilarityFunction.EUCLIDEAN.compare(...) returns a similarity of the form 1/(1+L2²) (larger = closer), not a distance.
ArcadeDB stores the value as-is and sorts ascending — placing the least similar candidates first.

Code

engine/com/arcadedb/index/vector/LSMVectorIndex.java:2920–2931
(same wrong handling at :2682–2687 and :2728–2733)

final float distance = switch (metadata.similarityFunction) {
  case COSINE      -> 2.0f * (1.0f - score);
  case EUCLIDEAN   -> score;            // ← comment claims "already the distance"
  case DOT_PRODUCT -> -score;
  …
};
results.add(new Pair<>(bindRid(loc.rid), distance));
…
results.sort((a, b) -> Float.compare(a.getSecond(), b.getSecond()));

Impact

EUCLIDEAN K-NN queries return wrong results whenever the delta-merge or brute-force fallback paths run (i.e. whenever the vector index has
uncommitted delta or hasn't compacted recently — the common case).

Suggested fix

- case EUCLIDEAN -> score;
+ case EUCLIDEAN -> score > 0 ? (1.0f / score) - 1.0f : Float.MAX_VALUE;

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions