-
-
Notifications
You must be signed in to change notification settings - Fork 94
Description
First, kudos on the default vector index (JVector) — the speedups are substantial.
We compared the default implementation against the legacy hnswlib index on both small (~10k vectors) and large (~86k vectors) datasets:
-
Index build
- JVector: ~0.6–0.7s for ~86k vectors
- hnswlib: ~25–30 min for the same data
-
Query latency
- JVector (after warmup): ~0.002–0.004s
- hnswlib: ~0.02–0.05s
Performance-wise, JVector is clearly superior.
Differences in retrieval behavior
However, we consistently observe differences in neighbor selection compared to hnswlib.
These differences:
- Appear on both small and large movielens datasets
- Are not limited to distance scaling or ordering
- Persist after warmup (though results stabilize)
Concrete examples
Query: Jurassic Park (1993)
For the same embedding model and query vector:
JVector
1. Jurassic World (2015) (distance: 0.0952)
2. Jurassic Attack (2012) (distance: 0.0992)
3. Jurassic World Dominion (2022) (distance: 0.1018)
4. Jurassic Hunt (2021) (distance: 0.1153)
5. The Jurassic Games (2018) (distance: 0.1183)
hnswlib
1. Jurassic World (2015) (distance: 0.1904)
2. Jurassic Attack (2012) (distance: 0.1983)
3. Jurassic World Dominion (2022) (distance: 0.2036)
4. Jurassic Hunt (2021) (distance: 0.2307)
5. The Jurassic Games (2018) (distance: 0.2366)
Query: Forrest Gump (1994)
Another example (same query, same model):
JVector
1. Favor, The (1994) (distance: 0.1848)
2. What Happened Was... (1994) (distance: 0.1882)
3. Pillertrillaren (1994) (distance: 0.1915)
4. In Custody (1994) (distance: 0.1927)
5. Pulp Fiction (1994) (distance: 0.1928)
hnswlib
1. War, The (1994) (distance: 0.3277)
2. In the Army Now (1994) (distance: 0.3507)
3. This Means War (2012) (distance: 0.3533)
4. Fortunes of War (1994) (distance: 0.3554)
5. Men of War (1994) (distance: 0.3559)
These are different but still semantically reasonable results, indicating that both systems retrieve valid neighbors, but with different selection semantics. We are not expecting strict equivalence with hnswlib, but would like to understand the intended recall / determinism guarantees of the default index.
Additional observations / hypothesis
Based on logs and behavior, it seems likely that the differences are influenced by:
-
Lazy graph construction in JVector (first query triggers graph build)
-
Graph rebuilds from pages, where:
- Deleted / obsolete entries are skipped
- Active vector count may differ slightly from total records
-
Different graph traversal and entry-point heuristics compared to classic HNSW
In particular, early queries (before full graph materialization) can return different neighbors than later queries, and even after warmup, the neighbor sets are not identical to hnswlib.
This looks like a design tradeoff rather than a correctness bug, but it would be helpful to understand the guarantees.
Questions
-
Is the difference in neighbor selection expected by design given JVector’s hybrid / lazy graph construction?
-
Is there a way to:
- Force eager graph construction
- Avoid query-time rebuilds
- Improve determinism or recall guarantees?
-
Are there documented expectations around equivalence vs classic HNSW behavior?
The performance gains are excellent; we mainly want to understand the expected correctness and equivalence guarantees of the default vector index.
Happy to provide a minimal reproducer if useful.