Vector search is much faster than hnswlib, but returns different neighbors

First, kudos on the default vector index (JVector) — the **speedups are substantial**.

We compared the default implementation against the legacy hnswlib index on both **small (~10k vectors)** and **large (~86k vectors)** datasets:

* **Index build**

  * JVector: ~0.6–0.7s for ~86k vectors
  * hnswlib: ~25–30 min for the same data
* **Query latency**

  * JVector (after warmup): ~0.002–0.004s
  * hnswlib: ~0.02–0.05s

Performance-wise, JVector is clearly superior.

---

## Differences in retrieval behavior

However, we consistently observe **differences in neighbor selection** compared to hnswlib.

These differences:

* Appear on **both small and large movielens datasets**
* Are **not limited to distance scaling or ordering**
* Persist after warmup (though results stabilize)

### Concrete examples

#### Query: Jurassic Park (1993)

For the same embedding model and query vector:

**JVector**

```
1. Jurassic World (2015) (distance: 0.0952)
2. Jurassic Attack (2012) (distance: 0.0992)
3. Jurassic World Dominion (2022) (distance: 0.1018)
4. Jurassic Hunt (2021) (distance: 0.1153)
5. The Jurassic Games (2018) (distance: 0.1183)
```

**hnswlib**

```
1. Jurassic World (2015) (distance: 0.1904)
2. Jurassic Attack (2012) (distance: 0.1983)
3. Jurassic World Dominion (2022) (distance: 0.2036)
4. Jurassic Hunt (2021) (distance: 0.2307)
5. The Jurassic Games (2018) (distance: 0.2366)
```

#### Query: Forrest Gump (1994)

Another example (same query, same model):

**JVector**

```
1. Favor, The (1994) (distance: 0.1848)
2. What Happened Was... (1994) (distance: 0.1882)
3. Pillertrillaren (1994) (distance: 0.1915)
4. In Custody (1994) (distance: 0.1927)
5. Pulp Fiction (1994) (distance: 0.1928)
```

**hnswlib**

```
1. War, The (1994) (distance: 0.3277)
2. In the Army Now (1994) (distance: 0.3507)
3. This Means War (2012) (distance: 0.3533)
4. Fortunes of War (1994) (distance: 0.3554)
5. Men of War (1994) (distance: 0.3559) 
```

These are **different but still semantically reasonable** results, indicating that both systems retrieve valid neighbors, but with **different selection semantics**. We are not expecting strict equivalence with hnswlib, but would like to understand the intended recall / determinism guarantees of the default index.

---

## Additional observations / hypothesis

Based on logs and behavior, it seems likely that the differences are influenced by:

* **Lazy graph construction** in JVector (first query triggers graph build)
* **Graph rebuilds from pages**, where:

  * Deleted / obsolete entries are skipped
  * Active vector count may differ slightly from total records
* Different **graph traversal and entry-point heuristics** compared to classic HNSW

In particular, early queries (before full graph materialization) can return different neighbors than later queries, and even after warmup, the neighbor sets are not identical to hnswlib.

This looks like a **design tradeoff rather than a correctness bug**, but it would be helpful to understand the guarantees.

---

## Questions

1. Is the difference in neighbor selection **expected by design** given JVector’s hybrid / lazy graph construction?
2. Is there a way to:

   * Force **eager graph construction**
   * Avoid query-time rebuilds
   * Improve determinism or recall guarantees?
3. Are there documented expectations around **equivalence vs classic HNSW** behavior?

The performance gains are excellent; we mainly want to understand the **expected correctness and equivalence guarantees** of the default vector index.

Happy to provide a minimal reproducer if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vector search is much faster than hnswlib, but returns different neighbors #2914

Differences in retrieval behavior

Concrete examples

Query: Jurassic Park (1993)

Query: Forrest Gump (1994)

Additional observations / hypothesis

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Vector search is much faster than hnswlib, but returns different neighbors #2914

Description

Differences in retrieval behavior

Concrete examples

Query: Jurassic Park (1993)

Query: Forrest Gump (1994)

Additional observations / hypothesis

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions