[DISCUSS] Should there be a threshold-based vector search API?

### Context

Almost all [vector search algorithms](https://ann-benchmarks.com/index.html#algorithms) focus on getting the `topK` highest-scoring results for a given query vector. This however, may not be the best objective to pursue because:

- Some queries are more suited for vector search: Indexed vectors may not be uniformly distributed, with some parts of the vector space denser than others. If the query is present in a sparse part of the space, the `topK` results will be low scoring (and not relevant to the query). We may want some kind of score `threshold`, below which a vector is not considered as a "match"?
- Conversely, if the query is present in a dense part of the vector space, we may not want to limit vector search to `topK` when we have some more (slightly lower scoring) results, but still above the `threshold`. Considering a query and a vector as a "match" if their similarity score is above the `threshold`, we may want ALL possible matches?

I found the same thing being discussed in #11946 as well.. However, I'm not aware of any algorithms designed for this objective (or how common of a use-case this may be)

### Proposal

**IF** this turns out to be a valid use-case, HNSW may not be the best algorithm for this new objective (as it was designed for getting the K-Nearest Neighbors). On a high-level, the current algorithm is:
- Maintain a queue of `topK` results (initially empty) and an unbounded queue of candidates (initially containing the entry node). Both of these are priority queues in descending order of similarity scores
- Starting from the best (highest scoring) candidate, we insert it into the result queue if there are (1) fewer than `topK` results (2) it is better scoring than any existing result (also removing the least scoring result in this case)
- If a node is added as a result, consider all its neighbors as further candidates
- The search stops when the best scoring candidate is worse than the least scoring result

For the new objective, the underlying graphs created by HNSW may still be useful - where each node is connected to its nearest (and diverse) neighbors, so we have some sense of direction / ordering on which node to explore further

Can we instead change the graph traversal (and result collection) criteria such that:
- Consider a node as a candidate if it exceeds a lower, graph-traversal specific score threshold
- Consider a candidate as a result if it exceeds the actual, query-result similarity score threshold

The lower graph-traversal threshold acts as a buffer, to retain results where all nodes along its path may not be above the actual `threshold` (and is treated as a search-time hyper-parameter, chosen according to the underlying vector space and queries)

The upper levels of search will remain the same (find the single-best entry point for the next layer) and this new criteria is only applicable to the lowest layer (with all nodes) to collect an unbounded number of results above the `threshold`

This will also eliminate the need to maintain results and candidates as priority queues, instead keeping them as simple lists (reducing the overhead of `heapify` calls that maintain the min-max heap property) and just sort them once at the end

**Looking for inputs on both the use-case and tweaked algorithm..**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Should there be a threshold-based vector search API? #12579

Context

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DISCUSS] Should there be a threshold-based vector search API? #12579

Description

Context

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions