Add docs for kNN search endpoint#80378
Conversation
|
This is a "rough draft" and isn't ready for a full review. I opened it to get feedback on one point: there is a lot of overlap with the |
| `docvalue_fields`:: | ||
| (Optional, string) A comma-separated list of fields to return as the docvalue | ||
| representation of a field for each hit. | ||
| representation of a field for each hit. See <<docvalue-fields>>. |
There was a problem hiding this comment.
Some of these changes aren't closely related to kNN, but I saw the opportunity to improve the field retrieval docs a bit.
There was a problem hiding this comment.
These look great. I particularly like the use of "field pattern."
jrodewig
left a comment
There was a problem hiding this comment.
This looks good so far! I'd take the same approach you have.
Tag + reuse is probably a good fit for the query + request body params, but I think it's overkill for the response. We should just link users to the existing search API response body docs and concisely point out any differences.
|
Pinging @elastic/es-docs (Team:Docs) |
|
Pinging @elastic/es-search (Team:Search) |
jrodewig
left a comment
There was a problem hiding this comment.
This looks great. I left some comments and suggestions, but feel free to disregard as wanted. Nothing is blocking, aside from a syntax error in the stored_fields def. Thanks!
|
|
||
| experimental::[] | ||
|
|
||
| Performs a k-nearest neighbor search and returns the matching documents. |
There was a problem hiding this comment.
I'd include the acronym whenever we spell out k-nearest neighbor. I also don't think we need returns matching documents, but it's fine if you want to keep it.
| Performs a k-nearest neighbor search and returns the matching documents. | |
| Performs a k-nearest neighbor (kNN) search. |
There was a problem hiding this comment.
May be a better way to phrase it would be something like: "returns top K documents as found by k-nearest search".
There was a problem hiding this comment.
If it's okay with you, I'm going to go with "Performs a k-nearest neighbor (kNN) search and returns the matching documents." I think the wording is clearer and it helps clarify what the API does: kNN search just finds vectors, but the API actually returns documents.
| [source,console] | ||
| ---- | ||
| PUT my-vector-index | ||
| { | ||
| "mappings": { | ||
| "properties": { | ||
| "image_vector": { | ||
| "type": "dense_vector", | ||
| "dims": 3, | ||
| "index": true, | ||
| "similarity": "l2_norm" | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| GET /my-vector-index/_knn_search | ||
| { | ||
| "knn": { | ||
| "field": "image_vector", | ||
| "query_vector": [0.3, 0.1, 1.2], | ||
| "k": 10, | ||
| "num_candidates": 100 | ||
| }, | ||
| "_source": ["name", "date"] | ||
| } | ||
| ---- | ||
| // TEST[setup:my_index] |
There was a problem hiding this comment.
The // TEST[setup:my_index] comment isn't doing anything. It just creates some extra work for the cluster that runs the snippet tests.
I'd also hide the index creation/mapping setup since this reference is for the API. However, it's not a huge deal if you want to leave it in. I see benefits both ways. If we keep it, I'd index some data so users can test the API quickly.
Minor nit: We're trying to move away from a leading slash in our API/console examples. It's supposed to be easier if a user needs to copy/paste their own endpoint. However, that really, really minor. We still have examples of it everywhere.
Is there a reason we went with _source instead of fields? Not a huge deal, but I figured fields would be preferred.
| [source,console] | |
| ---- | |
| PUT my-vector-index | |
| { | |
| "mappings": { | |
| "properties": { | |
| "image_vector": { | |
| "type": "dense_vector", | |
| "dims": 3, | |
| "index": true, | |
| "similarity": "l2_norm" | |
| } | |
| } | |
| } | |
| } | |
| GET /my-vector-index/_knn_search | |
| { | |
| "knn": { | |
| "field": "image_vector", | |
| "query_vector": [0.3, 0.1, 1.2], | |
| "k": 10, | |
| "num_candidates": 100 | |
| }, | |
| "_source": ["name", "date"] | |
| } | |
| ---- | |
| // TEST[setup:my_index] | |
| //// | |
| [source,console] | |
| ---- | |
| PUT my-vector-index | |
| { | |
| "mappings": { | |
| "properties": { | |
| "image_vector": { | |
| "type": "dense_vector", | |
| "dims": 3, | |
| "index": true, | |
| "similarity": "l2_norm" | |
| } | |
| } | |
| } | |
| } | |
| ---- | |
| // TESTSETUP | |
| //// | |
| [source,console] | |
| ---- | |
| GET my-vector-index/_knn_search | |
| { | |
| "knn": { | |
| "field": "image_vector", | |
| "query_vector": [ 0.3, 0.1, 1.2 ], | |
| "k": 10, | |
| "num_candidates": 100 | |
| }, | |
| "_source": [ "name", "date" ] | |
| } | |
| ---- |
There was a problem hiding this comment.
Thanks for catching these issues! I ended up hiding the mapping set-up.
Is there a reason we went with _source instead of fields? Not a huge deal, but I figured fields would be preferred.
I initially had fields in some API examples, and one of the readers got very confused. Approaching the API for the first time, they thought fields referred to the vector fields for kNN. Unfortunately fields is not a widely-used parameter yet, so users would be learning two different concepts at the same time.
| `docvalue_fields`:: | ||
| (Optional, string) A comma-separated list of fields to return as the docvalue | ||
| representation of a field for each hit. | ||
| representation of a field for each hit. See <<docvalue-fields>>. |
There was a problem hiding this comment.
These look great. I particularly like the use of "field pattern."
| include::{es-repo-dir}/search/search.asciidoc[tag=fields-param-def] | ||
| include::{es-repo-dir}/search/search.asciidoc[tag=docvalue-fields-def] | ||
| include::{es-repo-dir}/search/search.asciidoc[tag=stored-fields-def] | ||
| include::{es-repo-dir}/search/search.asciidoc[tag=source-filtering-def] |
There was a problem hiding this comment.
We typically try to order parameters alphabetically, but I kinda like this sort.
There was a problem hiding this comment.
I can reorder these to be consistent. I was trying to emphasize that fields is what you usually want, so people don't accidentally start using docvalue_fields. But there's probably better ways to do this, like through examples and "how to" guides.
There was a problem hiding this comment.
Oh I just realized that if they were truly alphabetical, then fields and docvalue_fields would float above knn! I think that would really be unfortunate, as the knn section contains information that's important to understanding the API.
mayya-sharipova
left a comment
There was a problem hiding this comment.
@jtibshirani Thanks, a great addition! I've left a couple of small comments.
| * In <<query-dsl-script-score-query,`script_score`>> queries, to score | ||
| documents matching a filter | ||
| * In the <<knn-search, kNN search API>>, to find the _k_ most similar vectors | ||
| to a query vector |
There was a problem hiding this comment.
I don't usually put periods at the end of bulleted items unless they're complete sentences. It looks like most of our docs take this approach too (although we're not totally consistent).
|
|
||
| experimental::[] | ||
|
|
||
| Performs a k-nearest neighbor search and returns the matching documents. |
There was a problem hiding this comment.
May be a better way to phrase it would be something like: "returns top K documents as found by k-nearest search".
| }, | ||
| "_source": ["name", "date"] | ||
| } | ||
| ---- |
There was a problem hiding this comment.
// TEST[setup:my_index], adding to @jrodewig, also looks like the index name is different.
| (Required, string) The name of the vector field to search against. | ||
|
|
||
| `query_vector`:: | ||
| (Required, array of floats) The query vector. |
There was a problem hiding this comment.
May be we can add that query_vector must have the same dims as an indexed field it searches against. Although it is also kind of obvious, so not sure if it is worth to add.
| (Required, integer) The number of nearest neighbor candidates to consider per | ||
| shard. Increasing `num_candidates` tends to improve the accuracy of the final | ||
| `k` results. This value cannot exceed 10,000. |
There was a problem hiding this comment.
I agree with @jrodewig it would be nice to add more details here how this num_candidates works, for example: "{es} collects from each shard the top num_candidates results, and then merges collected from the shards results to get the top k results. Increasing num_candidates tends to improve the accuracy of the final k results..."
|
@jrodewig @mayya-sharipova thanks for the great comments. I tried to either respond or address them in the latest commits. I plan to merge later today unless you have more feedback. |
| to search. Supports wildcards (`*`). To search all data streams and indices, | ||
| use `*` or `_all`. | ||
|
|
||
| NOTE: The kNN search API does not support {ccs}. |
There was a problem hiding this comment.
Is it something that we forbid explicitly ? I didn't test but it should work transparently since we use the search action internally.
There was a problem hiding this comment.
It indeed should already work transparently. I wrote this because I didn't want to set a precedent that it supports CCS while we're still figuring out the execution strategy -- especially around how to combine kNN results with term-based results. However it's not something we forbid explicitly (given our strategy of translating at the REST layer, it's not very simple to do). What are your thoughts?
There was a problem hiding this comment.
IMO we're protected by the experimental status. CCS works out of the box and there's nothing that prevents us from removing the support later on. I don't see why we would though so I am not sure we need the NOTE at the moment.
There was a problem hiding this comment.
That reasoning makes sense to me, I'll remove the note to keep things simple.
This commit adds docs for the new `_knn_search` endpoint. It focuses on being an API reference and is light on details in terms of how exactly the kNN search works, and how the endpoint contrasts with `script_score` queries. We plan to add a high-level guide on kNN search that will explain this in depth. Relates to #78473.
This commit adds docs for the new
_knn_searchendpoint.It focuses on being an API reference and is light on details in terms of how
exactly the kNN search works, and how the endpoint contrasts with
script_scorequeries. We plan to add a high-level guide on kNN search thatwill explain this in depth.
Relates to #78473.