Skip to content

Add docs for kNN search endpoint#80378

Merged
jtibshirani merged 7 commits intoelastic:masterfrom
jtibshirani:knn-search-docs
Nov 9, 2021
Merged

Add docs for kNN search endpoint#80378
jtibshirani merged 7 commits intoelastic:masterfrom
jtibshirani:knn-search-docs

Conversation

@jtibshirani
Copy link
Copy Markdown
Contributor

@jtibshirani jtibshirani commented Nov 4, 2021

This commit adds docs for the new _knn_search endpoint.

It focuses on being an API reference and is light on details in terms of how
exactly the kNN search works, and how the endpoint contrasts with
script_score queries. We plan to add a high-level guide on kNN search that
will explain this in depth.

Relates to #78473.

@jtibshirani jtibshirani added >docs General docs changes :Search/Search Search-related issues that do not fall into other categories v8.0.0 v8.1.0 labels Nov 4, 2021
@jtibshirani
Copy link
Copy Markdown
Contributor Author

This is a "rough draft" and isn't ready for a full review. I opened it to get feedback on one point: there is a lot of overlap with the _search endpoint in terms of request and response sections. I started to go down the path of tagging several sections in search.asciidoc so I could include them in knn-search.asciidoc with no copying. Does this seem like a good direction? There are still a few left to tag (took, _shards, hits...)

@jtibshirani jtibshirani requested a review from jrodewig November 4, 2021 21:08
`docvalue_fields`::
(Optional, string) A comma-separated list of fields to return as the docvalue
representation of a field for each hit.
representation of a field for each hit. See <<docvalue-fields>>.
Copy link
Copy Markdown
Contributor Author

@jtibshirani jtibshirani Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these changes aren't closely related to kNN, but I saw the opportunity to improve the field retrieval docs a bit.

Copy link
Copy Markdown
Contributor

@jrodewig jrodewig Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look great. I particularly like the use of "field pattern."

@jtibshirani jtibshirani mentioned this pull request Nov 4, 2021
17 tasks
Copy link
Copy Markdown
Contributor

@jrodewig jrodewig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so far! I'd take the same approach you have.

Tag + reuse is probably a good fit for the query + request body params, but I think it's overkill for the response. We should just link users to the existing search API response body docs and concisely point out any differences.

@jtibshirani jtibshirani marked this pull request as ready for review November 4, 2021 22:10
@elasticmachine elasticmachine added Team:Docs Meta label for docs team Team:Search Meta label for search team labels Nov 4, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Copy Markdown
Contributor

@jrodewig jrodewig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I left some comments and suggestions, but feel free to disregard as wanted. Nothing is blocking, aside from a syntax error in the stored_fields def. Thanks!


experimental::[]

Performs a k-nearest neighbor search and returns the matching documents.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd include the acronym whenever we spell out k-nearest neighbor. I also don't think we need returns matching documents, but it's fine if you want to keep it.

Suggested change
Performs a k-nearest neighbor search and returns the matching documents.
Performs a k-nearest neighbor (kNN) search.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a better way to phrase it would be something like: "returns top K documents as found by k-nearest search".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's okay with you, I'm going to go with "Performs a k-nearest neighbor (kNN) search and returns the matching documents." I think the wording is clearer and it helps clarify what the API does: kNN search just finds vectors, but the API actually returns documents.

Comment on lines +11 to +38
[source,console]
----
PUT my-vector-index
{
"mappings": {
"properties": {
"image_vector": {
"type": "dense_vector",
"dims": 3,
"index": true,
"similarity": "l2_norm"
}
}
}
}

GET /my-vector-index/_knn_search
{
"knn": {
"field": "image_vector",
"query_vector": [0.3, 0.1, 1.2],
"k": 10,
"num_candidates": 100
},
"_source": ["name", "date"]
}
----
// TEST[setup:my_index]
Copy link
Copy Markdown
Contributor

@jrodewig jrodewig Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The // TEST[setup:my_index] comment isn't doing anything. It just creates some extra work for the cluster that runs the snippet tests.

I'd also hide the index creation/mapping setup since this reference is for the API. However, it's not a huge deal if you want to leave it in. I see benefits both ways. If we keep it, I'd index some data so users can test the API quickly.

Minor nit: We're trying to move away from a leading slash in our API/console examples. It's supposed to be easier if a user needs to copy/paste their own endpoint. However, that really, really minor. We still have examples of it everywhere.

Is there a reason we went with _source instead of fields? Not a huge deal, but I figured fields would be preferred.

Suggested change
[source,console]
----
PUT my-vector-index
{
"mappings": {
"properties": {
"image_vector": {
"type": "dense_vector",
"dims": 3,
"index": true,
"similarity": "l2_norm"
}
}
}
}
GET /my-vector-index/_knn_search
{
"knn": {
"field": "image_vector",
"query_vector": [0.3, 0.1, 1.2],
"k": 10,
"num_candidates": 100
},
"_source": ["name", "date"]
}
----
// TEST[setup:my_index]
////
[source,console]
----
PUT my-vector-index
{
"mappings": {
"properties": {
"image_vector": {
"type": "dense_vector",
"dims": 3,
"index": true,
"similarity": "l2_norm"
}
}
}
}
----
// TESTSETUP
////
[source,console]
----
GET my-vector-index/_knn_search
{
"knn": {
"field": "image_vector",
"query_vector": [ 0.3, 0.1, 1.2 ],
"k": 10,
"num_candidates": 100
},
"_source": [ "name", "date" ]
}
----

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching these issues! I ended up hiding the mapping set-up.

Is there a reason we went with _source instead of fields? Not a huge deal, but I figured fields would be preferred.

I initially had fields in some API examples, and one of the readers got very confused. Approaching the API for the first time, they thought fields referred to the vector fields for kNN. Unfortunately fields is not a widely-used parameter yet, so users would be learning two different concepts at the same time.

`docvalue_fields`::
(Optional, string) A comma-separated list of fields to return as the docvalue
representation of a field for each hit.
representation of a field for each hit. See <<docvalue-fields>>.
Copy link
Copy Markdown
Contributor

@jrodewig jrodewig Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look great. I particularly like the use of "field pattern."

Comment on lines +111 to +114
include::{es-repo-dir}/search/search.asciidoc[tag=fields-param-def]
include::{es-repo-dir}/search/search.asciidoc[tag=docvalue-fields-def]
include::{es-repo-dir}/search/search.asciidoc[tag=stored-fields-def]
include::{es-repo-dir}/search/search.asciidoc[tag=source-filtering-def]
Copy link
Copy Markdown
Contributor

@jrodewig jrodewig Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically try to order parameters alphabetically, but I kinda like this sort.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reorder these to be consistent. I was trying to emphasize that fields is what you usually want, so people don't accidentally start using docvalue_fields. But there's probably better ways to do this, like through examples and "how to" guides.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just realized that if they were truly alphabetical, then fields and docvalue_fields would float above knn! I think that would really be unfortunate, as the knn section contains information that's important to understanding the API.

Copy link
Copy Markdown
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks, a great addition! I've left a couple of small comments.

* In <<query-dsl-script-score-query,`script_score`>> queries, to score
documents matching a filter
* In the <<knn-search, kNN search API>>, to find the _k_ most similar vectors
to a query vector
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. at the end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't usually put periods at the end of bulleted items unless they're complete sentences. It looks like most of our docs take this approach too (although we're not totally consistent).


experimental::[]

Performs a k-nearest neighbor search and returns the matching documents.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a better way to phrase it would be something like: "returns top K documents as found by k-nearest search".

},
"_source": ["name", "date"]
}
----
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TEST[setup:my_index], adding to @jrodewig, also looks like the index name is different.

(Required, string) The name of the vector field to search against.

`query_vector`::
(Required, array of floats) The query vector.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can add that query_vector must have the same dims as an indexed field it searches against. Although it is also kind of obvious, so not sure if it is worth to add.

Comment on lines +106 to +108
(Required, integer) The number of nearest neighbor candidates to consider per
shard. Increasing `num_candidates` tends to improve the accuracy of the final
`k` results. This value cannot exceed 10,000.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jrodewig it would be nice to add more details here how this num_candidates works, for example: "{es} collects from each shard the top num_candidates results, and then merges collected from the shards results to get the top k results. Increasing num_candidates tends to improve the accuracy of the final k results..."

@jtibshirani
Copy link
Copy Markdown
Contributor Author

jtibshirani commented Nov 8, 2021

@jrodewig @mayya-sharipova thanks for the great comments. I tried to either respond or address them in the latest commits. I plan to merge later today unless you have more feedback.

to search. Supports wildcards (`*`). To search all data streams and indices,
use `*` or `_all`.

NOTE: The kNN search API does not support {ccs}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it something that we forbid explicitly ? I didn't test but it should work transparently since we use the search action internally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed should already work transparently. I wrote this because I didn't want to set a precedent that it supports CCS while we're still figuring out the execution strategy -- especially around how to combine kNN results with term-based results. However it's not something we forbid explicitly (given our strategy of translating at the REST layer, it's not very simple to do). What are your thoughts?

Copy link
Copy Markdown
Contributor

@jimczi jimczi Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we're protected by the experimental status. CCS works out of the box and there's nothing that prevents us from removing the support later on. I don't see why we would though so I am not sure we need the NOTE at the moment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reasoning makes sense to me, I'll remove the note to keep things simple.

@jtibshirani jtibshirani merged commit 8ca693b into elastic:master Nov 9, 2021
@jtibshirani jtibshirani deleted the knn-search-docs branch November 9, 2021 17:28
jtibshirani added a commit that referenced this pull request Nov 9, 2021
This commit adds docs for the new `_knn_search` endpoint.

It focuses on being an API reference and is light on details in terms of how
exactly the kNN search works, and how the endpoint contrasts with
`script_score` queries. We plan to add a high-level guide on kNN search that
will explain this in depth.

Relates to #78473.
@jtibshirani jtibshirani added :Search Relevance/Vectors Vector search and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>docs General docs changes :Search Relevance/Vectors Vector search Team:Docs Meta label for docs team Team:Search Meta label for search team v8.0.0-rc1 v8.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants