Documentation for semantic_text auto pre-filtering#139749
Documentation for semantic_text auto pre-filtering#139749dimitris-athanasiou merged 6 commits intoelastic:mainfrom
Conversation
Adds documentation for automatic pre-filtering that was introduced in elastic#138989.
|
Pinging @elastic/search-relevance (Team:Search - Relevance) |
|
Pinging @elastic/core-docs (Team:Docs) |
| % TEST[skip:Requires {{infer}} endpoint] | ||
|
|
||
|
|
||
| The `term` query will be applied as a pre-filter, meaning that when the *knn* search executes on |
There was a problem hiding this comment.
I'm making the assumption that people reading this understand that a match query against a dense semantic_text field is doing knn search under the hood. I wonder if we need to add more explanation here or not.
There was a problem hiding this comment.
I think so. Maybe we don't need to get technical - we can just say that in case it's needed, it will be applied as a pre-filter so we keep the expected number of results back.
| * `semantic_text` fields do not support [Cross-Cluster Search (CCS)](docs-content://explore-analyze/cross-cluster-search.md) in [ES|QL](/reference/query-languages/esql.md). | ||
| * `semantic_text` fields do not support [Cross-Cluster Replication (CCR)](docs-content://deploy-manage/tools/cross-cluster-replication.md). | ||
| * automatic pre-filtering in Query DSL does not apply on [Nested queries](/reference/query-languages/query-dsl/query-dsl-nested-query.md). Such queries will be applied as post-filters. | ||
| * automatic pre-filtering in ES|QL does not apply on filters that are not translatable to Lucene. Such filters will be applied as post-filters. |
There was a problem hiding this comment.
not translatable to Lucene this is tricky. @carlosdelest Any suggestions on how to phrase this better?
There was a problem hiding this comment.
Tricky indeed. I think we could add something like:
| * automatic pre-filtering in ES|QL does not apply on filters that are not translatable to Lucene. Such filters will be applied as post-filters. | |
| * automatic pre-filtering in ES|QL does not apply on filters that use functions (like `WHERE TO_LOWER(my_field) == 'a'`). These filters will be applied as post-filters. |
However, this is something we need to do. I've opened #139754 to track this.
🔍 Preview links for changed docs |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
carlosdelest
left a comment
There was a problem hiding this comment.
Looks good, thanks for documenting this!
I've added some suggestions, feel free to reject them
|
|
||
| Querying `semantic_text` fields that have dense vector embeddings automatically applies | ||
| filters found in the Query DSL tree or [ES|QL](/reference/query-languages/esql.md) query | ||
| as pre-filters in order to ensure the requested number of results is returned. |
There was a problem hiding this comment.
Let's add a link to prefiltering (https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-knn-query#knn-query-filtering)
| % TEST[skip:Requires {{infer}} endpoint] | ||
|
|
||
|
|
||
| The `term` query will be applied as a pre-filter, meaning that when the *knn* search executes on |
There was a problem hiding this comment.
I think so. Maybe we don't need to get technical - we can just say that in case it's needed, it will be applied as a pre-filter so we keep the expected number of results back.
| The `term` query will be applied as a pre-filter, meaning that when the *knn* search executes on | ||
| `dense_semantic_text_field`, only documents that matched the `term` query will be searched. | ||
|
|
||
| If the `term` query was applied as a post-filter, which is the default behavior for such filters, | ||
| the *knn* search would execute against all documents, and then the `term` query would filter out | ||
| documents that did not match. This could mean that fewer than 10 documents are returned if there | ||
| are more relevant documents that are not green. |
There was a problem hiding this comment.
Maybe something similar to:
| The `term` query will be applied as a pre-filter, meaning that when the *knn* search executes on | |
| `dense_semantic_text_field`, only documents that matched the `term` query will be searched. | |
| If the `term` query was applied as a post-filter, which is the default behavior for such filters, | |
| the *knn* search would execute against all documents, and then the `term` query would filter out | |
| documents that did not match. This could mean that fewer than 10 documents are returned if there | |
| are more relevant documents that are not green. | |
| In case the semantic_text uses dense_vector field embeddings, then the corresponding *knn* search executed on it will apply the term query as a pre-filter. | |
| This allows to retrieve as many results as specified by the query. | |
| If the `term` query was applied as a post-filter, which is the default behavior for such filters, the *knn* search would execute against all documents, and then the `term` query would filter out documents that did not match. | |
| This could mean that fewer than 10 documents are returned if there are more relevant documents that are not green. |
|
|
||
| ::::{note} | ||
| The queries in Query DSL that are used as pre-filters to `semantic_text` queries are all `must`, | ||
| `filter`, and `must_not` queries that are within parent `bool` queries. |
There was a problem hiding this comment.
| `filter`, and `must_not` queries that are within parent `bool` queries. | |
| `filter`, and `must_not` queries that are included in the parent `bool` queries. |
| * `semantic_text` fields do not support [Cross-Cluster Search (CCS)](docs-content://explore-analyze/cross-cluster-search.md) in [ES|QL](/reference/query-languages/esql.md). | ||
| * `semantic_text` fields do not support [Cross-Cluster Replication (CCR)](docs-content://deploy-manage/tools/cross-cluster-replication.md). | ||
| * automatic pre-filtering in Query DSL does not apply on [Nested queries](/reference/query-languages/query-dsl/query-dsl-nested-query.md). Such queries will be applied as post-filters. | ||
| * automatic pre-filtering in ES|QL does not apply on filters that are not translatable to Lucene. Such filters will be applied as post-filters. |
There was a problem hiding this comment.
Tricky indeed. I think we could add something like:
| * automatic pre-filtering in ES|QL does not apply on filters that are not translatable to Lucene. Such filters will be applied as post-filters. | |
| * automatic pre-filtering in ES|QL does not apply on filters that use functions (like `WHERE TO_LOWER(my_field) == 'a'`). These filters will be applied as post-filters. |
However, this is something we need to do. I've opened #139754 to track this.
|
@carlosdelest I have updated the PR taking into consideration your suggestions. |
carlosdelest
left a comment
There was a problem hiding this comment.
LGTM from my side! 💯
It would be great to have some docs review on this one 👍
Changes: - **Title**: "Pre-filtering for dense vector queries" - **Rewrote opening**: One clear sentence about automatic pre-filtering with link - **Unified example intro**: Sets up both Query DSL and ES|QL examples - **Created subsections**: "Query DSL example" and "ES|QL example" - **Integrated note into prose**: `must`, `filter`, `must_not` explanation now in main text - **Moved kNN caveat**: Now an "important" block after Query DSL example - **Added MATCH footnote**: Explains automatic kNN behavior
* upstream/main: (25 commits) Add spec for project routing CRUD REST API endpoints (elastic#139634) Implement AllSupportedFIeldsTestCase for TDigest (elastic#139744) Mute elastic#139802 (elastic#139803) fix(logsdb): batch bulk indexing to prevent OOM in challenge tests (elastic#139770) Documentation for semantic_text auto pre-filtering (elastic#139749) Always do bulk scoring for rescoring when possible (elastic#139777) Optimize script sorts that do not require query scores (elastic#139748) Bump versions after 9.1.9 release Update branches.json for 9.1.9 release Bump versions after 9.2.3 release Prune changelogs after 8.19.9 release Bump versions after 8.19.9 release Update branches.json for 8.19.9 release Finalize docs for v9.2.3 release (elastic#139795) ESQL: Added timezone support to date_format and date_parse (elastic#138517) Update branches.json for 9.2.3 release Finalize docs for v9.1.9 release (elastic#139796) Switch inline stats to GA in docs (elastic#139753) Validate license in CPS (elastic#139105) FIPS 140-3 support with BC FIPS 2.0.x (elastic#139319) ...
Adds documentation for automatic pre-filtering that was introduced in #138989.