Skip to content

Documentation for semantic_text auto pre-filtering#139749

Merged
dimitris-athanasiou merged 6 commits intoelastic:mainfrom
dimitris-athanasiou:docs-auto-prefiltering
Dec 19, 2025
Merged

Documentation for semantic_text auto pre-filtering#139749
dimitris-athanasiou merged 6 commits intoelastic:mainfrom
dimitris-athanasiou:docs-auto-prefiltering

Conversation

@dimitris-athanasiou
Copy link
Copy Markdown
Contributor

Adds documentation for automatic pre-filtering that was introduced in #138989.

Adds documentation for automatic pre-filtering that was introduced
in elastic#138989.
@dimitris-athanasiou dimitris-athanasiou added >docs General docs changes :SearchOrg/Relevance Label for the Search (solution/org) Relevance team v9.3.0 labels Dec 18, 2025
@elasticsearchmachine elasticsearchmachine added v9.4.0 Team:Docs Meta label for docs team Team:Search - Relevance The Search organization Search Relevance team labels Dec 18, 2025
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/search-relevance (Team:Search - Relevance)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/core-docs (Team:Docs)

% TEST[skip:Requires {{infer}} endpoint]


The `term` query will be applied as a pre-filter, meaning that when the *knn* search executes on
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm making the assumption that people reading this understand that a match query against a dense semantic_text field is doing knn search under the hood. I wonder if we need to add more explanation here or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Maybe we don't need to get technical - we can just say that in case it's needed, it will be applied as a pre-filter so we keep the expected number of results back.

* `semantic_text` fields do not support [Cross-Cluster Search (CCS)](docs-content://explore-analyze/cross-cluster-search.md) in [ES|QL](/reference/query-languages/esql.md).
* `semantic_text` fields do not support [Cross-Cluster Replication (CCR)](docs-content://deploy-manage/tools/cross-cluster-replication.md).
* automatic pre-filtering in Query DSL does not apply on [Nested queries](/reference/query-languages/query-dsl/query-dsl-nested-query.md). Such queries will be applied as post-filters.
* automatic pre-filtering in ES|QL does not apply on filters that are not translatable to Lucene. Such filters will be applied as post-filters.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not translatable to Lucene this is tricky. @carlosdelest Any suggestions on how to phrase this better?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky indeed. I think we could add something like:

Suggested change
* automatic pre-filtering in ES|QL does not apply on filters that are not translatable to Lucene. Such filters will be applied as post-filters.
* automatic pre-filtering in ES|QL does not apply on filters that use functions (like `WHERE TO_LOWER(my_field) == 'a'`). These filters will be applied as post-filters.

However, this is something we need to do. I've opened #139754 to track this.

@dimitris-athanasiou dimitris-athanasiou changed the title Adds documentation for semantic_text auto pre-filtering Documentation for semantic_text auto pre-filtering Dec 18, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 18, 2025

@github-actions
Copy link
Copy Markdown
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

Copy link
Copy Markdown
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for documenting this!

I've added some suggestions, feel free to reject them


Querying `semantic_text` fields that have dense vector embeddings automatically applies
filters found in the Query DSL tree or [ES|QL](/reference/query-languages/esql.md) query
as pre-filters in order to ensure the requested number of results is returned.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% TEST[skip:Requires {{infer}} endpoint]


The `term` query will be applied as a pre-filter, meaning that when the *knn* search executes on
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Maybe we don't need to get technical - we can just say that in case it's needed, it will be applied as a pre-filter so we keep the expected number of results back.

Comment on lines +204 to +210
The `term` query will be applied as a pre-filter, meaning that when the *knn* search executes on
`dense_semantic_text_field`, only documents that matched the `term` query will be searched.

If the `term` query was applied as a post-filter, which is the default behavior for such filters,
the *knn* search would execute against all documents, and then the `term` query would filter out
documents that did not match. This could mean that fewer than 10 documents are returned if there
are more relevant documents that are not green.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something similar to:

Suggested change
The `term` query will be applied as a pre-filter, meaning that when the *knn* search executes on
`dense_semantic_text_field`, only documents that matched the `term` query will be searched.
If the `term` query was applied as a post-filter, which is the default behavior for such filters,
the *knn* search would execute against all documents, and then the `term` query would filter out
documents that did not match. This could mean that fewer than 10 documents are returned if there
are more relevant documents that are not green.
In case the semantic_text uses dense_vector field embeddings, then the corresponding *knn* search executed on it will apply the term query as a pre-filter.
This allows to retrieve as many results as specified by the query.
If the `term` query was applied as a post-filter, which is the default behavior for such filters, the *knn* search would execute against all documents, and then the `term` query would filter out documents that did not match.
This could mean that fewer than 10 documents are returned if there are more relevant documents that are not green.


::::{note}
The queries in Query DSL that are used as pre-filters to `semantic_text` queries are all `must`,
`filter`, and `must_not` queries that are within parent `bool` queries.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`filter`, and `must_not` queries that are within parent `bool` queries.
`filter`, and `must_not` queries that are included in the parent `bool` queries.

* `semantic_text` fields do not support [Cross-Cluster Search (CCS)](docs-content://explore-analyze/cross-cluster-search.md) in [ES|QL](/reference/query-languages/esql.md).
* `semantic_text` fields do not support [Cross-Cluster Replication (CCR)](docs-content://deploy-manage/tools/cross-cluster-replication.md).
* automatic pre-filtering in Query DSL does not apply on [Nested queries](/reference/query-languages/query-dsl/query-dsl-nested-query.md). Such queries will be applied as post-filters.
* automatic pre-filtering in ES|QL does not apply on filters that are not translatable to Lucene. Such filters will be applied as post-filters.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky indeed. I think we could add something like:

Suggested change
* automatic pre-filtering in ES|QL does not apply on filters that are not translatable to Lucene. Such filters will be applied as post-filters.
* automatic pre-filtering in ES|QL does not apply on filters that use functions (like `WHERE TO_LOWER(my_field) == 'a'`). These filters will be applied as post-filters.

However, this is something we need to do. I've opened #139754 to track this.

@dimitris-athanasiou
Copy link
Copy Markdown
Contributor Author

@carlosdelest I have updated the PR taking into consideration your suggestions.

Copy link
Copy Markdown
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side! 💯

It would be great to have some docs review on this one 👍

Changes:

- **Title**: "Pre-filtering for dense vector queries"
- **Rewrote opening**: One clear sentence about automatic pre-filtering with link
- **Unified example intro**: Sets up both Query DSL and ES|QL examples
- **Created subsections**: "Query DSL example" and "ES|QL example"
- **Integrated note into prose**: `must`, `filter`, `must_not` explanation now in main text
- **Moved kNN caveat**: Now an "important" block after Query DSL example
- **Added MATCH footnote**: Explains automatic kNN behavior
@dimitris-athanasiou dimitris-athanasiou merged commit 5d9ed4d into elastic:main Dec 19, 2025
12 checks passed
@dimitris-athanasiou dimitris-athanasiou deleted the docs-auto-prefiltering branch December 19, 2025 12:09
szybia added a commit to szybia/elasticsearch that referenced this pull request Dec 19, 2025
* upstream/main: (25 commits)
  Add spec for project routing CRUD REST API endpoints (elastic#139634)
  Implement AllSupportedFIeldsTestCase for TDigest (elastic#139744)
  Mute elastic#139802 (elastic#139803)
  fix(logsdb): batch bulk indexing to prevent OOM in challenge tests (elastic#139770)
  Documentation for semantic_text auto pre-filtering (elastic#139749)
  Always do bulk scoring for rescoring when possible (elastic#139777)
  Optimize script sorts that do not require query scores (elastic#139748)
  Bump versions after 9.1.9 release
  Update branches.json for 9.1.9 release
  Bump versions after 9.2.3 release
  Prune changelogs after 8.19.9 release
  Bump versions after 8.19.9 release
  Update branches.json for 8.19.9 release
  Finalize docs for v9.2.3 release (elastic#139795)
  ESQL: Added timezone support to date_format and date_parse (elastic#138517)
  Update branches.json for 9.2.3 release
  Finalize docs for v9.1.9 release (elastic#139796)
  Switch inline stats to GA in docs (elastic#139753)
  Validate license in CPS (elastic#139105)
  FIPS 140-3 support with BC FIPS 2.0.x (elastic#139319)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>docs General docs changes :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:Docs Meta label for docs team Team:Search - Relevance The Search organization Search Relevance team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants