Match query support for semantic_text fields - POC DO NOT MERGE by Mikep86 · Pull Request #112166 · elastic/elasticsearch

Mikep86 · 2024-08-23T18:34:46Z

POC implementation of match query support for semantic_text fields. The technical goal of this POC is to update the match query rewrite logic such that it is rewritten to a semantic query when we detect that we are querying a semantic_text field.

There were two big challenges implementing this POC:

Rewriting to a semantic query must happen on the coordinating node, where mappings are not available
match query code lives in server, while semantic query code lives in the inference plugin

I solved these challenges in the POC by adding the name of the query that should be used to query the inference field to InferenceFieldMetadata. Since this is stored in index metadata, it is available on the coordinating node. I also added QueryBuilderService, which stores query builder creation functions for query builders that can be created using a field name and query string. QueryBuilderService is then populated using the getQueryBuilders plugin hook. I added QueryBuilderService to QueryRewriteContext, making it available to the match query during rewrite on the coordinating node. The query name from InferenceFieldMetadata is then used to create the proper query builder using QueryBuilderService.

Because of scoring incompatibility issues, the match query can be used to query semantic_text fields only when that is the only field type being queried. That is to say, you can't query a text field and semantic_text field using the same match query.

Putting it all together, it means you can now do things like this:

PUT _inference/sparse_embedding/my-elser-endpoint
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

PUT my-index-1
{
    "mappings": {
        "properties": {
            "inference_field": {
                "type": "semantic_text",
                "inference_id": "my-elser-endpoint"
            },
            "text_field": {
                "type": "text"
            }
        }
    }
}

PUT my-index-2
{
    "mappings": {
        "properties": {
            "inference_field": {
                "type": "text"
            },
            "text_field": {
                "type": "text"
            }
        }
    }
}

POST my-index-1/_doc/1
{
  "inference_field": "a test value",
  "text_field": "a test value"
}

POST my-index-2/_doc/1
{
  "inference_field": "a test value",
  "text_field": "a test value"
}

GET my-index-1/_search
{
  "query": {
    "match": {
      "inference_field": "test"
    }
  }
}

GET my-index-1/_search
{
  "query": {
    "match": {
      "text_field": "test"
    }
  }
}

// Fails due to combination of text & semantic_text fields
GET my-index-*/_search
{
  "query": {
    "match": {
      "inference_field": "test"
    }
  }
}

Mikep86 · 2024-08-23T18:37:34Z

server/src/main/java/org/elasticsearch/index/query/QueryBuilderServiceBuilder.java

We can probably rewrite this builder as a static build method in QueryBuilderService

Mikep86 · 2024-08-23T18:55:56Z

server/src/main/java/org/elasticsearch/index/query/MatchQueryBuilder.java

+                    if (inferenceFieldMetadata != null) {
+                        if (inferenceFieldQueryName != null
+                            && inferenceFieldMetadata.getQueryName().equals(inferenceFieldQueryName) == false) {
+                            throw new IllegalArgumentException("Detected incompatible inference field queries");


These error messages leave much to be desired. Will iterate on them to make them more informative.

kderusso

I like this approach - it's clean overall.

Some early feedback:

I wonder if we should err on the side of being more permissive, rather than throwing an error for a lot of these queries. Maybe if we're trying to match on multiple query types it's OK, wrap in a dis_max query or use RRF or semantic reranking as a recommendation?

kderusso · 2024-08-23T18:53:24Z

server/src/main/java/org/elasticsearch/cluster/metadata/InferenceFieldMetadata.java

 public final class InferenceFieldMetadata implements SimpleDiffable<InferenceFieldMetadata>, ToXContentFragment {
    private static final String INFERENCE_ID_FIELD = "inference_id";
    private static final String SOURCE_FIELDS_FIELD = "source_fields";
+    private static final String QUERY_NAME_FIELD = "query_name";


Nitpicky note: name overloaded here, because it refers to the keyed name of the query being run and also named queries. It would be good to differentiate these two concepts if possible.

kderusso · 2024-08-23T18:57:20Z

server/src/main/java/org/elasticsearch/index/query/QueryBuilderServiceBuilder.java

+    }
+
+    public QueryBuilderService build() {
+        Objects.requireNonNull(pluginsService);


This validation should be in the constructor?

I think we can remove the builder entirely actually and just have a static build method in QueryBuilderService instead

kderusso · 2024-08-23T19:25:07Z

server/src/main/java/org/elasticsearch/index/query/MatchQueryBuilder.java

+                }
+
+                if (foundNonInferenceField && inferenceFieldQueryName != null) {
+                    throw new IllegalArgumentException("Cannot query inference fields and non-inference fields at the same time");


Would it be better to dis max this and just return everything, knowing there will be better scores for semantic_text fields, than straight up erroring?

The issue is actually the opposite: When querying a semantic_text field that uses a dense vector model, all the scores will be in the [0.0-1.0] range. BM25 scores are essentially guaranteed to be greater than this, so all the semantic_text results would be at the end of the result set.

Ah, interesting, I ran through your script and

GET my-index-1/_search { "query": { "match": { "inference_field": "test" } } }

returned a score of 12.725867, while

GET my-index-1/_search { "query": { "match": { "text_field": "test" } } }

returned a score of 0.2876821.

Ah, nevermind, you noted dense vector not sparse vector - that makes sense. However, I think the point of either using dis max or just allowing it is potentially better than failing fast in this case.

That's just one example though. We know that BM25 scores are unbounded and combining them naively with scores in a bounded range is going to result in bad relevance. IMO we shouldn't do something we know will result in bad relevance.

I am strongly in favour of NOT throwing an error and using the default scores from different types of queries. Currently it works this way: for example if a user runs match query on a numeric field in one index (which always produces scores of 1 ) and on a text field in another index, docs with their scores will be combined without errors.

I agree with @mayya-sharipova . We shouldn't fail here. Its a horrible experience. While its weird to have things scored in different distributions, this is technically already occurring with BM25 over two indices with vastly different term statistics.

Good to see a consensus forming here. Given that the best dense vector scores will almost always be less than
the worst BM25 scores, I think we should at least alert the user to the scoring mismatch in a more proactive way than just documentation. What does everyone think about a warning header in this scenario?

I think a warning header might be overkill here.

Agreed, no warning headers, no logging, no nothing. We just need to ensure that explain works.

kderusso · 2024-08-23T19:26:54Z

server/src/main/java/org/elasticsearch/index/query/MatchQueryBuilder.java

+
+                        inferenceFieldQueryName = inferenceFieldMetadata.getQueryName();
+                    } else {
+                        foundNonInferenceField = true;


Can probably break out of the loop here once we get a single true value.

Mikep86 · 2024-08-23T19:50:28Z

@kderusso

Maybe if we're trying to match on multiple query types it's OK, wrap in a dis_max query or use RRF or semantic reranking as a recommendation?

Wrapping in a dis_max query doesn't work due to score range differences described in this comment. We could update the error messages to suggest RRF or reranking to combine the result sets, that gives the user a thread for a solution at least.

kderusso · 2024-08-23T19:54:40Z

Wrapping in a dis_max query doesn't work due to score range differences described #112166 (comment). We could update the error messages to suggest RRF or reranking to combine the result sets, that gives the user a thread for a solution at least.

Right - I get that, I think we should document RRF or reranking as a recommended solution but isn't returning matching results better than an error here?

Mikep86 · 2024-08-23T20:04:22Z

Right - I get that, I think we should document RRF or reranking as a recommended solution but isn't returning matching results better than an error here?

Not necessarily. The error at least tells the user what they are doing wrong. If we return results, how are they to know that anything is wrong? All they will know is that they get bad results out of the box, giving them a bad impression of Elasticsearch.

carlosdelest

This looks good - rewriting to a semantic query in the coordinator makes sense.

I have some thoughts for simplifying this, at least initially. LMKWYT!

carlosdelest · 2024-08-26T15:29:09Z

server/src/main/java/org/elasticsearch/index/query/MatchQueryBuilder.java

+    protected QueryBuilder doRewrite(QueryRewriteContext queryRewriteContext) throws IOException {
+        QueryBuilder rewritten = super.doRewrite(queryRewriteContext);
+
+        if (rewritten == this && queryRewriteContext.getClass() == QueryRewriteContext.class) {


I wonder if we can avoid the .getClass() comparison by doing something similar to the convertToXXXRewriteContext pattern that is used in other parts of the code, so the convert method returns null if it's not the expected QueryRewriteContext class 🤔 .

We're probably missing a rewrite phase for the coordinator - the current coordinator phase is really the can_match phase IIUC

I think we can refactor to avoid the getClass() comparison. This was the quick & dirty approach.

And just to clarify, this rewrite phase does not run in the can_match phase. It runs just before the can_match phase. It may seem like a pedantic detail, but it's important because the can_match phase is skipped for most search requests.

And just to clarify, this rewrite phase does not run in the can_match phase. It runs just before the can_match phase. It may seem like a pedantic detail, but it's important because the can_match phase is skipped for most search requests.

Understood - my comment referred to the current CoordinatorRewriteContext, which is really what runs on the can_match phase instead of actually being what is used in the coordinator.

carlosdelest · 2024-08-26T15:34:54Z

server/src/main/java/org/elasticsearch/index/query/MatchQueryBuilder.java

+                }
+
+                if (foundNonInferenceField && inferenceFieldQueryName != null) {
+                    throw new IllegalArgumentException("Cannot query inference fields and non-inference fields at the same time");


I understand your reasoning here @Mikep86 . As there is no way to boost fields for specific indices in match query, you prefer to error out instead of returning a surprising relevance result.

I think returning these results may be valuable though - there will be use cases where users will not be sorting by scoring. Users could also do some rescoring for adjusting the scores afterwards via script_score.

How about adding some docs for match support on the semantic_text field type that addresses this concern vs limiting the user beforehand?

carlosdelest · 2024-08-30T10:31:50Z

server/src/main/java/org/elasticsearch/index/query/MatchQueryBuilder.java

+                }
+
+                if (inferenceFieldQueryName != null) {
+                    rewritten = queryRewriteContext.getQueryBuilderService()


IIUC, you want to be prepared for potential future InferenceFieldMapper field types, and that's why you created the QueryBuilderService to help choosing the right QueryBuilder implementation.

My thoughts are that this is pretty complicated as of now - we only have a single InferenceFieldMapper which is semantic_text. Even if we have other InferenceFieldMapper subclasses, they would probably fit into using the semantic query.

Should we simplify this solution by assuming SemanticQueryBuilder is going to be used for any InferenceFieldMapper? We can always revisit this decision in the future when / if more inference field types are used.

That's part of the motivation, but I also added QueryBuilderService in an attempt to add something that can help with use cases outside of semantic search. There is nothing specific to semantic search in the QueryBuilderService implementation; it can be used to make a query implemented in any plugin accessible in server or any other plugin without major refactoring.

I agree we can probably simplify this POC implementation. You are correct that I wrote this with the idea in mind that someday we will have more than one inference field and inference query type. Before simplifying though, I think we should go through a thought experiment of how we will be forward-compatible if/when we add more inference field types or inference query types.

how we will be forward-compatible if/when we add more inference field types or inference query types.

IIUC, semantic query will be the one that checks the field type. If it does not support semantic query, it will fail.

We could even move semanticQuery() method from the SemanticTextFieldMapper to the InferenceFieldMapper interface to signal the intent.

We can't anticipate as of now any refactoring that will need to use this, or will be in a better position to review the solution once we need it.

I think it's better to have a simplified version for this, as we still may change this implementation in case we need to support other queries in semantic_text.

We could even move semanticQuery() method from the SemanticTextFieldMapper to the InferenceFieldMapper interface to signal the intent.

That approach won't work because we don't have access to the mappings on the coordinating node. All we have access to is InferenceFieldMetadata, which is why I added the query name to that.

Let's discuss offline about how we can simplify. I agree there is room to do so, but it may not be as straightforward as it appears.

That approach won't work because we don't have access to the mappings on the coordinating node. All we have access to is InferenceFieldMetadata, which is why I added the query name to that.

Yep - my comment is not about detecting it on the coordinator node (which is handled by the InferenceFieldMetadata), but checking during the shard rewriting that it's an InferenceFieldMapper, and just invoking semanticQuery() on it without having to actually check the specific implementation.

Let's discuss offline about how we can simplify.

Happy to do so - I'm not opposed to this implementation, just trying to understand if we can keep the change smaller until (if) we have the need 👍

mayya-sharipova · 2024-09-03T12:58:28Z

@Mikep86 I have not studied the PR deeply, so forgive my naive questions:

why do we need to do a rewrite to a semantic query on a coordinating node? Why this rewrite can't it be done when we are already on a shard and have all mappings? Is it because we want to make only a single inference per request?
Why do we need a queryName parameter for InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields, String queryName)? Could there be any different value from semantic?

server/src/main/java/org/elasticsearch/index/query/MatchQueryBuilder.java

benwtrent · 2024-09-03T14:04:27Z

server/src/main/java/org/elasticsearch/cluster/metadata/InferenceFieldMetadata.java

+    private final String queryName;

-    public InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields) {
+    public InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields, String queryName) {


Why are we adding this new metadata at all?

For forward-compatibility with any potential new queries against inference fields. It's probably overkill, but I wanted to write the POC with that potential in mind. We can simplify for the production implementation.

For forward-compatibility with any potential new queries against inference fields. It's probably overkill, but I wanted to write the POC with that potential in mind. We can simplify for the production implementation.

OK, this doesn't seem necessary and was strange to me. It would only be required for the edge case of a mixed cluster environment & us trying to allow "new code" to be used on an "old node". This over complicates things and we shouldn't do this.

I agree, it seems overkill for the current need.

Mikep86 · 2024-09-03T14:06:42Z

@mayya-sharipova

why do we need to do a rewrite to a semantic query on a coordinating node? Why this rewrite can't it be done when we are already on a shard and have all mappings? Is it because we want to make only a single inference per request?

We must rewrite on the coordinating node because the semantic query performs inference on the coordinating node. If we were to wait and rewrite on the shards, we would have to perform inference N times (where N is the number of data nodes with shards for the index queried).

Why do we need a queryName parameter for InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields, String queryName)? Could there be any different value from semantic?

Currently, no. This POC is written to be forward-compatible with any inference queries we may add in the future. @carlosdelest and I discussed (at a high level) how we could simplify in this comment thread. We should continue that discussion offline.

Mikep86 · 2024-10-21T15:04:12Z

Closing this PR as it has served its purpose for collecting feedback on the POC

Mikep86 added 5 commits August 22, 2024 15:22

Added query name to inference field metadata

392c948

Fix build error

c71900c

Added query builder service

4bb253a

Add query builder service to query rewrite context

592326f

Updated match query to support querying semantic text fields

3eb261a

elasticsearchmachine added the v8.16.0 label Aug 23, 2024

Mikep86 commented Aug 23, 2024

View reviewed changes

Mikep86 requested review from benwtrent, carlosdelest, jimczi, kderusso and mayya-sharipova August 23, 2024 18:40

Fix build error

1349dc8

Mikep86 commented Aug 23, 2024

View reviewed changes

kderusso reviewed Aug 23, 2024

View reviewed changes

carlosdelest reviewed Aug 30, 2024

View reviewed changes

mayya-sharipova reviewed Sep 3, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/index/query/MatchQueryBuilder.java Show resolved Hide resolved

benwtrent reviewed Sep 3, 2024

View reviewed changes

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

Mikep86 closed this Oct 21, 2024

ioanatia mentioned this pull request Nov 25, 2024

semantic search in ES|QL #116253

Closed

2 tasks

kderusso mentioned this pull request Dec 2, 2024

Add match support for semantic_text fields #117839

Merged

Conversation

Mikep86 commented Aug 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kderusso left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mikep86 commented Aug 23, 2024

Uh oh!

kderusso commented Aug 23, 2024

Uh oh!

Mikep86 commented Aug 23, 2024

Uh oh!

carlosdelest left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mikep86 commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayya-sharipova commented Sep 3, 2024 •

edited

Loading

Mikep86 commented Sep 3, 2024 •

edited

Loading