Skip to content

Match query support for semantic_text fields - POC DO NOT MERGE#112166

Closed
Mikep86 wants to merge 6 commits intoelastic:mainfrom
Mikep86:semantic-text_match-query-support
Closed

Match query support for semantic_text fields - POC DO NOT MERGE#112166
Mikep86 wants to merge 6 commits intoelastic:mainfrom
Mikep86:semantic-text_match-query-support

Conversation

@Mikep86
Copy link
Copy Markdown
Contributor

@Mikep86 Mikep86 commented Aug 23, 2024

POC implementation of match query support for semantic_text fields. The technical goal of this POC is to update the match query rewrite logic such that it is rewritten to a semantic query when we detect that we are querying a semantic_text field.

There were two big challenges implementing this POC:

  • Rewriting to a semantic query must happen on the coordinating node, where mappings are not available
  • match query code lives in server, while semantic query code lives in the inference plugin

I solved these challenges in the POC by adding the name of the query that should be used to query the inference field to InferenceFieldMetadata. Since this is stored in index metadata, it is available on the coordinating node. I also added QueryBuilderService, which stores query builder creation functions for query builders that can be created using a field name and query string. QueryBuilderService is then populated using the getQueryBuilders plugin hook. I added QueryBuilderService to QueryRewriteContext, making it available to the match query during rewrite on the coordinating node. The query name from InferenceFieldMetadata is then used to create the proper query builder using QueryBuilderService.

Because of scoring incompatibility issues, the match query can be used to query semantic_text fields only when that is the only field type being queried. That is to say, you can't query a text field and semantic_text field using the same match query.

Putting it all together, it means you can now do things like this:

PUT _inference/sparse_embedding/my-elser-endpoint
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

PUT my-index-1
{
    "mappings": {
        "properties": {
            "inference_field": {
                "type": "semantic_text",
                "inference_id": "my-elser-endpoint"
            },
            "text_field": {
                "type": "text"
            }
        }
    }
}

PUT my-index-2
{
    "mappings": {
        "properties": {
            "inference_field": {
                "type": "text"
            },
            "text_field": {
                "type": "text"
            }
        }
    }
}

POST my-index-1/_doc/1
{
  "inference_field": "a test value",
  "text_field": "a test value"
}

POST my-index-2/_doc/1
{
  "inference_field": "a test value",
  "text_field": "a test value"
}

GET my-index-1/_search
{
  "query": {
    "match": {
      "inference_field": "test"
    }
  }
}

GET my-index-1/_search
{
  "query": {
    "match": {
      "text_field": "test"
    }
  }
}

// Fails due to combination of text & semantic_text fields
GET my-index-*/_search
{
  "query": {
    "match": {
      "inference_field": "test"
    }
  }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably rewrite this builder as a static build method in QueryBuilderService

if (inferenceFieldMetadata != null) {
if (inferenceFieldQueryName != null
&& inferenceFieldMetadata.getQueryName().equals(inferenceFieldQueryName) == false) {
throw new IllegalArgumentException("Detected incompatible inference field queries");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages leave much to be desired. Will iterate on them to make them more informative.

Copy link
Copy Markdown
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach - it's clean overall.

Some early feedback:

  • I wonder if we should err on the side of being more permissive, rather than throwing an error for a lot of these queries. Maybe if we're trying to match on multiple query types it's OK, wrap in a dis_max query or use RRF or semantic reranking as a recommendation?

public final class InferenceFieldMetadata implements SimpleDiffable<InferenceFieldMetadata>, ToXContentFragment {
private static final String INFERENCE_ID_FIELD = "inference_id";
private static final String SOURCE_FIELDS_FIELD = "source_fields";
private static final String QUERY_NAME_FIELD = "query_name";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky note: name overloaded here, because it refers to the keyed name of the query being run and also named queries. It would be good to differentiate these two concepts if possible.

}

public QueryBuilderService build() {
Objects.requireNonNull(pluginsService);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation should be in the constructor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the builder entirely actually and just have a static build method in QueryBuilderService instead

}

if (foundNonInferenceField && inferenceFieldQueryName != null) {
throw new IllegalArgumentException("Cannot query inference fields and non-inference fields at the same time");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to dis max this and just return everything, knowing there will be better scores for semantic_text fields, than straight up erroring?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is actually the opposite: When querying a semantic_text field that uses a dense vector model, all the scores will be in the [0.0-1.0] range. BM25 scores are essentially guaranteed to be greater than this, so all the semantic_text results would be at the end of the result set.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, I ran through your script and

GET my-index-1/_search
{
  "query": {
    "match": {
      "inference_field": "test"
    }
  }
}

returned a score of 12.725867, while

GET my-index-1/_search
{
  "query": {
    "match": {
      "text_field": "test"
    }
  }
}

returned a score of 0.2876821.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nevermind, you noted dense vector not sparse vector - that makes sense. However, I think the point of either using dis max or just allowing it is potentially better than failing fast in this case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just one example though. We know that BM25 scores are unbounded and combining them naively with scores in a bounded range is going to result in bad relevance. IMO we shouldn't do something we know will result in bad relevance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly in favour of NOT throwing an error and using the default scores from different types of queries. Currently it works this way: for example if a user runs match query on a numeric field in one index (which always produces scores of 1 ) and on a text field in another index, docs with their scores will be combined without errors.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mayya-sharipova . We shouldn't fail here. Its a horrible experience. While its weird to have things scored in different distributions, this is technically already occurring with BM25 over two indices with vastly different term statistics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see a consensus forming here. Given that the best dense vector scores will almost always be less than
the worst BM25 scores, I think we should at least alert the user to the scoring mismatch in a more proactive way than just documentation. What does everyone think about a warning header in this scenario?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a warning header might be overkill here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, no warning headers, no logging, no nothing. We just need to ensure that explain works.


inferenceFieldQueryName = inferenceFieldMetadata.getQueryName();
} else {
foundNonInferenceField = true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably break out of the loop here once we get a single true value.

@Mikep86
Copy link
Copy Markdown
Contributor Author

Mikep86 commented Aug 23, 2024

@kderusso

Maybe if we're trying to match on multiple query types it's OK, wrap in a dis_max query or use RRF or semantic reranking as a recommendation?

Wrapping in a dis_max query doesn't work due to score range differences described in this comment. We could update the error messages to suggest RRF or reranking to combine the result sets, that gives the user a thread for a solution at least.

@kderusso
Copy link
Copy Markdown
Member

Wrapping in a dis_max query doesn't work due to score range differences described #112166 (comment). We could update the error messages to suggest RRF or reranking to combine the result sets, that gives the user a thread for a solution at least.

Right - I get that, I think we should document RRF or reranking as a recommended solution but isn't returning matching results better than an error here?

@Mikep86
Copy link
Copy Markdown
Contributor Author

Mikep86 commented Aug 23, 2024

Right - I get that, I think we should document RRF or reranking as a recommended solution but isn't returning matching results better than an error here?

Not necessarily. The error at least tells the user what they are doing wrong. If we return results, how are they to know that anything is wrong? All they will know is that they get bad results out of the box, giving them a bad impression of Elasticsearch.

Copy link
Copy Markdown
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - rewriting to a semantic query in the coordinator makes sense.

I have some thoughts for simplifying this, at least initially. LMKWYT!

protected QueryBuilder doRewrite(QueryRewriteContext queryRewriteContext) throws IOException {
QueryBuilder rewritten = super.doRewrite(queryRewriteContext);

if (rewritten == this && queryRewriteContext.getClass() == QueryRewriteContext.class) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can avoid the .getClass() comparison by doing something similar to the convertToXXXRewriteContext pattern that is used in other parts of the code, so the convert method returns null if it's not the expected QueryRewriteContext class 🤔 .

We're probably missing a rewrite phase for the coordinator - the current coordinator phase is really the can_match phase IIUC

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can refactor to avoid the getClass() comparison. This was the quick & dirty approach.

And just to clarify, this rewrite phase does not run in the can_match phase. It runs just before the can_match phase. It may seem like a pedantic detail, but it's important because the can_match phase is skipped for most search requests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just to clarify, this rewrite phase does not run in the can_match phase. It runs just before the can_match phase. It may seem like a pedantic detail, but it's important because the can_match phase is skipped for most search requests.

Understood - my comment referred to the current CoordinatorRewriteContext, which is really what runs on the can_match phase instead of actually being what is used in the coordinator.

}

if (foundNonInferenceField && inferenceFieldQueryName != null) {
throw new IllegalArgumentException("Cannot query inference fields and non-inference fields at the same time");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your reasoning here @Mikep86 . As there is no way to boost fields for specific indices in match query, you prefer to error out instead of returning a surprising relevance result.

I think returning these results may be valuable though - there will be use cases where users will not be sorting by scoring. Users could also do some rescoring for adjusting the scores afterwards via script_score.

How about adding some docs for match support on the semantic_text field type that addresses this concern vs limiting the user beforehand?

}

if (inferenceFieldQueryName != null) {
rewritten = queryRewriteContext.getQueryBuilderService()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, you want to be prepared for potential future InferenceFieldMapper field types, and that's why you created the QueryBuilderService to help choosing the right QueryBuilder implementation.

My thoughts are that this is pretty complicated as of now - we only have a single InferenceFieldMapper which is semantic_text. Even if we have other InferenceFieldMapper subclasses, they would probably fit into using the semantic query.

Should we simplify this solution by assuming SemanticQueryBuilder is going to be used for any InferenceFieldMapper? We can always revisit this decision in the future when / if more inference field types are used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's part of the motivation, but I also added QueryBuilderService in an attempt to add something that can help with use cases outside of semantic search. There is nothing specific to semantic search in the QueryBuilderService implementation; it can be used to make a query implemented in any plugin accessible in server or any other plugin without major refactoring.

I agree we can probably simplify this POC implementation. You are correct that I wrote this with the idea in mind that someday we will have more than one inference field and inference query type. Before simplifying though, I think we should go through a thought experiment of how we will be forward-compatible if/when we add more inference field types or inference query types.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how we will be forward-compatible if/when we add more inference field types or inference query types.

IIUC, semantic query will be the one that checks the field type. If it does not support semantic query, it will fail.

We could even move semanticQuery() method from the SemanticTextFieldMapper to the InferenceFieldMapper interface to signal the intent.

We can't anticipate as of now any refactoring that will need to use this, or will be in a better position to review the solution once we need it.

I think it's better to have a simplified version for this, as we still may change this implementation in case we need to support other queries in semantic_text.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even move semanticQuery() method from the SemanticTextFieldMapper to the InferenceFieldMapper interface to signal the intent.

That approach won't work because we don't have access to the mappings on the coordinating node. All we have access to is InferenceFieldMetadata, which is why I added the query name to that.

Let's discuss offline about how we can simplify. I agree there is room to do so, but it may not be as straightforward as it appears.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That approach won't work because we don't have access to the mappings on the coordinating node. All we have access to is InferenceFieldMetadata, which is why I added the query name to that.

Yep - my comment is not about detecting it on the coordinator node (which is handled by the InferenceFieldMetadata), but checking during the shard rewriting that it's an InferenceFieldMapper, and just invoking semanticQuery() on it without having to actually check the specific implementation.

Let's discuss offline about how we can simplify.

Happy to do so - I'm not opposed to this implementation, just trying to understand if we can keep the change smaller until (if) we have the need 👍

@mayya-sharipova
Copy link
Copy Markdown
Contributor

mayya-sharipova commented Sep 3, 2024

@Mikep86 I have not studied the PR deeply, so forgive my naive questions:

  1. why do we need to do a rewrite to a semantic query on a coordinating node? Why this rewrite can't it be done when we are already on a shard and have all mappings? Is it because we want to make only a single inference per request?
  2. Why do we need a queryName parameter for InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields, String queryName)? Could there be any different value from semantic?

private final String queryName;

public InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields) {
public InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields, String queryName) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding this new metadata at all?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For forward-compatibility with any potential new queries against inference fields. It's probably overkill, but I wanted to write the POC with that potential in mind. We can simplify for the production implementation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For forward-compatibility with any potential new queries against inference fields. It's probably overkill, but I wanted to write the POC with that potential in mind. We can simplify for the production implementation.

OK, this doesn't seem necessary and was strange to me. It would only be required for the edge case of a mixed cluster environment & us trying to allow "new code" to be used on an "old node". This over complicates things and we shouldn't do this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it seems overkill for the current need.

@Mikep86
Copy link
Copy Markdown
Contributor Author

Mikep86 commented Sep 3, 2024

@mayya-sharipova

why do we need to do a rewrite to a semantic query on a coordinating node? Why this rewrite can't it be done when we are already on a shard and have all mappings? Is it because we want to make only a single inference per request?

We must rewrite on the coordinating node because the semantic query performs inference on the coordinating node. If we were to wait and rewrite on the shards, we would have to perform inference N times (where N is the number of data nodes with shards for the index queried).

Why do we need a queryName parameter for InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields, String queryName)? Could there be any different value from semantic?

Currently, no. This POC is written to be forward-compatible with any inference queries we may add in the future. @carlosdelest and I discussed (at a high level) how we could simplify in this comment thread. We should continue that discussion offline.

@Mikep86
Copy link
Copy Markdown
Contributor Author

Mikep86 commented Oct 21, 2024

Closing this PR as it has served its purpose for collecting feedback on the POC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants