Match query support for semantic_text fields - POC DO NOT MERGE#112166
Match query support for semantic_text fields - POC DO NOT MERGE#112166Mikep86 wants to merge 6 commits intoelastic:mainfrom
Conversation
There was a problem hiding this comment.
We can probably rewrite this builder as a static build method in QueryBuilderService
| if (inferenceFieldMetadata != null) { | ||
| if (inferenceFieldQueryName != null | ||
| && inferenceFieldMetadata.getQueryName().equals(inferenceFieldQueryName) == false) { | ||
| throw new IllegalArgumentException("Detected incompatible inference field queries"); |
There was a problem hiding this comment.
These error messages leave much to be desired. Will iterate on them to make them more informative.
kderusso
left a comment
There was a problem hiding this comment.
I like this approach - it's clean overall.
Some early feedback:
- I wonder if we should err on the side of being more permissive, rather than throwing an error for a lot of these queries. Maybe if we're trying to
matchon multiple query types it's OK, wrap in adis_maxquery or use RRF or semantic reranking as a recommendation?
| public final class InferenceFieldMetadata implements SimpleDiffable<InferenceFieldMetadata>, ToXContentFragment { | ||
| private static final String INFERENCE_ID_FIELD = "inference_id"; | ||
| private static final String SOURCE_FIELDS_FIELD = "source_fields"; | ||
| private static final String QUERY_NAME_FIELD = "query_name"; |
There was a problem hiding this comment.
Nitpicky note: name overloaded here, because it refers to the keyed name of the query being run and also named queries. It would be good to differentiate these two concepts if possible.
| } | ||
|
|
||
| public QueryBuilderService build() { | ||
| Objects.requireNonNull(pluginsService); |
There was a problem hiding this comment.
This validation should be in the constructor?
There was a problem hiding this comment.
I think we can remove the builder entirely actually and just have a static build method in QueryBuilderService instead
| } | ||
|
|
||
| if (foundNonInferenceField && inferenceFieldQueryName != null) { | ||
| throw new IllegalArgumentException("Cannot query inference fields and non-inference fields at the same time"); |
There was a problem hiding this comment.
Would it be better to dis max this and just return everything, knowing there will be better scores for semantic_text fields, than straight up erroring?
There was a problem hiding this comment.
The issue is actually the opposite: When querying a semantic_text field that uses a dense vector model, all the scores will be in the [0.0-1.0] range. BM25 scores are essentially guaranteed to be greater than this, so all the semantic_text results would be at the end of the result set.
There was a problem hiding this comment.
Ah, interesting, I ran through your script and
GET my-index-1/_search
{
"query": {
"match": {
"inference_field": "test"
}
}
}
returned a score of 12.725867, while
GET my-index-1/_search
{
"query": {
"match": {
"text_field": "test"
}
}
}
returned a score of 0.2876821.
There was a problem hiding this comment.
Ah, nevermind, you noted dense vector not sparse vector - that makes sense. However, I think the point of either using dis max or just allowing it is potentially better than failing fast in this case.
There was a problem hiding this comment.
That's just one example though. We know that BM25 scores are unbounded and combining them naively with scores in a bounded range is going to result in bad relevance. IMO we shouldn't do something we know will result in bad relevance.
There was a problem hiding this comment.
I am strongly in favour of NOT throwing an error and using the default scores from different types of queries. Currently it works this way: for example if a user runs match query on a numeric field in one index (which always produces scores of 1 ) and on a text field in another index, docs with their scores will be combined without errors.
There was a problem hiding this comment.
I agree with @mayya-sharipova . We shouldn't fail here. Its a horrible experience. While its weird to have things scored in different distributions, this is technically already occurring with BM25 over two indices with vastly different term statistics.
There was a problem hiding this comment.
Good to see a consensus forming here. Given that the best dense vector scores will almost always be less than
the worst BM25 scores, I think we should at least alert the user to the scoring mismatch in a more proactive way than just documentation. What does everyone think about a warning header in this scenario?
There was a problem hiding this comment.
I think a warning header might be overkill here.
There was a problem hiding this comment.
Agreed, no warning headers, no logging, no nothing. We just need to ensure that explain works.
|
|
||
| inferenceFieldQueryName = inferenceFieldMetadata.getQueryName(); | ||
| } else { | ||
| foundNonInferenceField = true; |
There was a problem hiding this comment.
Can probably break out of the loop here once we get a single true value.
Wrapping in a |
Right - I get that, I think we should document RRF or reranking as a recommended solution but isn't returning matching results better than an error here? |
Not necessarily. The error at least tells the user what they are doing wrong. If we return results, how are they to know that anything is wrong? All they will know is that they get bad results out of the box, giving them a bad impression of Elasticsearch. |
carlosdelest
left a comment
There was a problem hiding this comment.
This looks good - rewriting to a semantic query in the coordinator makes sense.
I have some thoughts for simplifying this, at least initially. LMKWYT!
| protected QueryBuilder doRewrite(QueryRewriteContext queryRewriteContext) throws IOException { | ||
| QueryBuilder rewritten = super.doRewrite(queryRewriteContext); | ||
|
|
||
| if (rewritten == this && queryRewriteContext.getClass() == QueryRewriteContext.class) { |
There was a problem hiding this comment.
I wonder if we can avoid the .getClass() comparison by doing something similar to the convertToXXXRewriteContext pattern that is used in other parts of the code, so the convert method returns null if it's not the expected QueryRewriteContext class 🤔 .
We're probably missing a rewrite phase for the coordinator - the current coordinator phase is really the can_match phase IIUC
There was a problem hiding this comment.
I think we can refactor to avoid the getClass() comparison. This was the quick & dirty approach.
And just to clarify, this rewrite phase does not run in the can_match phase. It runs just before the can_match phase. It may seem like a pedantic detail, but it's important because the can_match phase is skipped for most search requests.
There was a problem hiding this comment.
And just to clarify, this rewrite phase does not run in the can_match phase. It runs just before the can_match phase. It may seem like a pedantic detail, but it's important because the can_match phase is skipped for most search requests.
Understood - my comment referred to the current CoordinatorRewriteContext, which is really what runs on the can_match phase instead of actually being what is used in the coordinator.
| } | ||
|
|
||
| if (foundNonInferenceField && inferenceFieldQueryName != null) { | ||
| throw new IllegalArgumentException("Cannot query inference fields and non-inference fields at the same time"); |
There was a problem hiding this comment.
I understand your reasoning here @Mikep86 . As there is no way to boost fields for specific indices in match query, you prefer to error out instead of returning a surprising relevance result.
I think returning these results may be valuable though - there will be use cases where users will not be sorting by scoring. Users could also do some rescoring for adjusting the scores afterwards via script_score.
How about adding some docs for match support on the semantic_text field type that addresses this concern vs limiting the user beforehand?
| } | ||
|
|
||
| if (inferenceFieldQueryName != null) { | ||
| rewritten = queryRewriteContext.getQueryBuilderService() |
There was a problem hiding this comment.
IIUC, you want to be prepared for potential future InferenceFieldMapper field types, and that's why you created the QueryBuilderService to help choosing the right QueryBuilder implementation.
My thoughts are that this is pretty complicated as of now - we only have a single InferenceFieldMapper which is semantic_text. Even if we have other InferenceFieldMapper subclasses, they would probably fit into using the semantic query.
Should we simplify this solution by assuming SemanticQueryBuilder is going to be used for any InferenceFieldMapper? We can always revisit this decision in the future when / if more inference field types are used.
There was a problem hiding this comment.
That's part of the motivation, but I also added QueryBuilderService in an attempt to add something that can help with use cases outside of semantic search. There is nothing specific to semantic search in the QueryBuilderService implementation; it can be used to make a query implemented in any plugin accessible in server or any other plugin without major refactoring.
I agree we can probably simplify this POC implementation. You are correct that I wrote this with the idea in mind that someday we will have more than one inference field and inference query type. Before simplifying though, I think we should go through a thought experiment of how we will be forward-compatible if/when we add more inference field types or inference query types.
There was a problem hiding this comment.
how we will be forward-compatible if/when we add more inference field types or inference query types.
IIUC, semantic query will be the one that checks the field type. If it does not support semantic query, it will fail.
We could even move semanticQuery() method from the SemanticTextFieldMapper to the InferenceFieldMapper interface to signal the intent.
We can't anticipate as of now any refactoring that will need to use this, or will be in a better position to review the solution once we need it.
I think it's better to have a simplified version for this, as we still may change this implementation in case we need to support other queries in semantic_text.
There was a problem hiding this comment.
We could even move semanticQuery() method from the SemanticTextFieldMapper to the InferenceFieldMapper interface to signal the intent.
That approach won't work because we don't have access to the mappings on the coordinating node. All we have access to is InferenceFieldMetadata, which is why I added the query name to that.
Let's discuss offline about how we can simplify. I agree there is room to do so, but it may not be as straightforward as it appears.
There was a problem hiding this comment.
That approach won't work because we don't have access to the mappings on the coordinating node. All we have access to is InferenceFieldMetadata, which is why I added the query name to that.
Yep - my comment is not about detecting it on the coordinator node (which is handled by the InferenceFieldMetadata), but checking during the shard rewriting that it's an InferenceFieldMapper, and just invoking semanticQuery() on it without having to actually check the specific implementation.
Let's discuss offline about how we can simplify.
Happy to do so - I'm not opposed to this implementation, just trying to understand if we can keep the change smaller until (if) we have the need 👍
|
@Mikep86 I have not studied the PR deeply, so forgive my naive questions:
|
server/src/main/java/org/elasticsearch/index/query/MatchQueryBuilder.java
Show resolved
Hide resolved
| private final String queryName; | ||
|
|
||
| public InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields) { | ||
| public InferenceFieldMetadata(String name, String inferenceId, String[] sourceFields, String queryName) { |
There was a problem hiding this comment.
Why are we adding this new metadata at all?
There was a problem hiding this comment.
For forward-compatibility with any potential new queries against inference fields. It's probably overkill, but I wanted to write the POC with that potential in mind. We can simplify for the production implementation.
There was a problem hiding this comment.
For forward-compatibility with any potential new queries against inference fields. It's probably overkill, but I wanted to write the POC with that potential in mind. We can simplify for the production implementation.
OK, this doesn't seem necessary and was strange to me. It would only be required for the edge case of a mixed cluster environment & us trying to allow "new code" to be used on an "old node". This over complicates things and we shouldn't do this.
There was a problem hiding this comment.
I agree, it seems overkill for the current need.
We must rewrite on the coordinating node because the semantic query performs inference on the coordinating node. If we were to wait and rewrite on the shards, we would have to perform inference N times (where N is the number of data nodes with shards for the index queried).
Currently, no. This POC is written to be forward-compatible with any inference queries we may add in the future. @carlosdelest and I discussed (at a high level) how we could simplify in this comment thread. We should continue that discussion offline. |
|
Closing this PR as it has served its purpose for collecting feedback on the POC |
POC implementation of
matchquery support forsemantic_textfields. The technical goal of this POC is to update thematchquery rewrite logic such that it is rewritten to asemanticquery when we detect that we are querying asemantic_textfield.There were two big challenges implementing this POC:
semanticquery must happen on the coordinating node, where mappings are not availablematchquery code lives inserver, whilesemanticquery code lives in theinferencepluginI solved these challenges in the POC by adding the name of the query that should be used to query the inference field to
InferenceFieldMetadata. Since this is stored in index metadata, it is available on the coordinating node. I also addedQueryBuilderService, which stores query builder creation functions for query builders that can be created using a field name and query string.QueryBuilderServiceis then populated using thegetQueryBuildersplugin hook. I addedQueryBuilderServicetoQueryRewriteContext, making it available to thematchquery during rewrite on the coordinating node. The query name fromInferenceFieldMetadatais then used to create the proper query builder usingQueryBuilderService.Because of scoring incompatibility issues, the
matchquery can be used to querysemantic_textfields only when that is the only field type being queried. That is to say, you can't query atextfield andsemantic_textfield using the samematchquery.Putting it all together, it means you can now do things like this: