Conversation
| name 'x-pack-esql' | ||
| description 'The plugin that powers ESQL for Elasticsearch' | ||
| classname 'org.elasticsearch.xpack.esql.plugin.EsqlPlugin' | ||
| extendedPlugins = ['x-pack-esql-core', 'lang-painless', 'x-pack-ml'] |
There was a problem hiding this comment.
I needed to change build.gradle for both esql-core and esql because we need SemanticQueryBuilder.
I had to remove x-pack-ml because of jar hell.
The problem is that the categorize integration tests no longer work:
REPRODUCE WITH: ./gradlew ":x-pack:plugin:esql:qa:server:single-node:javaRestTest" --tests "org.elasticsearch.xpack.esql.qa.single_node.EsqlSpecIT" -Dtests.method="test {categorize.Categorize ASYNC}"
[2024-11-18T19:19:08,448][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] fatal error in thread [elasticsearch[test-cluster-0][search][T#47]], exiting java.lang.NoClassDefFoundError: org/elasticsearch/xpack/ml/job/categorization/CategorizationAnalyzer |
-- | --
| at org.elasticsearch.xpack.esql.expression.function.grouping.Categorize.toEvaluator(Categorize.java:112) |
| at org.elasticsearch.xpack.esql.evaluator.EvalMapper.toEvaluator(EvalMapper.java:55) |
| at org.elasticsearch.xpack.esql.planner.LocalExecutionPlanner.planEval(LocalExecutionPlanner.java:410) |
| at org.elasticsearch.xpack.esql.planner.LocalExecutionPlanner.plan(LocalExecutionPlanner.java:197) |
| at org.elasticsearch.xpack.esql.planner.LocalExecutionPlanner.planAggregation(LocalExecutionPlanner.java:242) |
| at org.elasticsearch.xpack.esql.planner.LocalExecutionPlanner.plan(LocalExecutionPlanner.java:189)
...
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.xpack.ml.job.categorization.CategorizationAnalyzer |
-- | --
| at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) |
| at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:595) |
| at java.base/java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:870) |
| at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:528)
I don't know how to fix this - if I leave x-pack-ml here, we get Jar hell when Elasticsearch starts:
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:76)
Caused by: java.lang.IllegalStateException: jar hell!
class: com.ibm.icu.impl.Assert
jar1: /Users/ioana/projects/elasticsearch/distribution/archives/darwin-tar/build/install/elasticsearch-9.0.0-SNAPSHOT/modules/x-pack-ml/icu4j-68.2.jar
jar2: /Users/ioana/projects/elasticsearch/distribution/archives/darwin-tar/build/install/elasticsearch-9.0.0-SNAPSHOT/modules/x-pack-esql-core/icu4j-68.2.jar
at org.elasticsearch.base@9.0.0-SNAPSHOT/org.elasticsearch.jdk.JarHell.checkClass(JarHell.java:316)
at org.elasticsearch.base@9.0.0-SNAPSHOT/org.elasticsearch.jdk.JarHell.checkJarHell(JarHell.java:234)
at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.plugins.PluginsUtils.checkBundleJarHell(PluginsUtils.java:319)
... 9 more
inference and ml seem to have many common dependencies and AFAIK esql is the only plugin that seems to need both.
There was a problem hiding this comment.
Solved this by no longer using SemanticQueryBuilder - this means we no longer need to extend from x-pack-inference.
There was a problem hiding this comment.
I'm not sure how to make both search and ml work here, we should figure out how to deal with the dependency other than deferring it. @jan-elastic @ChrisHegarty do you have any idea whether it is ok to remove x-pack-ml here, or how to make it work?
There was a problem hiding this comment.
it is ok to remove x-pack-ml here, or how to make it work?
It's okay to remove it because we added it to x-pack-esql-core - x-pack-esql-core now extends x-pack-ml and x-pack-esql extends x-pack-esql-core.
When I simply removed x-pack-ml and replaced it x-pack-inference, the CATEGORIZE tests would start to fail. So I know this should be safe, because I can see CATEGORIZE works as expected (none of the tests for it fail).
we should figure out how to deal with the dependency other than deferring it.
I agree - I went back and forth on this and this is the best we could come up with:
- we could move
SemanticQueryBuilderinx-pack-core- this would mean moving some additional classes likeSemanticTextFieldMapper, maybe more, I haven't unravelled this one. - however even if we can use
SemanticQueryBuilderin ES|QL this would only be temporary - the reason for this is that for _search we want the match query to actually supportsemantic_text(Match query support for semantic_text fields - POC DO NOT MERGE #112166 - a POC and the actual implementation will be done in the following weeks). - if you take a look at the match query POC it looks like
MatchQueryBuilderhassemantic_textsupport -MatchQueryBuilderis in server so we should be able to use without dependency issues (we already use it).
So my take is that moving SemanticQueryBuilder does not gain us much if support in MatchQueryBuilder is coming soon. And I think that once that's coming, we might be able to revert build.gradle to what it was originally.
I captured that in a bullet point in #115103
carlosdelest
left a comment
There was a problem hiding this comment.
This LGTM - great work! 💯
We should get eyes from the AE team
| } | ||
|
|
||
| private String embeddingsFieldName() { | ||
| return name.concat(".inference.chunks.embeddings"); |
There was a problem hiding this comment.
It's a pity we can't access this constant from the ML plugin 😓
There was a problem hiding this comment.
What prevents us from accessing this constant from the ML plugin? Is it related to the dependency issue?
There was a problem hiding this comment.
These are not from the ML plugin - they are from the inference plugin:
We can import things from the ML plugin, but once we extend the inference plugin we end up with jar hell.
.../esql-core/src/main/java/org/elasticsearch/xpack/esql/core/querydsl/query/SemanticQuery.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/InferenceUtils.java
Outdated
Show resolved
Hide resolved
| - match: { columns: [{name: name, type: keyword}] } | ||
|
|
||
| --- | ||
| semantic_text declared in mapping: |
There was a problem hiding this comment.
I removed this test - it failed for serverless tests on mixed version clusters with Invalid EsField type [SemanticTextEsField]. However:
-
this test is a kind of a hack - it puts the full internal representation for the semantic_text field which will change, so this test will need change to use the test inference plugin so that the embeddings are internally generated - but then we have the same problem as Add mixed mode BWC testing support for test-specific plugins #115166
-
the amount of CSV tests we have already cover the fact that for ESQL the block loader is capable of only returning the original text content
-
semantic_textis only available in snapshots - so not available for serverless - the fact that this test fails has no meaning for serverless prod
There was a problem hiding this comment.
Having a yml test for semantic_text is better, if serverless is affected, can this be muted on serverless? Better keep it for stateful if possible.
There was a problem hiding this comment.
I reverted the removal and muted the test on serverless.
Also keeping track in #115103 to not forget to unmute this test in serverless once the changes here make it in 9.0.0 snapshot
fang-xing-esql
left a comment
There was a problem hiding this comment.
Thanks for putting these together @ioanatia ! I added some comments, and made a summary of my thoughts so far:
- This is a much more complicated feature compared to
KQL, as it touches a lot of more areas, and it worth spending more time on it. - I was looking for design docs, however I have no luck finding one that help me clarify the questions in my mind yet. There are a few areas that I think they worth further discussions:
- The workflow of inference resolution in
EsqlSession, I wonder if any alternative approaches were being considered or discussed. Have we considered implementing inference resolution in a similar way as index or enrich policy resolver? Making Inference resolution independent of them seems more clean if possible, so that we can minimize unnecessary interface changes or changes to tests that are irrelevant to inference resolution. - What is the best way to tell Analyzer that there is remote cluster involved? And how to deal with it within the scope of match?
- How to deal with semantic_text fields with multiple inference ids/results(this is allowed in the SemanticTextEsField definition), is it necessary to deal with them when there is no match function/operator in the query? Resolving inference adds extra cost, it is better to do it when we know that we will need it. It is likely multiple inference ids on the same semantic text field happens when there are multiple indices that match a common index pattern in this case, to make it more complicated, the data type could be different given the same field name, do we know how they behave?
- The workflow of inference resolution in
- More negative and positive tests will be appreciated, as they are the only way to exercise all possible code paths.
| name 'x-pack-esql' | ||
| description 'The plugin that powers ESQL for Elasticsearch' | ||
| classname 'org.elasticsearch.xpack.esql.plugin.EsqlPlugin' | ||
| extendedPlugins = ['x-pack-esql-core', 'lang-painless', 'x-pack-ml'] |
There was a problem hiding this comment.
I'm not sure how to make both search and ml work here, we should figure out how to deal with the dependency other than deferring it. @jan-elastic @ChrisHegarty do you have any idea whether it is ok to remove x-pack-ml here, or how to make it work?
...ck/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/type/SemanticTextEsFieldTests.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/InferenceUtils.java
Outdated
Show resolved
Hide resolved
| FieldAttribute field = (FieldAttribute) match.field(); | ||
| SemanticTextEsField esField = (SemanticTextEsField) field.field(); | ||
| if (esField.inferenceIds().size() == 1) { | ||
| result.add(new SemanticQuery(field.sourceText(), match.query().sourceText(), esField.inferenceIds().iterator().next())); |
There was a problem hiding this comment.
The first inferenceId is used here, match works with one inferenceId only, however multiple inferenceId can be defined in SemanticTextEsField. it will be appreciated if there are some comments here, and if we don't deal with multiple inference ids, does it make sense to keep just one inference id in SemanticTextEsField?
In case there are multiple inferenceIds, do we want to throw an exception here? Similar with what is done in Verifier.
There was a problem hiding this comment.
In case there are multiple inferenceIds, do we want to throw an exception here? Similar with what is done in Verifier.
I guess it does not hurt to throw an exception - but at this point in the execution, the Verifier already checked and returned an error if we are dealing with multiple inference ids.
| } else { | ||
|
|
||
| // This should never happen, but we handle it here either way | ||
| throw new IllegalStateException("Cannot handle inference results"); |
There was a problem hiding this comment.
Is IllegalStateException a 400 or 500?
| public Set<String> inferenceIdsForField( | ||
| Map<String, IndexMetadata> indexMetadata, | ||
| FieldCapabilitiesResponse fieldCapsResponse, | ||
| String name | ||
| ) { | ||
| Set<String> inferenceIds = new HashSet<>(); | ||
|
|
||
| for (FieldCapabilitiesIndexResponse ir : fieldCapsResponse.getIndexResponses()) { | ||
| String indexName = ir.getIndexName(); | ||
| InferenceFieldMetadata inferenceFieldMetadata = indexMetadata.get(indexName).getInferenceFields().get(name); | ||
| if (inferenceFieldMetadata != null) { | ||
| inferenceIds.add(inferenceFieldMetadata.getInferenceId()); | ||
| } | ||
| } | ||
| return inferenceIds; | ||
| } | ||
| } |
There was a problem hiding this comment.
Please consider decoupling this from IndexResolver and make an InferenceResolver to handle inference related tasks.
| 10043 | Yishay | Tzvieli | ||
| ; | ||
|
|
||
| testMatchWithSemanticText |
There was a problem hiding this comment.
More positive tests will be appreciated. How does it work with multi-valued fields? More combinations(conjunction) with other functions and operators, before/after stats/eval when it works, for the completeness. There could be surprises, without testing it out, it is hard to tell.
| 8 | William Faulkner | ||
| ; | ||
|
|
||
| testMatchWithSemanticText |
There was a problem hiding this comment.
Same as match-function, more variety of queries will be appreciated.
| ); | ||
| } | ||
|
|
||
| public void testMatchWithSemanticTextWithCCQ() { |
There was a problem hiding this comment.
More negative tests will be appreciated, we have code to detect errors, without having negative tests to cover those code paths, it is hard to tell whether they work as expected. Here are a few scenarios that I can think of:
Matchonsemantic_textfields with multipleinference_idsMatchon multiple indices that have a common index pattern, the field names is the same, however they may have different data types -text,keyword,semantic_textQSTRorKQLwith semantic_text field in the index mapping, they are search functions as well.- Include
semantic_textinMatchTestsandMatchOperatorTests.
| - match: { columns: [{name: name, type: keyword}] } | ||
|
|
||
| --- | ||
| semantic_text declared in mapping: |
There was a problem hiding this comment.
Having a yml test for semantic_text is better, if serverless is affected, can this be muted on serverless? Better keep it for stateful if possible.
…f semantic_text" This reverts commit 8d8efa9.
|
Thank you @fang-xing-esql for your review and thanks for meeting with me to discuss this.
Looking at ways to simplify this I am thinking to:
If the
I think I also need to add a comment in
Will follow up on this tomorrow. thanks! |
|
I started to do some experiments with the branch, and came across an issue when we have multiple indices with the same semantic_text field name, but different inference ids, here are the steps to reproduce it. |
fang-xing-esql
left a comment
There was a problem hiding this comment.
Thank you for fixing the issue with multiple inference ids @ioanatia! I got chance to try more today and add some comments with my findings.
| Set<String> indexNames = indexNames(analyzedPlan); | ||
|
|
||
| analyzedPlan.forEachExpressionDown(Match.class, match -> { | ||
| if (match.field().dataType() == DataType.SEMANTIC_TEXT && match.field() instanceof FieldAttribute field) { |
There was a problem hiding this comment.
Could you add this case if it hasn't been covered yet? Schema is here.
+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "from testidx00 | eval x = textfield | where match(x, \"live\")"
}
'
{
"error" : {
"root_cause" : [
{
"type" : "null_pointer_exception",
"reason" : "Cannot invoke \"org.elasticsearch.inference.InferenceResults.getClass()\" because \"this.inferenceResults\" is null"
}
],
"type" : "null_pointer_exception",
"reason" : "Cannot invoke \"org.elasticsearch.inference.InferenceResults.getClass()\" because \"this.inferenceResults\" is null",
"suppressed" : [
{
"type" : "exception",
"reason" : "1 further exceptions were dropped"
},
{
"type" : "task_cancelled_exception",
"reason" : "parent task was cancelled [cancelled on failure]"
}
]
},
"status" : 500
}
+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "from testidx04 | eval x = textfield | where match(x, \"live\")"
}
'
textfield | textfield.keyword | x
-------------------------------------+-------------------------------------+-------------------------------------
live long and prosper 04 text keyword|live long and prosper 04 text keyword|live long and prosper 04 text keyword
| ) | ||
| ); | ||
| } else { | ||
| failures.add( |
There was a problem hiding this comment.
After thinking about it more, I wonder if we should allow multiple inference ids on the same field, if the inference ids come from different indices, and each index has only one inference id on this field. This is more like a usability issue.
If I have multiple indices that have the same semantic text field name, I couldn't find a way to query them in at once. Does it look like a valid use case? Schema is here.
Thanks for fixing this!
+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "from testidx00, testidx01 | where match(textfield, \"live\")"
}
'
{
"error" : {
"root_cause" : [
{
"type" : "verification_exception",
"reason" : "Found 1 problem\nline 1:41: [MATCH] function cannot operate on [textfield] because it is configured with multiple inference IDs."
}
],
"type" : "verification_exception",
"reason" : "Found 1 problem\nline 1:41: [MATCH] function cannot operate on [textfield] because it is configured with multiple inference IDs."
},
"status" : 400
}
+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "from testidx00, testidx01 | where match(textfield::keyword, \"live\") | eval k = textfield::keyword | keep k"
}
'
{
"error" : {
"root_cause" : [
{
"type" : "verification_exception",
"reason" : "Found 1 problem\nline 1:41: [MATCH] function cannot operate on [textfield::keyword], which is not a field from an index mapping"
}
],
"type" : "verification_exception",
"reason" : "Found 1 problem\nline 1:41: [MATCH] function cannot operate on [textfield::keyword], which is not a field from an index mapping"
},
"status" : 400
}
There was a problem hiding this comment.
@fang-xing-esql , the problem with multiple inference IDs is relevance. It does not make sense to mix multiple inference IDs from the scoring point of view, as different models will get different scores that can't directly be combined. That's the main reason it is not allowed on the _search endpoint.
In _search, users can combine different inference ids using different query clauses in _search, that can the be further adjusted via boosting.
Even if it's technically possible to do this from a filtering PoV, it won't make sense to do it once we combine with score.
There was a problem hiding this comment.
Can we take this as a follow up item?
Since this restriction is also present in _search, for all the reasons @carlosdelest mentioned, I thought having it in ES|QL is reasonable for now.
The time where it becomes relevant to lift this restriction is when we introduce RRF, in that case we might want to allow RRF between two subqueries that query the same field on different indices: WHERE where match(textfield, "live") AND _index == "testidx00" and WHERE where match(textfield, "live") AND _index == "testidx01".
There was a problem hiding this comment.
Yes, this can be addressed as a follow up.
| match.inferenceResults() | ||
| ); | ||
| } | ||
|
|
There was a problem hiding this comment.
There is another scenario that I can think of - there are two indices with the same field name, one is semantic text, the other is a text field for example. If I query each of them individually, I got expected results, however if I query both of them together, I got empty results. It is similar to an issue that I brought up to Carlos on his PR, however, semantic text has a different type of query for now, perhaps it has to be taken care of in a different way.
+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "from testidx00 | where match(textfield, \"live\")"
}
'
textfield
------------------------------
live long and prosper 00 elser
+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "from testidx04 | where match(textfield, \"live\")"
}
'
textfield | textfield.keyword
-------------------------------------+-------------------------------------
live long and prosper 04 text keyword|live long and prosper 04 text keyword
+ curl -u elastic:password -v -X POST 'localhost:9200/_query?format=txt&pretty' -H 'Content-Type: application/json' '-d
{
"query": "from testidx00, testidx04 | where match(textfield::keyword, \"live\") | eval k = textfield::keyword | keep k"
}
'
k
---------------
There was a problem hiding this comment.
Good find! I added this as a follow up of things to do before we take this out of snapshot in #115103.
|
closed in favour of #118676 |
tracked in: #115103
This brings the ability to execute semantic queries using the
matchfunction onsemantic_textfields.How to test:
POST /_license/start_trial?acknowledge=true.Machine Learning > Trained Modelsyou can check the status of the ELSER model:Execution
What we do in _search:
In _search when we execute a semantic query, we have 2 rewrite phases:
live long- the inference results are the embeddings we will use to querying thesemantic_textfield. Retrieving the embeddings is done on the coordinator, because we do not want to do an inference call on every shard. The inference endpoint might be configured to work with an external, paid service like Azure OpenAI/Amazon Bedrock.For ES|QL we kept the same execution flow:
matchfunction that has asemantic_textargument. We collect these semantic query arguments and call the inference service to retrieve the embeddings. We then embed the inference results in theMatchobjects that represent semantic queries(InferenceUtils).Matchfunction to a query builder as part of pushing the filters to Lucene, we do the same type of query rewriting that is done in _search depending on whether the inference results are dense or sparse vectors.I I initially wanted to useSemanticQueryBuilderbut that resulted in a dependency hell. NOTE: this is only temporary, very soon we should have support forsemantic_textinMatchQueryBuilder, so this logic will be simplified.For licensing:
I did not add any extra licensing checks. After consulting with team, it seems those are not needed.
SemanticQueryBuilderalso does not do any licensing checks. It relies on the inference APIs to have proper licensing handling.Checklist:
semanticqueries in_searcheither - but we need to check we don't fail with a funny error, but with a meaningful messagesemantic_textfield that has differentinference_ids - validations/tests are needed, this is not supported in_searchtoo