Semantic search with query builder rewrite#118676
Conversation
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/QueryBuilderResolver.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/QueryBuilderResolver.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java
Show resolved
Hide resolved
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
| assertEquals(404, re.getResponse().getStatusLine().getStatusCode()); | ||
| } | ||
|
|
||
| @Before |
There was a problem hiding this comment.
Would @BeforeClass / @AfterClass be better in the setup / teardown methods as no data is modified?
There was a problem hiding this comment.
I am having a bit of trouble using AfterClass/BeforeClass even if I switch to adminClient():
SemanticMatchIT > classMethod FAILED
java.lang.NullPointerException: Cannot invoke "org.elasticsearch.client.RestClient.performRequest(org.elasticsearch.client.Request)" because the return value of "org.elasticsearch.xpack.esql.qa.rest.SemanticMatchTestCase.adminClient()" is null
at __randomizedtesting.SeedInfo.seed([9885A2F1F0E88028]:0)
at org.elasticsearch.xpack.esql.qa.rest.SemanticMatchTestCase.wipeData(SemanticMatchTestCase.java:105)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1763)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:909)
at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:48)
I see that most integration tests use @Before and @After, so I will revert to that.
x-pack/plugin/esql/qa/testFixtures/src/main/resources/match-function.csv-spec
Show resolved
Hide resolved
| ; | ||
|
|
||
| _id:keyword | _score:double | ||
| 2 | 1.2879333961116942E19 |
There was a problem hiding this comment.
I wonder if score should be checked via YAML tests:
- It would allow to check that it's just greater than zero
- We could check if it's the same score as returned by an equivalent search
No need to do anything on this PR, just thinking out loud 🙂
There was a problem hiding this comment.
Yes - I am not sure either - for now I followed the pattern we already have for testing with _score.
Happy to change this, but not as part of this PR.
I think it's a bit more complicated than that - if we could just make sure we use the same sharding strategy consistently in the scoring tests, this shouldn't be a problem anymore.
The sharding does not only affect the value of the scores, but as we have seen from some of the failing tests, it can also affect the order of the documents, with some document getting scored higher than other depending on the sharding strategy. So to fix this scoring tests, it is not sufficient to just check that the scores are greater than 0.
There was a problem hiding this comment.
Happy to change this, but not as part of this PR.
++
to fix this scoring tests, it is not sufficient to just check that the scores are greater than 0.
What worked for me to test knn rescoring is to get the results from an actual search and compare them directly to the ones returned from another, both scoring and the results themselves. We can check on some follow up on the strategy
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/QueryBuilderResolver.java
Show resolved
Hide resolved
| // Check for every possible query data type | ||
| for (DataType fieldDataType : fieldDataTypes) { | ||
| // TODO: semantic_text is not present in mapping-all-types.json so we skip it for now | ||
| if (fieldDataType == DataType.SEMANTIC_TEXT) { |
There was a problem hiding this comment.
because I have not fully mapped how mapping-all-types.json is being used - I will follow up on this separately.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/QueryBuilderResolver.java
Show resolved
Hide resolved
| callback.accept(plan, listener); | ||
| return; | ||
| } | ||
| QueryRewriteContext ctx = queryRewriteContext(indexNames); |
There was a problem hiding this comment.
I wonder if QueryRewriteContext an intermediate data structure to carry ResolvedIndices to get the inference results? Can we just provide ResolvedIndices instead, without having to create a QueryRewriteContext?
There was a problem hiding this comment.
I am not sure I am following - we need a QueryRewriteContext, because what Rewritable.rewriteAndFetch does (among other things) is to call the rewrite method from QueryBuilder which needs a QueryRewriteContext:
Many implementations of the QueryBuilder interface will then have their own rewrite logic.
| } | ||
|
|
||
| @Override | ||
| protected Collection<Class<? extends Plugin>> nodePlugins() { |
There was a problem hiding this comment.
The KqlFunctionIT tests would fail with:
java.lang.AssertionError: Unknown NamedWriteable [org.elasticsearch.index.query.QueryBuilder][kql] |
-- | --
| at __randomizedtesting.SeedInfo.seed([7E86655D8AA148E0]:0) |
| at org.elasticsearch.common.io.stream.NamedWriteableRegistry.throwOnUnknownWritable(NamedWriteableRegistry.java:150) |
| at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:126) |
| at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:109)
this is because we needed to explicitly require the KqlPlugin so that the named writable for the KqlQueryBuilder is registered.
note that this was just a nit for this particular test (and other ES|QL integration tests also override this method), and that single and multi node tests for KQL continue to work with no required changes.
| | keep _id | ||
| ; | ||
|
|
||
| _id:keyword |
There was a problem hiding this comment.
I dropped the score column for now because I am getting different values for single vs multi node tests 🤷♀️ .
It's a separate problem which we'll need to look into.
Example:
Actual: | ----------------------------> these are the values generated with single node
-- | --
| _id:keyword \| _score:double |
| 2 \| 5.603396578413904E18 |
| 3 \| 2.156063961865257E18 |
| 1 \| 8.411017936759685E17 |
| |
| Expected: | -------------------------------> these are the generated values for multi node
| _id:keyword \| _score:double |
| 2 \| 5.603396E18 |
| 3 \| 2.156063E18 |
| 1 \| 8.411017E17
* Semantic search with query builder rewrite * Address review feedback * Add feature behind snapshot * Use after/before instead of afterClass/beforeClass * Call onFailure instead of throwing exception * Fix KqlFunctionIT by requiring KqlPlugin * Update scoring tests now that they are enabled * Drop the score column for now
* Semantic search with query builder rewrite * Address review feedback * Add feature behind snapshot * Use after/before instead of afterClass/beforeClass * Call onFailure instead of throwing exception * Fix KqlFunctionIT by requiring KqlPlugin * Update scoring tests now that they are enabled * Drop the score column for now
* Semantic search with query builder rewrite * Address review feedback * Add feature behind snapshot * Use after/before instead of afterClass/beforeClass * Call onFailure instead of throwing exception * Fix KqlFunctionIT by requiring KqlPlugin * Update scoring tests now that they are enabled * Drop the score column for now
tracked in #115103
actual implementation of the prototype from #118106
This adds a query builder rewrite phase on the coordinator to ES|QL.
Since the
MatchQueryBuildernow supportssemantic_text(#117839), with the new query builder rewrite phase it becomes very easy to add support for queryingsemantic_textin ES|QL.Still needs more tests - for now I added CSV tests (for positive cases) and integration tests to show how we capture and return errors that can happen during the query builder rewrite.