Skip to content

ES|QL MATCH command#111147

Merged
carlosdelest merged 30 commits intoelastic:mainfrom
carlosdelest:carlosdelest/esql-match-command
Aug 6, 2024
Merged

ES|QL MATCH command#111147
carlosdelest merged 30 commits intoelastic:mainfrom
carlosdelest:carlosdelest/esql-match-command

Conversation

@carlosdelest
Copy link
Copy Markdown
Member

@carlosdelest carlosdelest commented Jul 22, 2024

Adds an initial MATCH command to ES|QL, that uses a query string query:

FROM sample_data
| MATCH "message: Connecte*" 
| SORT @timestamp

For now, the query string is opaque - we don't analyze it. A Filter is created from it using a StringQueryPredicate.

The command is only included in snapshot builds to allow follow up work.

Test snippet
PUT sample_data
{
  "mappings": {
    "properties": {
      "client_ip": {
        "type": "ip"
      },
      "message": {
        "type": "text"
      }
    }
  }
}

POST sample_data/_doc
{"@timestamp": "2023-10-23T12:15:03.360Z", "client_ip": "172.21.2.162", "message": "Connected to 10.1.0.3", "event_duration": 3450233}

POST sample_data/_doc
{"@timestamp": "2023-10-23T12:15:03.360Z", "client_ip": "172.21.2.162", "message": "Connected to 10.1.0.3", "event_duration": 3450233}

POST sample_data/_doc
{"@timestamp": "2023-10-23T12:27:28.948Z", "client_ip": "172.21.2.113", "message": "Connected to 10.1.0.2", "event_duration": 2764889}

POST sample_data/_doc
{"@timestamp": "2023-10-23T13:33:34.937Z", "client_ip": "172.21.0.5", "message": "Disconnected", "event_duration": 1232382}

POST sample_data/_doc
{"@timestamp": "2023-10-23T13:51:54.732Z", "client_ip": "172.21.3.15", "message": "Connection error", "event_duration": 725448}

POST sample_data/_doc
{"@timestamp": "2023-10-23T13:52:55.015Z", "client_ip": "172.21.3.15", "message": "Connection error", "event_duration": 8268153}

POST sample_data/_doc
{"@timestamp": "2023-10-23T13:53:55.832Z", "client_ip": "172.21.3.15", "message": "Connection error", "event_duration": 5033755}

POST sample_data/_doc
{"@timestamp": "2023-10-23T13:55:01.543Z", "client_ip": "172.21.3.15", "message": "Connected to 10.1.0.1", "event_duration": 1756467}

POST /_query?format=txt
{
  "query": "FROM sample_data | MATCH \"message: Connecte*\" | SORT @timestamp"
}

@Override
protected PhysicalPlan rule(QueryStringFilterExec queryStringFilterExec, LocalPhysicalOptimizerContext ctx) {
PhysicalPlan plan = queryStringFilterExec;
if (queryStringFilterExec.child() instanceof EsQueryExec queryExec) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushes down query string to ES query filters

);
}
} else if (p instanceof QueryStringFilterExec queryStringFilterExec) {
failures.add(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, there's no usage of MATCH directly on non-index based sources - we can implement that separately

private PhysicalOperation planQueryStringFilter(QueryStringFilterExec queryStringFilter, LocalExecutionPlannerContext context) {
PhysicalOperation source = plan(queryStringFilter.child(), context);
// TODO: Do we need a QueryStringFilterOperator here? Can we apply query string filter directly on the Page?
return source;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we don't apply query string filters on Page directly

} else if (exp instanceof SpatialRelatesFunction bc) {
return bc.canPushToSource(LocalPhysicalPlanOptimizer::isAggregatable);
} else if (exp instanceof StringQueryPredicate) {
return true;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to parse the query string in order to retrieve the fields - can be done as a follow up once we agree on the approach

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. At the moment it's just an opaque string.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have to do that much earlier than the Local PhysicalPlanOptimizer I think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ES|QL "retrieval" of fields and how they are resolved (what type they have, if the name is correct or not, if there are conflicts between data types or not) is done in a different layer.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're still not retrieving the fields that are involved in the query string, is there something we need to do at this stage besides always trying to push it to source?

My thinking was to check that afterwards. using maybe the LogicalVerifier to double check that there were no Filter plans with unsupported Expression conditions.

Would that approach work, or should any other approach work better for an (as of now) opaque query string?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogicalVerifier has a different role: it checks that every logicalPlan has the correct set of inputs and outputs (that it doesn't create un-tracked columns or loses references along the way).

You need a check similar to what Ioana did with the match operator in the Verifier.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying @astefan !

I don't think we can provide something similar until the fields from the query string are parsed in follow up iterations.

Do y'all think there's something specific we can do before that happens to prevent an invalid use of the command?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about queries where match comes after some commands that you know for sure it cannot be ok. For example, after stats for sure it's not ok because stats aggregates the results; doing a match afterwards is pointless.

Copy link
Copy Markdown
Member Author

@carlosdelest carlosdelest Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a closer look, I think there's only a couple of commands where it would be safe to use MATCH after (without field information): WHERE and SORT: ed1852e and 42751b7

We're very conservative there, but it's probably better that way until we can decide on optimizations and validations.

I've added some tests as well, they're incomplete - pending validation of this approach.

WDYAT?

Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the direction here - to start off as a filter and use StringQueryPredicate to support Elasticsearch Query String syntax only. We can then built up from there.

} else if (exp instanceof SpatialRelatesFunction bc) {
return bc.canPushToSource(LocalPhysicalPlanOptimizer::isAggregatable);
} else if (exp instanceof StringQueryPredicate) {
return true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. At the moment it's just an opaque string.

@carlosdelest carlosdelest changed the title [WIP] ES|QL MATCH command ES|QL MATCH command Jul 23, 2024
@carlosdelest carlosdelest marked this pull request as ready for review July 23, 2024 10:53
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 23, 2024
@carlosdelest carlosdelest added :Analytics/ES|QL AKA ESQL and removed needs:triage Requires assignment of a team area label labels Jul 23, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 23, 2024
@astefan
Copy link
Copy Markdown
Contributor

astefan commented Jul 30, 2024

This breaks the definition of a command in ES|QL: every command acts on the outputs it receives as inputs from the previous command(s). ES|QL doesn't make such an exception anywhere else in its commands. Using "opaque strings" is error prone, it's making the command confusing and will have an unexpected behavior.

This is the first step towards query string support - we can continue adding parsing support so we can add these features in follow up iterations.

This issue combined with this other one will create follow up work that is an order of magnitude greater than what is in this PR so far.

That is understood - this is just a first, demoable step with no production value yet. It is meant to have follow up iterations.

Doing iterations will allow gathering earlier feedback and have progress over perfection. It also helps me ramp up on the project and understand pieces a few at a time.

Given that this is behind a snapshot build check, are there any specific issues that you would like addressed to get it merged, so we can iterate on it?

Sure. This comment needs to be addressed #111147 (comment).

@carlosdelest
Copy link
Copy Markdown
Member Author

@elasticsearchmachine update branch

@ioanatia ioanatia mentioned this pull request Jul 30, 2024
14 tasks
Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this a bit. While the code here in this PR is ok and can be regarded as an acceptable version of a match command (while very primitive) there are some issues that sooner or later must be addressed and as I mentioned the work of these followups is an order of magnitude greater that the simple code here:

  • a match command and a match operator is confusing
  • there are no specific optimizations
    • here first they need to be identified and this is hard
    • secondly, it can extend the list of supported queries which currently are not that many
  • the match command accepting a string that is opaque is a problem
    • the field names inside are not under the control of the parser and any layer that deals with fields in ES|QL; this has a lot of implications
    • one example of field names transparency issue has already been mentioned by @ioanatia
    • another one, which is not obvious:

test1 index

            "message": {
                "type": "text"
            }

test2 index

            "message": {
                "type": "integer"
            }

and use this data

{"index": {"_index":"test1","_id":1}}
{"message": "FOO bar 123"}
{"index": {"_index":"test2","_id":8}}
{"whatever":100}
{"index": {"_index":"test2","_id":9}}
{"message": 123}

with query from test* metadata _index | match "123"

you get this information

    message    |   whatever    |    _index     
---------------+---------------+---------------
null           |null           |test2          
null           |null           |test1          

Meaning, something matched but users will not see why things matched.

@carlosdelest
Copy link
Copy Markdown
Member Author

I've tested this a bit. While the code here in this PR is ok and can be regarded as an acceptable version of a match command (while very primitive)

Agreed that this is meant as a first step in that direction 👍

there are some issues that sooner or later must be addressed and as I mentioned the work of these followups is an order of magnitude greater that the simple code here:

Totally agreed - we can add them as follow up and prioritise them

  • a match command and a match operator is confusing

Good feedback. Names can still change as we're still in an early phase and this is not exposed to users.

  • there are no specific optimizations
    • here first they need to be identified and this is hard

There are no optimizations yet - I guess first we need to get fields information before applying them.

  • secondly, it can extend the list of supported queries which currently are not that many

Can you please clarify what do you mean by this?

  • the match command accepting a string that is opaque is a problem

    • the field names inside are not under the control of the parser and any layer that deals with fields in ES|QL; this has a lot of implications
    • one example of field names transparency issue has already been mentioned by @ioanatia
    • another one, which is not obvious:

I agree. Bot issues are due to not parsing field information from the query, is that correct?

We can start working on properly parsing the query string (and extract field information) if y'all think that's the next step to do.

Some follow up work that we can discuss and prioritize:

  • Parsing the actual query string, so it stops being opaque and we get fields information
    • We can be iterative on this - start with an all fields query, then add support for:
      • field names
      • operators (AND, OR, NOT)
      • Wildcards
      • Boosting
        ...
  • Validations that take into account field names
  • Optimizations
  • Rinse and repeat

Is there anything that should be addressed before merging this PR, given the intention of addressing the above?

…ch-command

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
@carlosdelest
Copy link
Copy Markdown
Member Author

@astefan I've created #111487 for follow up work. We can break down the tasks further if needed.

Copy link
Copy Markdown
Member

@ioanatia ioanatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work!

The Verifier validations you have added have should address the comments from my initial PR review.
Queries like this one should not be supported:

  FROM search-movies
  | KEEP plot
  | MATCH "title:harry*"

It also addresses this valid comment from Andrei about a fundamental aspect of ES|QL:

This breaks the definition of a command in ES|QL: every command acts on the outputs it receives as inputs from the previous command(s). ES|QL doesn't make such an exception anywhere else in its commands. Using #111147 (comment) is error prone, it's making the command confusing and will have an unexpected behavior.

There is one point from Andrei regarding matching on fields with conflicting types that's not been addressed. We should follow up on that.

There are a lot of comments on the PR so at a first glance it might seem that they are left unaddressed, but AFAICS they have been addressed.

I think we are at a state where we can merge this - the support for query string is just the initial step. This is also only available for snapshot builds - but having it merged makes it easier to test and get feedback.

@carlosdelest
Copy link
Copy Markdown
Member Author

Thanks @ioanatia !

There is one point from Andrei regarding matching on fields with conflicting types that's not been addressed. We should follow up on that.

I don't think there's a way of addressing it without field information. On #111487 we will retrieve fields information so we can unblock work on this.

@ChrisHegarty
Copy link
Copy Markdown
Contributor

...
There are a lot of comments on the PR so at a first glance it might seem that they are left unaddressed, but AFAICS they have been addressed.

I think we are at a state where we can merge this - the support for query string is just the initial step. This is also only available for snapshot builds - but having it merged makes it easier to test and get feedback.

++ LGTM

@costin costin requested a review from fang-xing-esql August 2, 2024 14:41
Copy link
Copy Markdown
Member

@fang-xing-esql fang-xing-esql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good start to open up more possibilities in ES|QL to support full text search. Thank you for all the effort @carlosdelest!

@carlosdelest
Copy link
Copy Markdown
Member Author

@nik9000 , you requested some changes but I think that was due to your unintentional approval. Can you please review so I can merge this? Thanks! 🙏

Copy link
Copy Markdown
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning, something matched but users will not see why things matched.

Noting another facet of the above and the (current) lack of query string analysis: a query like from hosts | match "_index : hosts" will match on a field not exposed / enabled in the source command.

secondly, it can extend the list of supported queries which currently are not that many

Can you please clarify what do you mean by this?

I supposed this has to do with the restrictions enforced in the Verifier, many (most?) of which could be lifted once the query string is parsed.

Has this been addressed? I think it would be good to have tests (integration and spec) that chain multiple MATCHes, possibly interweaved with other allowed commands. Possibly for a next iteration.

// StringQueryPredicate
plan.forEachDown(LogicalPlan.class, lp -> {
if ((lp instanceof Filter || lp instanceof OrderBy || lp instanceof EsRelation) == false) {
failures.add(fail(plan, "MATCH cannot be used after {}", lp.sourceText().split(" ")[0].toUpperCase(Locale.ROOT)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case this check remains in place later as well: the user won't know what to do next, we usually try to provide guiding error messages (even if they're long[er]). In this case, we can add that this command can only be used past those other commands (WHERE, SORT or FROM).
Can be picked by next iteration, I understand that this is a small work fragment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! This is a guardrail for now. We aim to provide better interoperability / error messages, plus docs guidance about limitations if needed.

LIKE: 'like';
LP : '(';
MATCH: 'match';
MATCH_OPERATOR: 'match';
Copy link
Copy Markdown
Contributor

@bpintea bpintea Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another very minor note for next work: from hosts | matc will error with: "reason": "line 7:15: mismatched input 'matc' expecting {'dissect', 'drop', 'enrich', 'eval', 'grok', 'inlinestats', 'keep', 'limit', 'lookup', 'mv_expand', 'rename', 'sort', 'stats', 'where', MATCH}",
Edit: location is skewed / to be ignored, focus is the way the MATCH command is listed / standing out.

Comment on lines +378 to +383
* LimitExec[1000[INTEGER]]
* \_ExchangeExec[[],false]
* \_ProjectExec[[_meta_field{f}#8, emp_no{f}#2, first_name{f}#3, gender{f}#4, job{f}#9, job.raw{f}#10, languages{f}#5, last_na
* me{f}#6, long_noidx{f}#11, salary{f}#7]]
* \_FieldExtractExec[_meta_field{f}#8, emp_no{f}#2, first_name{f}#3]
* \_EsQueryExec[test], indexMode[standard], query[{"query_string":{"query":"\"last_name: Smith\""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😊 Yes! Thanks for the catch!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected in d102419

@carlosdelest
Copy link
Copy Markdown
Member Author

Meaning, something matched but users will not see why things matched.

Noting another facet of the above and the (current) lack of query string analysis: a query like from hosts | match "_index : hosts" will match on a field not exposed / enabled in the source command.

Yes - until we don't have query analysis we can't have that.

Has this been addressed? I think it would be good to have tests (integration and spec) that chain multiple MATCHes, possibly interweaved with other allowed commands. Possibly for a next iteration.

I did add a test in f1b8dc0. I didn't add a whole lot more as this is a restriction that I expect to have lifted soon - we can add more test cases once we approach the final version of the command.

LMK if you think there's more test cases that should be added for this first iteration!

@bpintea
Copy link
Copy Markdown
Contributor

bpintea commented Aug 6, 2024

I did add a test in f1b8dc0.
LMK if you think there's more test cases that should be added for this first iteration!

What mean is smth that Ioana mentioned above as well: multiple MATCHes, not just multiple filters. But unless there will be any other changes to go in before merging, it doesn't have to be now.

  FROM search-movies
  | MATCH "harry"
  | SORT title
  | LIMIT 10
  | MATCH "harry potter"

Note: the above won't work now as is, but once the limit is removed, it should.

@carlosdelest
Copy link
Copy Markdown
Member Author

multiple MATCHes, not just multiple filters.

I see! Added a test in 52e8619

@carlosdelest carlosdelest merged commit 25d89ae into elastic:main Aug 6, 2024
@carlosdelest
Copy link
Copy Markdown
Member Author

I'm merging this as it has approvals already. We can follow up on #111487

mhl-b pushed a commit that referenced this pull request Aug 8, 2024
@bvader
Copy link
Copy Markdown

bvader commented Aug 15, 2024

Hi @carlosdelest Just playing with this....

Your test snippet dataset is good but gives the wrong "impression" that this is a "starts with" text search.

You should add a couple of messages with an example that shows the search matches mid string like

POST sample_data/_doc
{"@timestamp": "2023-10-23T13:51:54.732Z", "client_ip": "172.21.3.15", "message": "Error - Failed Connection", "event_duration": 725448}

POST sample_data/_doc
{"@timestamp": "2023-10-23T13:52:55.015Z", "client_ip": "172.21.3.15", "message": "Timed Out - Not Connected", "event_duration": 8268153}

POST sample_data/_doc
{"@timestamp": "2023-10-23T13:53:55.832Z", "client_ip": "172.21.3.15", "message": "Successfully Connected", "event_duration": 5033755}

POST sample_data/_doc
{"@timestamp": "2023-10-23T13:55:01.543Z", "client_ip": "172.21.3.15", "message": "Successful Connection to 10.1.0.1", "event_duration": 1756467}


POST /_query?format=txt
{
  "query": "FROM sample_data | MATCH \"message: Connecte*\" | SORT @timestamp"
}

#! No limit defined, adding default limit of [1000]
       @timestamp       |   client_ip   |event_duration |         message         
------------------------+---------------+---------------+-------------------------
2023-10-23T12:15:03.360Z|172.21.2.162   |3450233        |Connected to 10.1.0.3    
2023-10-23T12:15:03.360Z|172.21.2.162   |3450233        |Connected to 10.1.0.3    
2023-10-23T12:27:28.948Z|172.21.2.113   |2764889        |Connected to 10.1.0.2    
2023-10-23T13:52:55.015Z|172.21.3.15    |8268153        |Timed Out - Not Connected
2023-10-23T13:53:55.832Z|172.21.3.15    |5033755        |Successfully Connected   
2023-10-23T13:55:01.543Z|172.21.3.15    |1756467        |Connected to 10.1.0.1    

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants