ESQL: Compute support for filtering ungrouped aggs by nik9000 · Pull Request #112717 · elastic/elasticsearch

nik9000 · 2024-09-10T18:55:53Z

Adds support to the compute engine for filtering which positions are processed by ungrouping aggs. This should allow syntax like:

| STATS
       success = COUNT(*) WHERE 200 <= response_code AND response_code < 300,
      redirect = COUNT(*) WHERE 300 <= response_code AND response_code < 400,
    client_err = COUNT(*) WHERE 400 <= response_code AND response_code < 500,
    server_err = COUNT(*) WHERE 500 <= response_code AND response_code < 600,
   total_count = COUNT(*)

We could translate the WHERE expression into an ExpressionEvaluator and run it, then plug it into the filtering support added in this PR.

The actual filtering is done by creating a FilteredAggregatorFunction which wraps a regular AggregatorFunction first executing the filter against the incoming Page and then passing the resulting mask to the AggregatorFunction. We've then added a mask to AggregatorFunction#process which each aggregation function must use for filtering.

We keep the unfiltered behavior by sending a constant block with true in it. Each agg detects this and takes an "unfiltered" path, preserving the original performance.

Importantly, when you don't turn this on it doesn't effect performance:

 (blockType)  (grouping)   (op)  Score    Error -> Score    Error  Units
vector_longs        none  count  0.007 ±  0.001 -> 0.007 ±  0.001  ns/op
vector_longs        none    min  0.123 ±  0.004 -> 0.128 ±  0.005  ns/op
vector_longs       longs  count  4.311 ±  0.192 -> 4.218 ±  0.053  ns/op
vector_longs       longs    min  5.476 ±  0.077 -> 5.451 ±  0.074  ns/op

Adds support to the compute engine for filtering which positions are processed by ungrouping aggs. This should allow syntax like: ``` | STATS success = COUNT(*) WHERE 200 <= response_code AND response_code < 300, redirect = COUNT(*) WHERE 300 <= response_code AND response_code < 400, client_err = COUNT(*) WHERE 400 <= response_code AND response_code < 500, server_err = COUNT(*) WHERE 500 <= response_code AND response_code < 600, total_count = COUNT(*) ``` We could translate the WHERE expression into an `ExpressionEvaluator` and run it, then plug it into the filtering support added in this PR. The actual filtering is done by creating a `FilteredAggregatorFunction` which wraps a regular `AggregatorFunction` first executing the filter against the incoming `Page` and then passing the resulting mask to the `AggregatorFunction`. We've then added a `mask` to `AggregatorFunction#process` which each aggregation function must use for filtering. We keep the unfiltered behavior by sending a constant block with `true` in it. Each agg detects this and takes an "unfiltered" path, preserving the original performance. Importantly, when you don't turn this on it doesn't effect performance: ``` (blockType) (grouping) (op) Score Error -> Score Error Units vector_longs none count 0.007 ± 0.001 -> 0.007 ± 0.001 ns/op vector_longs none min 0.123 ± 0.004 -> 0.128 ± 0.005 ns/op vector_longs longs count 4.311 ± 0.192 -> 4.218 ± 0.053 ns/op vector_longs longs min 5.476 ± 0.077 -> 5.451 ± 0.074 ns/op ```

elasticsearchmachine · 2024-09-10T18:56:16Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

ivancea

Looks good!

ivancea · 2024-09-11T08:35:37Z

.../test/java/org/elasticsearch/xpack/esql/expression/function/AbstractAggregationTestCase.java

-                try {
-                    aggregator.processPage(inputPage);
+                try (
+                    BooleanVector noMasking = driverContext().blockFactory().newConstantBooleanVector(true, inputPage.getPositionCount())


At the point of this PR, we should be able to test masking here. Maybe making another test for it.
Should we do it now? Or in other PR?

Yeah! I was going to do it in a follow-up. But, yeah. Soon!

ivancea · 2024-09-11T08:37:50Z

...ugin/esql/compute/gen/src/main/java/org/elasticsearch/compute/gen/AggregatorImplementer.java

+            builder.beginControlFlow("if (vector != null)").addStatement("addRawVector(vector)");
+            builder.nextControlFlow("else").addStatement("addRawBlock(block)").endControlFlow();


nit: Maybe it's me, but I think this is easier to read if every statement is in a new line. So you can "read" the code within quotes from top to bottom.

builder.beginControlFlow("if (vector != null)").addStatement("addRawVector(vector)"); builder.nextControlFlow("else").addStatement("addRawBlock(block)").endControlFlow();

VS

builder.beginControlFlow("if (vector != null)"); builder.addStatement("addRawVector(vector)"); builder.nextControlFlow("else"); builder.addStatement("addRawBlock(block)"); builder.endControlFlow();

I can do that!

Adds support to the compute engine for filtering which positions are processed by ungrouping aggs. This should allow syntax like: ``` | STATS success = COUNT(*) WHERE 200 <= response_code AND response_code < 300, redirect = COUNT(*) WHERE 300 <= response_code AND response_code < 400, client_err = COUNT(*) WHERE 400 <= response_code AND response_code < 500, server_err = COUNT(*) WHERE 500 <= response_code AND response_code < 600, total_count = COUNT(*) ``` We could translate the WHERE expression into an `ExpressionEvaluator` and run it, then plug it into the filtering support added in this PR. The actual filtering is done by creating a `FilteredAggregatorFunction` which wraps a regular `AggregatorFunction` first executing the filter against the incoming `Page` and then passing the resulting mask to the `AggregatorFunction`. We've then added a `mask` to `AggregatorFunction#process` which each aggregation function must use for filtering. We keep the unfiltered behavior by sending a constant block with `true` in it. Each agg detects this and takes an "unfiltered" path, preserving the original performance. Importantly, when you don't turn this on it doesn't effect performance: ``` (blockType) (grouping) (op) Score Error -> Score Error Units vector_longs none count 0.007 ± 0.001 -> 0.007 ± 0.001 ns/op vector_longs none min 0.123 ± 0.004 -> 0.128 ± 0.005 ns/op vector_longs longs count 4.311 ± 0.192 -> 4.218 ± 0.053 ns/op vector_longs longs min 5.476 ± 0.077 -> 5.451 ± 0.074 ns/op ```

elasticsearchmachine · 2024-09-11T19:42:06Z

💚 Backport successful

Status	Branch	Result
✅	8.x

Adds support to the compute engine for filtering which positions are processed by ungrouping aggs. This should allow syntax like: ``` | STATS success = COUNT(*) WHERE 200 <= response_code AND response_code < 300, redirect = COUNT(*) WHERE 300 <= response_code AND response_code < 400, client_err = COUNT(*) WHERE 400 <= response_code AND response_code < 500, server_err = COUNT(*) WHERE 500 <= response_code AND response_code < 600, total_count = COUNT(*) ``` We could translate the WHERE expression into an `ExpressionEvaluator` and run it, then plug it into the filtering support added in this PR. The actual filtering is done by creating a `FilteredAggregatorFunction` which wraps a regular `AggregatorFunction` first executing the filter against the incoming `Page` and then passing the resulting mask to the `AggregatorFunction`. We've then added a `mask` to `AggregatorFunction#process` which each aggregation function must use for filtering. We keep the unfiltered behavior by sending a constant block with `true` in it. Each agg detects this and takes an "unfiltered" path, preserving the original performance. Importantly, when you don't turn this on it doesn't effect performance: ``` (blockType) (grouping) (op) Score Error -> Score Error Units vector_longs none count 0.007 ± 0.001 -> 0.007 ± 0.001 ns/op vector_longs none min 0.123 ± 0.004 -> 0.128 ± 0.005 ns/op vector_longs longs count 4.311 ± 0.192 -> 4.218 ± 0.053 ns/op vector_longs longs min 5.476 ± 0.077 -> 5.451 ± 0.074 ns/op ```

…tion-ironbank-ubi * upstream/main: (302 commits) Deduplicate BucketOrder when deserializing (elastic#112707) Introduce test utils for ingest pipelines (elastic#112733) [Test] Account for auto-repairing for shard gen file (elastic#112778) Do not throw in task enqueued by CancellableRunner (elastic#112780) Mute org.elasticsearch.script.StatsSummaryTests testEqualsAndHashCode elastic#112439 Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testTransportException elastic#112779 Use a dedicated test executor in MockTransportService (elastic#112748) Estimate segment field usages (elastic#112760) (Doc+) Inference Pipeline ignores Mapping Analyzers (elastic#112522) Fix verifyVersions task (elastic#112765) (Doc+) Terminating Exit Codes (elastic#112530) (Doc+) CAT Nodes default columns (elastic#112715) [DOCS] Augment installation warnings (elastic#112756) Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testCorruption elastic#112769 Bump Elasticsearch to a minimum of JDK 21 (elastic#112252) ESQL: Compute support for filtering ungrouped aggs (elastic#112717) Bump Elasticsearch version to 9.0.0 (elastic#112570) add CDR related data streams to kibana_system priviliges (elastic#112655) Support widening of numeric types in union-types (elastic#112610) Introduce data stream options and failure store configuration classes (elastic#109515) ...

Adds support to the compute engine for filtering which positions are processed by ungrouping aggs. This should allow syntax like: ``` | STATS success = COUNT(*) WHERE 200 <= response_code AND response_code < 300, redirect = COUNT(*) WHERE 300 <= response_code AND response_code < 400, client_err = COUNT(*) WHERE 400 <= response_code AND response_code < 500, server_err = COUNT(*) WHERE 500 <= response_code AND response_code < 600, total_count = COUNT(*) ``` We could translate the WHERE expression into an `ExpressionEvaluator` and run it, then plug it into the filtering support added in this PR. The actual filtering is done by creating a `FilteredAggregatorFunction` which wraps a regular `AggregatorFunction` first executing the filter against the incoming `Page` and then passing the resulting mask to the `AggregatorFunction`. We've then added a `mask` to `AggregatorFunction#process` which each aggregation function must use for filtering. We keep the unfiltered behavior by sending a constant block with `true` in it. Each agg detects this and takes an "unfiltered" path, preserving the original performance. Importantly, when you don't turn this on it doesn't effect performance: ``` (blockType) (grouping) (op) Score Error -> Score Error Units vector_longs none count 0.007 ± 0.001 -> 0.007 ± 0.001 ns/op vector_longs none min 0.123 ± 0.004 -> 0.128 ± 0.005 ns/op vector_longs longs count 4.311 ± 0.192 -> 4.218 ± 0.053 ns/op vector_longs longs min 5.476 ± 0.077 -> 5.451 ± 0.074 ns/op ```

nik9000 added >non-issue :Analytics/ES|QL AKA ESQL v8.16.0 labels Sep 10, 2024

nik9000 requested a review from ivancea September 10, 2024 18:55

nik9000 requested a review from a team as a code owner September 10, 2024 18:55

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 10, 2024

ivancea approved these changes Sep 11, 2024

View reviewed changes

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

Format

2cf7085

nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 11, 2024

nik9000 mentioned this pull request Sep 11, 2024

ESQL: Add pre and post filter for grouping operator #111439

Open

nik9000 added auto-backport-and-merge v8.16.0 labels Sep 11, 2024

elasticsearchmachine merged commit d7cc407 into elastic:main Sep 11, 2024

nik9000 deleted the esql_filter_aggs branch September 11, 2024 19:41

nik9000 mentioned this pull request Sep 11, 2024

[8.x] ESQL: Compute support for filtering ungrouped aggs (#112717) #112763

Merged

nik9000 mentioned this pull request Sep 12, 2024

Add CircuitBreaker to TDigest, Step 1: Raw arrays to Arrays wrapper #112810

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Compute support for filtering ungrouped aggs#112717

ESQL: Compute support for filtering ungrouped aggs#112717
elasticsearchmachine merged 2 commits intoelastic:mainfrom
nik9000:esql_filter_aggs

nik9000 commented Sep 10, 2024

Uh oh!

elasticsearchmachine commented Sep 10, 2024

Uh oh!

ivancea left a comment •

edited

Loading

Uh oh!

ivancea Sep 11, 2024

Uh oh!

nik9000 Sep 11, 2024

Uh oh!

ivancea Sep 11, 2024

Uh oh!

nik9000 Sep 11, 2024

Uh oh!

elasticsearchmachine commented Sep 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		builder.beginControlFlow("if (vector != null)").addStatement("addRawVector(vector)");
		builder.nextControlFlow("else").addStatement("addRawBlock(block)").endControlFlow();

Conversation

nik9000 commented Sep 10, 2024

Uh oh!

elasticsearchmachine commented Sep 10, 2024

Uh oh!

ivancea left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivancea Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

nik9000 Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

ivancea Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

nik9000 Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 11, 2024

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivancea left a comment •

edited

Loading