ESQL: Fix synthetic attribute pruning by alex-spies · Pull Request #111413 · elastic/elasticsearch

alex-spies · 2024-07-29T15:04:38Z

Fix ESQL: ProjectAwayColumns rule not handling synthetic attribute correctly #105821 and mark synthetically introduced Aliases as synthetic again.
Simplify ProjectAwayColumns.
Simplify DependencyConsistency.
Make AggregateExec track the intermediate attributes it actually outputs/requries.

ProjectAwayColumns is an optimization, run on the coordinator node, which determines which columns/attributes are required from the data nodes; unused columns/attributes are projected out by adding a new projection to the PhysicalPlan's Fragment, i.e. the logical plan sent to data nodes.

The current approach tries to find all references by traversing the plan nodes, and, roughly, assuming each NamedExpression (that's usually an Attribute or Alias) is going to be required to be provided by the current plan nodes's children - unless it's a NamedExpression that the current node generates itself. E.g. for EVAL x = 2*field, y = x + 1, the plan node references field and x, but x is generated in the same node, so what's required from the child plan node is actually just field.

However, this is brittle, as plan nodes can contain all kinds of NamedExpressions, and it is also very difficult to reason about this optimization rule's correctness. #99188 added a workaround to synthetic attributes accidentally not being accounted for correctly, which prevented us from using synthetic attributes in many situations because synthetic attributes couldn't be sent from data to coordinator node anymore; however, this is required for union types and some push down optimization rules.

This PR fixes and simplifies ProjectAwayColumns by observing that determining the minimum set of columns to run a given logical plan node while still creating the same output can be computed from 3 sets:

the expected output (QueryPlan.output())
the attributes generated by this plan node (subtract the children input from output())
any references required for this node specifically, e.g. for EVAL x = 2*field, y = x + 1 this is only field.

To obtain 3., this PR fixes QueryPlan.references, which used to just mechanically traverse all expressions of a plan node and look for any attributes it could find. Because 3. is also implicitly computed in DependencyConsistency to check whether a plan node's children provide all required attributes, this PR also simplifies DependencyConsistency.

alex-spies · 2024-07-29T15:26:16Z

.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizer.java

+                } else {
+                    AttributeSet childOutput = currentPlanNode.inputSet();
+                    AttributeSet addedAttributes = currentPlanNode.outputSet().subtract(childOutput);
+                    requiredAttributes.set(requiredAttributes.get().subtract(addedAttributes).combine(currentPlanNode.requiredInputSet()));


The main change; the way we collected all attributes that occurred and then removed generated attributes is fully abstracted away here.

alex-spies · 2024-07-29T15:27:54Z

.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizer.java

-                            // skip synthetically added attributes (the ones from AVG), see LogicalPlanOptimizer.SubstituteSurrogates
-                            if (attr.synthetic() == false && aliases.containsKey(attr) == false) {


Synthetic attributes were skipped because we had to solve an NPE; IMHO this was not a correct long term fix, as it made heavy assumptions on where synthetic attributes are used.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/AggregateExec.java

…bute-pruning

elasticsearchmachine · 2024-07-30T09:31:01Z

Hi @alex-spies, I've created a changelog YAML for you.

…bute-pruning

astefan

I think this is ok imo. I am curious to see how this preliminary step will help with synthetic attributes usage in other places (mainly union types).

astefan · 2024-07-31T14:45:19Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Drop.java

    }

+    @Override
+    public AttributeSet requiredInputSet() {


Isn't drop different here? It's usage is to capture what the query wants to remove from projections and, in essence, a drop doesn't "live" too long, being transformed in a projection in the Analyzer.

True. Drop essentially is not a real logical plan node, it's an AST node. The same is true for Lookup and Keep.

We can have this throw UnsupportedOperationException instead; although I think returning references() is correct, too; this contains the UnresolvedAttributes that this is trying to drop, excluding any wildcards.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java

x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/plan/QueryPlan.java

astefan · 2024-08-01T14:06:22Z

Minor suggestion: just for debugging purposes, to see the in trace logging which attributes in the plan are also synthetic. Maybe something like return qualifiedName() + "{" + label() + "}" + (synthetic() ? "{s}" : "") + "#" + id(); to be used here?

…bute-pruning

alex-spies · 2024-08-19T11:58:02Z

I think this is ok imo. I am curious to see how this preliminary step will help with synthetic attributes usage in other places (mainly union types).

I turned attributes synthetic that were meant to be synthetic in this commit.

For union types specifically, we turn each expression like TO_STRING(multi_typed_field) into a synthetic field attribute $$multi_typed_field$converted_to$keyword. This was previously impossible, because they would get projected away before reaching the coordinator node, which is required e.g. to compute the top n over TO_STRING(multi_typed_field).

costin

Looks pretty good - thanks for incorporating the feedback and left another round.

...gin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/FieldAttribute.java

...k/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/Attribute.java

.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizer.java

costin · 2024-08-21T18:03:11Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/QueryPlan.java

        return lazyReferences;
    }

+    protected abstract AttributeSet computeReferences();


Make the default implementation Expressions.references(expressions()) like before since it applies to most implementations.

I would actually prefer not to give this a default implementation.

As long as we add new commands, I'd prefer each plan node to be forced to explicitly declare what it requires from its children. This is more verbose, but consistent with output() - and the fact that references had the mentioned default implementation led to many plan nodes having wrong implementations, and we built a bunch of workarounds to account for that.

Case in point: when the Join node was first added, its references implementation was wrong and needed fixing, but this was not immediately apparent at all.

Let me know how you'd like us to proceed. I'm fine with either way, but wanted to lay out my reasoning first.

tl;dr - I've read your arguments however I'd prefer to use a default implementations, similar to inputSet, etc...

references and output are different.
The input of each set is well determined - the output set of its children. The same patter makes sense for references(), expressions() etc.. Not so with the output since each input had its own particularities.

As you point out, the issue was a bug in the default implementation - better to fix that in one place instead of having multiple places that do the same thing and potentially replicate the bug.

At the end of the day, it's an arbitrary choice. The style of the code-base is trustful towards the author, that is non-defensive: (for example collections are not copied for input/returns by default), method calls don't perform null checks for every params, etc...

Hence why I opt towards a default implementation that is useful and guiding - adding a new command requires the author to know what they're doing to begin with so I'd use this mindset, at least at this stage of the project.

As you point out, the issue was a bug in the default implementation - better to fix that in one place instead of having multiple places that do the same thing and potentially replicate the bug.

The problem was having a default implementation. The default implementation just didn't apply for most query plans but we kept using it. The default implementation was exactly Expressions.references(expressions()), that's why I'm hesitant to revert back to how it was before.

But it's probably fine either way and I'll revert it to using the default impl. Since we now actually rely on correct references() implementations in ProjectAwayColumns and the dependency checkers, there's now a higher chance to notice that the default implementation needed to be overridden if it didn't apply.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Aggregate.java

costin · 2024-08-21T19:10:23Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Eval.java

+    public static AttributeSet requiredAttributesFromChild(List<Alias> fields) {
+        AttributeSet generated = new AttributeSet(asAttributes(fields));
+        return Expressions.references(fields).subtract(generated);
+    }


Same remark - is this used elsewhere externally?

Yup, in EvalExec!

Any reason why the methods across Eval and Aggregate don't have the same name? Something like doComputeReferences() or determineReferences(), etc... required is a misnomer in this context.

I'll just overload computeReferences (and output for Aggregate), that should be nice and consistent.

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java

...in/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizerTests.java

…bute-pruning

...gin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/FieldAttribute.java

astefan · 2024-08-22T14:30:35Z

.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/UnresolvedRelation.java

-    @Override
-    public AttributeSet references() {
-        AttributeSet refs = super.references();
-        if (indexMode == IndexMode.TIME_SERIES) {


Why did you make the move to EsqlSession? Was the fieldNames the only place where metrics specific metadata attributes were needed?

I think fieldNames is the only place where we would somehow call UnresolvedRelation.references() and do something with the output in case of time series indices; after the first analysis steps, there's not supposed to be any UnresolvedRelations anymore.

Removing this code altogether makes the tests in k8s-metrics fail, whereas it currently runs green; also, the only place in the whole (non-test) code base where IndexMode.TIME_SERIES is set on an UnresolvedRelation is LogicalPlanBuilder.visitMetricsCommand.

I'll double-check if moving this code to EsqlSession.fieldNames is the best course of action; if so, I'll add a test case to IndexResovlerFieldNamesTests.

Ok, it seems like it's reasonable to have a special case for METRICS in EsqlSession.fieldNames. I added a test.

I also added a comment and made this look for Aggregate of Metrics type rather than looking for unresolved relations with timeseries mode. It's the Aggregate that needs to consume the @timestamp field if it's in Metrics mode, which I think is easier to grasp than "that's an unresolved time series, so probably we need the @timestamp field somewhere".

Better to create an issue for this - adding a separate node (or extending Aggregate) is a more elegant way for evolving this requirement.

I opened #112473; I'm not sure I captured your intent @costin , so feel free to comment on or update that.

…bute-pruning

alex-spies · 2024-08-28T14:29:24Z

Ok, this should be ready to go now.

@costin would you like to take another look?

astefan

LGTM with a final remark regarding metrics usage in EsqlSession.
Thanks @alex-spies

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

…bute-pruning

costin

Left a few comments, LGTM otherwise.

costin

Left a few comments, LGTM otherwise.

…bute-pruning

alex-spies · 2024-09-04T08:19:54Z

CI actually went through and was green; no idea why elasticsearch-ci/part-4 is still showing as pending.

alex-spies · 2024-09-04T08:22:02Z

Thanks for your reviews @astefan and @costin !

- Fix ProjectAwayColumns to handle synthetic attributes and mark synthetically introduced Aliases as synthetic again. - Fix QueryPlan.references() - Simplify ProjectAwayColumns. - Simplify DependencyConsistency. - Make AggregateExec track the intermediate attributes it actually outputs/requries.

alex-spies added 6 commits July 29, 2024 11:45

Use AggregatorMode instead of AggregateExec.Mode

205dfad

Make AggregateExec track intermediateAttributes

1cabef2

Implement QueryPlan.requiredInputSet

54e0436

Simplify and correct ProjectAwayColumns

890701e

Small refactor

bb0c926

Squash some minor mistakes

f22f3a1

elasticsearchmachine added the v8.16.0 label Jul 29, 2024

alex-spies commented Jul 29, 2024

View reviewed changes

Update tests

dbc9199

alex-spies added the test-full-bwc Trigger full BWC version matrix tests label Jul 30, 2024

alex-spies commented Jul 30, 2024

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/AggregateExec.java Show resolved Hide resolved

alex-spies added 5 commits July 30, 2024 09:35

Simplify DependencyConsistency

741dbd7

Small refactor

9e4dab5

Tiny refactor

d78f073

Start using synthetic=true again

9695190

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

c343812

…bute-pruning

alex-spies added >bug :Analytics/ES|QL AKA ESQL labels Jul 30, 2024

alex-spies added 3 commits July 30, 2024 11:31

Update docs/changelog/111413.yaml

c5e74be

Do not filter out <no-fields> in the Analyzer

ac2d6e4

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

54e1847

…bute-pruning

alex-spies requested a review from astefan July 30, 2024 09:38

astefan reviewed Aug 1, 2024

View reviewed changes

alex-spies added 3 commits August 14, 2024 16:16

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

7db76d8

…bute-pruning

Fix AggregateExec ser/de tests, hash, equals

01d4c5f

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

5443fef

…bute-pruning

costin requested changes Aug 21, 2024

View reviewed changes

alex-spies added 4 commits August 22, 2024 10:41

Reduce noise via static imports

d4dc892

Synthetic label: {f$} instead of {f:s}

0505dc4

Make synth attr prefix inaccessible

6f36f70

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

ec67cf2

…bute-pruning

alex-spies requested a review from costin August 22, 2024 09:02

astefan reviewed Aug 22, 2024

View reviewed changes

alex-spies added 5 commits August 22, 2024 18:35

Make FieldAttribute ctor with type package private

259d93f

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

7849978

…bute-pruning

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

3582136

…bute-pruning

Refactor fieldNames for METRICS + add test

d1f55e3

Update comment

776ed78

alex-spies mentioned this pull request Aug 29, 2024

ESQL: Reorganize optimizer rules #112338

Merged

astefan approved these changes Aug 29, 2024

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java Show resolved Hide resolved

alex-spies added 3 commits August 29, 2024 16:48

Undo refactor to fieldNames; rely on index mode

77431f6

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

7c3aaf5

…bute-pruning

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

e557eb4

…bute-pruning

costin approved these changes Sep 2, 2024

View reviewed changes

alex-spies added 3 commits September 3, 2024 18:48

Merge remote-tracking branch 'upstream/main' into fix-synthetic-attri…

cb69899

…bute-pruning

Use default implementation for computeReferences

f65747d

Rename static helpers in Aggregate and Eval

a7b8715

alex-spies mentioned this pull request Sep 3, 2024

ESQL: Make explicit when METRICS Aggregate requires @timestamp field #112473

Open

alex-spies merged commit 9b96665 into elastic:main Sep 4, 2024

alex-spies deleted the fix-synthetic-attribute-pruning branch September 4, 2024 08:22

alex-spies mentioned this pull request Sep 4, 2024

ESQL: ENRICH attribute handling inconsistent with EVAL, GROK, ... #105807

Closed

		// skip synthetically added attributes (the ones from AVG), see LogicalPlanOptimizer.SubstituteSurrogates
		if (attr.synthetic() == false && aliases.containsKey(attr) == false) {

Conversation

alex-spies commented Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 30, 2024

Uh oh!

astefan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

astefan commented Aug 1, 2024

Uh oh!

alex-spies commented Aug 19, 2024

Uh oh!

costin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-spies commented Aug 28, 2024

Uh oh!

astefan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

costin left a comment

Choose a reason for hiding this comment

Uh oh!

costin left a comment

Choose a reason for hiding this comment

Uh oh!

alex-spies commented Sep 4, 2024

Uh oh!

alex-spies commented Sep 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

alex-spies commented Jul 29, 2024 •

edited

Loading