ESQL: Fix synthetic attribute pruning#111413
Conversation
| } else { | ||
| AttributeSet childOutput = currentPlanNode.inputSet(); | ||
| AttributeSet addedAttributes = currentPlanNode.outputSet().subtract(childOutput); | ||
| requiredAttributes.set(requiredAttributes.get().subtract(addedAttributes).combine(currentPlanNode.requiredInputSet())); |
There was a problem hiding this comment.
The main change; the way we collected all attributes that occurred and then removed generated attributes is fully abstracted away here.
| // skip synthetically added attributes (the ones from AVG), see LogicalPlanOptimizer.SubstituteSurrogates | ||
| if (attr.synthetic() == false && aliases.containsKey(attr) == false) { |
There was a problem hiding this comment.
Synthetic attributes were skipped because we had to solve an NPE; IMHO this was not a correct long term fix, as it made heavy assumptions on where synthetic attributes are used.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/AggregateExec.java
Show resolved
Hide resolved
|
Hi @alex-spies, I've created a changelog YAML for you. |
astefan
left a comment
There was a problem hiding this comment.
I think this is ok imo. I am curious to see how this preliminary step will help with synthetic attributes usage in other places (mainly union types).
| } | ||
|
|
||
| @Override | ||
| public AttributeSet requiredInputSet() { |
There was a problem hiding this comment.
Isn't drop different here? It's usage is to capture what the query wants to remove from projections and, in essence, a drop doesn't "live" too long, being transformed in a projection in the Analyzer.
There was a problem hiding this comment.
True. Drop essentially is not a real logical plan node, it's an AST node. The same is true for Lookup and Keep.
We can have this throw UnsupportedOperationException instead; although I think returning references() is correct, too; this contains the UnresolvedAttributes that this is trying to drop, excluding any wildcards.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java
Show resolved
Hide resolved
x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/plan/QueryPlan.java
Outdated
Show resolved
Hide resolved
|
Minor suggestion: just for debugging purposes, to see the in trace logging which attributes in the plan are also synthetic. Maybe something like |
I turned attributes synthetic that were meant to be synthetic in this commit. For union types specifically, we turn each expression like |
costin
left a comment
There was a problem hiding this comment.
Looks pretty good - thanks for incorporating the feedback and left another round.
...gin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/FieldAttribute.java
Outdated
Show resolved
Hide resolved
...k/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/Attribute.java
Outdated
Show resolved
Hide resolved
.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizer.java
Show resolved
Hide resolved
| return lazyReferences; | ||
| } | ||
|
|
||
| protected abstract AttributeSet computeReferences(); |
There was a problem hiding this comment.
Make the default implementation Expressions.references(expressions()) like before since it applies to most implementations.
There was a problem hiding this comment.
I would actually prefer not to give this a default implementation.
As long as we add new commands, I'd prefer each plan node to be forced to explicitly declare what it requires from its children. This is more verbose, but consistent with output() - and the fact that references had the mentioned default implementation led to many plan nodes having wrong implementations, and we built a bunch of workarounds to account for that.
Case in point: when the Join node was first added, its references implementation was wrong and needed fixing, but this was not immediately apparent at all.
Let me know how you'd like us to proceed. I'm fine with either way, but wanted to lay out my reasoning first.
There was a problem hiding this comment.
tl;dr - I've read your arguments however I'd prefer to use a default implementations, similar to inputSet, etc...
references and output are different.
The input of each set is well determined - the output set of its children. The same patter makes sense for references(), expressions() etc.. Not so with the output since each input had its own particularities.
As you point out, the issue was a bug in the default implementation - better to fix that in one place instead of having multiple places that do the same thing and potentially replicate the bug.
At the end of the day, it's an arbitrary choice. The style of the code-base is trustful towards the author, that is non-defensive: (for example collections are not copied for input/returns by default), method calls don't perform null checks for every params, etc...
Hence why I opt towards a default implementation that is useful and guiding - adding a new command requires the author to know what they're doing to begin with so I'd use this mindset, at least at this stage of the project.
There was a problem hiding this comment.
As you point out, the issue was a bug in the default implementation - better to fix that in one place instead of having multiple places that do the same thing and potentially replicate the bug.
The problem was having a default implementation. The default implementation just didn't apply for most query plans but we kept using it. The default implementation was exactly Expressions.references(expressions()), that's why I'm hesitant to revert back to how it was before.
But it's probably fine either way and I'll revert it to using the default impl. Since we now actually rely on correct references() implementations in ProjectAwayColumns and the dependency checkers, there's now a higher chance to notice that the default implementation needed to be overridden if it didn't apply.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Aggregate.java
Outdated
Show resolved
Hide resolved
| public static AttributeSet requiredAttributesFromChild(List<Alias> fields) { | ||
| AttributeSet generated = new AttributeSet(asAttributes(fields)); | ||
| return Expressions.references(fields).subtract(generated); | ||
| } |
There was a problem hiding this comment.
Same remark - is this used elsewhere externally?
There was a problem hiding this comment.
Yup, in EvalExec!
There was a problem hiding this comment.
Any reason why the methods across Eval and Aggregate don't have the same name? Something like doComputeReferences() or determineReferences(), etc... required is a misnomer in this context.
There was a problem hiding this comment.
I'll just overload computeReferences (and output for Aggregate), that should be nice and consistent.
...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java
Show resolved
Hide resolved
...in/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizerTests.java
Show resolved
Hide resolved
...gin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/FieldAttribute.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| public AttributeSet references() { | ||
| AttributeSet refs = super.references(); | ||
| if (indexMode == IndexMode.TIME_SERIES) { |
There was a problem hiding this comment.
Why did you make the move to EsqlSession? Was the fieldNames the only place where metrics specific metadata attributes were needed?
There was a problem hiding this comment.
I think fieldNames is the only place where we would somehow call UnresolvedRelation.references() and do something with the output in case of time series indices; after the first analysis steps, there's not supposed to be any UnresolvedRelations anymore.
Removing this code altogether makes the tests in k8s-metrics fail, whereas it currently runs green; also, the only place in the whole (non-test) code base where IndexMode.TIME_SERIES is set on an UnresolvedRelation is LogicalPlanBuilder.visitMetricsCommand.
I'll double-check if moving this code to EsqlSession.fieldNames is the best course of action; if so, I'll add a test case to IndexResovlerFieldNamesTests.
There was a problem hiding this comment.
Ok, it seems like it's reasonable to have a special case for METRICS in EsqlSession.fieldNames. I added a test.
I also added a comment and made this look for Aggregate of Metrics type rather than looking for unresolved relations with timeseries mode. It's the Aggregate that needs to consume the @timestamp field if it's in Metrics mode, which I think is easier to grasp than "that's an unresolved time series, so probably we need the @timestamp field somewhere".
There was a problem hiding this comment.
Better to create an issue for this - adding a separate node (or extending Aggregate) is a more elegant way for evolving this requirement.
|
Ok, this should be ready to go now. @costin would you like to take another look? |
astefan
left a comment
There was a problem hiding this comment.
LGTM with a final remark regarding metrics usage in EsqlSession.
Thanks @alex-spies
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java
Show resolved
Hide resolved
costin
left a comment
There was a problem hiding this comment.
Left a few comments, LGTM otherwise.
costin
left a comment
There was a problem hiding this comment.
Left a few comments, LGTM otherwise.
|
CI actually went through and was green; no idea why elasticsearch-ci/part-4 is still showing as pending. |
- Fix ProjectAwayColumns to handle synthetic attributes and mark synthetically introduced Aliases as synthetic again. - Fix QueryPlan.references() - Simplify ProjectAwayColumns. - Simplify DependencyConsistency. - Make AggregateExec track the intermediate attributes it actually outputs/requries.
syntheticagain.ProjectAwayColumns.DependencyConsistency.AggregateExectrack the intermediate attributes it actually outputs/requries.ProjectAwayColumns is an optimization, run on the coordinator node, which determines which columns/attributes are required from the data nodes; unused columns/attributes are projected out by adding a new projection to the PhysicalPlan's Fragment, i.e. the logical plan sent to data nodes.
The current approach tries to find all references by traversing the plan nodes, and, roughly, assuming each NamedExpression (that's usually an Attribute or Alias) is going to be required to be provided by the current plan nodes's children - unless it's a NamedExpression that the current node generates itself. E.g. for
EVAL x = 2*field, y = x + 1, the plan node referencesfieldandx, butxis generated in the same node, so what's required from the child plan node is actually justfield.However, this is brittle, as plan nodes can contain all kinds of NamedExpressions, and it is also very difficult to reason about this optimization rule's correctness. #99188 added a workaround to synthetic attributes accidentally not being accounted for correctly, which prevented us from using synthetic attributes in many situations because synthetic attributes couldn't be sent from data to coordinator node anymore; however, this is required for union types and some push down optimization rules.
This PR fixes and simplifies
ProjectAwayColumnsby observing that determining the minimum set of columns to run a given logical plan node while still creating the same output can be computed from 3 sets:QueryPlan.output())output())EVAL x = 2*field, y = x + 1this is onlyfield.To obtain 3., this PR fixes
QueryPlan.references, which used to just mechanically traverse all expressions of a plan node and look for any attributes it could find. Because 3. is also implicitly computed inDependencyConsistencyto check whether a plan node's children provide all required attributes, this PR also simplifiesDependencyConsistency.