-
Notifications
You must be signed in to change notification settings - Fork 190
[FEATURE] Enhancement on agg with dependent bucket fields #4681
Description
Is your feature request related to a problem?
Clickbench Query q36:
source=hits | eval `ClientIP - 1` = ClientIP - 1, `ClientIP - 2` = ClientIP - 2, `ClientIP - 3` = ClientIP - 3 | stats count() as c by `ClientIP`, `ClientIP - 1`, `ClientIP - 2`, `ClientIP - 3` | sort - c | head 10
currently gets plan:
Explain: {
"calcite": {
"logical": "LogicalSystemLimit(sort0=[$0], dir0=[DESC-nulls-last], fetch=[10000], type=[QUERY_SIZE_LIMIT])
LogicalSort(sort0=[$0], dir0=[DESC-nulls-last], fetch=[10])
LogicalProject(c=[$4], ClientIP=[$0], ClientIP - 1=[$1], ClientIP - 2=[$2], ClientIP - 3=[$3])
LogicalAggregate(group=[{0, 1, 2, 3}], c=[COUNT()])
LogicalProject(ClientIP=[$76], ClientIP - 1=[-($76, 1)], ClientIP - 2=[-($76, 2)], ClientIP - 3=[-($76, 3)])
CalciteLogicalIndexScan(table=[[OpenSearch, hits]])
",
"physical": "EnumerableLimit(fetch=[10000])
EnumerableCalc(expr#0..4=[{inputs}], c=[$t4], ClientIP=[$t0], ClientIP - 1=[$t1], ClientIP - 2=[$t2], ClientIP - 3=[$t3])
EnumerableLimit(fetch=[10])
EnumerableSort(sort0=[$4], dir0=[DESC-nulls-last])
CalciteEnumerableIndexScan(table=[[OpenSearch, hits]], PushDownContext=[[AGGREGATION->rel#6613:LogicalAggregate.NONE.[](input=RelSubset#6612,group={0, 1, 2, 3},c=COUNT())], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":1000,\"sources\":[{\"ClientIP\":{\"terms\":{\"field\":\"ClientIP\",\"missing_bucket\":true,\"missing_order\":\"first\",\"order\":\"asc\"}}},{\"ClientIP - 1\":{\"terms\":{\"script\":
...
...
,\"missing_bucket\":true,\"missing_order\":\"first\",\"order\":\"asc\"}}}]},\"aggregations\":{\"c\":{\"value_count\":{\"field\":\"_index\"}}}}}}, requestedTotalSize=2147483647, pageSize=null, startFrom=0)])
"
}
}
which contains multiple buckets with script.
However, the bucket fields ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 should be equivalent on group splitting, which means it should make no difference between using any one of them as the group field with using all of them. If we optimize the plan to use only 1 bucket field, e.g. ClientIP, it could avoid script and multiple-level buckets, which both contribute to better performance.
What solution would you like?
Implement a new rule to simplify plan with single field ClientIP in agg and create a project upon original agg to add other bucket fields back. The plan could be presented by query like
source=hits | stats stats count() as c by `ClientIP`| eval ClientIP - 1` = ClientIP - 1, `ClientIP - 2` = ClientIP - 2, `ClientIP - 3` = ClientIP - 3 | ...
What alternatives have you considered?
A clear and concise description of any alternative solutions or features you've considered.
Do you have any additional context?
Add any other context or screenshots about the feature request here.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status