Skip to content

[FEATURE] Enhancement on agg with dependent bucket fields #4681

@qianheng-aws

Description

@qianheng-aws

Is your feature request related to a problem?
Clickbench Query q36:

source=hits | eval `ClientIP - 1` = ClientIP - 1, `ClientIP - 2` = ClientIP - 2, `ClientIP - 3` = ClientIP - 3 | stats count() as c by `ClientIP`, `ClientIP - 1`, `ClientIP - 2`, `ClientIP - 3` | sort - c | head 10

currently gets plan:

Explain: {
  "calcite": {
    "logical": "LogicalSystemLimit(sort0=[$0], dir0=[DESC-nulls-last], fetch=[10000], type=[QUERY_SIZE_LIMIT])
  LogicalSort(sort0=[$0], dir0=[DESC-nulls-last], fetch=[10])
    LogicalProject(c=[$4], ClientIP=[$0], ClientIP - 1=[$1], ClientIP - 2=[$2], ClientIP - 3=[$3])
      LogicalAggregate(group=[{0, 1, 2, 3}], c=[COUNT()])
        LogicalProject(ClientIP=[$76], ClientIP - 1=[-($76, 1)], ClientIP - 2=[-($76, 2)], ClientIP - 3=[-($76, 3)])
          CalciteLogicalIndexScan(table=[[OpenSearch, hits]])
",
    "physical": "EnumerableLimit(fetch=[10000])
  EnumerableCalc(expr#0..4=[{inputs}], c=[$t4], ClientIP=[$t0], ClientIP - 1=[$t1], ClientIP - 2=[$t2], ClientIP - 3=[$t3])
    EnumerableLimit(fetch=[10])
      EnumerableSort(sort0=[$4], dir0=[DESC-nulls-last])
        CalciteEnumerableIndexScan(table=[[OpenSearch, hits]], PushDownContext=[[AGGREGATION->rel#6613:LogicalAggregate.NONE.[](input=RelSubset#6612,group={0, 1, 2, 3},c=COUNT())], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":1000,\"sources\":[{\"ClientIP\":{\"terms\":{\"field\":\"ClientIP\",\"missing_bucket\":true,\"missing_order\":\"first\",\"order\":\"asc\"}}},{\"ClientIP - 1\":{\"terms\":{\"script\":
...
...
,\"missing_bucket\":true,\"missing_order\":\"first\",\"order\":\"asc\"}}}]},\"aggregations\":{\"c\":{\"value_count\":{\"field\":\"_index\"}}}}}}, requestedTotalSize=2147483647, pageSize=null, startFrom=0)])
"
  }
}

which contains multiple buckets with script.

However, the bucket fields ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 should be equivalent on group splitting, which means it should make no difference between using any one of them as the group field with using all of them. If we optimize the plan to use only 1 bucket field, e.g. ClientIP, it could avoid script and multiple-level buckets, which both contribute to better performance.

What solution would you like?
Implement a new rule to simplify plan with single field ClientIP in agg and create a project upon original agg to add other bucket fields back. The plan could be presented by query like

source=hits | stats stats count() as c by `ClientIP`| eval ClientIP - 1` = ClientIP - 1, `ClientIP - 2` = ClientIP - 2, `ClientIP - 3` = ClientIP - 3 | ...

What alternatives have you considered?
A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagecalcitecalcite migration releatedclickbenchenhancementNew feature or request

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions