Skip to content

Conversation

@seawinde
Copy link
Contributor

@seawinde seawinde commented Jul 9, 2025

What problem does this PR solve?

If agg table is random hash distribute, would add aggregate node on scan.
The aggregate function alias expr id is same to the child expr id of alias.

such as query sql is select * from db1.tagg
the query plan is as following, and the sum(b#1) AS b#1, alias expr id is same to the child expr id of alias, the id is 1
this would cause hidden problems

LogicalResultSink[30] ( outputExprs=[a#0, b#1] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#1], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042452160, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )

the pr fix this, and the expression change to sum(b#1) AS b#4

LogicalResultSink[30] ( outputExprs=[a#0, b#4] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#4], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042065062, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jul 9, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@seawinde
Copy link
Contributor Author

seawinde commented Jul 9, 2025

run buildall

}
Alias alias = new Alias(exprId, ImmutableList.of(function), col.getName(),
olapScan.qualified(), true);
Alias alias = new Alias(StatementScopeIdGenerator.newExprId(), ImmutableList.of(function),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a new ctor for Alias accept four args as children, name, qualifier and nameFromChild

@doris-robot
Copy link

TPC-H: Total hot run time: 33745 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit bf922e703b4553d8c3b34d106fbef86efe16d694, data reload: false

------ Round 1 ----------------------------------
q1	17591	5121	5117	5117
q2	1934	286	183	183
q3	10310	1299	759	759
q4	10229	1047	569	569
q5	7660	2355	2426	2355
q6	182	165	131	131
q7	913	775	622	622
q8	9314	1393	1136	1136
q9	6807	5079	5132	5079
q10	6917	2351	1959	1959
q11	491	288	288	288
q12	341	364	232	232
q13	17774	3678	3204	3204
q14	223	229	220	220
q15	555	476	476	476
q16	442	431	397	397
q17	607	879	349	349
q18	7631	7282	7084	7084
q19	1218	1108	578	578
q20	337	346	223	223
q21	4035	3268	2483	2483
q22	364	339	301	301
Total cold run time: 105875 ms
Total hot run time: 33745 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5151	5299	5138	5138
q2	245	327	212	212
q3	2157	2699	2267	2267
q4	1408	1820	1349	1349
q5	4282	4422	4577	4422
q6	213	170	123	123
q7	2055	1970	1755	1755
q8	2709	2473	2695	2473
q9	7382	7267	7224	7224
q10	3112	3290	2860	2860
q11	581	553	485	485
q12	743	781	658	658
q13	3793	3961	3362	3362
q14	288	315	282	282
q15	515	472	522	472
q16	491	520	454	454
q17	1178	1561	1457	1457
q18	7892	8114	7530	7530
q19	853	889	1011	889
q20	2069	1962	1806	1806
q21	4905	4414	4351	4351
q22	627	628	555	555
Total cold run time: 52649 ms
Total hot run time: 50124 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186196 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit bf922e703b4553d8c3b34d106fbef86efe16d694, data reload: false

query1	1007	391	386	386
query2	6497	1662	1659	1659
query3	6736	207	208	207
query4	26174	23306	23604	23306
query5	4359	576	440	440
query6	314	217	196	196
query7	4622	493	285	285
query8	269	224	221	221
query9	8611	2653	2659	2653
query10	481	324	268	268
query11	15682	15082	14881	14881
query12	152	108	105	105
query13	1654	521	402	402
query14	9245	5858	5833	5833
query15	209	190	171	171
query16	7668	446	267	267
query17	1302	717	596	596
query18	2011	411	314	314
query19	198	191	163	163
query20	129	120	116	116
query21	210	121	102	102
query22	4128	4274	4143	4143
query23	34139	33267	33024	33024
query24	8473	2328	2375	2328
query25	531	472	411	411
query26	1232	258	146	146
query27	2776	495	340	340
query28	4289	2122	2118	2118
query29	725	571	441	441
query30	285	221	196	196
query31	982	820	750	750
query32	71	66	64	64
query33	551	328	290	290
query34	803	848	528	528
query35	606	655	550	550
query36	950	982	889	889
query37	115	103	73	73
query38	4178	4199	4180	4180
query39	1498	1424	1454	1424
query40	214	119	106	106
query41	56	54	55	54
query42	125	116	119	116
query43	517	501	494	494
query44	1299	824	824	824
query45	173	169	171	169
query46	837	1027	628	628
query47	1751	1832	1718	1718
query48	377	417	324	324
query49	747	481	403	403
query50	645	691	419	419
query51	4136	4290	4237	4237
query52	105	113	97	97
query53	225	251	182	182
query54	568	557	510	510
query55	83	76	84	76
query56	301	316	305	305
query57	1220	1198	1128	1128
query58	259	262	264	262
query59	2502	2657	2527	2527
query60	322	326	310	310
query61	129	128	125	125
query62	804	713	631	631
query63	224	186	188	186
query64	4400	1180	854	854
query65	4256	4178	4165	4165
query66	1072	405	312	312
query67	15960	15713	15550	15550
query68	8317	873	527	527
query69	497	306	275	275
query70	1174	1113	1048	1048
query71	490	347	309	309
query72	5654	4708	4725	4708
query73	697	576	348	348
query74	8911	9112	8968	8968
query75	3859	3173	2686	2686
query76	3701	1141	690	690
query77	767	355	277	277
query78	10984	11054	10139	10139
query79	2329	876	593	593
query80	605	511	438	438
query81	472	262	222	222
query82	467	124	97	97
query83	248	251	235	235
query84	245	105	86	86
query85	824	367	315	315
query86	387	311	290	290
query87	4444	4452	4346	4346
query88	3503	2313	2274	2274
query89	385	316	289	289
query90	1897	217	211	211
query91	140	143	111	111
query92	76	57	55	55
query93	1673	947	587	587
query94	676	317	203	203
query95	376	298	295	295
query96	490	568	279	279
query97	2691	2762	2599	2599
query98	231	217	219	217
query99	1445	1423	1301	1301
Total cold run time: 275824 ms
Total hot run time: 186196 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.67 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit bf922e703b4553d8c3b34d106fbef86efe16d694, data reload: false

query1	0.04	0.04	0.03
query2	0.08	0.04	0.04
query3	0.24	0.07	0.07
query4	1.60	0.11	0.11
query5	0.43	0.42	0.43
query6	1.16	0.68	0.66
query7	0.02	0.02	0.02
query8	0.04	0.04	0.04
query9	0.62	0.53	0.51
query10	0.57	0.57	0.56
query11	0.16	0.10	0.11
query12	0.16	0.12	0.11
query13	0.63	0.60	0.61
query14	0.79	0.81	0.82
query15	0.92	0.89	0.86
query16	0.39	0.39	0.41
query17	1.08	1.08	1.07
query18	0.23	0.21	0.21
query19	2.00	1.87	1.87
query20	0.02	0.02	0.02
query21	15.38	0.88	0.53
query22	0.76	1.15	0.62
query23	15.01	1.40	0.65
query24	6.98	1.36	0.77
query25	0.45	0.15	0.11
query26	0.63	0.18	0.13
query27	0.07	0.06	0.06
query28	9.57	0.88	0.46
query29	12.58	3.93	3.32
query30	0.24	0.09	0.07
query31	2.83	0.60	0.39
query32	3.24	0.55	0.50
query33	3.03	3.11	3.11
query34	16.09	5.36	4.76
query35	4.78	4.86	4.84
query36	0.72	0.51	0.50
query37	0.10	0.07	0.07
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.17	0.14	0.14
query41	0.08	0.03	0.03
query42	0.03	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 104.04 s
Total hot run time: 29.67 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 18, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@starocean999 starocean999 merged commit bc72f9c into apache:master Jul 18, 2025
31 of 34 checks passed
github-actions bot pushed a commit that referenced this pull request Jul 18, 2025
…able with random distribute (#52993)

If agg table is random hash distribute, would add aggregate node on
scan.
The aggregate function alias expr id is same to the child expr id of
alias.

such as query sql is `select * from db1.tagg`
the query plan is as following, and the `sum(b#1) AS `b`#1`, alias expr
id is same to the child expr id of alias, the id is 1
this would cause hidden problems
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#1] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#1], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042452160, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```

the pr fix this, and the expression  change to `sum(b#1) AS `b`#4`  
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#4] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#4], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042065062, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```
github-actions bot pushed a commit that referenced this pull request Jul 18, 2025
…able with random distribute (#52993)

If agg table is random hash distribute, would add aggregate node on
scan.
The aggregate function alias expr id is same to the child expr id of
alias.

such as query sql is `select * from db1.tagg`
the query plan is as following, and the `sum(b#1) AS `b`#1`, alias expr
id is same to the child expr id of alias, the id is 1
this would cause hidden problems
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#1] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#1], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042452160, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```

the pr fix this, and the expression  change to `sum(b#1) AS `b`#4`  
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#4] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#4], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042065062, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```
github-actions bot pushed a commit that referenced this pull request Jul 18, 2025
…able with random distribute (#52993)

If agg table is random hash distribute, would add aggregate node on
scan.
The aggregate function alias expr id is same to the child expr id of
alias.

such as query sql is `select * from db1.tagg`
the query plan is as following, and the `sum(b#1) AS `b`#1`, alias expr
id is same to the child expr id of alias, the id is 1
this would cause hidden problems
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#1] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#1], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042452160, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```

the pr fix this, and the expression  change to `sum(b#1) AS `b`#4`  
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#4] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#4], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042065062, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```
seawinde added a commit to seawinde/doris that referenced this pull request Jul 21, 2025
…able with random distribute (apache#52993)

If agg table is random hash distribute, would add aggregate node on
scan.
The aggregate function alias expr id is same to the child expr id of
alias.

such as query sql is `select * from db1.tagg`
the query plan is as following, and the `sum(b#1) AS `b`#1`, alias expr
id is same to the child expr id of alias, the id is 1
this would cause hidden problems
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#1] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#1], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042452160, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```

the pr fix this, and the expression  change to `sum(b#1) AS `b`apache#4`  
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#4] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`apache#4], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042065062, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```
seawinde added a commit to seawinde/doris that referenced this pull request Jul 21, 2025
…able with random distribute (apache#52993)

If agg table is random hash distribute, would add aggregate node on
scan.
The aggregate function alias expr id is same to the child expr id of
alias.

such as query sql is `select * from db1.tagg`
the query plan is as following, and the `sum(b#1) AS `b`#1`, alias expr
id is same to the child expr id of alias, the id is 1
this would cause hidden problems
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#1] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#1], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042452160, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```

the pr fix this, and the expression  change to `sum(b#1) AS `b`apache#4`  
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#4] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`apache#4], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042065062, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```
seawinde added a commit to seawinde/doris that referenced this pull request Jul 21, 2025
…able with random distribute (apache#52993)

If agg table is random hash distribute, would add aggregate node on
scan.
The aggregate function alias expr id is same to the child expr id of
alias.

such as query sql is `select * from db1.tagg`
the query plan is as following, and the `sum(b#1) AS `b`#1`, alias expr
id is same to the child expr id of alias, the id is 1
this would cause hidden problems
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#1] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`#1], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042452160, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```

the pr fix this, and the expression  change to `sum(b#1) AS `b`apache#4`  
```sql
LogicalResultSink[30] ( outputExprs=[a#0, b#4] )
+--LogicalAggregate[29] ( groupByExpr=[a#0], outputExpr=[a#0, sum(b#1) AS `b`apache#4], hasRepeat=false, stats=1 )
   +--LogicalOlapScan ( qualified=internal.db1.tagg, indexName=<index_not_selected>, selectedIndexId=1752042065062, preAgg=ON, operativeCol=[a#0, b#1], stats=1 )
```
morrySnow pushed a commit that referenced this pull request Jul 21, 2025
…r when agg table with random distribute #52993 (#53621)

cherry picked from #52993
yiguolei pushed a commit that referenced this pull request Aug 7, 2025
dataroaring pushed a commit that referenced this pull request Aug 12, 2025
…r when agg table with random distribute #52993 (#53619)

picked from #52993
@gavinchou gavinchou mentioned this pull request Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants