Skip to content

Conversation

@924060929
Copy link
Contributor

@924060929 924060929 commented Mar 8, 2023

Proposed changes

This pr refactor the column pruning by the visitor, the good sides

  1. easy to provide ability of column pruning for new plan by implement the interface OutputPrunable if the plan contains output field or do nothing if not contains output field, don't need to add new rule like PruneXxxChildColumns, few scenarios need to override the visit function to write special logic, like prune the LogicalSetOperation and Aggregate
  2. support shrink output field in some plans, this can skip some useless operations so improvement
    example:
select id 
from (
  select id, sum(age)
  from student
  group by id
)a

we should prune the useless sum (age) in the aggregate.
before refactor:

LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
+--LogicalSubQueryAlias ( qualifier=[a] )
   +--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0, sum(age#2) AS `sum(age)`#4], hasRepeat=false )
      +--LogicalProject ( distinct=false, projects=[id#0, age#2], excepts=[], canEliminate=true )
         +--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON )

after refactor:

LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
+--LogicalSubQueryAlias ( qualifier=[a] )
   +--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0], hasRepeat=false )
      +--LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
         +--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON )

Checklist(Required)

  • Does it affect the original behavior
  • Has unit tests been added
  • Has document been added or modified
  • Does it need to update dependencies
  • Is this PR support rollback (If NO, please explain WHY)

@924060929
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

hello-stephen commented Mar 8, 2023

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 34.49 seconds
stream load tsv: 457 seconds loaded 74807831229 Bytes, about 156 MB/s
stream load json: 23 seconds loaded 2358488459 Bytes, about 97 MB/s
stream load orc: 74 seconds loaded 1101869774 Bytes, about 14 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230323145712_clickbench_pr_119711.html

@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the refactor_column_prune branch 2 times, most recently from 0dc3b93 to daca07b Compare March 10, 2023 05:11
@924060929
Copy link
Contributor Author

run p0

@924060929 924060929 force-pushed the refactor_column_prune branch from daca07b to 651f64f Compare March 10, 2023 11:17
@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 marked this pull request as ready for review March 10, 2023 11:21
@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the refactor_column_prune branch from 651f64f to b008264 Compare March 13, 2023 04:51
@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the refactor_column_prune branch from b008264 to 0ebdbb0 Compare March 14, 2023 02:26
@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the refactor_column_prune branch from 0ebdbb0 to fa9f569 Compare March 15, 2023 06:07
@github-actions github-actions bot added area/planner Issues or PRs related to the query planner kind/test labels Mar 15, 2023
@924060929
Copy link
Contributor Author

run buildall

1 similar comment
@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the refactor_column_prune branch from 91daa1a to c441dec Compare March 15, 2023 07:55
@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

@qzsee PTAL

@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the refactor_column_prune branch from 751c940 to e12b20c Compare March 16, 2023 03:48
@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the refactor_column_prune branch from e12b20c to eea820d Compare March 16, 2023 07:52
@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the refactor_column_prune branch from eea820d to ddcc855 Compare March 16, 2023 08:01
@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the refactor_column_prune branch from 7f6eb12 to 54f417f Compare March 23, 2023 11:20
@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 23, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morrySnow morrySnow merged commit d3e7f12 into apache:master Mar 24, 2023
gnehil pushed a commit to gnehil/doris that referenced this pull request Apr 21, 2023
This pr refactor the column pruning by the visitor, the good sides
1. easy to provide ability of column pruning for new plan by implement the interface `OutputPrunable` if the plan contains output field or do nothing if not contains output field, don't need to add new rule like `PruneXxxChildColumns`, few scenarios need to override the visit function to write special logic, like prune the LogicalSetOperation and Aggregate
2. support shrink output field in some plans, this can skip some useless operations so improvement

example:
```sql
select id 
from (
  select id, sum(age)
  from student
  group by id
)a
```

we should prune the useless `sum (age)` in the aggregate.
before refactor:
```
LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
+--LogicalSubQueryAlias ( qualifier=[a] )
   +--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0, sum(age#2) AS `sum(age)`apache#4], hasRepeat=false )
      +--LogicalProject ( distinct=false, projects=[id#0, age#2], excepts=[], canEliminate=true )
         +--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON )
```

after refactor:
```
LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
+--LogicalSubQueryAlias ( qualifier=[a] )
   +--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0], hasRepeat=false )
      +--LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
         +--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON )
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/nereids area/planner Issues or PRs related to the query planner area/vectorization kind/test reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants