[feature](reader) Optimize Complex Type Column Reading with Column Pruning #57204

kaka11chen · 2025-10-21T13:26:00Z

What problem does this PR solve?

Problem Summary:

Release note

Optimize Complex Type Column Reading with Column Pruning

Description

This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed.

Key changes:

FE (Frontend): Added column access path calculation and type pruning
- Collects and analyzes access paths for complex type fields
- Performs type pruning based on access paths
- Implements projection pushdown for complex types
BE (Backend): Added selective column reading
- Uses columnAccessPath array from FE to identify required sub-columns
- Implements selective reading to skip unnecessary sub-columns

Why

Performance Improvement: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with struct<int a, int b> s, when only s.a is referenced, we can avoid reading s.b entirely.

Technical Benefits: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals .

TODO & Future Optimizations

Lazy Materialization for Complex Type Sub-columns: Defer materialization of unused sub-columns
Predicate Pushdown for Complex Type Sub-columns: Push predicates to storage layer for better filtering
Parquet RL/DL Optimization: Read only repetition levels and definition levels without data in appropriate scenarios
Array Size Optimization: Read only offset and null values for array_size() operations
Null Check Optimization: Read only offset and null values for != null checks

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2025-10-21T13:26:05Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

kaka11chen · 2025-10-21T13:31:08Z

run buildall

kaka11chen · 2025-10-21T13:45:38Z

run buildall

hello-stephen · 2025-10-21T14:40:56Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	80.77% (1647/2039)
Line Coverage	67.04% (29059/43346)
Region Coverage	67.31% (14371/21352)
Branch Coverage	57.66% (7638/13246)

hello-stephen · 2025-10-21T14:55:38Z

FE UT Coverage Report

Increment line coverage 52.00% (156/300) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2025-10-21T18:04:13Z

FE Regression Coverage Report

Increment line coverage 88.67% (266/300) 🎉
Increment coverage report
Complete coverage report

kaka11chen · 2025-10-22T02:58:23Z

run buildall

doris-robot · 2025-10-22T03:36:44Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	80.77% (1647/2039)
Line Coverage	67.03% (29054/43346)
Region Coverage	67.32% (14374/21352)
Branch Coverage	57.69% (7641/13246)

doris-robot · 2025-10-22T04:02:19Z

ClickBench: Total hot run time: 29.15 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3647221f1bd5d8e52cb32f746bfb6833b2a6494f, data reload: false

query1	0.05	0.05	0.05
query2	0.13	0.07	0.07
query3	0.31	0.07	0.07
query4	1.60	0.09	0.09
query5	0.27	0.26	0.25
query6	1.17	0.67	0.65
query7	0.03	0.02	0.03
query8	0.07	0.06	0.06
query9	0.66	0.54	0.52
query10	0.59	0.59	0.59
query11	0.27	0.14	0.14
query12	0.27	0.15	0.14
query13	0.66	0.62	0.63
query14	1.07	1.06	1.04
query15	0.96	0.90	0.88
query16	0.39	0.40	0.39
query17	1.07	1.05	1.07
query18	0.24	0.22	0.24
query19	1.99	1.89	1.81
query20	0.01	0.01	0.02
query21	15.41	0.29	0.24
query22	5.00	0.09	0.10
query23	15.38	0.38	0.23
query24	2.92	0.48	0.30
query25	0.10	0.09	0.09
query26	0.19	0.18	0.17
query27	0.09	0.09	0.08
query28	3.66	1.26	1.05
query29	12.62	4.06	3.34
query30	0.34	0.12	0.10
query31	2.84	0.64	0.44
query32	3.24	0.63	0.55
query33	3.16	3.10	3.13
query34	17.03	5.50	4.77
query35	4.89	4.87	4.89
query36	0.67	0.53	0.52
query37	0.22	0.09	0.09
query38	0.19	0.06	0.06
query39	0.06	0.05	0.05
query40	0.21	0.18	0.19
query41	0.11	0.07	0.06
query42	0.06	0.04	0.04
query43	0.06	0.06	0.06
Total cold run time: 100.26 s
Total hot run time: 29.15 s

hello-stephen · 2025-10-22T04:18:02Z

FE UT Coverage Report

Increment line coverage 52.00% (156/300) 🎉
Increment coverage report
Complete coverage report

kaka11chen · 2025-10-23T13:23:07Z

run buildall

doris-robot · 2025-10-23T14:00:42Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	80.62% (1647/2043)
Line Coverage	66.94% (29038/43376)
Region Coverage	67.26% (14372/21368)
Branch Coverage	57.62% (7637/13254)

doris-robot · 2025-10-23T15:18:00Z

ClickBench: Total hot run time: 28.24 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 087f4e08af2665a1082f04139fe559371526a6c1, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.06	0.05
query3	0.25	0.09	0.08
query4	1.61	0.12	0.11
query5	0.28	0.28	0.25
query6	1.18	0.67	0.66
query7	0.04	0.03	0.02
query8	0.06	0.05	0.04
query9	0.63	0.54	0.52
query10	0.59	0.59	0.58
query11	0.17	0.11	0.12
query12	0.16	0.12	0.13
query13	0.63	0.61	0.61
query14	1.04	1.02	1.02
query15	0.86	0.86	0.85
query16	0.39	0.41	0.40
query17	1.05	1.06	1.05
query18	0.22	0.21	0.20
query19	1.87	1.84	1.81
query20	0.01	0.01	0.02
query21	15.46	0.20	0.13
query22	5.04	0.07	0.05
query23	15.68	0.27	0.10
query24	1.63	1.13	0.88
query25	0.08	0.08	0.09
query26	0.15	0.14	0.13
query27	0.07	0.07	0.06
query28	5.20	1.17	0.94
query29	12.61	4.03	3.29
query30	0.29	0.15	0.14
query31	2.83	0.58	0.39
query32	3.24	0.56	0.48
query33	3.12	3.12	3.04
query34	15.67	5.16	4.53
query35	4.54	4.59	4.62
query36	0.67	0.51	0.50
query37	0.11	0.07	0.07
query38	0.07	0.05	0.04
query39	0.04	0.04	0.03
query40	0.18	0.16	0.14
query41	0.09	0.03	0.04
query42	0.04	0.04	0.03
query43	0.04	0.04	0.03
Total cold run time: 98.05 s
Total hot run time: 28.24 s

hello-stephen · 2025-10-23T21:22:17Z

FE Regression Coverage Report

Increment line coverage 66.60% (674/1012) 🎉
Increment coverage report
Complete coverage report

kaka11chen · 2025-10-24T04:56:50Z

run buildall

kaka11chen · 2025-10-24T05:10:14Z

run buildall

doris-robot · 2025-10-24T05:33:23Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	80.64% (1649/2045)
Line Coverage	67.00% (29104/43437)
Region Coverage	67.32% (14419/21420)
Branch Coverage	57.73% (7674/13294)

doris-robot · 2025-10-24T05:57:05Z

ClickBench: Total hot run time: 27.8 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f059d1417b9ec9216ace8201487208836abebf05, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.05	0.05
query3	0.26	0.08	0.09
query4	1.60	0.11	0.11
query5	0.28	0.26	0.26
query6	1.17	0.65	0.64
query7	0.04	0.03	0.03
query8	0.05	0.05	0.05
query9	0.64	0.53	0.53
query10	0.58	0.57	0.58
query11	0.17	0.13	0.12
query12	0.15	0.12	0.12
query13	0.61	0.61	0.60
query14	1.00	1.01	1.02
query15	0.84	0.83	0.86
query16	0.39	0.38	0.38
query17	1.01	1.03	1.02
query18	0.21	0.20	0.20
query19	1.88	1.79	1.77
query20	0.02	0.01	0.01
query21	15.46	0.21	0.12
query22	5.04	0.07	0.05
query23	15.67	0.27	0.10
query24	3.28	0.64	0.94
query25	0.09	0.07	0.06
query26	0.14	0.13	0.13
query27	0.06	0.07	0.05
query28	5.52	1.14	0.93
query29	12.56	3.92	3.26
query30	0.28	0.14	0.11
query31	2.83	0.60	0.38
query32	3.22	0.54	0.48
query33	3.01	3.05	3.00
query34	15.85	5.12	4.59
query35	4.55	4.62	4.60
query36	0.66	0.50	0.49
query37	0.10	0.07	0.07
query38	0.07	0.04	0.04
query39	0.04	0.03	0.03
query40	0.19	0.14	0.14
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 99.85 s
Total hot run time: 27.8 s

hello-stephen · 2025-10-24T07:54:19Z

FE Regression Coverage Report

Increment line coverage 75.89% (768/1012) 🎉
Increment coverage report
Complete coverage report

apache#58765) ### What problem does this PR solve? Related PR: apache#57204 Problem Summary: This pull request refactors and improves the `PushDownProject` rule in the Nereids optimizer, mainly focusing on the logic for pushing down projections through `UNION` operations. It also introduces a comprehensive unit test to verify the new logic, making the relevant methods more testable and robust. **Refactoring and Logic Improvements:** * Refactored the `pushThroughUnion` logic by extracting it into a new static method, making it easier to test and use independently. The main logic now takes explicit arguments instead of relying on the context object. * Improved the handling of projections and child outputs when pushing down through `UNION`, ensuring correct mapping and replacement of slots. This includes using regulator outputs for children and constant expressions, and making the slot replacement logic static for better testability. **Testing Enhancements:** * Added a new unit test class `PushDownProjectTest` to rigorously test the pushdown logic in various scenarios, including unions with and without children. The tests verify both the structure and the correctness of the rewritten plans. **Code Quality Improvements:** * Added the `@VisibleForTesting` annotation and imported necessary dependencies to clarify method visibility and intent for testing. * Replaced some usages of `Collection` with `List` for better type safety and clarity in projection handling. These changes make the projection pushdown logic more modular, testable, and robust, and provide strong test coverage for future maintenance.

…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>

optimize push down project, this can reduce the scan bytes and shuffle bytes by prune nested column. #57204 related the sql: ```sql select coalecse(struct_element(t1.s, 'city'), 'beijing') from t1 join t2 on t1.id = t2.id ``` original plan: ``` Project(coalecse(struct_element(t1.s, 'city'), 'beijing')) | Join(t1.id=t2.id) / \ Project(t1.id, t1.s) Project(t2.id) | | Scan(t1) Scan(t2) ``` optimize plan: ``` Project(coalecse(slot#3, 'beijing')) | Join(t1.id=t2.id) / \ Project(t1.id, struct_element(t1.s, 'city')#3) Project(t2.id) | | Scan(t1) Scan(t2) ``` (cherry picked from commit c30c0ff)

fix prune map type cause backend core, when the map type is changed, we should not prune the nested column type, introduced by #57204 (cherry picked from commit 1d7f6c4)

fix `Input slot(s) not in child's output`, introduced by #57204 (cherry picked from commit b788842)

fix prune map type cause backend core, when the map type is changed, we should not prune the nested column type, introduced by #57204 (cherry picked from commit 1d7f6c4)

…rning (#59286) ### What problem does this PR solve? Problem Summary: ### Release note Cherry-pick #58370 #58354 #59043 #58851 #58485 #58682 #58614 #58373 #57204 #58719 #58471 #58573 #58657 ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label  --------- Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com> Co-authored-by: Jerry Hu <hushenggang@selectdb.com> Co-authored-by: lihangyu <lihangyu@selectdb.com>

…58776) support prune nested column through lateral view with the functions: explode, explode_outer, explode_map, explode_map_outer, posexplode, posexplode_outer, #57204 related

…s are missing after schema evolution (#59586) ### What problem does this PR solve? - relate pr: #57204 **Problem Summary:** When querying struct fields in Iceberg tables after schema evolution, if all queried struct fields are missing in old Parquet files, the code fails with error: ``` File column name 'removed' not found in struct children ``` **Root Cause:** When all queried struct sub-fields are missing in the old Parquet file (e.g., newly added fields after schema evolution), the code needs to find a reference column from the file schema to get repetition level (RL) and definition level (DL) information. However, if the reference column (e.g., `removed`) was dropped from the table schema, calling `root_node->get_children_node_by_file_column_name()` will fail because the column doesn't exist in `root_node`. **Scenario:** 1. Create table with struct containing: `removed`, `rename`, `keep`, `drop_and_add` 2. Insert data (creates Parquet file with these fields) 3. Perform schema evolution: DROP `a_struct.removed`, DROP then ADD `a_struct.drop_and_add` (gets new field ID), ADD `a_struct.added` 4. Query `struct_element(a_struct, 'drop_and_add')` or `struct_element(a_struct, 'added')` on the old file 5. The query fails because: - All queried fields (`drop_and_add`, `added`) are missing in the old file - Code tries to use `removed` as reference column (it exists in file but was dropped from table schema) - Accessing `removed` via `root_node` fails because it doesn't exist in table schema ### Solution: Use `TableSchemaChangeHelper::ConstNode::get_instance()` instead of looking up from `root_node` for the reference column. Since the reference column is only used to get RL/DL information (not for schema mapping), using `ConstNode` is safe and avoids the issue where the reference column doesn't exist in `root_node`. ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

…tential issues (#59860) ### What problem does this PR solve? Related PR: #57204 Problem Summary: Disable nested column pruning for HUDI tables to avoid potential issues

#58765) ### What problem does this PR solve? Related PR: #57204 Problem Summary: This pull request refactors and improves the `PushDownProject` rule in the Nereids optimizer, mainly focusing on the logic for pushing down projections through `UNION` operations. It also introduces a comprehensive unit test to verify the new logic, making the relevant methods more testable and robust. **Refactoring and Logic Improvements:** * Refactored the `pushThroughUnion` logic by extracting it into a new static method, making it easier to test and use independently. The main logic now takes explicit arguments instead of relying on the context object. * Improved the handling of projections and child outputs when pushing down through `UNION`, ensuring correct mapping and replacement of slots. This includes using regulator outputs for children and constant expressions, and making the slot replacement logic static for better testability. **Testing Enhancements:** * Added a new unit test class `PushDownProjectTest` to rigorously test the pushdown logic in various scenarios, including unions with and without children. The tests verify both the structure and the correctness of the rewritten plans. **Code Quality Improvements:** * Added the `@VisibleForTesting` annotation and imported necessary dependencies to clarify method visibility and intent for testing. * Replaced some usages of `Collection` with `List` for better type safety and clarity in projection handling. These changes make the projection pushdown logic more modular, testable, and robust, and provide strong test coverage for future maintenance.

…pache#58776) support prune nested column through lateral view with the functions: explode, explode_outer, explode_map, explode_map_outer, posexplode, posexplode_outer, apache#57204 related (cherry picked from commit c28afa3)

kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch from 1c99dc6 to d47ffd5 Compare October 21, 2025 13:27

kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch from 5642997 to 3fc502e Compare October 21, 2025 13:45

kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch 2 times, most recently from 3627661 to 3647221 Compare October 22, 2025 02:58

kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch 3 times, most recently from 34a95f7 to 087f4e0 Compare October 23, 2025 13:22

kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch 2 times, most recently from 0d12c7d to 33c5e80 Compare October 24, 2025 04:41

kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch from 33c5e80 to f059d14 Compare October 24, 2025 05:09

kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch from f059d14 to bb96ea9 Compare October 25, 2025 05:53

morningman mentioned this pull request Dec 22, 2025

[fix](profile) sort out parquet reader profile #58895

Merged

16 tasks

kaka11chen mentioned this pull request Dec 22, 2025

[feature](reader) Optimize Complex Type Column Reading with Column Puning #59249

Closed

16 tasks

924060929 added a commit that referenced this pull request Dec 23, 2025

[fix](nereids) fix Input slot(s) not in child's output (#58471)

adcdfd4

fix `Input slot(s) not in child's output`, introduced by #57204 (cherry picked from commit b788842)

kaka11chen mentioned this pull request Dec 23, 2025

[feature](reader) Optimize Complex Type Column Reading with Column Purning #59286

Merged

16 tasks

yiguolei added dev/4.0.3-merged and removed dev/4.0.x dev/4.0.x-conflict labels Dec 24, 2025

924060929 mentioned this pull request Dec 26, 2025

[opt](complex type) support prune nested column through lateral view #58776

Merged

16 tasks

suxiaogang223 mentioned this pull request Jan 6, 2026

[fix](parquet) Fix struct column reading error when all queried fields are missing after schema evolution #59586

Merged

16 tasks

suxiaogang223 mentioned this pull request Jan 14, 2026

[fix](hudi) Disable nested column pruning for HUDI tables to avoid potential issues #59860

Merged

16 tasks

[feature](reader) Optimize Complex Type Column Reading with Column Pruning #57204

[feature](reader) Optimize Complex Type Column Reading with Column Pruning #57204

Uh oh!

Conversation

kaka11chen commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Description

Why

TODO & Future Optimizations

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Oct 21, 2025

Uh oh!

kaka11chen commented Oct 21, 2025

Uh oh!

kaka11chen commented Oct 21, 2025

Uh oh!

hello-stephen commented Oct 21, 2025

Cloud UT Coverage Report

Uh oh!

hello-stephen commented Oct 21, 2025

FE UT Coverage Report

Uh oh!

hello-stephen commented Oct 21, 2025

FE Regression Coverage Report

Uh oh!

kaka11chen commented Oct 22, 2025

Uh oh!

doris-robot commented Oct 22, 2025

Cloud UT Coverage Report

Uh oh!

doris-robot commented Oct 22, 2025

Uh oh!

hello-stephen commented Oct 22, 2025

FE UT Coverage Report

Uh oh!

kaka11chen commented Oct 23, 2025

Uh oh!

doris-robot commented Oct 23, 2025

Cloud UT Coverage Report

Uh oh!

doris-robot commented Oct 23, 2025

Uh oh!

hello-stephen commented Oct 23, 2025

FE Regression Coverage Report

Uh oh!

kaka11chen commented Oct 24, 2025

Uh oh!

kaka11chen commented Oct 24, 2025

Uh oh!

doris-robot commented Oct 24, 2025

Cloud UT Coverage Report

Uh oh!

doris-robot commented Oct 24, 2025

Uh oh!

hello-stephen commented Oct 24, 2025

FE Regression Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

kaka11chen commented Oct 21, 2025 •

edited

Loading