Skip to content

feat(parquet): Support page‑level pruning#14214

Closed
zhli1142015 wants to merge 15 commits intofacebookincubator:mainfrom
zhli1142015:index_page
Closed

feat(parquet): Support page‑level pruning#14214
zhli1142015 wants to merge 15 commits intofacebookincubator:mainfrom
zhli1142015:index_page

Conversation

@zhli1142015
Copy link
Copy Markdown
Contributor

@zhli1142015 zhli1142015 commented Jul 24, 2025

This PR implements Parquet page pruning:
• ColumnPageIndex: Implements parsing of column index pages and offset index
pages, and the function to convert relevant metadata into
dwio::common::ColumnStatistics.
• RowRanges: Introduces this to represent pushdown filter evaluation results
for variably sized data pages, with added support in MetadataFilter.
• Parquet Reader: Implements index page reading, merges filtering results
across columns, and generates final RowRanges. Skips unneeded rows during
data reading based on computed RowRanges.
• ParquetData: Uses column index statistics to apply pushdown filters for page
skipping. Loads only required data pages according to final RowRanges in
function enqueueRowGroup.
• PageReader: Skips unneeded pages using offset index information.
Fixes: #14195

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 24, 2025
@netlify
Copy link
Copy Markdown

netlify Bot commented Jul 24, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a690963
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/691ad4860f384f0008d02ee0

@zhli1142015
Copy link
Copy Markdown
Contributor Author

@majetideepak @rui-mo @Yuhta could you please help to review this PR?
For more details, please see #14195 .

Comment thread velox/dwio/common/MetadataFilter.h
@zhli1142015 zhli1142015 force-pushed the index_page branch 2 times, most recently from 83f0664 to 9e5c920 Compare July 28, 2025 07:24
Copy link
Copy Markdown
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this nice work! Added some comments.

Comment thread velox/dwio/parquet/reader/ParquetData.h Outdated
Comment thread velox/dwio/parquet/reader/ParquetData.cpp
Comment thread velox/dwio/parquet/reader/ParquetData.cpp
Comment thread velox/dwio/parquet/reader/ParquetData.cpp
Comment thread velox/dwio/parquet/reader/ParquetData.cpp
Comment thread velox/dwio/parquet/reader/ParquetData.cpp
Comment thread velox/dwio/parquet/reader/Metadata.h Outdated
@zhli1142015 zhli1142015 force-pushed the index_page branch 2 times, most recently from a55dd79 to 6decdbf Compare July 31, 2025 05:44
@zhli1142015 zhli1142015 requested review from Yuhta and rui-mo August 1, 2025 03:24
@zhli1142015
Copy link
Copy Markdown
Contributor Author

@rui-mo and @Yuhta , I resolved the comments, could you continue review this PR?

@zhli1142015 zhli1142015 force-pushed the index_page branch 3 times, most recently from 20408d9 to 2108eb3 Compare August 14, 2025 13:19
Copy link
Copy Markdown
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just added some nits.

Comment thread velox/dwio/common/RowRanges.h Outdated
Comment thread velox/dwio/common/RowRanges.h Outdated
Comment thread velox/dwio/parquet/reader/ColumnPageIndex.h Outdated
Comment thread velox/dwio/parquet/reader/ColumnPageIndex.h
Comment thread velox/dwio/parquet/reader/ColumnPageIndex.h
Comment thread velox/dwio/parquet/reader/Metadata.h Outdated
Comment thread velox/dwio/parquet/reader/ParquetReader.cpp Outdated
Copy link
Copy Markdown
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@zhli1142015
Copy link
Copy Markdown
Contributor Author

@majetideepak could you help review this change? Thanks.

@majetideepak
Copy link
Copy Markdown
Collaborator

@zhli1142015 I will take a look today.

@majetideepak
Copy link
Copy Markdown
Collaborator

@zhli1142015 I started reviewing this. I need a couple more days to complete. Thanks.

@zhli1142015
Copy link
Copy Markdown
Contributor Author

UT failure is not related, #15093

@zhli1142015 zhli1142015 force-pushed the index_page branch 2 times, most recently from d37ab69 to a4b9237 Compare October 15, 2025 11:10
@stale
Copy link
Copy Markdown

stale Bot commented Feb 15, 2026

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale Bot added the stale label Feb 15, 2026
@stale stale Bot closed this Mar 2, 2026
@FelixYBW
Copy link
Copy Markdown

If the filter column is sorted or z-ordered, the benefit is obvious:

Metric read filter column read filter column read 2 column read 2 column read 3 column read 3 column
page skip Y N Y N Y N
Elapsed time 18s 35s 36s 64s 51s 115s
data source add split time 2.0 m
min: 0 ms
med: 306 ms
max: 911 ms (150)
49.0 s
min: 0 ms
med: 110 ms
max: 519 ms (524)
1.6 m
min: 0 ms
med: 257 ms
max: 875 ms (895)
43.2 s
min: 0 ms
med: 114 ms
max: 485 ms (1110)
1.1 m
min: 0 ms
med: 179 ms
max: 1.1 s (1724)
46.0 s
min: 0 ms
med: 121 ms
max: 577 ms (2159)
data source read time 4.9 m
min: 297 ms
med: 823 ms
max: 1.2 s (219)
3.4 m
min: 236 ms
med: 566 ms
max: 849 ms (606)
6.3 m
min: 544 ms
med: 1.0 s
max: 1.5 s (850)
1.8 m
min: 112 ms
med: 285 ms
max: 677 ms (1359)
3.7 m
min: 296 ms
med: 579 ms
max: 1.4 s (1461)
1.3 m
min: 87 ms
med: 206 ms
max: 437 ms (2162)
io wait time 51.4 m
min: 3.2 s
med: 8.4 s
max: 10.7 s (1)
31.7 m
min: 2.0 s
med: 5.3 s
max: 7.4 s (364)
59.9 m
min: 3.9 s
med: 9.9 s
max: 15.5 s (727)
1.04 h
min: 3.6 s
med: 10.2 s
max: 14.6 s (1121)
1.65 h
min: 5.1 s
med: 16.2 s
max: 22.9 s (1453)
1.44 h
min: 4.8 s
med: 14.1 s
max: 18.6 s (1816)
page load time 4.8 m
min: 290 ms
med: 802 ms
max: 1.2 s (219)
3.3 m
min: 229 ms
med: 543 ms
max: 825 ms (606)
4.8 m
min: 473 ms
med: 802 ms
max: 1.2 s (1012)
6.5 m
min: 446 ms
med: 1.1 s
max: 1.8 s (1113)
8.6 m
min: 378 ms
med: 1.4 s
max: 2.4 s (1594)
8.5 m
min: 644 ms
med: 1.4 s
max: 2.2 s (1847)
time of scan 41.2 m
min: 2.5 s
med: 6.8 s
max: 9.3 s (1)
24.5 m
min: 1.5 s
med: 4.1 s
max: 6.2 s (364)
49.0 m
min: 3.1 s
med: 8.1 s
max: 12.5 s (727)
51.4 m
min: 2.9 s
med: 8.4 s
max: 12.3 s (1090)
1.41 h
min: 4.3 s
med: 13.9 s
max: 19.3 s (1453)
1.22 h
min: 3.8 s
med: 12.0 s
max: 16.1 s (1816)
time of scan and filter 7.3 m
min: 547 ms
med: 1.2 s
max: 1.9 s (60)
4.4 m
min: 428 ms
med: 715 ms
max: 1.0 s (378)
8.4 m
min: 723 ms
med: 1.4 s
max: 2.2 s (850)
8.6 m
min: 737 ms
med: 1.4 s
max: 2.4 s (1110)
15.3 m
min: 876 ms
med: 2.5 s
max: 3.6 s (1515)
11.8 m
min: 839 ms
med: 2.0 s
max: 2.8 s (1847)
number of input bytes 30.9 GiB
min: 29.4 MiB
med: 88.9 MiB
max: 89.3 MiB (0)
30.9 GiB
min: 29.4 MiB
med: 88.9 MiB
max: 89.3 MiB (363)
160.6 GiB
min: 149.3 MiB
med: 452.2 MiB
max: 536.3 MiB (756)
93.2 GiB
min: 88.6 MiB
med: 268.1 MiB
max: 269.1 MiB (1316)
213.7 GiB
min: 199.7 MiB
med: 605.0 MiB
max: 627.0 MiB (1482)
155.4 GiB
min: 147.8 MiB
med: 447.3 MiB
max: 449.0 MiB (2042)
number of input vectors 676,321 676,321 3,522,122 676,321 3,522,122 676,321
number of memory allocations 4,308,001 4,308,001 12,088,039 8,111,587 27,155,091 11,898,597
number of output bytes 30.9 GiB
min: 29.4 MiB
med: 88.9 MiB
max: 89.3 MiB (0)
30.9 GiB
min: 29.4 MiB
med: 88.9 MiB
max: 89.3 MiB (363)
160.6 GiB
min: 149.3 MiB
med: 452.2 MiB
max: 536.3 MiB (756)
93.2 GiB
min: 88.6 MiB
med: 268.1 MiB
max: 269.1 MiB (1316)
213.7 GiB
min: 199.7 MiB
med: 605.0 MiB
max: 627.0 MiB (1482)
155.4 GiB
min: 147.8 MiB
med: 447.3 MiB
max: 449.0 MiB (2042)
number of output vectors 676,321 676,321 3,522,122 676,321 3,522,122 676,321
number of preloaded splits 7,911 8,164 8,175 8,510 8,602 8,587
number of processed pages 8,673 8,673 1,442,968 33,234 2,885,936 60,015
number of processed row groups 8,673 8,673 9,064 8,673 9,064 8,673
number of raw input bytes 2.2 GiB
min: 2.3 MiB
med: 6.0 MiB
max: 12.8 MiB (26)
2.2 GiB
min: 2.3 MiB
med: 6.0 MiB
max: 12.8 MiB (389)
26.3 GiB
min: 25.0 MiB
med: 75.2 MiB
max: 76.3 MiB (787)
24.0 GiB
min: 24.7 MiB
med: 67.9 MiB
max: 76.2 MiB (1426)
79.2 GiB
min: 74.8 MiB
med: 226.6 MiB
max: 228.2 MiB (1452)
52.9 GiB
min: 55.8 MiB
med: 150.0 MiB
max: 172.4 MiB (2152)
number of raw input rows 28,799,983,563 28,799,983,563 28,799,983,549 28,799,983,563 28,799,983,549 28,799,983,563
number of skipped pages 20,595 20,595 N/A 55,472 N/A 125,090
peak memory bytes 762.4 MiB
min: 1725.9 KiB
med: 2.2 MiB
max: 2.3 MiB (146)
734.4 MiB
min: 1660.1 KiB
med: 2.0 MiB
max: 2.3 MiB (382)
9.0 GiB
min: 17.6 MiB
med: 26.5 MiB
max: 27.1 MiB (1028)
9.0 GiB
min: 21.0 MiB
med: 26.3 MiB
max: 26.5 MiB (1143)
28.8 GiB
min: 49.9 MiB
med: 85.1 MiB
max: 87.2 MiB (1541)
20.7 GiB
min: 42.4 MiB
med: 60.4 MiB
max: 60.8 MiB (1824)
storage read bytes 2.2 GiB
min: 2.3 MiB
med: 6.1 MiB
max: 12.8 MiB (29)
2.2 GiB
min: 2.3 MiB
med: 6.1 MiB
max: 12.8 MiB (392)
26.7 GiB
min: 25.4 MiB
med: 76.3 MiB
max: 77.5 MiB (787)
24.0 GiB
min: 24.7 MiB
med: 68.0 MiB
max: 76.3 MiB (1426)
79.7 GiB
min: 75.2 MiB
med: 228.1 MiB
max: 229.7 MiB (1452)
53.0 GiB
min: 55.8 MiB
med: 150.1 MiB
max: 172.5 MiB (2152)
storage read counts 26,321 27,103 34,994 35,776 43,667 44,449

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Page‑Level Pruning Using Page Index in Velox Parquet Reader

6 participants