Skip to content

feat:Add function for row alignment with page mask#1791

Merged
tustvold merged 13 commits into
apache:masterfrom
Ted-Jiang:range_filter
Jun 6, 2022
Merged

feat:Add function for row alignment with page mask#1791
tustvold merged 13 commits into
apache:masterfrom
Ted-Jiang:range_filter

Conversation

@Ted-Jiang

@Ted-Jiang Ted-Jiang commented Jun 5, 2022

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #1790.

Rationale for this change

For now row group filter in datafusion pass a closure to arrow-rs

fn build_row_group_predicate(
    pruning_predicate: &PruningPredicate,
    metrics: ParquetFileMetrics,
) -> Box<dyn FnMut(&RowGroupMetaData, usize) -> bool> {

https://github.com/apache/arrow-datafusion/blob/585bc3a629b92ea7a86ebfe8bf762dbef4155710/datafusion/core/src/physical_plan/file_format/parquet.rs#L559-L562

So for page filter in datafusion, define filter_predicate

 Box<dyn FnMut(&[pageIndex], &[pageLocation], usize) -> &[bool]>

datafusion will send a mask(&[bool]) to arrow-rs,
then use mask call compute_row_ranges to construct RowRanges : row ranges in a row-group (one col) if col is sorted vec size will be 1.
For multi filter combine:
if there are two filters use and connect,use RowRanges::intersection to get the final rowRange; two filters use or connect,use RowRanges::union to get the final rowRange.

After this PR: i will working on column_page_reader, enable read specific row ranges record.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 5, 2022
@codecov-commenter

codecov-commenter commented Jun 5, 2022

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.26549% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.45%. Comparing base (c1a91dc) to head (81292d5).

Files with missing lines Patch % Lines
parquet/src/file/page_index/range.rs 90.26% 22 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1791      +/-   ##
==========================================
+ Coverage   83.39%   83.45%   +0.05%     
==========================================
  Files         198      199       +1     
  Lines       56142    56553     +411     
==========================================
+ Hits        46821    47197     +376     
- Misses       9321     9356      +35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
@Ted-Jiang

Copy link
Copy Markdown
Member Author

@tustvold @alamb @viirya PTAL 😊

@tustvold

tustvold commented Jun 5, 2022

Copy link
Copy Markdown
Contributor

Thank you, I will review this tomorrow (GMT)

Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
}

/// Return the row ranges `Vec(start, len)` of all the selected pages
pub fn compute_row_ranges(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this be used?

Do you need to make something like with_predicate in ReadOptionsBuilder to take a closure for filtering pages based on the ranges?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, basically add a closure called page_predicates in ReadOptionsBuilder like

Box<dyn FnMut(&[pageIndex], &[pageLocation], usize) -> &[bool]>

to generate mask then call this function compute_row_ranges .

I'm thinking about doing this at the end for testing with datafusion.

@tustvold tustvold left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left some high level comments. Will review more thoroughly tomorrow

num_rows: i64,
total_byte_size: i64,
schema_descr: SchemaDescPtr,
// Todo add filter result -> row range

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this the more I wonder whether the metadata structs are the right place to put the index information. They're parsed and interpreted separately from the main metadata, and so I think it makes sense for them to be stored separately?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. page index stored in file-meta level.
My thought is read less pageIndex after rowgroup filter

let mut filtered_row_groups = Vec::<RowGroupMetaData>::new();
for (i, rg_meta) in row_groups.into_iter().enumerate() {
let mut keep = true;
for predicate in &mut predicates {
if !predicate(&rg_meta, i) {
keep = false;
break;
}
}
if keep {
filtered_row_groups.push(rg_meta);
}
}

metadata: ParquetMetaData::new(
metadata.file_metadata().clone(),
filtered_row_groups,
),

So i want to read index here and insert it into RowGroupMetaData.
It was just a simple idea at first, maybe we can find a better way in the process of implementation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page index stored in file-meta level.

It isn't even file-meta level, it isn't part of the footer but stored as separate pages 😅

It was just a simple idea at first, maybe we can find a better way in the process of implementation

Provided we take care to ensure we keep things pub(crate) so we don't break APIs, this seems like a good strategy 👍

@Ted-Jiang Ted-Jiang Jun 6, 2022

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't even file-meta level, it isn't part of the footer but stored as separate pages 😅

yes separately from RowGroup, before the footer !😂

Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs
Comment thread parquet/src/file/page_index/mod.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated

@tustvold tustvold left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, I am, however, a bit confused as to how it will be used... I had expected something like the following.

For each not-pruned row group:

  • Use Index to identify covering set of rows based on predicates
  • Pass row selection down to RecordReader
  • Add a skip_next_page to PageReader
  • Add a skip_values to ColumnValueDecoder
  • Have RecordReader use a combination of the above to skip pages and rows during decode based on the row selection

This would also naturally extend to #1191

I'm not entirely sure where the datastructure added in this PR would fit into this?

Edit: is the intention to use this for the first step?

Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs
Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
Comment thread parquet/src/file/page_index/range.rs Outdated
Ted-Jiang and others added 2 commits June 6, 2022 20:27
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@Ted-Jiang

Ted-Jiang commented Jun 6, 2022

Copy link
Copy Markdown
Member Author

Edit: is the intention to use this for the first step?

Basically, yes ! 👍

Use Index to identify covering set of rows based on predicates
Pass row selection down to RecordReader
Add a skip_next_page to PageReader
Add a skip_values to ColumnValueDecoder
Have RecordReader use a combination of the above to skip pages and rows during decode based on the row selection

First step contains:

  1. Use SerializedFileReader read fileMeta get file_metadata(optional on page_index)
  2. Use min_max on row group filter filter useless group (already exist)
  3. Read specific row group page_index construct Index(Support reading PageIndex from column metadata #1761)
  4. Use both index and filter generate pagemask (this one)
    ....

@tustvold I will follow up in PageReader
Hope will get big performance improvement in selective query!

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@Ted-Jiang Ted-Jiang changed the title feat:Implement page filtering with Row Alignment feat:Add function for row alignment with page mask Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add function for row alignment with page mask

4 participants