Fix page size on dictionary fallback #2854

thinkharderdev · 2022-10-09T13:46:05Z

Which issue does this PR close?

Rationale for this change

On fallback ByteArrayEncoder wasn't tracking the number of values written so when the dictionary page hits the limit and we fallback, all remaining data was written in a single data page.

What changes are included in this PR?

Make sure ByteArrayEncoder tracks the number of encoded values after it falls back to the fallback encoder

Are there any user-facing changes?

This will change the way data pages are laid out in some cases.

No

tustvold

Thank you 👍

tustvold · 2022-10-09T14:51:59Z

parquet/src/arrow/arrow_writer/mod.rs


+    #[test]
+    fn arrow_writer_page_size() {
+        let mut rng = thread_rng();


I think we should either seed this, or loosen the assert below. Otherwise I worry that depending on what values are generated, we may end up with more or less pages (as the dictionary page will only spill once it has seen sufficient different values, which technically could occur at any point)

tustvold · 2022-10-09T14:52:29Z

parquet/src/arrow/arrow_writer/mod.rs

+
+        let props = WriterProperties::builder()
+            .set_max_row_group_size(usize::MAX)
+            .set_data_pagesize_limit(256)


You could potentially set the dictionary page size smaller to verify that as well, but up to you

So I think there are still some issues here. It is still ignoring the size limit. It is at least respecting the write_batch_size though.

That is expected and I believe consistent with other parquet writers. The limit is best effort

tustvold · 2022-10-09T14:53:35Z

parquet/src/arrow/arrow_writer/byte_array.rs

        Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();


I'm guessing the problem was that whilst the estimated_data_page_size would increase, the lack of any values would cause it to erroneously not try to flush the page?

In particular https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L567

yep, exactly

tustvold · 2022-10-09T15:02:58Z

parquet/src/arrow/arrow_writer/byte_array.rs

        Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();


Should we be doing this regardless of if we've fallen back? I think currently this will fail to flush a dictionary encoded data page even if it has reached sufficient size?

Maybe, when we do it that way it causes a panic which may also be a bug.

General("Must flush data pages before flushing dictionary")'

I think we need to reset num_values to 0 when we flush a data page

I think it already does that right?

arrow-rs/parquet/src/arrow/arrow_writer/byte_array.rs

Line 313 in 2ae2309

num_values: std::mem::take(&mut self.num_values),

Ted-Jiang

👍

ursabot · 2022-10-10T08:31:41Z

Benchmark runs are scheduled for baseline = c3aac93 and contender = 0268bba. 0268bba is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Fix page size on dictionary fallback

a85568b

thinkharderdev requested a review from tustvold October 9, 2022 13:46

github-actions bot added the parquet Changes to the parquet crate label Oct 9, 2022

tustvold approved these changes Oct 9, 2022

View reviewed changes

tustvold reviewed Oct 9, 2022

View reviewed changes

thinkharderdev added 2 commits October 9, 2022 11:32

Make test deterministic

b627724

Comments and improve test

606e4f8

Ted-Jiang approved these changes Oct 10, 2022

View reviewed changes

tustvold merged commit 0268bba into master Oct 10, 2022

thinkharderdev mentioned this pull request Oct 11, 2022

Add benchmarks for testing row filtering apache/datafusion#3769

Merged

alamb mentioned this pull request Oct 14, 2022

parquet::arrow::arrow_writer::ArrowWriter ignores page size properties #2853

Closed

tustvold mentioned this pull request Oct 18, 2022

Respect Page Size Limits in ArrowWriter (#2853) #2890

Merged

alamb deleted the issue-2853 branch November 18, 2022 18:16

Fix page size on dictionary fallback #2854

Fix page size on dictionary fallback #2854

Uh oh!

Conversation

thinkharderdev commented Oct 9, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold Oct 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold Oct 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

ursabot commented Oct 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tustvold Oct 9, 2022 •

edited

Loading

tustvold Oct 9, 2022 •

edited

Loading