GH-41032: [C++][Parquet] Bugfixes and more tests in boolean arrow decoding#41037
Conversation
5aec277 to
6f7f796
Compare
6f7f796 to
4b811f2
Compare
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Only true_prob = 0.999 but not 1 trigger the bug in issue, because if true_prob = 1, it goes to fastpath without hittin a all 1 block
4b811f2 to
825bb90
Compare
| // set current_index_in_batch to current_batch_size means | ||
| // the whole batch is totally consumed. | ||
| current_index_in_batch = current_batch_size; |
There was a problem hiding this comment.
This is another bug I found by testing. "Release" will not trouble by this, but DCHECK might failed in next_boolean_batch
There was a problem hiding this comment.
What's the point of current_index_in_batch exactly? It always ends up having the same value.
There was a problem hiding this comment.
It's only useful in per-value next when null_count != 0. next_boolean_batch will check all values in this batch is consumed (.i.e current_index_in_batch == batch_size). When null_count != 0, it's useless actually, but it uses next_boolean_batch here.
|
Sorry for add these two untested path, I'll add more test rather than only benchmarks later |
cpp/src/parquet/encoding.cc
Outdated
| if (ARROW_PREDICT_FALSE(!bit_reader_->Advance(values_decoded))) { | ||
| ParquetException::EofException(); | ||
| ParquetException::EofException( | ||
| "PlainDecoder doesn't have enough values in bit_reader_"); |
There was a problem hiding this comment.
IMO, we should not expose variable name to user.
There was a problem hiding this comment.
You're right. I change bit_reader_ to page now
|
@github-actions crossbow submit -g cpp |
cpp/src/parquet/encoding_test.cc
Outdated
|
|
||
| // Initialize input_data_ for the encoder from the expected_array_ values | ||
| const auto& boolean_array = | ||
| static_cast<const ::arrow::BooleanArray&>(*expected_dense_); |
|
Revision: 9c7e01c Submitted crossbow builds: ursacomputing/crossbow @ actions-bbcae774eb |
cpp/src/parquet/encoding_test.cc
Outdated
| for (double true_prob : true_probabilities_) { | ||
| for (int read_batch_size : this->read_batch_sizes_) { | ||
| // Resume the state of decoder | ||
| InitTestCase(encoding, np, true_prob); |
There was a problem hiding this comment.
This is regenerating the input data for each batch size, can we avoid this?
cpp/src/parquet/encoding_test.cc
Outdated
| for (auto true_prob : true_probabilities_) { | ||
| for (int read_batch_size : this->read_batch_sizes_) { | ||
| // Resume the decoder | ||
| InitTestCase(encoding, /*null_probability=*/0, true_prob); |
cpp/src/parquet/encoding_test.cc
Outdated
| std::shared_ptr<Buffer> buffer_; | ||
| }; | ||
|
|
||
| TEST_F(TestBooleanArrowDecoding, CheckDecodeArrowUsingDenseBuilderPlain) { |
There was a problem hiding this comment.
What is a "dense builder"? Can we rename this? For example DecodeArrowPlain.
6bfa3c4 to
f5cafc4
Compare
|
Thanks a lot for the quick submission @mapleFU ! |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f8784ac. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…ow decoding (apache#41037) ### Rationale for this change Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug ### What changes are included in this PR? 1. Add tests for DecodeArrow with nulls and without nulls 2. Fix 2 bugs ### Are these changes tested? Yes ### Are there any user-facing changes? no * GitHub Issue: apache#41032 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <anmmscs_maple@qq.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Previous patches ( #40876 , #40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
no