Skip to content

GH-41032: [C++][Parquet] Bugfixes and more tests in boolean arrow decoding#41037

Merged
pitrou merged 6 commits intoapache:mainfrom
mapleFU:more-tests-in-boolean-arrow-decoding
Apr 8, 2024
Merged

GH-41032: [C++][Parquet] Bugfixes and more tests in boolean arrow decoding#41037
pitrou merged 6 commits intoapache:mainfrom
mapleFU:more-tests-in-boolean-arrow-decoding

Conversation

@mapleFU
Copy link
Copy Markdown
Member

@mapleFU mapleFU commented Apr 5, 2024

Rationale for this change

Previous patches ( #40876 , #40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

What changes are included in this PR?

  1. Add tests for DecodeArrow with nulls and without nulls
  2. Fix 2 bugs

Are these changes tested?

Yes

Are there any user-facing changes?

no

@mapleFU mapleFU requested a review from wgtmac as a code owner April 5, 2024 16:14
@mapleFU mapleFU changed the title GH-41032: [C++][Parquet] Bugfixes More tests in boolean arrow decoding GH-41032: [C++][Parquet] Bugfixes and more tests in boolean arrow decoding Apr 5, 2024
@github-actions github-actions bot added the awaiting review Awaiting review label Apr 5, 2024
@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Apr 5, 2024

cc @pitrou @wgtmac

@mapleFU mapleFU force-pushed the more-tests-in-boolean-arrow-decoding branch from 5aec277 to 6f7f796 Compare April 5, 2024 16:16
@mapleFU mapleFU added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Apr 5, 2024
@mapleFU mapleFU force-pushed the more-tests-in-boolean-arrow-decoding branch from 6f7f796 to 4b811f2 Compare April 5, 2024 16:17
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only true_prob = 0.999 but not 1 trigger the bug in issue, because if true_prob = 1, it goes to fastpath without hittin a all 1 block

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 5, 2024
@mapleFU mapleFU requested a review from pitrou April 5, 2024 16:20
@mapleFU mapleFU force-pushed the more-tests-in-boolean-arrow-decoding branch from 4b811f2 to 825bb90 Compare April 5, 2024 16:21
Comment on lines +3182 to +3184
// set current_index_in_batch to current_batch_size means
// the whole batch is totally consumed.
current_index_in_batch = current_batch_size;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another bug I found by testing. "Release" will not trouble by this, but DCHECK might failed in next_boolean_batch

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of current_index_in_batch exactly? It always ends up having the same value.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only useful in per-value next when null_count != 0. next_boolean_batch will check all values in this batch is consumed (.i.e current_index_in_batch == batch_size). When null_count != 0, it's useless actually, but it uses next_boolean_batch here.

@mapleFU
Copy link
Copy Markdown
Member Author

mapleFU commented Apr 5, 2024

Sorry for add these two untested path, I'll add more test rather than only benchmarks later

if (ARROW_PREDICT_FALSE(!bit_reader_->Advance(values_decoded))) {
ParquetException::EofException();
ParquetException::EofException(
"PlainDecoder doesn't have enough values in bit_reader_");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should not expose variable name to user.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I change bit_reader_ to page now

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 8, 2024

@github-actions crossbow submit -g cpp


// Initialize input_data_ for the encoder from the expected_array_ values
const auto& boolean_array =
static_cast<const ::arrow::BooleanArray&>(*expected_dense_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: checked_cast?

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2024

Revision: 9c7e01c

Submitted crossbow builds: ursacomputing/crossbow @ actions-bbcae774eb

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

for (double true_prob : true_probabilities_) {
for (int read_batch_size : this->read_batch_sizes_) {
// Resume the state of decoder
InitTestCase(encoding, np, true_prob);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is regenerating the input data for each batch size, can we avoid this?

for (auto true_prob : true_probabilities_) {
for (int read_batch_size : this->read_batch_sizes_) {
// Resume the decoder
InitTestCase(encoding, /*null_probability=*/0, true_prob);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

std::shared_ptr<Buffer> buffer_;
};

TEST_F(TestBooleanArrowDecoding, CheckDecodeArrowUsingDenseBuilderPlain) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "dense builder"? Can we rename this? For example DecodeArrowPlain.

@mapleFU mapleFU force-pushed the more-tests-in-boolean-arrow-decoding branch from 6bfa3c4 to f5cafc4 Compare April 8, 2024 13:35
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 8, 2024

Thanks a lot for the quick submission @mapleFU !

@pitrou pitrou merged commit f8784ac into apache:main Apr 8, 2024
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Apr 8, 2024
@mapleFU mapleFU deleted the more-tests-in-boolean-arrow-decoding branch April 8, 2024 14:40
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f8784ac.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: C++ Component: Parquet Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants