-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-14923: [C++][Parquet] Fix DELTA_BINARY_PACKED problem on reading the last block with malford bit-width #15241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
|
|
Wait, it's still some bug, sorry. Let me fix it |
|
I change the code to for (uint32_t i = num_miniblocks; i < mini_blocks_per_block_; i++) {
bit_width_data[i] = random() % 128;
}And run multiple times with ASAN, test still passed. Though I don't know how to mock in release, seens we cannot use gmock when |
|
Oh, when running release, still bug occur in ASAN (https://github.com/apache/arrow/actions/runs/3862492798/jobs/6584098451), seems a ub or SIMD related bug. I'll work on it |
|
Well, it's a bit weird. TEST_F(DeltaBitPackEncoding, MalfordMiniblockBitWidth) {
std::shared_ptr<ColumnDescriptor> descr_ = ExampleDescr<Int32Type>();
auto decoder = MakeTypedDecoder<Int32Type>(Encoding::DELTA_BINARY_PACKED, descr_.get());
using c_type = parquet::Int32Type::c_type;
unsigned char good_data[] = "\200\001\004A\237\224\316\362\r\242\220\203- ";
int encode_buffer_size = 273;
int num_values = 65;
std::vector<uint8_t> output_bytes = std::vector<uint8_t>(num_values * sizeof(c_type));
auto decode_buf = reinterpret_cast<c_type*>(output_bytes.data());
decoder->SetData(num_values, &good_data[0], encode_buffer_size);@rok How do you build the |
|
Well, seems that both buffer is 273B. And I need to padding the |
|
This time it failed because unpack32_avx512 meets an ASAN issue. Maybe I need to know how input data is generated and do some adjustions. Go to sleep now. |
|
(By the way, my PC using AMD's CPU, which I can't reproduce unpack32_avx512 problem. So it's important for me to know how test data generated...) |
|
Hey @mapleFU, sorry for not replying earlier, I am preoccupied with the Jira->Github migration :). |
|
Sorry for being crazy and at you for multiple times. By using a self-written I guess the problem is from the data is truncated from first And by the way, min-delta in miniblock might be truncated, so the data is "incomplete". I will generate a new one here instead. |
|
Test regenerated, and representing using Hex. |
|
@pitrou @wjones127 Mind take a look? Since I made lots of naming changes( described #15241 (comment) ), and update a Hex testing format. Don't know whether it's ok. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me why you're making so many changes to the decoder. Was the suggested fix in #14923 (comment) not sufficient?
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this different from the other data above? IIUC, the trailing bytes should be different, but they seem identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other data is equal. (In fact, they're all generated by this->Execute(65, 1) )
|
In most case, #14923 (comment) is right. But we may have the case above: if (total_value_count_ != 1) {
InitBlock();
}Here handling it would becoming trickey. And update so much code because some name really makes me feel confusing. But maybe I can make it simplier |
|
I don't understand, what would be tricky? |
|
Let's go back to the original code: if (ARROW_PREDICT_FALSE(values_current_mini_block_ == 0)) {
if (ARROW_PREDICT_FALSE(!block_initialized_)) {
buffer[i++] = last_value_;
DCHECK_EQ(i, 1); // we're at the beginning of the page
if (ARROW_PREDICT_FALSE(i == max_values)) {
// When block is uninitialized and i reaches max_values we have two
// different possibilities:
// 1. total_value_count_ == 1, which means that the page may have only
// one value (encoded in the header), and we should not initialize
// any block.
// 2. total_value_count_ != 1, which means we should initialize the
// incoming block for subsequent reads.
if (total_value_count_ != 1) {
InitBlock();
}
break;
}
InitBlock();
}And assume the page has 33 values and 32 values per-mini-block:
But here, it should be |
|
Or maybe I can just add a void InitBlock(bool is_init_page) {
int current_miniblock_values = is_init_page ? 1 : 0;
...
} |
|
Why not, but can you add a test that would reproduce the issue? |
Yes, but adding test case is trickey... I cannot use |
|
Perhaps you can use the encoder and manually change the last bitwidths in the output? |
Sure, it's easy to write that in my personal develop enviroment. But I cannot submit these "mock" code to the codebase, which make writing unit test a bit more trickey. In my local machine, I can just use for (uint32_t i = num_miniblocks; i < mini_blocks_per_block_; i++) {
- bit_width_data[i] = 0;
+ // bit_width_data[i] = 0;
+ bit_width_data[i] = static_cast<uint8_t>(random());
}and run roundtrip tests. But in unit test, I should submit a test file or a string. And I cannot test too much cases, because I need to paste the whole string here. By the way, do you think using Hex to represent binary is ok? And if not ok, is there some good example? |
You can write the code that mutates the encoder output directly in the unit test. Schematically: std::vector<uint8_t> encoded = ...;
for (auto i = encoded.length() - 33; i < encoded.length() - 2; ++i) {
encoded[i] = 0xFF;
}
That works, yes. |
|
Well, I found that rok's testing data already covers that case: And assume the page has 65 values and 32 values per-mini-block: When changing to while (i < max_values) {
... decode values from miniblocks
}
total_values_remaining_ -= max_values;So, I think there are some fixing:
while (i < max_values) {
... decode values from miniblocks, when i increase, decrease remaining
... or passing i into InitBlock
}
CHECK_EQ(i, max_values);@pitrou what's your optionion here? Maintaining a (Personally I'd like to call |
…ead of <= for corner case )
- `total_value_count_` would be changed on every read - Add renaming to `total_value_count_` and `values_count_current_mini_block`
|
Ok, I was not satisfied with the test and implementation, so I ended up rewriting them. |
|
@shanhuuang Do you want to take a look? |
|
Great job, The code look good to me now. |
|
I update the description here: #15241 (comment) If you like, you can edit it for your new changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Hmm, interestingly, there are test failures... |
|
On my MacOs: The problem is that, for zigzag uleb 128, it may consume 2Bytes if lower bound is -63. Because the min-delta would be less than -63, and consuming 2B. I changed this to 31. |
|
Thanks @mapleFU ! |
|
@wjones127 Mind take a look :) ? |
|
@pitrou would you mind merge this patch? Or I should wait @wjones127 to take a look? |
|
I'll wait a bit for @wjones127 to chime in. Otherwise I'll probably merge tomorrow. |
|
Benchmark runs are scheduled for baseline = 4e439f6 and contender = e837f73. e837f73 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Problem is mentioned here: #14923
This patch fixes that issue. And the code is a bit complex.
block_initialized_->first_block_initialized_, because it's a mask that indicates that if the first block in page is initialized.total_value_count_->total_values_remaining_. Because it's nottotal values within a page, it meansremaing values to be decoded within a pagevalues_count_current_mini_block_->values_remaining_current_mini_block_, dittototal_value_count_: the total value numbers within a page.InitBlock()toInitBlock()andInitMiniBlockInitBlock()andInitMiniBlock32 32 165 165.And personally, I use the code here for testing:
The code works well in both debug and release mode.