GH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT#39570
GH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT#39570pitrou merged 1 commit intoapache:mainfrom
Conversation
c277c21 to
250e3a1
Compare
| ASSERT_EQ(values[1], 0.4001572f); | ||
| ASSERT_EQ(values[kNumRows - 2], -0.39944902f); |
There was a problem hiding this comment.
So we didn't need to check the whole file?
There was a problem hiding this comment.
Probably not? We don't want to hardcode the 300 values, I think.
There was a problem hiding this comment.
To elaborate a bit on this: if the first and last values decode correctly, it is very unlikely for other values to be incorrect, given the structure of the encoding.
Also, we have roundtrip tests to ensure internal consistency.
There was a problem hiding this comment.
I think it's ok, maybe one thing is it's generated by pyarrow, which uses our C++ encoder...
| file_reader->RowGroup(row_group)->Column(column)); | ||
| std::vector<ValueType> values(expected_values_read); | ||
| int64_t values_read; | ||
| auto levels_read = column_reader->ReadBatch(expected_values_read, nullptr, nullptr, |
There was a problem hiding this comment.
I don't know if passing nullptr as level get the right level...Maybe not?
There was a problem hiding this comment.
I'm not sure what you mean, but this is used on data without definition levels currently.
There was a problem hiding this comment.
the schema is :
required group field_id=-1 schema {
optional float field_id=-1 f32;
optional double field_id=-1 f64;
}
So the data has def-levels. And this read without def-levels force it read 300 values without querying levels. I don't know if this is expected. Should we make batchSize larger, and read with def-levels to check it only has 300 values? Or this is exactly what we need? (Just read 300 values here)
There was a problem hiding this comment.
I see. Well, it may have definition levels but it certainly has zero nulls, given how it was generated :-)
https://github.com/apache/parquet-testing/blob/master/data/README.md#byte-stream-split
If I generate the data directly, I get the same values too:
>>>
...: np.random.seed(0)
...: table = pa.Table.from_pydict({
...: 'f32': np.random.normal(size=300).astype(np.float32),
...: 'f64': np.random.normal(size=300).astype(np.float64),
...: })
>>> table
pyarrow.Table
f32: float
f64: double
----
f32: [[1.7640524,0.4001572,0.978738,2.2408931,1.867558,...,1.1368914,0.09772497,0.5829537,-0.39944902,0.37005588]]
f64: [[-1.3065268517353166,1.658130679618188,-0.11816404512856976,-0.6801782039968504,0.6663830820319143,...,0.37923553353558676,-0.4700328827008748,-0.21673147057553863,-0.9301565025243212,-0.17858909208732915]]There was a problem hiding this comment.
Hmmm I'm ok to not checking this. Just make it sure it's what we need
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 8149c39. There was 1 benchmark result with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…PLIT (apache#39570) ### Rationale for this change In apache/parquet-testing#45 , an integration file for BYTE_STREAM_SPLIT was added to the parquet-testing repo. ### What changes are included in this PR? Add a test reading that file and ensuring the decoded values are as expected. ### Are these changes tested? By definition. ### Are there any user-facing changes? No. * Closes: apache#39560 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
In apache/parquet-testing#45 , an integration file for BYTE_STREAM_SPLIT was added to the parquet-testing repo.
What changes are included in this PR?
Add a test reading that file and ensuring the decoded values are as expected.
Are these changes tested?
By definition.
Are there any user-facing changes?
No.