GH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT by pitrou · Pull Request #39570 · apache/arrow

pitrou · 2024-01-11T17:25:39Z

Rationale for this change

In apache/parquet-testing#45 , an integration file for BYTE_STREAM_SPLIT was added to the parquet-testing repo.

What changes are included in this PR?

Add a test reading that file and ensuring the decoded values are as expected.

Are these changes tested?

By definition.

Are there any user-facing changes?

No.

Closes: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT #39560

…PLIT

mapleFU · 2024-01-11T17:35:14Z

cpp/src/parquet/reader_test.cc

+    ASSERT_EQ(values[1], 0.4001572f);
+    ASSERT_EQ(values[kNumRows - 2], -0.39944902f);


So we didn't need to check the whole file?

Probably not? We don't want to hardcode the 300 values, I think.

To elaborate a bit on this: if the first and last values decode correctly, it is very unlikely for other values to be incorrect, given the structure of the encoding.

Also, we have roundtrip tests to ensure internal consistency.

I think it's ok, maybe one thing is it's generated by pyarrow, which uses our C++ encoder...

mapleFU · 2024-01-11T17:38:09Z

cpp/src/parquet/reader_test.cc

+      file_reader->RowGroup(row_group)->Column(column));
+  std::vector<ValueType> values(expected_values_read);
+  int64_t values_read;
+  auto levels_read = column_reader->ReadBatch(expected_values_read, nullptr, nullptr,


#39381

I don't know if passing nullptr as level get the right level...Maybe not?

I'm not sure what you mean, but this is used on data without definition levels currently.

the schema is :

required group field_id=-1 schema { optional float field_id=-1 f32; optional double field_id=-1 f64; }

So the data has def-levels. And this read without def-levels force it read 300 values without querying levels. I don't know if this is expected. Should we make batchSize larger, and read with def-levels to check it only has 300 values? Or this is exactly what we need? (Just read 300 values here)

I see. Well, it may have definition levels but it certainly has zero nulls, given how it was generated :-)
https://github.com/apache/parquet-testing/blob/master/data/README.md#byte-stream-split

If I generate the data directly, I get the same values too:

>>> ...: np.random.seed(0) ...: table = pa.Table.from_pydict({ ...: 'f32': np.random.normal(size=300).astype(np.float32), ...: 'f64': np.random.normal(size=300).astype(np.float64), ...: }) >>> table pyarrow.Table f32: float f64: double ---- f32: [[1.7640524,0.4001572,0.978738,2.2408931,1.867558,...,1.1368914,0.09772497,0.5829537,-0.39944902,0.37005588]] f64: [[-1.3065268517353166,1.658130679618188,-0.11816404512856976,-0.6801782039968504,0.6663830820319143,...,0.37923553353558676,-0.4700328827008748,-0.21673147057553863,-0.9301565025243212,-0.17858909208732915]]

Hmmm I'm ok to not checking this. Just make it sure it's what we need

mapleFU

+1

conbench-apache-arrow · 2024-01-13T03:54:57Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 8149c39.

There was 1 benchmark result with an error:

Commit Run on ursa-i9-9960x at 2024-01-12 17:22:23Z
- dataframe-to-table (R) with dataset=chi_traffic_2020_Q1, language=R

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

…PLIT (apache#39570) ### Rationale for this change In apache/parquet-testing#45 , an integration file for BYTE_STREAM_SPLIT was added to the parquet-testing repo. ### What changes are included in this PR? Add a test reading that file and ensuring the decoded values are as expected. ### Are these changes tested? By definition. ### Are there any user-facing changes? No. * Closes: apache#39560 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

pitrou requested a review from wgtmac as a code owner January 11, 2024 17:25

pitrou requested review from mapleFU and removed request for wgtmac January 11, 2024 17:25

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jan 11, 2024

apacheGH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_S…

250e3a1

…PLIT

pitrou force-pushed the gh39560-byte-stream-split-integration branch from c277c21 to 250e3a1 Compare January 11, 2024 17:27

mapleFU reviewed Jan 11, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 11, 2024

mapleFU approved these changes Jan 11, 2024

View reviewed changes

pitrou merged commit 8149c39 into apache:main Jan 11, 2024

pitrou removed the awaiting committer review Awaiting committer review label Jan 11, 2024

pitrou deleted the gh39560-byte-stream-split-integration branch January 11, 2024 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT#39570

GH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT#39570
pitrou merged 1 commit intoapache:mainfrom
pitrou:gh39560-byte-stream-split-integration

pitrou commented Jan 11, 2024 •

edited by github-actions bot

Loading

Uh oh!

mapleFU Jan 11, 2024

Uh oh!

pitrou Jan 11, 2024

Uh oh!

pitrou Jan 11, 2024

Uh oh!

mapleFU Jan 11, 2024

Uh oh!

mapleFU Jan 11, 2024

Uh oh!

pitrou Jan 11, 2024

Uh oh!

mapleFU Jan 11, 2024

Uh oh!

pitrou Jan 11, 2024

Uh oh!

mapleFU Jan 11, 2024 •

edited

Loading

Uh oh!

mapleFU left a comment

Uh oh!

conbench-apache-arrow bot commented Jan 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		ASSERT_EQ(values[1], 0.4001572f);
		ASSERT_EQ(values[kNumRows - 2], -0.39944902f);

Conversation

pitrou commented Jan 11, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mapleFU Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

conbench-apache-arrow bot commented Jan 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pitrou commented Jan 11, 2024 •

edited by github-actions bot

Loading

mapleFU Jan 11, 2024 •

edited

Loading