Skip to content

GH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT#39570

Merged
pitrou merged 1 commit intoapache:mainfrom
pitrou:gh39560-byte-stream-split-integration
Jan 11, 2024
Merged

GH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT#39570
pitrou merged 1 commit intoapache:mainfrom
pitrou:gh39560-byte-stream-split-integration

Conversation

@pitrou
Copy link
Copy Markdown
Member

@pitrou pitrou commented Jan 11, 2024

Rationale for this change

In apache/parquet-testing#45 , an integration file for BYTE_STREAM_SPLIT was added to the parquet-testing repo.

What changes are included in this PR?

Add a test reading that file and ensuring the decoded values are as expected.

Are these changes tested?

By definition.

Are there any user-facing changes?

No.

@pitrou pitrou requested a review from wgtmac as a code owner January 11, 2024 17:25
@pitrou pitrou requested review from mapleFU and removed request for wgtmac January 11, 2024 17:25
@pitrou pitrou force-pushed the gh39560-byte-stream-split-integration branch from c277c21 to 250e3a1 Compare January 11, 2024 17:27
Comment on lines +1508 to +1509
ASSERT_EQ(values[1], 0.4001572f);
ASSERT_EQ(values[kNumRows - 2], -0.39944902f);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we didn't need to check the whole file?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not? We don't want to hardcode the 300 values, I think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate a bit on this: if the first and last values decode correctly, it is very unlikely for other values to be incorrect, given the structure of the encoding.

Also, we have roundtrip tests to ensure internal consistency.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok, maybe one thing is it's generated by pyarrow, which uses our C++ encoder...

file_reader->RowGroup(row_group)->Column(column));
std::vector<ValueType> values(expected_values_read);
int64_t values_read;
auto levels_read = column_reader->ReadBatch(expected_values_read, nullptr, nullptr,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#39381

I don't know if passing nullptr as level get the right level...Maybe not?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean, but this is used on data without definition levels currently.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the schema is :

required group field_id=-1 schema {
  optional float field_id=-1 f32;
  optional double field_id=-1 f64;
}

So the data has def-levels. And this read without def-levels force it read 300 values without querying levels. I don't know if this is expected. Should we make batchSize larger, and read with def-levels to check it only has 300 values? Or this is exactly what we need? (Just read 300 values here)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Well, it may have definition levels but it certainly has zero nulls, given how it was generated :-)
https://github.com/apache/parquet-testing/blob/master/data/README.md#byte-stream-split

If I generate the data directly, I get the same values too:

>>> 
...: np.random.seed(0)
...: table = pa.Table.from_pydict({
...:   'f32': np.random.normal(size=300).astype(np.float32),
...:   'f64': np.random.normal(size=300).astype(np.float64),
...: })
>>> table
pyarrow.Table
f32: float
f64: double
----
f32: [[1.7640524,0.4001572,0.978738,2.2408931,1.867558,...,1.1368914,0.09772497,0.5829537,-0.39944902,0.37005588]]
f64: [[-1.3065268517353166,1.658130679618188,-0.11816404512856976,-0.6801782039968504,0.6663830820319143,...,0.37923553353558676,-0.4700328827008748,-0.21673147057553863,-0.9301565025243212,-0.17858909208732915]]

Copy link
Copy Markdown
Member

@mapleFU mapleFU Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I'm ok to not checking this. Just make it sure it's what we need

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 11, 2024
Copy link
Copy Markdown
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou pitrou merged commit 8149c39 into apache:main Jan 11, 2024
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jan 11, 2024
@pitrou pitrou deleted the gh39560-byte-stream-split-integration branch January 11, 2024 18:41
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 8149c39.

There was 1 benchmark result with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…PLIT (apache#39570)

### Rationale for this change

In apache/parquet-testing#45 , an integration file for BYTE_STREAM_SPLIT was added to the parquet-testing repo.

### What changes are included in this PR?

Add a test reading that file and ensuring the decoded values are as expected.

### Are these changes tested?

By definition.

### Are there any user-facing changes?

No.
* Closes: apache#39560

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Add integration test for BYTE_STREAM_SPLIT

2 participants