Skip to content

PARQUET-2261 Size Statistics#14000

Merged
rapids-bot[bot] merged 173 commits intorapidsai:branch-24.02from
etseidl:size_statistics
Dec 6, 2023
Merged

PARQUET-2261 Size Statistics#14000
rapids-bot[bot] merged 173 commits intorapidsai:branch-24.02from
etseidl:size_statistics

Conversation

@etseidl
Copy link
Copy Markdown
Contributor

@etseidl etseidl commented Aug 29, 2023

Description

Adds Parquet size statistics introduced in apache/parquet-format#197.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Aug 29, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 29, 2023
@PointKernel PointKernel self-requested a review December 5, 2023 18:24
@ttnghia ttnghia self-requested a review December 5, 2023 23:05
{
T v;
bool const res = parquet_field_struct<T>(field(), v).operator()(cpr, field_type);
bool const res = parquet_field_struct<T>{field(), v}(cpr, field_type);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank @vuule :D

Comment thread cpp/src/io/parquet/page_enc.cu Outdated

__syncthreads();
// page fragment size must fit in a 32-bit signed integer
if (s->frag.fragment_data_size > std::numeric_limits<int32_t>::max()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, compare signed vs unsigned.

Suggested change
if (s->frag.fragment_data_size > std::numeric_limits<int32_t>::max()) {
if (s->frag.fragment_data_size > static_cast<uint32_t>(std::numeric_limits<int32_t>::max())) {

Copy link
Copy Markdown
Contributor Author

@etseidl etseidl Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL std::numeric_limits<T>::max() returns a T 😅

But I don't think the cast is necessary since the result of max() will be implicitly converted to unsigned int per the "usual arithmetic conversions" anyway.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that the compiler can issue a warning which turn into error for this issue. Currently, it doesn't but that can change in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* This value should not be written if max_repetition_level is 0.
*/
thrust::optional<std::vector<int64_t>> repetition_level_histogram;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to mix thrust:: with std::? Why don't just use std::optional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compact protocol reader/writer uses thrust::optional because some optional fields need to be usable on the device.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how thrust::optional<std::vector>> can be used on device? Or this is just for consistency with other thrust::optional<plain_type>?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, consistency. LogicalType, for instance, is optional as well, and is referenced in device code. Since that one had to be thrust::optional, I figured it was easier for them all to be.

@etseidl etseidl requested a review from ttnghia December 6, 2023 15:50
@ttnghia
Copy link
Copy Markdown
Contributor

ttnghia commented Dec 6, 2023

/ok to test

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Dec 6, 2023
@raydouglass raydouglass removed the request for review from a team December 6, 2023 20:31
@raydouglass
Copy link
Copy Markdown
Contributor

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

@vuule vuule removed the 5 - Ready to Merge Testing and reviews complete, ready to merge label Dec 6, 2023
@vuule
Copy link
Copy Markdown
Contributor

vuule commented Dec 6, 2023

/merge

@rapids-bot rapids-bot Bot merged commit b136d8b into rapidsai:branch-24.02 Dec 6, 2023
@etseidl etseidl deleted the size_statistics branch December 6, 2023 23:52
rapids-bot Bot pushed a commit that referenced this pull request Dec 9, 2023
While investigating #14597 it was found that there was a race introduced by #14000. Adding a sync between invocations of a block reduce resolves the error.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14598
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
Adds Parquet size statistics introduced in apache/parquet-format#197.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: rapidsai#14000
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
While investigating rapidsai#14597 it was found that there was a race introduced by rapidsai#14000. Adding a sync between invocations of a block reduce resolves the error.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#14598
rapids-bot Bot pushed a commit that referenced this pull request Mar 7, 2024
#14000 added the ability to write new page statistics to the Parquet writer. This PR uses these new statistics to avoid some string size computations. Benchmarks show an improvement in read times of up to 20%.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #14973
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants