Skip to content

Conversation

@codephage2020
Copy link
Contributor

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

What changes are included in this PR?

  • add simdutf8 dependency for parquet-variant
  • add a fn extract_and_validate_utf8_slice

Are these changes tested?

We typically require tests for all PRs in order to:

  1. Prevent the code from being accidentally broken by subsequent changes
  2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

Signed-off-by: codephage2020 <tingwangyan2020@163.com>
Signed-off-by: codephage2020 <tingwangyan2020@163.com>
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2025
Signed-off-by: codephage2020 <tingwangyan2020@163.com>
@codephage2020 codephage2020 marked this pull request as ready for review July 11, 2025 16:56
@codephage2020
Copy link
Contributor Author

codephage2020 commented Jul 11, 2025

CC @alamb @friendlymatthew @Dandandan .
Please review when you have time. Let me know if any adjustments are needed!

Copy link
Contributor

@friendlymatthew friendlymatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @codephage2020. I just have a few comments

Signed-off-by: codephage2020 <tingwangyan2020@163.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 12, 2025
Signed-off-by: codephage2020 <tingwangyan2020@163.com>
@codephage2020
Copy link
Contributor Author

I just have a few comments

hi, @friendlymatthew . Thanks for your review! I've addressed all your comments in the latest update.

Copy link
Contributor

@friendlymatthew friendlymatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @codephage2020. I had some comments but overall LGTM

Signed-off-by: codephage2020 <tingwangyan2020@163.com>
Signed-off-by: codephage2020 <tingwangyan2020@163.com>
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tho I question the utility of one bit of complexity this PR adds.

Comment on lines 85 to 88
let offset_buffer = match offset {
0 => slice_from_slice(slice, range)?,
_ => slice_from_slice_at_offset(slice, offset, range)?,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch to test for zero probably costs more than applying an offset of zero?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could get a benchmark and show

I agree that unless we have a benchmark that shows this optimization makes a measurable difference, it would be better to go with the simpler code

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 85 to 88
let offset_buffer = match offset {
0 => slice_from_slice(slice, range)?,
_ => slice_from_slice_at_offset(slice, offset, range)?,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could get a benchmark and show

I agree that unless we have a benchmark that shows this optimization makes a measurable difference, it would be better to go with the simpler code

@codephage2020
Copy link
Contributor Author

CC @alamb @scovich .
I conducted a simple benchmark test locally.

result:

 direct_slice_at_offset_zero
                        time:   [1.0067 ns 1.0092 ns 1.0122 ns]
                        change: [−0.9043% −0.2643% +0.3090%] (p = 0.42 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

branched_slice_at_offset_zero
                        time:   [1.4050 ns 1.4100 ns 1.4157 ns]
                        change: [−0.3173% +0.0921% +0.4766%] (p = 0.65 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

@scovich The results show that you are right. And I will delete the branch for testing zero.

Below is the code for the benchmark test. Please let me know if there are any issues. This is a temporary test and I won't submit it.

const LARGE_DATA_SIZE: usize = 1 << 20;

fn setup_data() -> Vec<u8> {
    let mut rng = StdRng::seed_from_u64(42);
    (0..LARGE_DATA_SIZE).map(|_| rng.random()).collect()
}

fn bench_simple_direct(c: &mut Criterion) {
    let data = setup_data();
    let range = 0..1000;
    
    c.bench_function("direct_slice_at_offset_zero", |b| {
        b.iter(|| {
            let _ = black_box(slice_from_slice_at_offset(&data, 0, range.clone()));
        })
    });
}

fn bench_with_branch_zero_offset(c: &mut Criterion) {
    let data = setup_data();
    let range = 0..1000;
    let offset = 0;
    
    c.bench_function("branched_slice_at_offset_zero", |b| {
        b.iter(|| {
            let _ = black_box(match offset {
                0 => slice_from_slice(&data, range.clone()),
                _ => slice_from_slice_at_offset(&data, offset, range.clone()),
            });
        })
    });
}

@codephage2020 codephage2020 requested review from alamb and scovich July 14, 2025 14:03
@codephage2020
Copy link
Contributor Author

Hi @alamb ,
Just checking in - could you help merge this PR when you have a moment? Thanks for your review!

@alamb alamb merged commit 52fd59c into apache:main Jul 14, 2025
29 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 14, 2025

Sorry for the delay -- thank you @codephage2020 , @scovich and @friendlymatthew

@codephage2020
Copy link
Contributor Author

Sorry for the delay

All good! Thank you @alamb, @scovich and @friendlymatthew for reviewing and merging this!
Let's move this project forward🚀 !

@codephage2020 codephage2020 deleted the feat/variant-use-simdutf8-validation branch July 15, 2025 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Offer simdutf8 as an optional dependency when validating metadata

4 participants