GH-39747: [C++][Parquet] Make BYTE_STREAM_SPLIT routines type-agnostic by pitrou · Pull Request #39748 · apache/arrow

pitrou · 2024-01-22T17:21:38Z

Rationale for this change

The low-level BYTE_STREAM_SPLIT routines currently reference the logical type they are operating on (float or double). However, the BYTE_STREAM_SPLIT encoding is type-agnostic and only cares about the type width. Removing references to logical types makes these routines easier to reuse.

Are these changes tested?

Yes, including more exhaustive SIMD tests.

Are there any user-facing changes?

No. These routines are internal.

Closes: [C++][Parquet] Make BYTE_STREAM_SPLIT routines type-agnostic #39747

…gnostic

github-actions · 2024-01-22T17:22:08Z

⚠️ GitHub issue #39747 has been automatically assigned in GitHub to PR creator.

pitrou · 2024-01-22T17:22:16Z

@mapleFU @wgtmac @cyb70289 Does one of you have access to a AVX512 machine? Can you build this PR with -DARROW_SIMD_LEVEL=AVX512 and run the tests?

pitrou · 2024-01-22T17:27:41Z

@github-actions crossbow submit -g cpp

github-actions · 2024-01-22T17:30:09Z

Revision: 267dae4

Submitted crossbow builds: ursacomputing/crossbow @ actions-4c99dc6a39

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-38-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-no-threading

cyb70289 · 2024-01-22T18:01:24Z

All arrow-utility-test passed.

[----------] 3 tests from TestByteStreamSplitSpecialized/0, where TypeParam = float
[ RUN      ] TestByteStreamSplitSpecialized/0.RoundtripSmall
[       OK ] TestByteStreamSplitSpecialized/0.RoundtripSmall (0 ms)
[ RUN      ] TestByteStreamSplitSpecialized/0.RoundtripMidsized
[       OK ] TestByteStreamSplitSpecialized/0.RoundtripMidsized (1 ms)
[ RUN      ] TestByteStreamSplitSpecialized/0.PiecewiseDecode
[       OK ] TestByteStreamSplitSpecialized/0.PiecewiseDecode (0 ms)
[----------] 3 tests from TestByteStreamSplitSpecialized/0 (1 ms total)

[----------] 3 tests from TestByteStreamSplitSpecialized/1, where TypeParam = double
[ RUN      ] TestByteStreamSplitSpecialized/1.RoundtripSmall
[       OK ] TestByteStreamSplitSpecialized/1.RoundtripSmall (0 ms)
[ RUN      ] TestByteStreamSplitSpecialized/1.RoundtripMidsized
[       OK ] TestByteStreamSplitSpecialized/1.RoundtripMidsized (0 ms)
[ RUN      ] TestByteStreamSplitSpecialized/1.PiecewiseDecode
[       OK ] TestByteStreamSplitSpecialized/1.PiecewiseDecode (0 ms)
[----------] 3 tests from TestByteStreamSplitSpecialized/1 (0 ms total)

Looks most AVX512 benchmarks are slower than AVX2.

BM_ByteStreamSplitDecode_Float_Sse2/1024            385 ns          385 ns      1817970 bytes_per_second=9.90691G/s
BM_ByteStreamSplitDecode_Float_Sse2/4096           1578 ns         1578 ns       443296 bytes_per_second=9.66871G/s
BM_ByteStreamSplitDecode_Float_Sse2/32768         17828 ns        17828 ns        39266 bytes_per_second=6.84725G/s
BM_ByteStreamSplitDecode_Float_Sse2/65536         36005 ns        36004 ns        19381 bytes_per_second=6.78086G/s
BM_ByteStreamSplitDecode_Double_Sse2/1024          1481 ns         1481 ns       473181 bytes_per_second=5.15284G/s
BM_ByteStreamSplitDecode_Double_Sse2/4096          6152 ns         6152 ns       113963 bytes_per_second=4.96074G/s
BM_ByteStreamSplitDecode_Double_Sse2/32768        50235 ns        50233 ns        10000 bytes_per_second=4.86012G/s
BM_ByteStreamSplitDecode_Double_Sse2/65536       108497 ns       108491 ns         6467 bytes_per_second=4.50067G/s
BM_ByteStreamSplitEncode_Float_Sse2/1024            506 ns          506 ns      1382482 bytes_per_second=7.53754G/s
BM_ByteStreamSplitEncode_Float_Sse2/4096           2092 ns         2092 ns       334971 bytes_per_second=7.29386G/s
BM_ByteStreamSplitEncode_Float_Sse2/32768         21095 ns        21095 ns        32815 bytes_per_second=5.7868G/s
BM_ByteStreamSplitEncode_Float_Sse2/65536         42673 ns        42672 ns        16394 bytes_per_second=5.72127G/s
BM_ByteStreamSplitEncode_Double_Sse2/1024          1504 ns         1504 ns       469782 bytes_per_second=5.07356G/s
BM_ByteStreamSplitEncode_Double_Sse2/4096          6695 ns         6695 ns       104530 bytes_per_second=4.55845G/s
BM_ByteStreamSplitEncode_Double_Sse2/32768        53906 ns        53906 ns        13078 bytes_per_second=4.52898G/s
BM_ByteStreamSplitEncode_Double_Sse2/65536       110635 ns       110631 ns         6351 bytes_per_second=4.41362G/s

BM_ByteStreamSplitDecode_Float_Avx2/1024            245 ns          245 ns      2861683 bytes_per_second=15.5934G/s
BM_ByteStreamSplitDecode_Float_Avx2/4096           1025 ns         1025 ns       681538 bytes_per_second=14.8871G/s
BM_ByteStreamSplitDecode_Float_Avx2/32768         10257 ns        10257 ns        68326 bytes_per_second=11.9015G/s
BM_ByteStreamSplitDecode_Float_Avx2/65536         20694 ns        20694 ns        33842 bytes_per_second=11.7977G/s
BM_ByteStreamSplitDecode_Double_Avx2/1024          1016 ns         1016 ns       688647 bytes_per_second=7.50591G/s
BM_ByteStreamSplitDecode_Double_Avx2/4096          4500 ns         4500 ns       155458 bytes_per_second=6.78202G/s
BM_ByteStreamSplitDecode_Double_Avx2/32768        36439 ns        36438 ns        19211 bytes_per_second=6.70008G/s
BM_ByteStreamSplitDecode_Double_Avx2/65536        78390 ns        78389 ns         8948 bytes_per_second=6.22898G/s
BM_ByteStreamSplitEncode_Float_Avx2/1024            835 ns          835 ns       838644 bytes_per_second=4.57004G/s
BM_ByteStreamSplitEncode_Float_Avx2/4096           3312 ns         3312 ns       211295 bytes_per_second=4.60756G/s
BM_ByteStreamSplitEncode_Float_Avx2/32768         30086 ns        30086 ns        23273 bytes_per_second=4.05734G/s
BM_ByteStreamSplitEncode_Float_Avx2/65536         60202 ns        60201 ns        11628 bytes_per_second=4.05541G/s
BM_ByteStreamSplitEncode_Double_Avx2/1024          1516 ns         1516 ns       462231 bytes_per_second=5.03305G/s
BM_ByteStreamSplitEncode_Double_Avx2/4096          6741 ns         6741 ns       103251 bytes_per_second=4.52713G/s
BM_ByteStreamSplitEncode_Double_Avx2/32768        54044 ns        54043 ns        12981 bytes_per_second=4.51749G/s
BM_ByteStreamSplitEncode_Double_Avx2/65536       111468 ns       111465 ns         6257 bytes_per_second=4.38058G/s

BM_ByteStreamSplitDecode_Float_Avx512/1024          311 ns          311 ns      2253479 bytes_per_second=12.2817G/s
BM_ByteStreamSplitDecode_Float_Avx512/4096         1331 ns         1331 ns       524762 bytes_per_second=11.4655G/s
BM_ByteStreamSplitDecode_Float_Avx512/32768       12203 ns        12203 ns        57141 bytes_per_second=10.0036G/s
BM_ByteStreamSplitDecode_Float_Avx512/65536       24593 ns        24592 ns        28474 bytes_per_second=9.92753G/s
BM_ByteStreamSplitDecode_Double_Avx512/1024        1105 ns         1105 ns       633441 bytes_per_second=6.90407G/s
BM_ByteStreamSplitDecode_Double_Avx512/4096        5196 ns         5196 ns       134568 bytes_per_second=5.8736G/s
BM_ByteStreamSplitDecode_Double_Avx512/32768      42411 ns        42411 ns        16505 bytes_per_second=5.75654G/s
BM_ByteStreamSplitDecode_Double_Avx512/65536      89219 ns        89126 ns         7834 bytes_per_second=5.47857G/s
BM_ByteStreamSplitEncode_Float_Avx512/1024          578 ns          578 ns      1210908 bytes_per_second=6.59961G/s
BM_ByteStreamSplitEncode_Float_Avx512/4096         2495 ns         2495 ns       280623 bytes_per_second=6.11606G/s
BM_ByteStreamSplitEncode_Float_Avx512/32768       28142 ns        28142 ns        24881 bytes_per_second=4.33771G/s
BM_ByteStreamSplitEncode_Float_Avx512/65536       56378 ns        56378 ns        12384 bytes_per_second=4.3304G/s
BM_ByteStreamSplitEncode_Double_Avx512/1024        1034 ns         1034 ns       677564 bytes_per_second=7.38006G/s
BM_ByteStreamSplitEncode_Double_Avx512/4096        7255 ns         7255 ns        96388 bytes_per_second=4.20642G/s
BM_ByteStreamSplitEncode_Double_Avx512/32768      62212 ns        62211 ns        11207 bytes_per_second=3.92438G/s
BM_ByteStreamSplitEncode_Double_Avx512/65536     126019 ns       126013 ns         5535 bytes_per_second=3.87484G/s

pitrou · 2024-01-22T18:04:24Z

Looks most AVX512 benchmarks are slower than AVX2.

Amusing. Which CPU is it?

cyb70289 · 2024-01-22T20:17:01Z

Looks most AVX512 benchmarks are slower than AVX2.

Amusing. Which CPU is it?

It's Intel Cascade Lake.
https://www.intel.com/content/www/us/en/products/sku/192444/intel-xeon-gold-5218-processor-22m-cache-2-30-ghz/specifications.html

pitrou · 2024-01-22T20:37:50Z

This might be because the CPU clock frequency slows down when executing AVX-512 instructions.

cyb70289 · 2024-01-22T22:45:02Z

Tested with perf, cpu freq (cycles/task-clock) is 2.8GHz for sse2 and avx2 tests, matches the system max frequency.
But for avx512 test, the reported cpu freq is 2.3GHz. IPC also drops.

# SSE2, ~2.8GHz
$ sudo perf stat -e cycles,task-clock,instructions -- release/parquet-encoding-benchmark --benchmark_filter="BM_ByteStreamSplitDecode_Float_Sse2/1024"
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Sse2/1024        385 ns          385 ns      1818400 bytes_per_second=9.90927G/s

 Performance counter stats for 'release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitDecode_Float_Sse2/1024':

     3,177,983,141      cycles                    #    2.792 GHz
          1,138.11 msec task-clock                #    0.967 CPUs utilized
    11,062,559,384      instructions              #    3.48  insn per cycle


# AVX2, ~2.8GHz
$ sudo perf stat -e cycles,task-clock,instructions -- release/parquet-encoding-benchmark --benchmark_filter="BM_ByteStreamSplitDecode_Float_Avx2/1024"
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Avx2/1024        257 ns          257 ns      2725100 bytes_per_second=14.8485G/s

 Performance counter stats for 'release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitDecode_Float_Avx2/1024':

     2,778,881,859      cycles                    #    2.793 GHz
            995.09 msec task-clock                #    0.966 CPUs utilized
     8,314,230,560      instructions              #    2.99  insn per cycle


# AVX512, ~2.3GHz
$ sudo perf stat -e cycles,task-clock,instructions -- release/parquet-encoding-benchmark --benchmark_filter="BM_ByteStreamSplitDecode_Float_Avx512/1024"
-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Avx512/1024        322 ns          322 ns      2173349 bytes_per_second=11.8434G/s

 Performance counter stats for 'release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitDecode_Float_Avx512/1024':

     2,454,008,893      cycles                    #    2.298 GHz
          1,067.73 msec task-clock                #    0.965 CPUs utilized
     5,877,374,076      instructions              #    2.40  insn per cycle

wgtmac · 2024-01-23T02:00:15Z

Looks most AVX512 benchmarks are slower than AVX2.

We've seen same performance on Intel Xeon.

pitrou · 2024-01-23T08:37:20Z

Perhaps we should open a separate issue to remove the AVX512 variants then?

pitrou · 2024-01-23T08:38:29Z

@wgtmac @mapleFU Do you want to review this PR?

mapleFU

+1

cyb70289 · 2024-01-23T15:31:37Z

Perhaps we should open a separate issue to remove the AVX512 variants then?

Agreed

conbench-apache-arrow · 2024-01-23T19:23:40Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 78ec4dc.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…gnostic (apache#39748) ### Rationale for this change The low-level BYTE_STREAM_SPLIT routines currently reference the logical type they are operating on (float or double). However, the BYTE_STREAM_SPLIT encoding is type-agnostic and only cares about the type width. Removing references to logical types makes these routines easier to reuse. ### Are these changes tested? Yes, including more exhaustive SIMD tests. ### Are there any user-facing changes? No. These routines are internal. * Closes: apache#39747 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

apacheGH-39747: [C++][Parquet] Make BYTE_STREAM_SPLIT routines type-a…

267dae4

…gnostic

pitrou requested a review from wgtmac as a code owner January 22, 2024 17:21

github-actions bot added Component: Parquet Component: C++ labels Jan 22, 2024

github-actions bot added the awaiting review Awaiting review label Jan 22, 2024

mapleFU approved these changes Jan 23, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 23, 2024

pitrou merged commit 78ec4dc into apache:main Jan 23, 2024

pitrou removed the awaiting committer review Awaiting committer review label Jan 23, 2024

pitrou deleted the gh39747-byte-stream-split-type-agnostic branch January 23, 2024 09:36

Conversation

pitrou commented Jan 22, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jan 22, 2024

Uh oh!

pitrou commented Jan 22, 2024

Uh oh!

pitrou commented Jan 22, 2024

Uh oh!

github-actions bot commented Jan 22, 2024

Uh oh!

cyb70289 commented Jan 22, 2024

Uh oh!

pitrou commented Jan 22, 2024

Uh oh!

cyb70289 commented Jan 22, 2024

Uh oh!

pitrou commented Jan 22, 2024

Uh oh!

cyb70289 commented Jan 22, 2024

Uh oh!

wgtmac commented Jan 23, 2024

Uh oh!

pitrou commented Jan 23, 2024

Uh oh!

pitrou commented Jan 23, 2024

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

cyb70289 commented Jan 23, 2024

Uh oh!

conbench-apache-arrow bot commented Jan 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pitrou commented Jan 22, 2024 •

edited by github-actions bot

Loading