Skip to content

GH-39747: [C++][Parquet] Make BYTE_STREAM_SPLIT routines type-agnostic#39748

Merged
pitrou merged 1 commit intoapache:mainfrom
pitrou:gh39747-byte-stream-split-type-agnostic
Jan 23, 2024
Merged

GH-39747: [C++][Parquet] Make BYTE_STREAM_SPLIT routines type-agnostic#39748
pitrou merged 1 commit intoapache:mainfrom
pitrou:gh39747-byte-stream-split-type-agnostic

Conversation

@pitrou
Copy link
Copy Markdown
Member

@pitrou pitrou commented Jan 22, 2024

Rationale for this change

The low-level BYTE_STREAM_SPLIT routines currently reference the logical type they are operating on (float or double). However, the BYTE_STREAM_SPLIT encoding is type-agnostic and only cares about the type width. Removing references to logical types makes these routines easier to reuse.

Are these changes tested?

Yes, including more exhaustive SIMD tests.

Are there any user-facing changes?

No. These routines are internal.

@pitrou pitrou requested a review from wgtmac as a code owner January 22, 2024 17:21
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #39747 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Jan 22, 2024

@mapleFU @wgtmac @cyb70289 Does one of you have access to a AVX512 machine? Can you build this PR with -DARROW_SIMD_LEVEL=AVX512 and run the tests?

@github-actions github-actions bot added the awaiting review Awaiting review label Jan 22, 2024
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Jan 22, 2024

@github-actions crossbow submit -g cpp

@github-actions
Copy link
Copy Markdown

Revision: 267dae4

Submitted crossbow builds: ursacomputing/crossbow @ actions-4c99dc6a39

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-38-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions

@cyb70289
Copy link
Copy Markdown
Contributor

All arrow-utility-test passed.

[----------] 3 tests from TestByteStreamSplitSpecialized/0, where TypeParam = float
[ RUN      ] TestByteStreamSplitSpecialized/0.RoundtripSmall
[       OK ] TestByteStreamSplitSpecialized/0.RoundtripSmall (0 ms)
[ RUN      ] TestByteStreamSplitSpecialized/0.RoundtripMidsized
[       OK ] TestByteStreamSplitSpecialized/0.RoundtripMidsized (1 ms)
[ RUN      ] TestByteStreamSplitSpecialized/0.PiecewiseDecode
[       OK ] TestByteStreamSplitSpecialized/0.PiecewiseDecode (0 ms)
[----------] 3 tests from TestByteStreamSplitSpecialized/0 (1 ms total)

[----------] 3 tests from TestByteStreamSplitSpecialized/1, where TypeParam = double
[ RUN      ] TestByteStreamSplitSpecialized/1.RoundtripSmall
[       OK ] TestByteStreamSplitSpecialized/1.RoundtripSmall (0 ms)
[ RUN      ] TestByteStreamSplitSpecialized/1.RoundtripMidsized
[       OK ] TestByteStreamSplitSpecialized/1.RoundtripMidsized (0 ms)
[ RUN      ] TestByteStreamSplitSpecialized/1.PiecewiseDecode
[       OK ] TestByteStreamSplitSpecialized/1.PiecewiseDecode (0 ms)
[----------] 3 tests from TestByteStreamSplitSpecialized/1 (0 ms total)

Looks most AVX512 benchmarks are slower than AVX2.

BM_ByteStreamSplitDecode_Float_Sse2/1024            385 ns          385 ns      1817970 bytes_per_second=9.90691G/s
BM_ByteStreamSplitDecode_Float_Sse2/4096           1578 ns         1578 ns       443296 bytes_per_second=9.66871G/s
BM_ByteStreamSplitDecode_Float_Sse2/32768         17828 ns        17828 ns        39266 bytes_per_second=6.84725G/s
BM_ByteStreamSplitDecode_Float_Sse2/65536         36005 ns        36004 ns        19381 bytes_per_second=6.78086G/s
BM_ByteStreamSplitDecode_Double_Sse2/1024          1481 ns         1481 ns       473181 bytes_per_second=5.15284G/s
BM_ByteStreamSplitDecode_Double_Sse2/4096          6152 ns         6152 ns       113963 bytes_per_second=4.96074G/s
BM_ByteStreamSplitDecode_Double_Sse2/32768        50235 ns        50233 ns        10000 bytes_per_second=4.86012G/s
BM_ByteStreamSplitDecode_Double_Sse2/65536       108497 ns       108491 ns         6467 bytes_per_second=4.50067G/s
BM_ByteStreamSplitEncode_Float_Sse2/1024            506 ns          506 ns      1382482 bytes_per_second=7.53754G/s
BM_ByteStreamSplitEncode_Float_Sse2/4096           2092 ns         2092 ns       334971 bytes_per_second=7.29386G/s
BM_ByteStreamSplitEncode_Float_Sse2/32768         21095 ns        21095 ns        32815 bytes_per_second=5.7868G/s
BM_ByteStreamSplitEncode_Float_Sse2/65536         42673 ns        42672 ns        16394 bytes_per_second=5.72127G/s
BM_ByteStreamSplitEncode_Double_Sse2/1024          1504 ns         1504 ns       469782 bytes_per_second=5.07356G/s
BM_ByteStreamSplitEncode_Double_Sse2/4096          6695 ns         6695 ns       104530 bytes_per_second=4.55845G/s
BM_ByteStreamSplitEncode_Double_Sse2/32768        53906 ns        53906 ns        13078 bytes_per_second=4.52898G/s
BM_ByteStreamSplitEncode_Double_Sse2/65536       110635 ns       110631 ns         6351 bytes_per_second=4.41362G/s

BM_ByteStreamSplitDecode_Float_Avx2/1024            245 ns          245 ns      2861683 bytes_per_second=15.5934G/s
BM_ByteStreamSplitDecode_Float_Avx2/4096           1025 ns         1025 ns       681538 bytes_per_second=14.8871G/s
BM_ByteStreamSplitDecode_Float_Avx2/32768         10257 ns        10257 ns        68326 bytes_per_second=11.9015G/s
BM_ByteStreamSplitDecode_Float_Avx2/65536         20694 ns        20694 ns        33842 bytes_per_second=11.7977G/s
BM_ByteStreamSplitDecode_Double_Avx2/1024          1016 ns         1016 ns       688647 bytes_per_second=7.50591G/s
BM_ByteStreamSplitDecode_Double_Avx2/4096          4500 ns         4500 ns       155458 bytes_per_second=6.78202G/s
BM_ByteStreamSplitDecode_Double_Avx2/32768        36439 ns        36438 ns        19211 bytes_per_second=6.70008G/s
BM_ByteStreamSplitDecode_Double_Avx2/65536        78390 ns        78389 ns         8948 bytes_per_second=6.22898G/s
BM_ByteStreamSplitEncode_Float_Avx2/1024            835 ns          835 ns       838644 bytes_per_second=4.57004G/s
BM_ByteStreamSplitEncode_Float_Avx2/4096           3312 ns         3312 ns       211295 bytes_per_second=4.60756G/s
BM_ByteStreamSplitEncode_Float_Avx2/32768         30086 ns        30086 ns        23273 bytes_per_second=4.05734G/s
BM_ByteStreamSplitEncode_Float_Avx2/65536         60202 ns        60201 ns        11628 bytes_per_second=4.05541G/s
BM_ByteStreamSplitEncode_Double_Avx2/1024          1516 ns         1516 ns       462231 bytes_per_second=5.03305G/s
BM_ByteStreamSplitEncode_Double_Avx2/4096          6741 ns         6741 ns       103251 bytes_per_second=4.52713G/s
BM_ByteStreamSplitEncode_Double_Avx2/32768        54044 ns        54043 ns        12981 bytes_per_second=4.51749G/s
BM_ByteStreamSplitEncode_Double_Avx2/65536       111468 ns       111465 ns         6257 bytes_per_second=4.38058G/s

BM_ByteStreamSplitDecode_Float_Avx512/1024          311 ns          311 ns      2253479 bytes_per_second=12.2817G/s
BM_ByteStreamSplitDecode_Float_Avx512/4096         1331 ns         1331 ns       524762 bytes_per_second=11.4655G/s
BM_ByteStreamSplitDecode_Float_Avx512/32768       12203 ns        12203 ns        57141 bytes_per_second=10.0036G/s
BM_ByteStreamSplitDecode_Float_Avx512/65536       24593 ns        24592 ns        28474 bytes_per_second=9.92753G/s
BM_ByteStreamSplitDecode_Double_Avx512/1024        1105 ns         1105 ns       633441 bytes_per_second=6.90407G/s
BM_ByteStreamSplitDecode_Double_Avx512/4096        5196 ns         5196 ns       134568 bytes_per_second=5.8736G/s
BM_ByteStreamSplitDecode_Double_Avx512/32768      42411 ns        42411 ns        16505 bytes_per_second=5.75654G/s
BM_ByteStreamSplitDecode_Double_Avx512/65536      89219 ns        89126 ns         7834 bytes_per_second=5.47857G/s
BM_ByteStreamSplitEncode_Float_Avx512/1024          578 ns          578 ns      1210908 bytes_per_second=6.59961G/s
BM_ByteStreamSplitEncode_Float_Avx512/4096         2495 ns         2495 ns       280623 bytes_per_second=6.11606G/s
BM_ByteStreamSplitEncode_Float_Avx512/32768       28142 ns        28142 ns        24881 bytes_per_second=4.33771G/s
BM_ByteStreamSplitEncode_Float_Avx512/65536       56378 ns        56378 ns        12384 bytes_per_second=4.3304G/s
BM_ByteStreamSplitEncode_Double_Avx512/1024        1034 ns         1034 ns       677564 bytes_per_second=7.38006G/s
BM_ByteStreamSplitEncode_Double_Avx512/4096        7255 ns         7255 ns        96388 bytes_per_second=4.20642G/s
BM_ByteStreamSplitEncode_Double_Avx512/32768      62212 ns        62211 ns        11207 bytes_per_second=3.92438G/s
BM_ByteStreamSplitEncode_Double_Avx512/65536     126019 ns       126013 ns         5535 bytes_per_second=3.87484G/s

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Jan 22, 2024

Looks most AVX512 benchmarks are slower than AVX2.

Amusing. Which CPU is it?

@cyb70289
Copy link
Copy Markdown
Contributor

Looks most AVX512 benchmarks are slower than AVX2.

Amusing. Which CPU is it?

It's Intel Cascade Lake.
https://www.intel.com/content/www/us/en/products/sku/192444/intel-xeon-gold-5218-processor-22m-cache-2-30-ghz/specifications.html

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Jan 22, 2024

This might be because the CPU clock frequency slows down when executing AVX-512 instructions.

@cyb70289
Copy link
Copy Markdown
Contributor

Tested with perf, cpu freq (cycles/task-clock) is 2.8GHz for sse2 and avx2 tests, matches the system max frequency.
But for avx512 test, the reported cpu freq is 2.3GHz. IPC also drops.

# SSE2, ~2.8GHz
$ sudo perf stat -e cycles,task-clock,instructions -- release/parquet-encoding-benchmark --benchmark_filter="BM_ByteStreamSplitDecode_Float_Sse2/1024"
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Sse2/1024        385 ns          385 ns      1818400 bytes_per_second=9.90927G/s

 Performance counter stats for 'release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitDecode_Float_Sse2/1024':

     3,177,983,141      cycles                    #    2.792 GHz
          1,138.11 msec task-clock                #    0.967 CPUs utilized
    11,062,559,384      instructions              #    3.48  insn per cycle


# AVX2, ~2.8GHz
$ sudo perf stat -e cycles,task-clock,instructions -- release/parquet-encoding-benchmark --benchmark_filter="BM_ByteStreamSplitDecode_Float_Avx2/1024"
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Avx2/1024        257 ns          257 ns      2725100 bytes_per_second=14.8485G/s

 Performance counter stats for 'release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitDecode_Float_Avx2/1024':

     2,778,881,859      cycles                    #    2.793 GHz
            995.09 msec task-clock                #    0.966 CPUs utilized
     8,314,230,560      instructions              #    2.99  insn per cycle


# AVX512, ~2.3GHz
$ sudo perf stat -e cycles,task-clock,instructions -- release/parquet-encoding-benchmark --benchmark_filter="BM_ByteStreamSplitDecode_Float_Avx512/1024"
-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Avx512/1024        322 ns          322 ns      2173349 bytes_per_second=11.8434G/s

 Performance counter stats for 'release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitDecode_Float_Avx512/1024':

     2,454,008,893      cycles                    #    2.298 GHz
          1,067.73 msec task-clock                #    0.965 CPUs utilized
     5,877,374,076      instructions              #    2.40  insn per cycle

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Jan 23, 2024

Looks most AVX512 benchmarks are slower than AVX2.

We've seen same performance on Intel Xeon.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Jan 23, 2024

Perhaps we should open a separate issue to remove the AVX512 variants then?

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Jan 23, 2024

@wgtmac @mapleFU Do you want to review this PR?

Copy link
Copy Markdown
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 23, 2024
@pitrou pitrou merged commit 78ec4dc into apache:main Jan 23, 2024
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jan 23, 2024
@pitrou pitrou deleted the gh39747-byte-stream-split-type-agnostic branch January 23, 2024 09:36
@cyb70289
Copy link
Copy Markdown
Contributor

Perhaps we should open a separate issue to remove the AVX512 variants then?

Agreed

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 78ec4dc.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…gnostic (apache#39748)

### Rationale for this change

The low-level BYTE_STREAM_SPLIT routines currently reference the logical type they are operating on (float or double). However, the BYTE_STREAM_SPLIT encoding is type-agnostic and only cares about the type width. Removing references to logical types makes these routines easier to reuse.

### Are these changes tested?

Yes, including more exhaustive SIMD tests.

### Are there any user-facing changes?

No. These routines are internal.

* Closes: apache#39747

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Make BYTE_STREAM_SPLIT routines type-agnostic

4 participants