Optimize PFCOUNT, PFMERGE command by SIMD acceleration#13558
Optimize PFCOUNT, PFMERGE command by SIMD acceleration#13558ShooterIT merged 16 commits intoredis:unstablefrom
Conversation
ShooterIT
left a comment
There was a problem hiding this comment.
Thank you for you PR.
I think some CPU instruction sets may significantly improve performance of redis for some computing scenarios, also maybe for crc or hash algorithms.
But actually i lacks this knowledge, i will learn it. Hi @filipecosta90, could you also take a look.
CE Performance Automation : step 1 of 2 (build) DONE.This comment was automatically generated given a benchmark was triggered.
You can check a comparison in detail via the grafana link |
CE Performance Automation : step 2 of 2 (benchmark) FINISHED.This comment was automatically generated given a benchmark was triggered. Started benchmark suite at 2024-11-13 19:19:28.887445 and took 48079.879437 seconds to finish. In total will run 141 benchmarks. |
There was a problem hiding this comment.
I read it carefully, thank you for your impressive PR which improves a lot for HLL, it is exciting. And the code looks very good to me, maybe we just need to update comments to let people understand it easily.
small optimization, just skip 8 bytes at the first as below
unstable : Pfcounts 20097.37 9.93275 11.00700 16.31900 18.94300 1079.45
current : Pfcounts 79993.65 2.49851 2.46300 4.92700 5.82300 4296.53
cr : Pfcounts 80441.25 2.48559 2.43100 4.86300 6.14300 4320.57
unstable : Pfmerges 15863.74 12.56612 14.84700 18.55900 20.99100 991.48
current : Pfmerges 148759.06 1.34511 1.31900 2.63900 4.25500 9297.44
cr. : Pfmerges 152308.85 1.31122 1.28700 2.57500 4.25500 9519.30
|
currently this PR looks good to me, please also have a look @sundb @fcostaoliveira @YaacovHazan |
|
fully CI on arm64: https://github.com/redis/redis-extra-ci/actions/runs/11586127029 |
Co-authored-by: debing.sun <debing.sun@redis.com>
|
@Nugine merged, thank you |
|
Hi @ShooterIT @sundb |
@Nugine you have the right to apply this PR to valkey. |
The bug was introduced in #13558 . When merging dense hll structures, `hllDenseCompress` writes to wrong location and the result will be zero. The unit tests didn't cover this case. This PR + fixes the bug + adds `PFDEBUG SIMD (ON|OFF)` for unit tests + adds a new TCL test to cover the cases Synchronized from valkey-io/valkey#1293 --------- Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn> Co-authored-by: debing.sun <debing.sun@redis.com>
|
Note that after merging this PR, building the code with clang 18 gives the following error - |
|
@alonre24 Have you seen it on github action? This complaint is not caused by this, but because ubuntu-latest is using 24.04 in the ci of some users, I want to wait until this repository ubuntu is upgraded before fixing it. you can try locally with ubuntu 24.04: |
|
@sundb not only in github actions, I reproduced it locally with ubuntu 22. From what I see, it happens when using clang-18 (older clang are ok). ? |
|
cool, @alonre24 could you make a PR for this issue? |
|
@alonre24 yeah, you're right. |
|
one of the alternative is forbiding attribute deprecated for clang. |
This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths. Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated. AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6). When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version. ``` pfcount key1 key2 key3 pfmerge keyall key1 key2 key3 ``` ``` ====================================================================================================== Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ------------------------------------------------------------------------------------------------------ PFCOUNT-scalar 5570.09 35.89060 32.51100 65.27900 69.11900 299.17 PFCOUNT-avx2 72604.92 2.82072 2.73500 5.50300 7.13500 3899.68 ------------------------------------------------------------------------------------------------------ PFMERGE-scalar 7879.13 25.52156 24.19100 46.33500 48.38300 492.45 PFMERGE-avx2 126448.64 1.58120 1.53500 3.08700 4.89500 7903.04 ------------------------------------------------------------------------------------------------------ scalar: redis:unstable 7f38c7b avx2: Nugine:hll-simd 02e09f8 CPU: 13th Gen Intel® Core™ i9-13900H × 20 Memory: 32.0 GiB OS: Ubuntu 22.04.5 LTS ``` Experiment repo: https://github.com/Nugine/redis-hyperloglog Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp resolves redis#13551 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com>
The bug was introduced in redis#13558 . When merging dense hll structures, `hllDenseCompress` writes to wrong location and the result will be zero. The unit tests didn't cover this case. This PR + fixes the bug + adds `PFDEBUG SIMD (ON|OFF)` for unit tests + adds a new TCL test to cover the cases Synchronized from valkey-io/valkey#1293 --------- Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn> Co-authored-by: debing.sun <debing.sun@redis.com>
This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths.
Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated.
AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6).
When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version.
Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp
resolves #13551