-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Add 2K software prefetch to improve BITCOUNT performance #14103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CE Performance Automation : step 1 of 2 (build) DONE.This comment was automatically generated given a benchmark was triggered.
You can check a comparison in detail via the grafana link |
CE Performance Automation : step 2 of 2 (benchmark) FINISHED.This comment was automatically generated given a benchmark was triggered. Started benchmark suite at 2025-06-21 21:46:11.741325 and took 11296.191358 seconds to finish. In total will run 176 benchmarks. |
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a 2 KiB software prefetch into the scalar loop of redisPopcount to hide memory latency and boost BITCOUNT throughput by up to 41.6%.
- Inserted
redis_prefetch_read(p + 2048);in the popcount loop - Added a comment explaining the 2 KiB stride rationale
Comments suppressed due to low confidence (1)
src/bitops.c:72
- [nitpick] Clarify the comment by using "2 KiB" (instead of "2K") and specify "L3 cache miss latency" for precision.
/* Prefetch with 2K stride is just enough to overlap L3 miss latency effectively
Automated performance analysis summaryThis comment was automatically generated given there is performance data available. Using platform named: intel64-ubuntu22.04-redis-clx1 to do the comparison. In summary:
You can check a comparison in detail via the grafana link Comparison between unstable and filipecosta90:prefetch.popcount.Time Period from 5 months ago. (environment used: oss-standalone) Unstable Table
Unstable test regexp names: memtier_benchmark-100Kkeys-hash-hgetall-50-fields-100B-values|memtier_benchmark-100Kkeys-load-hash-50-fields-with-100B-values|memtier_benchmark-10Kkeys-load-hash-50-fields-with-10000B-values|memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values|memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10|memtier_benchmark-1Mkeys-100B-expire-use-case|memtier_benchmark-1Mkeys-10B-expire-use-case|memtier_benchmark-1Mkeys-1KiB-expire-use-case|memtier_benchmark-1Mkeys-4KiB-expire-use-case|memtier_benchmark-1Mkeys-bitmap-getbit-pipeline-10|memtier_benchmark-1Mkeys-generic-expire-pipeline-10|memtier_benchmark-1Mkeys-generic-expireat-pipeline-10|memtier_benchmark-1Mkeys-generic-pexpire-pipeline-10|memtier_benchmark-1Mkeys-generic-scan-pipeline-10|memtier_benchmark-1Mkeys-generic-touch-pipeline-10|memtier_benchmark-1Mkeys-generic-ttl-pipeline-10|memtier_benchmark-1Mkeys-hash-hexists|memtier_benchmark-1Mkeys-hash-hget-hgetall-hkeys-hvals-with-100B-values|memtier_benchmark-1Mkeys-hash-hgetall-50-fields-10B-values|memtier_benchmark-1Mkeys-hash-hmget-5-fields-with-100B-values-pipeline-10|memtier_benchmark-1Mkeys-list-lpop-rpop-with-100B-values|memtier_benchmark-1Mkeys-list-lpop-rpop-with-10B-values|memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10|memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values|memtier_benchmark-1Mkeys-load-list-with-100B-values|memtier_benchmark-1Mkeys-load-list-with-1KiB-values|memtier_benchmark-1Mkeys-load-set-intset-with-100-elements-pipeline-10|memtier_benchmark-1Mkeys-load-stream-1-fields-with-100B-values|memtier_benchmark-1Mkeys-load-stream-1-fields-with-100B-values-pipeline-10|memtier_benchmark-1Mkeys-load-string-with-100B-values|memtier_benchmark-1Mkeys-load-string-with-10B-values|memtier_benchmark-1Mkeys-load-string-with-10B-values-pipeline-10|memtier_benchmark-1Mkeys-load-string-with-1KiB-values|memtier_benchmark-1Mkeys-load-string-with-20KiB-values|memtier_benchmark-1Mkeys-load-zset-with-10-elements-double-score|memtier_benchmark-1Mkeys-load-zset-with-10-elements-int-score|memtier_benchmark-1Mkeys-string-append-1-100B|memtier_benchmark-1Mkeys-string-get-100B|memtier_benchmark-1Mkeys-string-get-100B-pipeline-10|memtier_benchmark-1Mkeys-string-get-10B|memtier_benchmark-1Mkeys-string-get-10B-pipeline-10|memtier_benchmark-1Mkeys-string-get-1KiB|memtier_benchmark-1Mkeys-string-incrby-pipeline-10|memtier_benchmark-1Mkeys-string-incrbyfloat|memtier_benchmark-1Mkeys-string-incrbyfloat-pipeline-10|memtier_benchmark-1Mkeys-string-setex-100B-pipeline-10|memtier_benchmark-1Mkeys-string-setrange-100B-pipeline-10|memtier_benchmark-1key-100M-bits-bitmap-bitcount|memtier_benchmark-1key-1Billion-bits-bitmap-bitcount|memtier_benchmark-1key-geo-2-elements-geopos|memtier_benchmark-1key-geo-60M-elements-geodist|memtier_benchmark-1key-geo-60M-elements-geodist-pipeline-10|memtier_benchmark-1key-geo-60M-elements-geohash|memtier_benchmark-1key-geo-60M-elements-geohash-pipeline-10|memtier_benchmark-1key-geo-60M-elements-geopos|memtier_benchmark-1key-geo-60M-elements-geopos-pipeline-10|memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat|memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat-pipeline-10|memtier_benchmark-1key-hash-hscan-50-fields-10B-values|memtier_benchmark-1key-list-10-elements-lrange-all-elements|memtier_benchmark-1key-list-100-elements-lrange-all-elements|memtier_benchmark-1key-list-10K-elements-lindex-integer|memtier_benchmark-1key-list-10K-elements-lindex-string|memtier_benchmark-1key-list-10K-elements-linsert-lrem-integer|memtier_benchmark-1key-list-10K-elements-lpos-integer|memtier_benchmark-1key-set-10-elements-smembers|memtier_benchmark-1key-set-10-elements-smismember|memtier_benchmark-1key-set-100-elements-sismember-not-a-member|memtier_benchmark-1key-set-1K-elements-smembers|memtier_benchmark-1key-set-200K-elements-sadd-constant|memtier_benchmark-1key-set-2M-elements-sadd-increasing|memtier_benchmark-1key-zrevrangebyscore-256K-elements-pipeline-1|memtier_benchmark-1key-zset-10-elements-zrange-all-elements|memtier_benchmark-1key-zset-10-elements-zrange-all-elements-long-scores|memtier_benchmark-1key-zset-100-elements-zrange-all-elements|memtier_benchmark-1key-zset-100-elements-zrangebyscore-all-elements|memtier_benchmark-1key-zset-100-elements-zscan|memtier_benchmark-1key-zset-1M-elements-zcard-pipeline-10|memtier_benchmark-1key-zset-1M-elements-zrevrange-5-elements|memtier_benchmark-1key-zset-1M-elements-zscore-pipeline-10|memtier_benchmark-2keys-lua-eval-hset-expire|memtier_benchmark-2keys-lua-evalsha-hset-expire|memtier_benchmark-2keys-set-10-100-elements-sdiff|memtier_benchmark-2keys-set-10-100-elements-sinter|memtier_benchmark-2keys-set-10-100-elements-sunion|memtier_benchmark-2keys-stream-5-entries-xread-all-entries|memtier_benchmark-2keys-stream-5-entries-xread-all-entries-pipeline-10|memtier_benchmark-2keys-zset-300-elements-skiplist-encoded-zunionstore|memtier_benchmark-connection-hello Regressions Table
Regressions test regexp names: memtier_benchmark-1Mkeys-load-list-with-100B-values|memtier_benchmark-1Mkeys-load-set-intset-with-100-elements|memtier_benchmark-1Mkeys-string-mget-1KiB|memtier_benchmark-1Mkeys-string-setrange-100B|memtier_benchmark-1key-geo-60M-elements-geohash-pipeline-10|memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat-bybox|memtier_benchmark-1key-list-10K-elements-linsert-lrem-integer|memtier_benchmark-1key-list-10K-elements-linsert-lrem-string|memtier_benchmark-1key-list-10K-elements-lpos-string|memtier_benchmark-1key-list-1K-elements-lrange-all-elements|memtier_benchmark-1key-set-10-elements-smismember|memtier_benchmark-1key-set-100-elements-sismember-is-a-member|memtier_benchmark-1key-set-100-elements-sismember-not-a-member|memtier_benchmark-1key-set-100-elements-smembers|memtier_benchmark-1key-set-1M-elements-sismember-50pct-chance|memtier_benchmark-1key-zrevrangebyscore-256K-elements-pipeline-1|memtier_benchmark-1key-zset-10-elements-zrange-all-elements-long-scores|memtier_benchmark-1key-zset-100-elements-zrangebyscore-all-elements-long-scores|memtier_benchmark-2keys-stream-5-entries-xread-all-entries|memtier_benchmark-3Mkeys-load-string-with-512B-values Improvements Table
Improvements test regexp names: latency-rate-limited-10000_qps-memtier_benchmark-1key-set-1K-elements-smembers|memtier_benchmark-100Kkeys-load-hash-50-fields-with-10B-values|memtier_benchmark-10Kkeys-load-hash-50-fields-with-10000B-values|memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values|memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10|memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values|memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10|memtier_benchmark-1Mkeys-100B-expire-use-case|memtier_benchmark-1Mkeys-generic-touch-pipeline-10|memtier_benchmark-1Mkeys-hash-hincrby|memtier_benchmark-1Mkeys-hash-hmget-5-fields-with-100B-values-pipeline-10|memtier_benchmark-1Mkeys-hash-transactions-multi-exec-pipeline-20|memtier_benchmark-1Mkeys-list-lpop-rpop-with-1KiB-values|memtier_benchmark-1Mkeys-load-list-with-1KiB-values|memtier_benchmark-1Mkeys-load-stream-1-fields-with-100B-values|memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10|memtier_benchmark-1Mkeys-load-string-with-100B-values-pipeline-10|memtier_benchmark-1Mkeys-load-zset-listpack-with-100-elements-double-score|memtier_benchmark-1Mkeys-string-decr|memtier_benchmark-1Mkeys-string-get-1KiB|memtier_benchmark-1Mkeys-string-incrby|memtier_benchmark-1Mkeys-string-incrbyfloat|memtier_benchmark-1Mkeys-string-setex-100B-pipeline-10|memtier_benchmark-1key-100M-bits-bitmap-bitcount|memtier_benchmark-1key-geo-2-elements-geosearch-fromlonlat-withcoord|memtier_benchmark-1key-geo-60M-elements-geopos|memtier_benchmark-1key-hash-hscan-50-fields-10B-values|memtier_benchmark-1key-list-100-elements-lrange-all-elements-pipeline-10|memtier_benchmark-1key-list-10K-elements-lindex-integer|memtier_benchmark-1key-list-10K-elements-lindex-string|memtier_benchmark-1key-list-10K-elements-lpos-integer|memtier_benchmark-1key-list-1K-elements-lrange-all-elements-pipeline-10|memtier_benchmark-1key-list-2K-elements-quicklist-lrange-all-elements-longs|memtier_benchmark-1key-pfadd-4KB-values-pipeline-10|memtier_benchmark-1key-set-10-elements-smembers-pipeline-10|memtier_benchmark-1key-set-100-elements-sscan|memtier_benchmark-1key-set-10M-elements-sismember-50pct-chance|memtier_benchmark-1key-set-2M-elements-sadd-increasing|memtier_benchmark-1key-zrem-5M-elements-pipeline-1|memtier_benchmark-1key-zset-100-elements-zrangebyscore-all-elements|memtier_benchmark-1key-zset-1K-elements-zrange-all-elements|memtier_benchmark-1key-zset-1M-elements-zrevrange-5-elements|memtier_benchmark-2keys-set-10-100-elements-sinter|memtier_benchmark-2keys-set-10-100-elements-sunion|memtier_benchmark-2keys-stream-5-entries-xread-all-entries-pipeline-10 Full Results table:
|
…14309) This PR introduces **vectorized implementations of `BITCOUNT`** for x86_64 targets with AVX2 and AVX512 support. - **AVX2 path**: processes 32B at a time, using unrolled POPCNT on 64-bit lanes with independent accumulators to reduce data dependencies. - **AVX512 path**: leverages `VPOPCNTDQ` on 64B chunks with `_mm512_reduce_add_epi64` to efficiently aggregate results across 512-bit vectors. - Both paths include cache prefetching hints to better overlap memory fetches with computation. This was proved to matter in #14103. - Fallbacks to the scalar implementation if hardware support is unavailable. --------- Co-authored-by: debing.sun <debing.sun@redis.com>
Summary
Adds a software prefetch with a 2K stride to the scalar popcount loop in redisPopcount().
Prefetching improved BITCOUNT throughput by up to 41.6%, reduced p50 latency by up to 43.9%, and significantly lowered L3 memory stalls, confirming effective mitigation of memory-bound bottlenecks, with no negative impact on L1/L2 usage or cache pollution (confirmed with HW counters).
Note: The 2K stride was the best starting from 128,256,512,1024,2048,4096.
4K gave the same outcome so it's best to avoid larger strides without reason.
Performance gains on client side
Performance gains on HW
Replicate the results
Benchmark
preload
benchmark
HW counters
you can replicate the hw collection with the following command on the server side while benchmarking (you need a metal VM of physical HW access)
Full benchmark suite