Optimize stream ID comparison and endian conversion hot paths#14480
Optimize stream ID comparison and endian conversion hot paths#14480sundb merged 6 commits intoredis:unstablefrom
Conversation
|
@filipecosta90 is there a benchmark for it? |
Co-authored-by: debing.sun <debing.sun@redis.com>
Automated performance analysis summaryThis comment was automatically generated given there is performance data available. Using platform named: x86-aws-m7i.metal-24xl for both baseline and comparison. Using triggering environment: ci for both baseline and comparison. In summary:
You can check a comparison in detail via the grafana link Comparison between d4307af and intrinsics.bswap.Time Period from 5 months ago. (environment used: oss-standalone) By GROUP change csv:command_group,min_change,q1_change,median_change,q3_change,max_change By COMMAND change csv:command,min_change,q1_change,median_change,q3_change,max_change
Improvements test regexp names: memtier_benchmark-stream-10M-entries-xreadgroup-count-100 Full Results table:
|
yes. just added the official run in TLDR: ~10% bump on XREADGROUP. |
…n overhead in stream propagation (#14516) As seen in the following flamegraph, even after PR #14480, there a lot of redundant work when propagating multiple XCLAIMs withing a XREADGROUP. This PR refactors streamPropagateXCLAIM to add a new static inline variant, `streamPropagateXCLAIMCopyFree()`, which accepts pre-created `robj*` arguments. This enables reusing argument objects across multiple XCLAIM propagations, reducing repeated creation and destruction costs during high-throughput consumer group operations.
This is the General Availability release of Redis 8.4 in Redis Open Source. ### Major changes compared to 8.2 - `DIGEST`, `DELEX`; `SET` extensions - atomic compare-and-set and compare-and-delete for string keys - `MSETEX` - atomically set multiple string keys and update their expiration - `XREADGROUP` - new `CLAIM` option for reading both idle pending and incoming stream entries - `CLUSTER MIGRATION` - atomic slot migration - `CLUSTER SLOT-STATS` - per-slot usage metrics: key count, CPU time, and network I/O - Redis query engine: `FT.HYBRID` - hybrid search and fused scoring - Redis query engine: I/O threading with performance boost for search and query commands (FT.*) - I/O threading: substantial throughput increase (e.g. >30% for caching use cases (10% `SET`, 90% `GET`), 4 cores) - JSON: substantial memory reduction for homogenous arrays (up to 91%) ### Binary distributions - Alpine and Debian Docker images - https://hub.docker.com/_/redis - Install using snap - see https://github.com/redis/redis-snap - Install using brew - see https://github.com/redis/homebrew-redis - Install using RPM - see https://github.com/redis/redis-rpm - Install using Debian APT - see https://github.com/redis/redis-debian ### Operating systems we test Redis 8.4 on - Ubuntu 22.04 (Jammy Jellyfish), 24.04 (Noble Numbat) - Rocky Linux 8.10, 9.5 - AlmaLinux 8.10, 9.5 - Debian 12 (Bookworm), Debian 13 (Trixie) - macOS 13 (Ventura), 14 (Sonoma), 15 (Sequoia) ### Bug fixes (compared to 8.4-RC1) - #14524 `XREADGROUP CLAIM` returns strings instead of integers - #14529 Add variable key-spec flags to SET IF* and DELEX - #P928 Potential memory leak (MOD-11484) - #T1801, #T1805 macOS build failures (MOD-12293) - #J1438 `JSON.NUMINCRBY` - wrong result on integer array with non-integer increment (MOD-12282) - #J1437 Thread safety issue related to ASM and shared strings (MOD-12013) ### Performance and resource utilization improvements (compared to 8.4-RC1) - #14480, #14516 Optimize `XREADGROUP` ### known bugs and limitations - When executing `FT.SEARCH`, `FT.AGGREGATE`, `FT.CURSOR`, `FT.HYBRID`, `TS.MGET`, `TS.MRANGE`, `TS.MREVRANGE` and `TS.QUERYINDEX` while an atomic slot migration process is in progress, the results may be partial or contain duplicates - `FT.PROFILE`, `FT.EXPLAIN` and `FT.EXPLACINCLI` doesn’t contain the `FT.HYBRID` option - Metrics from `FT.HYBRID` command aren’t displayed on `FT.INFO` and `INFO` - Option `EXPLAINSCORE`, `SHARD_K_RATIO`, `YIELD_DISTANCE_AS` and `WITHCURSOR` with `FT.HYBRID` are not available - Post-filtering (after `COMBINE` step) using FILTER is not available - Currently the default response format considers only `key_id` and `score`, this may change for delivering entire document content
This PR improves stream performance in the range iteration and reply generation paths, benefits xadd, xrange, xrevrange, xreadgroup. - ull2string memcpy optimization - streamID struct + streamCompareID - streamID2string + reply path - getClientType inline + cache locality Inspired by the high-level description (not the code) of redis/redis#14480. --------- Signed-off-by: Ernesto Alejandro Santana Hidalgo <ernesto.alejandrosantana@gmail.com>
The added logic from #14402 introduced overhead to the XREADGROUP even when the added feature is not used.
This PR tries to mitigate it, by removing unnecessary streamEncodeID() calls and redundant byte-swapping operations from the stream iterator hot path.
By comparing stream IDs directly in native-endian form, we eliminate repeated encoding and memcmp() calls that were responsible for a significant portion of total CPU time during stream iteration.
A sample vtune profile about streamEncodeID
Additionally, endian conversion helpers are modernized to leverage compiler-provided intrinsics (__builtin_bswap*) for single-instruction byte-swaps on supported compilers.
Improvements Table
Altogether it leads to ~10% improvement when compared to the latest unstable, as seen bellow: