Skip to content

Optimize stream ID comparison and endian conversion hot paths#14480

Merged
sundb merged 6 commits intoredis:unstablefrom
filipecosta90:intrinsics.bswap
Nov 7, 2025
Merged

Optimize stream ID comparison and endian conversion hot paths#14480
sundb merged 6 commits intoredis:unstablefrom
filipecosta90:intrinsics.bswap

Conversation

@fcostaoliveira
Copy link
Collaborator

@fcostaoliveira fcostaoliveira commented Oct 28, 2025

The added logic from #14402 introduced overhead to the XREADGROUP even when the added feature is not used.

This PR tries to mitigate it, by removing unnecessary streamEncodeID() calls and redundant byte-swapping operations from the stream iterator hot path.
By comparing stream IDs directly in native-endian form, we eliminate repeated encoding and memcmp() calls that were responsible for a significant portion of total CPU time during stream iteration.

A sample vtune profile about streamEncodeID

Function Stack	CPU Time: Total	CPU Time: Self	Module	Function (Full)	Source File	Start Address
streamEncodeID	7.3%	0.229s	redis-server	streamEncodeID	t_stream.c	0x220d80
  intrev64	4.2%	0s	redis-server	intrev64	endianconv.c	0x220d90

Additionally, endian conversion helpers are modernized to leverage compiler-provided intrinsics (__builtin_bswap*) for single-instruction byte-swaps on supported compilers.

Improvements Table

Altogether it leads to ~10% improvement when compared to the latest unstable, as seen bellow:

Test Case Baseline /redis d4307af (median obs. +- std.dev) Comparison filipecosta90/redis intrinsics.bswap (median obs. +- std.dev) % change (higher-better) Note
memtier_benchmark-stream-10M-entries-xreadgroup-count-100 5739 6325 +- 4.3% (2 datapoints) 10.2% Comparison is Better

@sundb
Copy link
Collaborator

sundb commented Oct 29, 2025

@filipecosta90 is there a benchmark for it?

@sundb sundb moved this from Todo to In Review in Redis 8.4 Oct 30, 2025
Co-authored-by: debing.sun <debing.sun@redis.com>
@fcostaoliveira
Copy link
Collaborator Author

Automated performance analysis summary

This comment was automatically generated given there is performance data available.

Using platform named: x86-aws-m7i.metal-24xl for both baseline and comparison.

Using triggering environment: ci for both baseline and comparison.

In summary:

  • Detected a total of 1 stable tests between versions.
  • Detected a total of 1 improvements above the improvement water line (Comparison is Better).
    • The median improvement (Comparison is Better) was 10.2%, with values ranging from 10.2% to 10.2%.
    • Quartile distribution: P25=10.2%, P50=10.2%, P75=10.2%.

You can check a comparison in detail via the grafana link

Comparison between d4307af and intrinsics.bswap.

Time Period from 5 months ago. (environment used: oss-standalone)

By GROUP change csv:

command_group,min_change,q1_change,median_change,q3_change,max_change
stream,-0.157,2.434,5.025,7.616,10.208

By COMMAND change csv:

command,min_change,q1_change,median_change,q3_change,max_change

#### Improvements Table
Test Case Baseline /redis d4307af (median obs. +- std.dev) Comparison filipecosta90/redis intrinsics.bswap (median obs. +- std.dev) % change (higher-better) Note
memtier_benchmark-stream-10M-entries-xreadgroup-count-100 5739 6325 +- 4.3% (2 datapoints) 10.2% Comparison is Better

Improvements test regexp names: memtier_benchmark-stream-10M-entries-xreadgroup-count-100

Full Results table:
Test Case Baseline /redis d4307af (median obs. +- std.dev) Comparison filipecosta90/redis intrinsics.bswap (median obs. +- std.dev) % change (higher-better) Note
memtier_benchmark-stream-10M-entries-xreadgroup-count-100 5739 6325 +- 4.3% (2 datapoints) 10.2% Comparison is Better
memtier_benchmark-stream-10M-entries-xreadgroup-count-100-noack 37851 37791 +- 0.4% (3 datapoints) -0.2% No Change

@fcostaoliveira
Copy link
Collaborator Author

fcostaoliveira commented Nov 5, 2025

@filipecosta90 is there a benchmark for it?

yes. just added the official run in
#14480 (comment)

TLDR: ~10% bump on XREADGROUP.
Updated the main comment to include it.

Copy link
Collaborator

@sggeorgiev sggeorgiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@sundb sundb merged commit b9ad4f6 into redis:unstable Nov 7, 2025
19 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in Redis 8.4 Nov 7, 2025
sundb pushed a commit that referenced this pull request Nov 7, 2025
…n overhead in stream propagation (#14516)

As seen in the following flamegraph, even after PR #14480, there a lot
of redundant work when propagating multiple XCLAIMs withing a
XREADGROUP.

This PR refactors streamPropagateXCLAIM to add a new static inline
variant, `streamPropagateXCLAIMCopyFree()`, which accepts pre-created
`robj*` arguments.
This enables reusing argument objects across multiple XCLAIM
propagations, reducing repeated creation and destruction costs during
high-throughput consumer group operations.
@sundb sundb mentioned this pull request Nov 18, 2025
sundb added a commit that referenced this pull request Nov 18, 2025
This is the General Availability release of Redis 8.4 in Redis Open
Source.

### Major changes compared to 8.2

- `DIGEST`, `DELEX`; `SET` extensions - atomic compare-and-set and
compare-and-delete for string keys
- `MSETEX` - atomically set multiple string keys and update their
expiration
- `XREADGROUP` - new `CLAIM` option for reading both idle pending and
incoming stream entries
- `CLUSTER MIGRATION` - atomic slot migration
- `CLUSTER SLOT-STATS` - per-slot usage metrics: key count, CPU time,
and network I/O
- Redis query engine: `FT.HYBRID` - hybrid search and fused scoring
- Redis query engine: I/O threading with performance boost for search
and query commands (FT.*)
- I/O threading: substantial throughput increase (e.g. >30% for caching
use cases (10% `SET`, 90% `GET`), 4 cores)
- JSON: substantial memory reduction for homogenous arrays (up to 91%)

### Binary distributions

- Alpine and Debian Docker images - https://hub.docker.com/_/redis
- Install using snap - see https://github.com/redis/redis-snap
- Install using brew - see https://github.com/redis/homebrew-redis
- Install using RPM - see https://github.com/redis/redis-rpm
- Install using Debian APT - see https://github.com/redis/redis-debian


### Operating systems we test Redis 8.4 on

- Ubuntu 22.04 (Jammy Jellyfish), 24.04 (Noble Numbat)
- Rocky Linux 8.10, 9.5
- AlmaLinux 8.10, 9.5
- Debian 12 (Bookworm), Debian 13 (Trixie)
- macOS 13 (Ventura), 14 (Sonoma), 15 (Sequoia)

### Bug fixes (compared to 8.4-RC1)

- #14524 `XREADGROUP CLAIM` returns strings instead of integers
- #14529 Add variable key-spec flags to SET IF* and DELEX
- #P928 Potential memory leak (MOD-11484)
- #T1801, #T1805 macOS build failures (MOD-12293)
- #J1438 `JSON.NUMINCRBY` - wrong result on integer array with
non-integer increment (MOD-12282)
- #J1437 Thread safety issue related to ASM and shared strings
(MOD-12013)


### Performance and resource utilization improvements (compared to
8.4-RC1)

- #14480, #14516 Optimize `XREADGROUP`

### known bugs and limitations

- When executing `FT.SEARCH`, `FT.AGGREGATE`, `FT.CURSOR`, `FT.HYBRID`,
`TS.MGET`, `TS.MRANGE`, `TS.MREVRANGE` and `TS.QUERYINDEX` while an
atomic slot migration process is in progress, the results may be partial
or contain duplicates
- `FT.PROFILE`, `FT.EXPLAIN` and `FT.EXPLACINCLI` doesn’t contain the
`FT.HYBRID` option
- Metrics from `FT.HYBRID` command aren’t displayed on `FT.INFO` and
`INFO`
- Option `EXPLAINSCORE`, `SHARD_K_RATIO`, `YIELD_DISTANCE_AS` and
`WITHCURSOR` with `FT.HYBRID` are not available
- Post-filtering (after `COMBINE` step) using FILTER is not available
- Currently the default response format considers only `key_id` and
`score`, this may change for delivering entire document content
zuiderkwast pushed a commit to valkey-io/valkey that referenced this pull request Mar 13, 2026
This PR improves stream performance in the range iteration and reply
generation paths, benefits xadd, xrange, xrevrange, xreadgroup.

- ull2string memcpy optimization
- streamID struct + streamCompareID
- streamID2string + reply path
- getClientType inline + cache locality

Inspired by the high-level description (not the code) of
redis/redis#14480.

---------

Signed-off-by: Ernesto Alejandro Santana Hidalgo <ernesto.alejandrosantana@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants