Skip to content

Conversation

@pizhenwei
Copy link
Contributor

In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.

Note that this feature is ONLY implemented/tested on Linux.

Several steps of the full job:
1, support RDMA for the connection from client & server. This is the
most import part, and luckly it's implemented in this patch.
Add a new config "rdma-port" for the server side to listen on
a RDMA port. Both redis-cli and redis-benchmark work fine with a
new argument '--rdma'. "REPLICAOF" command launches a RDMA client
if "rdma-replication" is enabled, and it works fine.

2, To support RDMA cluster mode.

3, To implement async read/write for client side. Because RDMA does
NOT support POLLOUT event, it's a little difficult to implement
the async IO mechanism for hiredis.

The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000
--threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000
--threads 8 -d 512 -t ping,set,get,lrange_100

====== PING_INLINE ======
TCP: QPS: 159017 AVG LAT: 0.183
RDMA: QPS: 523944 AVG LAT: 0.054

====== PING_MBULK ======
TCP: QPS: 162256 AVG LAT: 0.179
RDMA: QPS: 509839 AVG LAT: 0.056

====== SET ======
TCP: QPS: 154700 AVG LAT: 0.187
RDMA: QPS: 492368 AVG LAT: 0.058

====== GET ======
TCP: QPS: 159022 AVG LAT: 0.182
RDMA: QPS: 525099 AVG LAT: 0.054

====== LPUSH (needed to benchmark LRANGE) ======
TCP: QPS: 142537 AVG LAT: 0.207
RDMA: QPS: 395038 AVG LAT: 0.073

====== LRANGE_100 (first 100 elements) ======
TCP: QPS: 36171 AVG LAT: 0.657
RDMA: QPS: 55266 AVG LAT: 0.412

Signed-off-by: zhenwei pi pizhenwei@bytedance.com

In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.

Note that this feature is ONLY implemented/tested on Linux.

Several steps of the full job:
1, support RDMA for the connection from client & server. This is the
   most import part, and luckly it's implemented in this patch.
   Add a new config "rdma-port" for the server side to listen on
   a RDMA port. Both redis-cli and redis-benchmark work fine with a
   new argument '--rdma'. "REPLICAOF" command launches a RDMA client
   if "rdma-replication" is enabled, and it works fine.

2, To support RDMA cluster mode.

3, To implement async read/write for client side. Because RDMA does
   NOT support POLLOUT event, it's a little difficult to implement
   the async IO mechanism for hiredis.

The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
                 server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

====== PING_INLINE ======
 TCP: QPS: 159017   AVG LAT: 0.183
RDMA: QPS: 523944   AVG LAT: 0.054

====== PING_MBULK ======
 TCP: QPS: 162256   AVG LAT: 0.179
RDMA: QPS: 509839   AVG LAT: 0.056

====== SET ======
 TCP: QPS: 154700   AVG LAT: 0.187
RDMA: QPS: 492368   AVG LAT: 0.058

====== GET ======
 TCP: QPS: 159022   AVG LAT: 0.182
RDMA: QPS: 525099   AVG LAT: 0.054

====== LPUSH (needed to benchmark LRANGE) ======
 TCP: QPS: 142537   AVG LAT: 0.207
RDMA: QPS: 395038   AVG LAT: 0.073

====== LRANGE_100 (first 100 elements) ======
 TCP: QPS:  36171   AVG LAT: 0.657
RDMA: QPS:  55266   AVG LAT: 0.412

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@kukey
Copy link
Contributor

kukey commented Jun 29, 2021

cool

@pizhenwei
Copy link
Contributor Author

@oranagra @yossigo @soloestoy
Hi, sorry about that maybe I should create an issue to discuss this feature before pushing a new PR.
About 1 month ago, when I debug iSCSI/iSER, I noted that the performance of storage virtualization got a lot of improvement during using RDMA. But I'm not sure if redis could get the same achievement(because the IO size is always aligned to 4K, the KV size is indefinite length). So I tried to implement this feature to test this idea, and luckily the test result seems almost to triple of TCP.
What's the next step should I take?

@oranagra
Copy link
Member

@pizhenwei thank you for the significant contribution.
we are already looking into this PR, please hold on until we publish our feedback.
this is certainly very interesting.
p.s. it's nice to see the TLS / connection abstraction project paying off again 8-)

@yossigo
Copy link
Collaborator

yossigo commented Jul 8, 2021

@pizhenwei Thanks for this contribution! After a discussion with @oranagra and the rest of the core team, we have reached a few conclusions.

First, RDMA is not (yet?) a commodity technology so in practical terms we can't test / use / experience it in any way. For example, none of the major public clouds offer it today.

We also know very little about it, but we can assess that the surface area for officially supporting it is pretty big and involves many additional aspects, such as:

  • Formally defining the Redis over RDMA protocol (including security, RESP2/RESP3 considerations, consider how replication is done, cluster bus, etc.).
  • Consider how to make that accessible to different clients (can they directly use a lower level RDMA library? create a new standard low-level lib that will serve as the C binding?).
  • Complete the implementation for replication, cluster bus, Sentinel, etc.

Because of this, we don't think it's possible at this point in time to accept this contribution and have it an integral part of the Redis core.

What we can and want to do is use this opportunity to move forward with ideas we have already discussed in the past around first-party modules and modularizing Redis. One idea we discussed in the past was being able to have the TLS connection capability implemented as a standalone optional module, so users can use Redis with or without it, or even load alternative connection modules with different TLS implementations.

If we pursue this, RDMA support could be (mostly) an external module which can be developed and maintained separately.

@FujiZ
Copy link

FujiZ commented Jul 8, 2021

  • Formally defining the Redis over RDMA protocol (including security, RESP2/RESP3 considerations, consider how replication is done, cluster bus, etc.).

+1 for this. I think defining protocol over RDMA is especially challenging since there are many design choices for RDMA (ibverbs) to implement the same function. For example, to transmit a bulk of data, one can use SEND/RECV primitive similar to UDP; or he can use RDMA WRITE primitive to directly write the data into the peer's memory, which bypasses the peer's CPU; or he can also let peer read from his buffer using RDMA READ primitive. These design choices may yield different performance characteristics.

In addition, the performance of RDMA is sensitive to the platform it runs on (NIC model, CPU model, etc.). The design choice that performs well on one platform may yield bad performance on another one, so finding an ideal design is even more challenging with so many platforms available.

@pizhenwei
Copy link
Contributor Author

Close this PR, and still keep pizhenwei:feature-rdma branch for performance test purpose and so on.

Issues & suggestions are welcomed!

@pizhenwei pizhenwei closed this Jul 23, 2021
pizhenwei added a commit to pizhenwei/redis that referenced this pull request Jul 23, 2021
In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.

Note that this feature is ONLY implemented/tested on Linux.

Actually, this is the v2 implementation. The v1 uses low level IB
verbs API directly, the code and discuss ses PR:
    redis#9161

Instead of low level API, the v2 use rsocket which is implemented
by rdma-core to simplify the work in Redis.

The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
                 server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

====== PING_INLINE ======
    TCP: QPS: 159017   AVG LAT: 0.183
v1 RDMA: QPS: 523944   AVG LAT: 0.054
v2 RDMA: QPS: 492683   AVG LAT: 0.052

====== PING_MBULK ======
    TCP: QPS: 162256   AVG LAT: 0.179
v1 RDMA: QPS: 509839   AVG LAT: 0.056
v2 RDMA: QPS: 532226   AVG LAT: 0.048

====== SET ======
    TCP: QPS: 154700   AVG LAT: 0.187
v1 RDMA: QPS: 492368   AVG LAT: 0.058
v2 RDMA: QPS: 295534   AVG LAT: 0.095

====== GET ======
    TCP: QPS: 159022   AVG LAT: 0.182
v1 RDMA: QPS: 525099   AVG LAT: 0.054
v1 RDMA: QPS: 411488   AVG LAT: 0.065

====== LPUSH (needed to benchmark LRANGE) ======
    TCP: QPS: 142537   AVG LAT: 0.207
v1 RDMA: QPS: 395038   AVG LAT: 0.073
v2 RDMA: QPS: 353232   AVG LAT: 0.079

====== LRANGE_100 (first 100 elements) ======
    TCP: QPS:  36171   AVG LAT: 0.657
v1 RDMA: QPS:  55266   AVG LAT: 0.412
v2 RDMA: QPS:  52228   AVG LAT: 0.468

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@rleon
Copy link

rleon commented Jul 25, 2021

First, RDMA is not (yet?) a commodity technology so in practical terms we can't test / use / experience it in any way. For example, none of the major public clouds offer it today.

At least this sentence is not true for years already.
Azure has it from 2015: https://azure.microsoft.com/en-us/blog/azure-linux-rdma-hpc-available/
https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-hpc - RDMA-capable instances
Amazon: https://aws.amazon.com/hpc/efa/
Alicloud: https://www.alibabacloud.com/blog/using-rdma-on-container-service-for-kubernetes_594462

So the more accurate sentence is: "Almost all HPC, AI and hyper-scale clouds use RDMA as a base for their network fabric".

Thanks

@jue-jiang
Copy link

Hi, @yossigo :
I have gone through Redis Protocol specification. TCP is just used as it is. There is no Redis over TCP protocol. If there is one, please correct me.
From RDMA side, application developer should just use RDMA the same way they use TCP and obtain better performance. So I think defining Redis over RDMA protocol is not the first priority. And I agree a clean way to switch between TCP and RDMA should be done first.

@yossigo
Copy link
Collaborator

yossigo commented Sep 2, 2021

Hi @jue-jiang,

Redis over TCP is not formally specified but there's the de-facto implementation that makes several assumptions, like:

  • Working on top of a full duplex reliable stream
  • How an endpoint address (equivalent to IP+port) is represneted, e.g. when a replica announces its visible address to a master, or in cluster bus messages

I assumed (and may was wrong) that some of those assumptions are not necessarily true for RDMA and that some work is required to analyze and define that.

@pizhenwei
Copy link
Contributor Author

pizhenwei commented Sep 3, 2021

Hi @jue-jiang,

Redis over TCP is not formally specified but there's the de-facto implementation that makes several assumptions, like:

  • Working on top of a full duplex reliable stream
  • How an endpoint address (equivalent to IP+port) is represneted, e.g. when a replica announces its visible address to a master, or in cluster bus messages

I assumed (and may was wrong) that some of those assumptions are not necessarily true for RDMA and that some work is required to analyze and define that.

@jue-jiang
Redis supports a connection abstract layer(Ref struct ConnectionType), and it's designed as stream semantics. Let's look at the basic difference between TCP and RDMA:

  • TCP: stream semantics, you can write/read uncertain length of data, the TCP protocol could send/receive correctly.
  • RDMA: message semantics. Before write/read data with remote side, we must allocate memory, and set memory as RDMA memory region with a fixed size.

For the Redis scenario, when client side runs 'get key', the client side does NOT know the length of response data, and client side could read stream data and parse 'Redis protocol'. But RDMA could only send/receive fixed size of data.
To support Redis over RDMA, the main job is 'emulate stream semantics by RDMA' to compact Redis connection abstract layer.
So defining Redis over RDMA protocol is definitely the first priority.

zuiderkwast pushed a commit to valkey-io/valkey that referenced this pull request Jul 15, 2024
Adds an option to build RDMA support as a module:

    make BUILD_RDMA=module

To start valkey-server with RDMA, use a command line like the following:

    ./src/valkey-server --loadmodule src/valkey-rdma.so \
        port=6379 bind=xx.xx.xx.xx

* Implement server side of connection module only, this means we can
*NOT*
  compile RDMA support as built-in.

* Add necessary information in README.md

* Support 'CONFIG SET/GET', for example, 'CONFIG Set rdma.port 6380',
then
  check this by 'rdma res show cm_id' and valkey-cli (with RDMA support,
  but not implemented in this patch).

* The full listeners show like:

      listener0:name=tcp,bind=*,bind=-::*,port=6379
      listener1:name=unix,bind=/var/run/valkey.sock
      listener2:name=rdma,bind=xx.xx.xx.xx,bind=yy.yy.yy.yy,port=6379
      listener3:name=tls,bind=*,bind=-::*,port=16379

Because the lack of RDMA support from TCL, use a simple C program to
test
Valkey Over RDMA (under tests/rdma/). This is a quite raw version with
basic
library dependence: libpthread, libibverbs, librdmacm. Run using the
script:

    ./runtest-rdma [ OPTIONS ]

To run RDMA in GitHub actions, a kernel module RXE for emulated soft
RDMA, needs
to be installed. The kernel module source code is fetched a repo
containing only
the RXE kernel driver from the Linux kernel, but stored in an separate
repo to
avoid cloning the whole Linux kernel repo.

----

Since 2021/06, I created a
[PR](redis/redis#9161) for *Redis Over RDMA*
proposal. Then I did some work to [fully abstract connection and make
TLS dynamically loadable](redis/redis#9320), a
new connection type could be built into Redis statically, or a separated
shared library(loaded by Redis on startup) since Redis 7.2.0.

Base on the new connection framework, I created a new
[PR](redis/redis#11182), some
guys(@xiezhq-hermann @zhangyiming1201 @JSpewock @uvletter @FujiZ)
noticed, played and tested this PR. However, because of the lack of time
and knowledge from the maintainers, this PR has been pending about 2
years.

Related doc: [Introduce *Valkey Over RDMA*
specification](valkey-io/valkey-doc#123). (same
as Redis, and this should be same)

Changes in this PR:
- implement *Valkey Over RDMA*. (compact the Valkey style)

Finally, if this feature is considered to merge, I volunteer to maintain
it.

---------

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants