-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Support RDMA as tranport layer protocol #9161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.
Note that this feature is ONLY implemented/tested on Linux.
Several steps of the full job:
1, support RDMA for the connection from client & server. This is the
most import part, and luckly it's implemented in this patch.
Add a new config "rdma-port" for the server side to listen on
a RDMA port. Both redis-cli and redis-benchmark work fine with a
new argument '--rdma'. "REPLICAOF" command launches a RDMA client
if "rdma-replication" is enabled, and it works fine.
2, To support RDMA cluster mode.
3, To implement async read/write for client side. Because RDMA does
NOT support POLLOUT event, it's a little difficult to implement
the async IO mechanism for hiredis.
The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
--threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
--threads 8 -d 512 -t ping,set,get,lrange_100
====== PING_INLINE ======
TCP: QPS: 159017 AVG LAT: 0.183
RDMA: QPS: 523944 AVG LAT: 0.054
====== PING_MBULK ======
TCP: QPS: 162256 AVG LAT: 0.179
RDMA: QPS: 509839 AVG LAT: 0.056
====== SET ======
TCP: QPS: 154700 AVG LAT: 0.187
RDMA: QPS: 492368 AVG LAT: 0.058
====== GET ======
TCP: QPS: 159022 AVG LAT: 0.182
RDMA: QPS: 525099 AVG LAT: 0.054
====== LPUSH (needed to benchmark LRANGE) ======
TCP: QPS: 142537 AVG LAT: 0.207
RDMA: QPS: 395038 AVG LAT: 0.073
====== LRANGE_100 (first 100 elements) ======
TCP: QPS: 36171 AVG LAT: 0.657
RDMA: QPS: 55266 AVG LAT: 0.412
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
|
cool |
|
@oranagra @yossigo @soloestoy |
|
@pizhenwei thank you for the significant contribution. |
|
@pizhenwei Thanks for this contribution! After a discussion with @oranagra and the rest of the core team, we have reached a few conclusions. First, RDMA is not (yet?) a commodity technology so in practical terms we can't test / use / experience it in any way. For example, none of the major public clouds offer it today. We also know very little about it, but we can assess that the surface area for officially supporting it is pretty big and involves many additional aspects, such as:
Because of this, we don't think it's possible at this point in time to accept this contribution and have it an integral part of the Redis core. What we can and want to do is use this opportunity to move forward with ideas we have already discussed in the past around first-party modules and modularizing Redis. One idea we discussed in the past was being able to have the TLS connection capability implemented as a standalone optional module, so users can use Redis with or without it, or even load alternative connection modules with different TLS implementations. If we pursue this, RDMA support could be (mostly) an external module which can be developed and maintained separately. |
+1 for this. I think defining protocol over RDMA is especially challenging since there are many design choices for RDMA (ibverbs) to implement the same function. For example, to transmit a bulk of data, one can use SEND/RECV primitive similar to UDP; or he can use RDMA WRITE primitive to directly write the data into the peer's memory, which bypasses the peer's CPU; or he can also let peer read from his buffer using RDMA READ primitive. These design choices may yield different performance characteristics. In addition, the performance of RDMA is sensitive to the platform it runs on (NIC model, CPU model, etc.). The design choice that performs well on one platform may yield bad performance on another one, so finding an ideal design is even more challenging with so many platforms available. |
|
Close this PR, and still keep pizhenwei:feature-rdma branch for performance test purpose and so on. Issues & suggestions are welcomed! |
In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.
Note that this feature is ONLY implemented/tested on Linux.
Actually, this is the v2 implementation. The v1 uses low level IB
verbs API directly, the code and discuss ses PR:
redis#9161
Instead of low level API, the v2 use rsocket which is implemented
by rdma-core to simplify the work in Redis.
The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
--threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
--threads 8 -d 512 -t ping,set,get,lrange_100
====== PING_INLINE ======
TCP: QPS: 159017 AVG LAT: 0.183
v1 RDMA: QPS: 523944 AVG LAT: 0.054
v2 RDMA: QPS: 492683 AVG LAT: 0.052
====== PING_MBULK ======
TCP: QPS: 162256 AVG LAT: 0.179
v1 RDMA: QPS: 509839 AVG LAT: 0.056
v2 RDMA: QPS: 532226 AVG LAT: 0.048
====== SET ======
TCP: QPS: 154700 AVG LAT: 0.187
v1 RDMA: QPS: 492368 AVG LAT: 0.058
v2 RDMA: QPS: 295534 AVG LAT: 0.095
====== GET ======
TCP: QPS: 159022 AVG LAT: 0.182
v1 RDMA: QPS: 525099 AVG LAT: 0.054
v1 RDMA: QPS: 411488 AVG LAT: 0.065
====== LPUSH (needed to benchmark LRANGE) ======
TCP: QPS: 142537 AVG LAT: 0.207
v1 RDMA: QPS: 395038 AVG LAT: 0.073
v2 RDMA: QPS: 353232 AVG LAT: 0.079
====== LRANGE_100 (first 100 elements) ======
TCP: QPS: 36171 AVG LAT: 0.657
v1 RDMA: QPS: 55266 AVG LAT: 0.412
v2 RDMA: QPS: 52228 AVG LAT: 0.468
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
At least this sentence is not true for years already. So the more accurate sentence is: "Almost all HPC, AI and hyper-scale clouds use RDMA as a base for their network fabric". Thanks |
|
Hi, @yossigo : |
|
Hi @jue-jiang, Redis over TCP is not formally specified but there's the de-facto implementation that makes several assumptions, like:
I assumed (and may was wrong) that some of those assumptions are not necessarily true for RDMA and that some work is required to analyze and define that. |
@jue-jiang
For the Redis scenario, when client side runs 'get key', the client side does NOT know the length of response data, and client side could read stream data and parse 'Redis protocol'. But RDMA could only send/receive fixed size of data. |
Adds an option to build RDMA support as a module:
make BUILD_RDMA=module
To start valkey-server with RDMA, use a command line like the following:
./src/valkey-server --loadmodule src/valkey-rdma.so \
port=6379 bind=xx.xx.xx.xx
* Implement server side of connection module only, this means we can
*NOT*
compile RDMA support as built-in.
* Add necessary information in README.md
* Support 'CONFIG SET/GET', for example, 'CONFIG Set rdma.port 6380',
then
check this by 'rdma res show cm_id' and valkey-cli (with RDMA support,
but not implemented in this patch).
* The full listeners show like:
listener0:name=tcp,bind=*,bind=-::*,port=6379
listener1:name=unix,bind=/var/run/valkey.sock
listener2:name=rdma,bind=xx.xx.xx.xx,bind=yy.yy.yy.yy,port=6379
listener3:name=tls,bind=*,bind=-::*,port=16379
Because the lack of RDMA support from TCL, use a simple C program to
test
Valkey Over RDMA (under tests/rdma/). This is a quite raw version with
basic
library dependence: libpthread, libibverbs, librdmacm. Run using the
script:
./runtest-rdma [ OPTIONS ]
To run RDMA in GitHub actions, a kernel module RXE for emulated soft
RDMA, needs
to be installed. The kernel module source code is fetched a repo
containing only
the RXE kernel driver from the Linux kernel, but stored in an separate
repo to
avoid cloning the whole Linux kernel repo.
----
Since 2021/06, I created a
[PR](redis/redis#9161) for *Redis Over RDMA*
proposal. Then I did some work to [fully abstract connection and make
TLS dynamically loadable](redis/redis#9320), a
new connection type could be built into Redis statically, or a separated
shared library(loaded by Redis on startup) since Redis 7.2.0.
Base on the new connection framework, I created a new
[PR](redis/redis#11182), some
guys(@xiezhq-hermann @zhangyiming1201 @JSpewock @uvletter @FujiZ)
noticed, played and tested this PR. However, because of the lack of time
and knowledge from the maintainers, this PR has been pending about 2
years.
Related doc: [Introduce *Valkey Over RDMA*
specification](valkey-io/valkey-doc#123). (same
as Redis, and this should be same)
Changes in this PR:
- implement *Valkey Over RDMA*. (compact the Valkey style)
Finally, if this feature is considered to merge, I volunteer to maintain
it.
---------
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.
Note that this feature is ONLY implemented/tested on Linux.
Several steps of the full job:
1, support RDMA for the connection from client & server. This is the
most import part, and luckly it's implemented in this patch.
Add a new config "rdma-port" for the server side to listen on
a RDMA port. Both redis-cli and redis-benchmark work fine with a
new argument '--rdma'. "REPLICAOF" command launches a RDMA client
if "rdma-replication" is enabled, and it works fine.
2, To support RDMA cluster mode.
3, To implement async read/write for client side. Because RDMA does
NOT support POLLOUT event, it's a little difficult to implement
the async IO mechanism for hiredis.
The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000
--threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000
--threads 8 -d 512 -t ping,set,get,lrange_100
====== PING_INLINE ======
TCP: QPS: 159017 AVG LAT: 0.183
RDMA: QPS: 523944 AVG LAT: 0.054
====== PING_MBULK ======
TCP: QPS: 162256 AVG LAT: 0.179
RDMA: QPS: 509839 AVG LAT: 0.056
====== SET ======
TCP: QPS: 154700 AVG LAT: 0.187
RDMA: QPS: 492368 AVG LAT: 0.058
====== GET ======
TCP: QPS: 159022 AVG LAT: 0.182
RDMA: QPS: 525099 AVG LAT: 0.054
====== LPUSH (needed to benchmark LRANGE) ======
TCP: QPS: 142537 AVG LAT: 0.207
RDMA: QPS: 395038 AVG LAT: 0.073
====== LRANGE_100 (first 100 elements) ======
TCP: QPS: 36171 AVG LAT: 0.657
RDMA: QPS: 55266 AVG LAT: 0.412
Signed-off-by: zhenwei pi pizhenwei@bytedance.com