Skip to content

Conversation

@pizhenwei
Copy link
Contributor

@pizhenwei pizhenwei commented May 23, 2023

RDMA is the abbreviation of remote direct memory access. It is a technology that enables computers in a network to exchange data in the main memory without involving the processor, cache, or operating system of either computer. This means RDMA has a better performance than TCP, the test results show Redis Over RDMA has a ~2.5X QPS and lower latency.

In recent years, RDMA gets popular in the data center, especially RoCE(RDMA over Converged Ethernet) architecture has been widely used.

Introduce Redis Over RDMA protocol as a new transport for Redis. For now, we defined 4 commands:

  • GetServerFeature & SetClientFeature: the two commands are used to negotiate features for further extension. There is no feature definition in this version. Flow control and multi-buffer may be supported in the future, this needs feature negotiation.
  • Keepalive
  • RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory with RDMA write/write with imm, it's similar to several mechanisms introduced by papers(but not same):

With this version of protocol, we achieve goals:

  • a high performance design for Redis
  • fully support current Redis operations/commands
  • good compatibility for optimization in future

RDMA is the abbreviation of remote direct memory access. It is a
technology that enables computers in a network to exchange data in
the main memory without involving the processor, cache, or operating
system of either computer. This means RDMA has a better performance
than TCP, the test results show Redis Over RDMA has a ~2.5X QPS and
lower latency.

In recent years, RDMA gets popular in the data center, especially
RoCE(RDMA over Converged Ethernet) architecture has been widely used.

Introduce Redis Over RDMA protocol as a new transport for Redis. For
now, we defined 4 commands:
- GetServerFeature & SetClientFeature: the two commands are used to
  negotiate features for further extension. There is no feature
  definition in this version. Flow control and multi-buffer may be
  supported in the future, this needs feature negotiation.
- Keepalive
- RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory
with RDMA write/write with imm, it's similar to several mechanisms
introduced by papers(but not same):
- Socksdirect: datacenter sockets can be fast and compatible
  <https://dl.acm.org/doi/10.1145/3341302.3342071>
- LITE Kernel RDMA Support for Datacenter Applications
  <https://dl.acm.org/doi/abs/10.1145/3132747.3132762>
- FaRM: Fast Remote Memory
  <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf>

Co-authored-by: Xinhao Kong <xinhao.kong@duke.edu>
Co-authored-by: Huaping Zhou <zhouhuaping.san@bytedance.com>
Co-authored-by: zhuo jiang <jiangzhuo.cs@bytedance.com>
Co-authored-by: Yiming Zhang <zhangyiming1201@bytedance.com>
Co-authored-by: Jianxi Ye <jianxi.ye@bytedance.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@uvletter
Copy link
Contributor

Hello @pizhenwei , I'm very interested in you proposal, and the protocol seems very novel comparing to other rdma implementation like brpc and NVMe-oF. I also have some questions about the proposal, hoping I didn't miss anything.

  1. What's the mapping between QP and client/server. If many QPs per client, it's somehow a little wasteful. But if one QP per client, it may need some multiplexing mechanism, since some Redis command is blocking, a request may block the following ones.
  2. Registering the memory region in batch benefits the performance, but it also involves the problem of low memory utilization. Supposing a redis-server with 1000 clients, which's normal in production environment, and reserving 1 MB memory region for every client, then 1GB is used, it's a little wasteful for memory database.
  3. What about the huge request/response case, that the size of request/response is larger than the preserved memory region size, e.g. a string larger than 1MB, will the protocol support interleaving write and register?

In general I think the protocol and implementation is neat and beautiful, it deserve more attention, for the sake of research/study or production.

@pizhenwei
Copy link
Contributor Author

pizhenwei commented May 31, 2023

Hello @pizhenwei , I'm very interested in you proposal, and the protocol seems very novel comparing to other rdma implementation like brpc and NVMe-oF. I also have some questions about the proposal, hoping I didn't miss anything.

Hi, I tried to describe the deference and comparing to other protocols, please see link.

  1. What's the mapping between QP and client/server. If many QPs per client, it's somehow a little wasteful. But if one QP per client, it may need some multiplexing mechanism, since some Redis command is blocking, a request may block the following ones.

Just imagine a QP(RC type) as a connection of TCP/TLS/Unix socket. If a client uses N sockets, it may need N QPs. (in fact, many sockets also waste resources in the kernel).

  1. Registering the memory region in batch benefits the performance, but it also involves the problem of low memory utilization. Supposing a redis-server with 1000 clients, which's normal in production environment, and reserving 1 MB memory region for every client, then 1GB is used, it's a little wasteful for memory database.

Currently, only one memory region per QP is defined. And no strict memory region size limitation in protocol. As far as I can see in the engineering implementation:

  • the server side could use a configurable size of 'RX' memory region.(the more memory used, the higher performance got. so I guess a typical size will be found during testing the real workload). I have implemented a POC version, see PR.
  • the server side could use a small memory region as 'TX memory' against a large 'RX memory' of a client. (In my plan, but not implement currently)
  1. What about the huge request/response case, that the size of request/response is larger than the preserved memory region size, e.g. a string larger than 1MB, will the protocol support interleaving write and register?

For example, transfer 10MB string over 1MB memory, this works like:

Register 1M memory, send 0-1MB, register 1M memory, sent 1MB-2MB ....

In general I think the protocol and implementation is neat and beautiful, it deserve more attention, for the sake of research/study or production.

Thanks!
This can be tested by repo

client:
use branch: feature-rdma-with-cli
make distclean; make BUILD_RDMA=yes -j

server:
use branch: feature-rdma
make distclean; make BUILD_RDMA=module -j

@CLAassistant
Copy link

CLAassistant commented Mar 24, 2024

CLA assistant check
All committers have signed the CLA.

@pizhenwei
Copy link
Contributor Author

So sad, years of waiting have made me lose patience and confidence.

@pizhenwei pizhenwei closed this Dec 18, 2024
@pizhenwei pizhenwei deleted the redis-over-rdma-protocol branch April 15, 2025 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants