[High level design] Allow Read requests on I/O threads


**Overview**
We propose enhancing Valkey with multi-threaded command execution capabilities to significantly increase throughput from the current  0.92M  GET request per second (CME) to over 3.1M GET requests per second. Our prototype demonstrates that offloading read commands to the aync threads can yield up to 12.7x performance improvement for CPU-intensive operations.

**Background**
 Valkey 8.0 introduced async threads to handle input/output operations while the main thread focuses on command execution. Despite this advancement, a critical bottleneck remains: all command execution still occurs sequentially in the single main thread, which becomes increasingly problematic as systems scale to more cores and memory.  Furthermore, the current architecture's benefits are minimal for workloads dominated by CPU-intensive operations, since offloading only I/O tasks provides limited performance gains when processing is the primary constraint.

**Proposed Solution**
The new dict-per-slot architecture introduced in Valkey 8.0 provides an opportunity for parallelization. Since each slot has its own dictionary, commands operating on different slots could theoretically execute in parallel without contention.
We suggest leveraging this feature and propose an architectural update to how the main thread interacts with other threads. In this design, the main thread still maintains its role as a central scheduler, distributing tasks among threads to ensure safe access to clients and data structures. The main thread communicates with the worker threads through lock-free producer-consumer queues.
![Image](https://github.com/user-attachments/assets/33f5dbc0-7a02-4f00-ac00-155da5e1dafc)
The key enhancement is expanding the job types beyond I/O operations to include execute-command jobs, allowing worker threads to independently process commands. By maintaining this centralized coordination model, we eliminate the need for complex synchronization mechanisms such as locks or atomic operations. Importantly, this enhancement requires only minimal changes to the existing codebase while significantly improving resource utilization and performance scalability 

**Read commands execution**
The picture below depicts the execution of a GET command in the proposed architecture:
![Image](https://github.com/user-attachments/assets/f17e7cb1-0f57-415d-9353-c1dd00c0ce58)
 In this example, thread #1 was assigned with the epoll_wait job. Upon reception of the epoll job completion, the main thread checks the response of the epoll_wait command and finds a read event on connection 18. It then determines the thread in charge of reading from connection 18, thread #2, and produces a read job on its job queue. Thread #2 reads data from the socket, parses the GET command, determines the slot owning the key and sends a read complete event to the main thread. When receiving the read event completion, the main thread validates the legitimacy of the command, increments the number of pending requests on that slot, post an execute request in the queue the thread in charge of executing commands for the key slot, thread #3. Thread #3 executes the command, writes the response on the connection socket and sends an execute completed event to the main thread, which in turn decrements the number of pending requests on the slot and summarizes the command execution.

**Write command execution**
The picture below depicts the execution of a SET command in the proposed architecture:
![Image](https://github.com/user-attachments/assets/88e5f12c-0bea-4094-940e-26850625c433)
Here, after validating the command, the main thread determines if it can execute the SET command or whether it has to wait until all read active commands on the slot have been completed. In the latter case, it puts the client in a blocked state. When the last read completed event on that slot has been received, the main thread unblocks the client, executes the command and sends a write job to thread #2. 

**Performance Evaluation**
We implemented the proposed update into a prototype to evaluate the performance gain in read and read/write scenario . 

**Test Environment**
- Server: c7gn.16xlarge instance
- Clients: 3 x r7g.16xlarge instances
- Configuration: All instances in same placement group (IAD region)
- Valkey 8.1: 8 io-threads (adding additional thread doesn’t improve performance)
- Prototype: 20 threads (1 main thread, 18 worker threads, 1 epoll thread)
- Mode: All tests in cluster mode

**Dataset**
- Strings: 3M keys with 512-byte values - GET and SET as read and write command respectively 
- Hashes: 1M hashes, each with 50 fields (70 bytes/field)  - HGETALL and HSET as read and write command respectively 
- Sorted Sets: 1M sorted sets, each with 50 members (70 bytes/member) - ZRANK and ZADD as read and write command respectively 
- Lists: 1M lists, each with ~50 elements (70 bytes/element) - LINDEX and LSET as read and write command respectively 

**Benchmark Scenarios**
1. 100% read operations
2. 80% read / 20% write operations


**Benchmarks Results:**
String Operations:
|     Workload          |     Valkey 8.1         | PoC  16xlarge    |     PoC 4xlarge       |
|---------------------|--------------------|-------------------|---------------------|
|    100% Read         |      924K               |     3,100K            |     2,005K               | 
|  80% R, 20% W      |     900K                |     1,860K            |     1,855K               |
 
Hash Operations:
|     Workload          |     Valkey 8.1         | PoC  16xlarge    | PoC 16xl 30 threads     |  PoC 4xlarge    |
|---------------------|--------------------|-------------------|--------------------------|-----------------| 
|    100% Read         |     233K                |     2,138K            |           2,960K                |         1,193K     | 
| 80% R, 20% W       |     266K                |    1,597K             |               NA                  |         1,325K    |
 
Sorted Set Operations:
|     Workload         |     Valkey 8.1          | PoC  16xlarge      | PoC  16xl  9 threads     | PoC 4xlarge      |
|--------------------|---------------------|--------------------|--------------------------|------------------| 
|   100% Read        |       465K                |       2,837K            |                NA                |       1,612K         | 
| 80% R, 20% W     |      418K                 |         962K            |             1,100K              |        1,003K        |
 
List Operations:
|     Workload         |     Valkey 8.1          | PoC  16xlarge      | PoC  16xl  11 threads   | PoC 4xlarge      |
|--------------------|---------------------|--------------------|--------------------------|------------------| 
|   100% Read        |        817K               |      3,090K            |                NA                 |        1,940K        | 
| 80% R, 20% W     |         807K              |       1,591K           |               1,901K            |        1,816K        |
 
**Key Findings**

- String GET:  3.1 million GET requests per second a 3.3x improvement
- HGETALL: 9.1x acceleration, up to 12.7x with 30 threads
- ZRANK:  6.1x acceleration 
- Mixed workloads: >2x performance even with non-parallelized writes
- Medium instances (16 cores): Significant improvements in all tests

**Conclusion**
This demonstrates that parallel read command execution can dramatically improve Valkey's performance reaching over 3M transactions per second on a single instance. This approach requires limited code changes while providing significant throughput improvements across various data structures, even when write commands are also invoked

**Proposed roadmap**
We propose to deliver for Valkey 9.0 the proposed update with the following limitations:  

- Offloading major commands only 
- Cluster Mode Only  
- Modules - Modules’ logic  can have side effects that create contention conflicts with other async threads. To address this issue, module functions will be executed by the main thread in "exclusive mode" - meaning all async threads  not be executing commands at that time. New APIs will be introduced to allow declaring certain module components (commands, keyspace event callbacks, cron jobs) as "slot safe," which will enable module parallelization as well

 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[High level design] Allow Read requests on I/O threads #2022

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Workload	Valkey 8.1	PoC 16xlarge	PoC 16xl 30 threads	PoC 4xlarge
100% Read	233K	2,138K	2,960K	1,193K
80% R, 20% W	266K	1,597K	NA	1,325K

Workload	Valkey 8.1	PoC 16xlarge	PoC 16xl 11 threads	PoC 4xlarge
100% Read	817K	3,090K	NA	1,940K
80% R, 20% W	807K	1,591K	1,901K	1,816K

Uh oh!

[High level design] Allow Read requests on I/O threads #2022

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions