Overview
We propose enhancing Valkey with multi-threaded command execution capabilities to significantly increase throughput from the current 0.92M GET request per second (CME) to over 3.1M GET requests per second. Our prototype demonstrates that offloading read commands to the aync threads can yield up to 12.7x performance improvement for CPU-intensive operations.
Background
Valkey 8.0 introduced async threads to handle input/output operations while the main thread focuses on command execution. Despite this advancement, a critical bottleneck remains: all command execution still occurs sequentially in the single main thread, which becomes increasingly problematic as systems scale to more cores and memory. Furthermore, the current architecture's benefits are minimal for workloads dominated by CPU-intensive operations, since offloading only I/O tasks provides limited performance gains when processing is the primary constraint.
Proposed Solution
The new dict-per-slot architecture introduced in Valkey 8.0 provides an opportunity for parallelization. Since each slot has its own dictionary, commands operating on different slots could theoretically execute in parallel without contention.
We suggest leveraging this feature and propose an architectural update to how the main thread interacts with other threads. In this design, the main thread still maintains its role as a central scheduler, distributing tasks among threads to ensure safe access to clients and data structures. The main thread communicates with the worker threads through lock-free producer-consumer queues.

The key enhancement is expanding the job types beyond I/O operations to include execute-command jobs, allowing worker threads to independently process commands. By maintaining this centralized coordination model, we eliminate the need for complex synchronization mechanisms such as locks or atomic operations. Importantly, this enhancement requires only minimal changes to the existing codebase while significantly improving resource utilization and performance scalability
Read commands execution
The picture below depicts the execution of a GET command in the proposed architecture:

In this example, thread #1 was assigned with the epoll_wait job. Upon reception of the epoll job completion, the main thread checks the response of the epoll_wait command and finds a read event on connection 18. It then determines the thread in charge of reading from connection 18, thread #2, and produces a read job on its job queue. Thread #2 reads data from the socket, parses the GET command, determines the slot owning the key and sends a read complete event to the main thread. When receiving the read event completion, the main thread validates the legitimacy of the command, increments the number of pending requests on that slot, post an execute request in the queue the thread in charge of executing commands for the key slot, thread #3. Thread #3 executes the command, writes the response on the connection socket and sends an execute completed event to the main thread, which in turn decrements the number of pending requests on the slot and summarizes the command execution.
Write command execution
The picture below depicts the execution of a SET command in the proposed architecture:

Here, after validating the command, the main thread determines if it can execute the SET command or whether it has to wait until all read active commands on the slot have been completed. In the latter case, it puts the client in a blocked state. When the last read completed event on that slot has been received, the main thread unblocks the client, executes the command and sends a write job to thread #2.
Performance Evaluation
We implemented the proposed update into a prototype to evaluate the performance gain in read and read/write scenario .
Test Environment
- Server: c7gn.16xlarge instance
- Clients: 3 x r7g.16xlarge instances
- Configuration: All instances in same placement group (IAD region)
- Valkey 8.1: 8 io-threads (adding additional thread doesn’t improve performance)
- Prototype: 20 threads (1 main thread, 18 worker threads, 1 epoll thread)
- Mode: All tests in cluster mode
Dataset
- Strings: 3M keys with 512-byte values - GET and SET as read and write command respectively
- Hashes: 1M hashes, each with 50 fields (70 bytes/field) - HGETALL and HSET as read and write command respectively
- Sorted Sets: 1M sorted sets, each with 50 members (70 bytes/member) - ZRANK and ZADD as read and write command respectively
- Lists: 1M lists, each with ~50 elements (70 bytes/element) - LINDEX and LSET as read and write command respectively
Benchmark Scenarios
- 100% read operations
- 80% read / 20% write operations
Benchmarks Results:
String Operations:
| Workload |
Valkey 8.1 |
PoC 16xlarge |
PoC 4xlarge |
| 100% Read |
924K |
3,100K |
2,005K |
| 80% R, 20% W |
900K |
1,860K |
1,855K |
Hash Operations:
| Workload |
Valkey 8.1 |
PoC 16xlarge |
PoC 16xl 30 threads |
PoC 4xlarge |
| 100% Read |
233K |
2,138K |
2,960K |
1,193K |
| 80% R, 20% W |
266K |
1,597K |
NA |
1,325K |
Sorted Set Operations:
| Workload |
Valkey 8.1 |
PoC 16xlarge |
PoC 16xl 9 threads |
PoC 4xlarge |
| 100% Read |
465K |
2,837K |
NA |
1,612K |
| 80% R, 20% W |
418K |
962K |
1,100K |
1,003K |
List Operations:
| Workload |
Valkey 8.1 |
PoC 16xlarge |
PoC 16xl 11 threads |
PoC 4xlarge |
| 100% Read |
817K |
3,090K |
NA |
1,940K |
| 80% R, 20% W |
807K |
1,591K |
1,901K |
1,816K |
Key Findings
- String GET: 3.1 million GET requests per second a 3.3x improvement
- HGETALL: 9.1x acceleration, up to 12.7x with 30 threads
- ZRANK: 6.1x acceleration
- Mixed workloads: >2x performance even with non-parallelized writes
- Medium instances (16 cores): Significant improvements in all tests
Conclusion
This demonstrates that parallel read command execution can dramatically improve Valkey's performance reaching over 3M transactions per second on a single instance. This approach requires limited code changes while providing significant throughput improvements across various data structures, even when write commands are also invoked
Proposed roadmap
We propose to deliver for Valkey 9.0 the proposed update with the following limitations:
- Offloading major commands only
- Cluster Mode Only
- Modules - Modules’ logic can have side effects that create contention conflicts with other async threads. To address this issue, module functions will be executed by the main thread in "exclusive mode" - meaning all async threads not be executing commands at that time. New APIs will be introduced to allow declaring certain module components (commands, keyspace event callbacks, cron jobs) as "slot safe," which will enable module parallelization as well
Overview
We propose enhancing Valkey with multi-threaded command execution capabilities to significantly increase throughput from the current 0.92M GET request per second (CME) to over 3.1M GET requests per second. Our prototype demonstrates that offloading read commands to the aync threads can yield up to 12.7x performance improvement for CPU-intensive operations.
Background
Valkey 8.0 introduced async threads to handle input/output operations while the main thread focuses on command execution. Despite this advancement, a critical bottleneck remains: all command execution still occurs sequentially in the single main thread, which becomes increasingly problematic as systems scale to more cores and memory. Furthermore, the current architecture's benefits are minimal for workloads dominated by CPU-intensive operations, since offloading only I/O tasks provides limited performance gains when processing is the primary constraint.
Proposed Solution

The new dict-per-slot architecture introduced in Valkey 8.0 provides an opportunity for parallelization. Since each slot has its own dictionary, commands operating on different slots could theoretically execute in parallel without contention.
We suggest leveraging this feature and propose an architectural update to how the main thread interacts with other threads. In this design, the main thread still maintains its role as a central scheduler, distributing tasks among threads to ensure safe access to clients and data structures. The main thread communicates with the worker threads through lock-free producer-consumer queues.
The key enhancement is expanding the job types beyond I/O operations to include execute-command jobs, allowing worker threads to independently process commands. By maintaining this centralized coordination model, we eliminate the need for complex synchronization mechanisms such as locks or atomic operations. Importantly, this enhancement requires only minimal changes to the existing codebase while significantly improving resource utilization and performance scalability
Read commands execution

The picture below depicts the execution of a GET command in the proposed architecture:
In this example, thread #1 was assigned with the epoll_wait job. Upon reception of the epoll job completion, the main thread checks the response of the epoll_wait command and finds a read event on connection 18. It then determines the thread in charge of reading from connection 18, thread #2, and produces a read job on its job queue. Thread #2 reads data from the socket, parses the GET command, determines the slot owning the key and sends a read complete event to the main thread. When receiving the read event completion, the main thread validates the legitimacy of the command, increments the number of pending requests on that slot, post an execute request in the queue the thread in charge of executing commands for the key slot, thread #3. Thread #3 executes the command, writes the response on the connection socket and sends an execute completed event to the main thread, which in turn decrements the number of pending requests on the slot and summarizes the command execution.
Write command execution

The picture below depicts the execution of a SET command in the proposed architecture:
Here, after validating the command, the main thread determines if it can execute the SET command or whether it has to wait until all read active commands on the slot have been completed. In the latter case, it puts the client in a blocked state. When the last read completed event on that slot has been received, the main thread unblocks the client, executes the command and sends a write job to thread #2.
Performance Evaluation
We implemented the proposed update into a prototype to evaluate the performance gain in read and read/write scenario .
Test Environment
Dataset
Benchmark Scenarios
Benchmarks Results:
String Operations:
Hash Operations:
Sorted Set Operations:
List Operations:
Key Findings
Conclusion
This demonstrates that parallel read command execution can dramatically improve Valkey's performance reaching over 3M transactions per second on a single instance. This approach requires limited code changes while providing significant throughput improvements across various data structures, even when write commands are also invoked
Proposed roadmap
We propose to deliver for Valkey 9.0 the proposed update with the following limitations: