Skip to content

[High level design] Allow Read requests on I/O threads #2022

Description

@touitou-dan

Overview
We propose enhancing Valkey with multi-threaded command execution capabilities to significantly increase throughput from the current 0.92M GET request per second (CME) to over 3.1M GET requests per second. Our prototype demonstrates that offloading read commands to the aync threads can yield up to 12.7x performance improvement for CPU-intensive operations.

Background
Valkey 8.0 introduced async threads to handle input/output operations while the main thread focuses on command execution. Despite this advancement, a critical bottleneck remains: all command execution still occurs sequentially in the single main thread, which becomes increasingly problematic as systems scale to more cores and memory. Furthermore, the current architecture's benefits are minimal for workloads dominated by CPU-intensive operations, since offloading only I/O tasks provides limited performance gains when processing is the primary constraint.

Proposed Solution
The new dict-per-slot architecture introduced in Valkey 8.0 provides an opportunity for parallelization. Since each slot has its own dictionary, commands operating on different slots could theoretically execute in parallel without contention.
We suggest leveraging this feature and propose an architectural update to how the main thread interacts with other threads. In this design, the main thread still maintains its role as a central scheduler, distributing tasks among threads to ensure safe access to clients and data structures. The main thread communicates with the worker threads through lock-free producer-consumer queues.
Image
The key enhancement is expanding the job types beyond I/O operations to include execute-command jobs, allowing worker threads to independently process commands. By maintaining this centralized coordination model, we eliminate the need for complex synchronization mechanisms such as locks or atomic operations. Importantly, this enhancement requires only minimal changes to the existing codebase while significantly improving resource utilization and performance scalability

Read commands execution
The picture below depicts the execution of a GET command in the proposed architecture:
Image
In this example, thread #1 was assigned with the epoll_wait job. Upon reception of the epoll job completion, the main thread checks the response of the epoll_wait command and finds a read event on connection 18. It then determines the thread in charge of reading from connection 18, thread #2, and produces a read job on its job queue. Thread #2 reads data from the socket, parses the GET command, determines the slot owning the key and sends a read complete event to the main thread. When receiving the read event completion, the main thread validates the legitimacy of the command, increments the number of pending requests on that slot, post an execute request in the queue the thread in charge of executing commands for the key slot, thread #3. Thread #3 executes the command, writes the response on the connection socket and sends an execute completed event to the main thread, which in turn decrements the number of pending requests on the slot and summarizes the command execution.

Write command execution
The picture below depicts the execution of a SET command in the proposed architecture:
Image
Here, after validating the command, the main thread determines if it can execute the SET command or whether it has to wait until all read active commands on the slot have been completed. In the latter case, it puts the client in a blocked state. When the last read completed event on that slot has been received, the main thread unblocks the client, executes the command and sends a write job to thread #2.

Performance Evaluation
We implemented the proposed update into a prototype to evaluate the performance gain in read and read/write scenario .

Test Environment

  • Server: c7gn.16xlarge instance
  • Clients: 3 x r7g.16xlarge instances
  • Configuration: All instances in same placement group (IAD region)
  • Valkey 8.1: 8 io-threads (adding additional thread doesn’t improve performance)
  • Prototype: 20 threads (1 main thread, 18 worker threads, 1 epoll thread)
  • Mode: All tests in cluster mode

Dataset

  • Strings: 3M keys with 512-byte values - GET and SET as read and write command respectively
  • Hashes: 1M hashes, each with 50 fields (70 bytes/field) - HGETALL and HSET as read and write command respectively
  • Sorted Sets: 1M sorted sets, each with 50 members (70 bytes/member) - ZRANK and ZADD as read and write command respectively
  • Lists: 1M lists, each with ~50 elements (70 bytes/element) - LINDEX and LSET as read and write command respectively

Benchmark Scenarios

  1. 100% read operations
  2. 80% read / 20% write operations

Benchmarks Results:
String Operations:

Workload Valkey 8.1 PoC 16xlarge PoC 4xlarge
100% Read 924K 3,100K 2,005K
80% R, 20% W 900K 1,860K 1,855K

Hash Operations:

Workload Valkey 8.1 PoC 16xlarge PoC 16xl 30 threads PoC 4xlarge
100% Read 233K 2,138K 2,960K 1,193K
80% R, 20% W 266K 1,597K NA 1,325K

Sorted Set Operations:

Workload Valkey 8.1 PoC 16xlarge PoC 16xl 9 threads PoC 4xlarge
100% Read 465K 2,837K NA 1,612K
80% R, 20% W 418K 962K 1,100K 1,003K

List Operations:

Workload Valkey 8.1 PoC 16xlarge PoC 16xl 11 threads PoC 4xlarge
100% Read 817K 3,090K NA 1,940K
80% R, 20% W 807K 1,591K 1,901K 1,816K

Key Findings

  • String GET: 3.1 million GET requests per second a 3.3x improvement
  • HGETALL: 9.1x acceleration, up to 12.7x with 30 threads
  • ZRANK: 6.1x acceleration
  • Mixed workloads: >2x performance even with non-parallelized writes
  • Medium instances (16 cores): Significant improvements in all tests

Conclusion
This demonstrates that parallel read command execution can dramatically improve Valkey's performance reaching over 3M transactions per second on a single instance. This approach requires limited code changes while providing significant throughput improvements across various data structures, even when write commands are also invoked

Proposed roadmap
We propose to deliver for Valkey 9.0 the proposed update with the following limitations:

  • Offloading major commands only
  • Cluster Mode Only
  • Modules - Modules’ logic can have side effects that create contention conflicts with other async threads. To address this issue, module functions will be executed by the main thread in "exclusive mode" - meaning all async threads not be executing commands at that time. New APIs will be introduced to allow declaring certain module components (commands, keyspace event callbacks, cron jobs) as "slot safe," which will enable module parallelization as well

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions