Cluster IO Offload by hpatro · Pull Request #4 · hpatro/valkey

hpatro · 2026-03-13T07:11:20Z

Cluster Bus I/O Offload

Overview

Cluster bus read, write, and TLS accept work can run on the shared I/O thread
pool, while all cluster state mutation stays on the main thread.

Jobs flow through:

io_shared_inbox (SPMC): main thread -> I/O workers
io_shared_outbox (MPSC): I/O workers -> main thread

If the pool is inactive or enqueue fails, the caller falls back to synchronous
I/O on the main thread and increments stat_cluster_io_sync_fallbacks.

Architecture

+---------------------------+       +-------------------------------+       +---------------------------+
|       Main Thread         |       |        Shared Queues          |       |     I/O Thread Pool       |
|                           |       |                               |       |                           |
| clusterReadOffloadHandler |------>| io_shared_inbox (SPMC)        |------>| clusterReadJob            |
| clusterWriteHandler       |       |                               |       | clusterWriteJob           |
| clusterAcceptHandler      |       |                               |       | clusterAcceptJob          |
| clusterSendMessage        |       +-------------------------------+       +---------------------------+
|                           |
| processIOThreadsResponses |<------ io_shared_outbox (MPSC) <------ sendToMainThread(...)
| clusterHandle*Completion  |
| clusterProcessPacket      |
| freeClusterLink           |
+---------------------------+

Key Design Decisions

Reuse the existing shared SPMC/MPSC queues; no cluster-specific threading primitives.
I/O threads do transport work only: connRead, connWrite, connAccept, framing, and result publication.
Main thread alone applies cluster packets and mutates cluster state.
Write offload uses a single canonical send_msg_queue plus a snapshot boundary:
io_last_send_block, io_head_offset, and io_nodes_sent.
Read offload uses rcvbuf snapshot boundaries: io_rcvbuf_snapshot_len and io_rcvbuf_snapshot_packets.
Read completion drains the queued rcvbuf snapshot fully; there is no bounded packet/time budget.
Link teardown is deferred with io_refs and async_close.
Accept offload is serialized per connection with CONN_FLAG_ACCEPT_OFFLOAD_PENDING.
ConnOwnerKind lets generic connection/TLS code distinguish client-owned and cluster-owned connections safely.

Data Flow: Read Path

[MAIN dispatch] clusterReadOffloadHandler
    |
    | drain queued snapshot
    | trySendClusterReadToIOThreads()
    v
[QUEUE] io_shared_inbox
    | JOB_REQ_CLUSTER_READ
    v
[IO worker] clusterReadJob
    | connRead() into rcvbuf
    | grow rcvbuf if needed
    | snapshot complete packet prefix
    | set io_rcvbuf_snapshot_len / packets
    | set io_result
    v
[QUEUE] io_shared_outbox
    | JOB_RES_CLUSTER_READ
    v
[MAIN completion] clusterHandleReadCompletion
    | set io_read_state = IDLE
    | io_refs--
    | reconcile memory
    | drain queued rcvbuf snapshot fully
    | shrink/free as needed

If dispatch returns C_ERR, the main thread falls back to clusterReadHandler(conn).

Data Flow: Write Path

[MAIN dispatch] clusterSendMessage / clusterWriteHandler
    |
    | append to send_msg_queue
    | snapshot io_last_send_block / io_head_offset
    | trySendClusterWriteToIOThreads()
    v
[QUEUE] io_shared_inbox
    | JOB_REQ_CLUSTER_WRITE
    v
[IO worker] clusterWriteJob
    | write from queue head
    | stop at io_last_send_block
    | set io_nodes_sent
    | set io_head_offset
    | set io_result
    v
[QUEUE] io_shared_outbox
    | JOB_RES_CLUSTER_WRITE
    v
[MAIN completion] clusterHandleWriteCompletion
    | set io_write_state = IDLE
    | io_refs--
    | pop sent nodes
    | apply head_msg_send_offset
    | reinstall handler or free on error

Important note:

New messages appended while a write job is in flight stay on send_msg_queue,
but the worker stops at io_last_send_block, so those new nodes are picked up
by a later dispatch.

If dispatch returns C_ERR, clusterWriteHandler(conn) falls back to the synchronous connWrite() loop.

Data Flow: Accept Path (TLS)

[MAIN dispatch] clusterAcceptHandler
    |
    | create accepted conn
    | set owner kind
    | allow accept offload
    | trySendClusterAcceptToIOThreads()
    v
[QUEUE] io_shared_inbox
    | JOB_REQ_CLUSTER_ACCEPT
    v
[IO worker] clusterAcceptJob
    | connAccept(conn, NULL)
    v
[QUEUE] io_shared_outbox
    | JOB_RES_CLUSTER_ACCEPT
    v
[MAIN completion] clusterHandleAcceptCompletion
    | clear ACCEPT_OFFLOAD_PENDING
    | finalize conn state / refs
    | if ACCEPTING: return
    | else clusterConnAcceptHandler()
    | create clusterLink on success

Important note:

TLS retries can re-enter the generic accept offload path. Cluster-owned
connections are routed back to trySendClusterAcceptToIOThreads, and the
pending flag ensures only one accept job is in flight for that connection.

If dispatch returns C_ERR, the main thread falls back to connAccept(conn, clusterConnAcceptHandler).

clusterLink I/O State

IDLE --dispatch read-->  READ_PENDING  --read completion-->  IDLE
IDLE --dispatch write--> WRITE_PENDING --write completion--> IDLE

If freeClusterLink() sees io_refs > 0:
  detach link from node
  remove handlers
  set async_close = 1
  defer final free until the last completion drops io_refs to 0

Read and write jobs are mutually exclusive per link.

Invariants

Link State

io_refs >= 0 always.
io_refs == 0 implies both io_read_state and io_write_state are IDLE.
Read and write jobs are mutually exclusive.
send_msg_queue_mem tracks the canonical send_msg_queue.
Link buffer-limit accounting currently uses send_msg_queue_mem + rcvbuf_len.
stat_cluster_links_memory is updated on the main thread.

I/O Thread MAY

Call connRead() / connWrite() / connAccept() on the offloaded connection.
Grow rcvbuf and update rcvbuf_len / rcvbuf_alloc.
Scan rcvbuf and publish io_rcvbuf_snapshot_len / io_rcvbuf_snapshot_packets.
Read send_msg_queue up to the snapped io_last_send_block.
Write io_result, io_nodes_sent, io_head_offset, and last_io_read_time.
Publish completions with sendToMainThread().

I/O Thread MUST NOT

Touch clusterNode, clusterState, slot ownership, epoch/failover state, or any other cluster-global state.
Call clusterProcessPacket() or any packet-application logic.
Modify io_read_state, io_write_state, io_refs, or async_close.
Pop queue nodes or update send_msg_queue_mem.
Update server.stat_cluster_links_memory.
Free clusterLink, clusterNode, or connection objects.

Main Thread MUST NOT While a Job Is In Flight

For read: touch rcvbuf, rcvbuf_len, rcvbuf_alloc, io_rcvbuf_snapshot_len, or io_rcvbuf_snapshot_packets.
For write: pop already-visible queue nodes or rewrite the tracked head send offset.
Dispatch a second read/write job for the same link.

Dispatch Contract

All three dispatch helpers (trySendClusterReadToIOThreads,
trySendClusterWriteToIOThreads, trySendClusterAcceptToIOThreads) follow the
same pattern:

connSetPostponeUpdateState(conn, 1) and connIncrRefs(conn) before enqueue.
Roll back state and refs on enqueue failure.
Finalize postponed connection state on completion.

Return values:

C_OK: work was offloaded, or an equivalent job is already pending.
C_ERR: pool inactive or enqueue failed, so caller may sync-fallback.

Deferred Teardown

If freeClusterLink(link) sees io_refs > 0:

Detach the link from node fields.
Remove read/write handlers so no new work is scheduled.
Set async_close = 1.
Return without closing/freeing immediately.

The actual connClose() and final free happen later, when the last completion
handler drops io_refs to 0.

Failure Detection

last_io_read_time is updated by the read worker on successful reads and is
included in the cluster node delay calculation.

data_delay = now - max(node->data_received,
                       node->link->last_io_read_time,
                       node->inbound_link->last_io_read_time)

The I/O thread stores last_io_read_time with release ordering; the main thread
loads it with acquire ordering.

Error Handling

Result / Condition	Handling
`CLUSTER_IO_BAD_HEADER` / `CLUSTER_IO_BAD_LENGTH`	Log warning, free link immediately
`CLUSTER_IO_READ_ERROR` / `CLUSTER_IO_EOF`	Log debug, drain queued `rcvbuf` snapshot, then free link
`CLUSTER_IO_WRITE_ERROR`	Log debug, free link
`CONN_STATE_ACCEPTING` after accept completion	Leave connection open; TLS event flow will retry
Pool inactive / queue full	Sync fallback + increment `stat_cluster_io_sync_fallbacks`

Follow-Up

Read completion currently drains the queued rcvbuf snapshot in one pass on the main thread and compacts the buffer after each packet.
Follow-up work: either reintroduce bounded completion with a correct continuation mechanism, or keep full drain semantics but switch to a front-offset model so we avoid repeated per-packet memmove() while a large burst is being applied.
Current behavior on CLUSTER_IO_BAD_HEADER / CLUSTER_IO_BAD_LENGTH is to close the link immediately, even if the same offloaded read also contained a valid packet prefix in the queued snapshot. This is accepted for now as an invalid-peer path, but if we ever need sync/offload parity here, read completion should drain the valid prefix before tearing the link down.
Create separate queue for cluster io jobs to avoid contention with client io jobs. This will allow us to prioritize different type of jobs and drain at different rate.

Implemented CLUSTERSCAN command for topology-aware scanning Unlike `SCAN` which is local to a single node, `CLUSTERSCAN` provides a mechanism that helps clients iterate across slot boundaries and handles `MOVED` redirections. **Key details** * Global cluster iteration via `fingerprint-{hashtag}-cursor` * Scan one slot at a time * Start the CLUSTERSCAN with 0 * SLOT argument for parallel scanning of multiple slots * Re-use scanGenericCommand for the response **Cursor format:** `fingerprint-{hashtag}-localcursor` - Fingerprint is a hash of the node's DB seed that identifies the current memory layout. On mismatch, scan restarts from cursor 0 rather than returning an error. - Fingerprint 0 indicates a cross slot cursor (e.g., initial cursor or slot transition) where validation is skipped. - Hashtag encodes the target slot - Local cursor tracks position within the slot **Usage:** ``` CLUSTERSCAN <cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number] ``` ``` CLUSTERSCAN 0 # Start scanning from slot 0 CLUSTERSCAN <cursor> # Continue from cursor CLUSTERSCAN 0 SLOT 1000 # Start scanning specific slot CLUSTERSCAN <cursor> MATCH user:* COUNT 100 ``` --------- Signed-off-by: nmvk <r@nmvk.com> Signed-off-by: Raghav <r@nmvk.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>

Before this change, if you built `valkey-server` (e.g. `make valkey-server`) the LUA module was not built. With this change, the LUA module is now a direct a dependency of `valkey-server` - (unless `BUILD_LUA=no` is passed) Added some colors to the Lua module & Lua lib `Makefile`s to they blend nicely in the build output. Signed-off-by: Eran Ifrah <eifrah@amazon.com>

…time_t (valkey-io#3252) Addresses issue valkey-io#2350 As noted in the issue, much of the expire code uses raw long long for timestamps, which provides no semantic meaning about the unit or purpose of the value. Valkey already defines mstime_t (milliseconds) and ustime_t (microseconds) typedefs — this PR replaces bare long long declarations with the appropriate typedef wherever the value represents an expiration timestamp or time duration. This PR only fixes a small subset in the codebase, but it is an incremental step toward fully replacing the bare long long references. --------- Signed-off-by: curious-george-rk <r.ebu@gmail.com> Co-authored-by: curious-george-rk <r.ebu@gmail.com>

Now we will be able to add a `run-cluster-benchmark` label to run a benchmark with cluster-mode enabled valkey-server It will use the config https://github.com/valkey-io/valkey/blob/unstable/.github/benchmark_configs/benchmark-config-arm.json modified for for cluster mode with a single clustermode enabled instance of valkey. It uses the same single instance for the benchmark as for run-benchmark. If both labels are used, they are sequential in the same concurrency group `group: ec2-al-2023-pr-benchmarking-arm64`. --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

…ey-io#2746) This PR introduces a new config `cluster-message-gossip-perc` which allows an operator to modify the amount of gossip node information to be sent per ping/pong/meet message. It can be modified dynamically (Related to valkey-io#2291). The default value is 10% i.e. 10% of peer node information would be gossiped along with each ping/pong/meet packet. Users can tune this configuration, setting the value higher allows faster information dissemination whereas setting it lower would lead to direct PING messages if no information was received about a node with the `server.cluster_node_timeout/2` period. Note: the behavior for partially failed gossip nodes still remains intact where all the `pfail` nodes are part of the message for faster propagation of information and faster transition of `PFAIL` to `FAIL`. --------- Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>

…o#3336) This fixes a flaky dual-channel replication integration test: https://github.com/valkey-io/valkey/actions/runs/22810251608/job/66165776198#step:8:7701 `INFO memory` field `used_memory_overhead` and `MEMORY STATS` field `overhead.total` can change during dual-channel sync if replica's pending replication buffer is still changing. This is probably more visible in slower environments. The test now collects `INFO` and `MEMORY STATS` in a single `MULTI/EXEC` on both the primary and replica, so the compared values come from the same snapshot. Passing here: https://github.com/sarthakaggarwal97/valkey/actions/runs/22864585326/job/66327772967 Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

…ation (valkey-io#3258) Observation around `dictGetSomeKeys(N)` performance is that it is around 10x faster than randomly picking `N` elements over N*3 attempts. Further, this avoids chances of not picking N elements due to duplicate element selection with individual random pick. Overall, it should help picking gossip node(s) for cluster bus message faster. --------- Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>

I found that on 32-bit platforms, the gtest fails if I add a declaration `int aof_rewrite_for_replication;` between https://github.com/valkey-io/valkey/blob/fc217fca2d7de246ab8494d9d8619b7348235f47/src/server.h#L1843 and https://github.com/valkey-io/valkey/blob/fc217fca2d7de246ab8494d9d8619b7348235f47/src/server.h#L2108 The crash is caused by https://github.com/valkey-io/valkey/blob/fc217fca2d7de246ab8494d9d8619b7348235f47/src/unit/wrappers.h#L40 stripping the _Atomic qualifier in C++ builds. This makes struct valkeyServer have different alignment/field offsets in C-compiled code (libvalkey.a) vs C++-compiled tests (gtest). As a result, fields after these atomics are accessed at different offsets, leading to corrupted reads/writes and a segfault. I added VALKEY_ATOMIC, and it works as follows: - C builds (production): VALKEY_ATOMIC(uint64_t) -> _Atomic uint64_t (unchanged behavior) - C++ builds (tests via wrappers.h): VALKEY_ATOMIC(uint64_t) -> alignas(8) uint64_t (preserves 8-byte alignment) - All existing atomic_load_explicit / atomic_store_explicit calls in C code continue to work because _Atomic is present in C mode - C++ test code only does direct field access (never uses atomic operations), so alignas(N) type is sufficient --------- Signed-off-by: cjx-zar <jxchenczar@foxmail.com>

…-io#3121) It's a combination of MSET and EXPIRE, which allows us to set an expiration time simultaneously with the MSET operation. Syntax for the new MSETEX command: ``` /* MSETEX numkeys key value [key value ...] [NX | XX] * [EX seconds | PX milliseconds | * EXAT seconds-timestamp | PXAT milliseconds-timestamp | KEEPTTL] */ ``` The MSETEX command supports a set of options (like SET command): - EX seconds -- Set the specified expire time, in seconds (a positive integer). - PX milliseconds -- Set the specified expire time, in milliseconds (a positive integer). - EXAT timestamp-seconds -- Set the specified Unix time at which the keys will expire, in seconds (a positive integer). - PXAT timestamp-milliseconds -- Set the specified Unix time at which the keys will expire, in milliseconds (a positive integer). - KEEPTTL -- Retain the time to live associated with the keys. - NX -- Only set the keys if all keys do not already exist. - XX -- Only set the keys if all keys already exists. Return Value (like MSETNX command): - Integer reply: 0 if no key was set due to NX or XX options. - Integer reply: 1 if all the keys were set. Closes valkey-io#2592 --------- Signed-off-by: Binbin <binloveplay1314@qq.com>

…valkey-io#3156) Implemented a way to propagate az through gossip and obtain az information through the CLUSTER SHARDS and CLUSTER SLOTS commands. It will only be displayed if the node is configured with it. Closes valkey-io#3110 --------- Signed-off-by: Su Ko <rhtn1128@gmail.com>

`zdiffAlgorithm2()` can break out early once the destination cardinality reaches zero. In that path, a temporary SDS created by `zuiSdsFromValue()` was left dirty and never released, because that cleanup normally happens on the next iterator step. This patch explicitly discards the dirty value before the early `break`. ``` ==11724== 11 bytes in 1 blocks are definitely lost in loss record 36 of 1,442 ==11724== at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==11724== by 0x2FA974: ztrymalloc_usable_internal (zmalloc.c:156) ==11724== by 0x2FAAEA: zmalloc_usable (zmalloc.c:200) ==11724== by 0x282D25: _sdsnewlen (sds.c:102) ==11724== by 0x2830B2: sdsnewlen (sds.c:169) ==11724== by 0x2DB537: zuiSdsFromValue (t_zset.c:2290) ==11724== by 0x2DBE00: zdiffAlgorithm2 (t_zset.c:2502) ==11724== by 0x2DC085: zdiff (t_zset.c:2568) ==11724== by 0x2DD01F: zunionInterDiffGenericCommand (t_zset.c:2817) ==11724== by 0x2DD574: zdiffCommand (t_zset.c:2898) ==11724== by 0x29F5AE: call (server.c:3883) ==11724== by 0x2A1598: processCommand (server.c:4569) ``` Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

In flushall we will call killRDBChild to kill the RDB child, but we need to wait for checkChildrenDone and waitpid to kick in, so that we can call resetChildState to reset the state. The previous wait_for_condition 10 100 (1s) might not be enough, changed it to wait_for_condition 50 100 (5s). ``` [err]: Test FLUSHALL aborts bgsave in tests/integration/rdb.tcl bgsave not aborted ``` Signed-off-by: Binbin <binloveplay1314@qq.com>

…tCommandVector (valkey-io#3279) If the command is in the form of "SET key value EX/PX/EXAT ttl", then we don't need to rewrite the entire command vector. In addition to saving the rewriting of the command vector, we also save one command table lookup in this case ince we don't need to lookup SET command again. We already have many relevant test coverages in expire.tcl. This PR does not change anything around the propagate. Somehow like 2544755. Signed-off-by: Binbin <binloveplay1314@qq.com>

With current Makefile, `LDFLAGS` are not used for modules. This results in security options not applied. ``` $ annocheck /usr/lib64/valkey/modules/rdma.so annocheck: Version 12.99. Hardened: rdma.so: FAIL: bind-now test because not linked with -Wl,-z,now Hardened: Rerun annocheck with --verbose to see more information on the tests. Hardened: rdma.so: Overall: FAIL. ``` With this patch ``` $ annocheck /usr/lib64/valkey/modules/rdma.so annocheck: Version 12.99. Hardened: rdma.so: PASS. ``` Signed-off-by: Remi Collet <remi@remirepo.net>

There is now a port of fast_float in C. So instead of having an optional fast_float dependency, we can just use ffc instead, unconditionally. https://github.com/kolemannix/ffc.h It is a high quality port. The performance should be the same or improved. Note : I am the maintainer and main author of fast_float. --------- Signed-off-by: Daniel Lemire <daniel@lemire.me>

- Supports `VALKEYMODULE_CONFIG_UNSIGNED` and `UNSIGNED_CONFIG` for a wider range of configurations. - Supported the API for `ValkeyModule_RegisterUnsignedNumericConfig` - `string2ull` method supports the length parameter, making it more secure --------- Signed-off-by: artikell <739609084@qq.com> Signed-off-by: skyfirelee <739609084@qq.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>

Fixes the failing run: ``` [err]: Force the use of IO threads and assert active IO thread usage in tests/unit/io-threads.tcl Expected '0.000000' to be more than '0' (context: type source line 49 file /Users/runner/work/valkey/valkey/tests/unit/io-threads.tcl cmd {assert_morethan $used_active_time 0} proc ::test) ``` The change deflakes the test by generating enough I/O-thread work. The test previously created 16 deferred clients but only sent 15 total `INCR` commands. On fast runners, that burst could be short enough that `used_active_time_io_thread_*` was still reported as `0.000000`, causing the test to fail even though I/O threads had run. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

…alkey-io#3352) `availability_zone` was added to the HELLO command in valkey-io#1487, and it was missing the reply schema and the test was wrongly marked with `logreqres:skip`. `availability-zone` was added to CLUSTER SHARDS and CLUSTER SLOTS commands in valkey-io#3156, and it was missing the reply schema. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>

…alkey-io#1901) When a replica finishes a **disk-based** full sync and both `appendonly yes` and `aof-use-rdb-preamble yes` are enabled, this PR reuses the received `dump.rdb` as the new AOF base file, instead of triggering `BGREWRITEAOF`. This avoids generating an almost identical base snapshot and removes redundant CPU/IO work. Before: 1. Replica receives RDB via disk-based full sync 2. Replica loads RDB into memory 3. Replica may delete synced RDB (`rdb-del-sync-files`) 4. `BGREWRITEAOF` runs and generates a new base snapshot 5. New snapshot becomes AOF base After: 1. Replica receives RDB via disk-based full sync 2. Replica loads RDB into memory 3. Synced RDB is renamed to the new AOF base file 4. A new incremental AOF is created 5. No `BGREWRITEAOF` is needed Tests: - [x] Disk-based sync + preamble on: RDB reused as AOF base - [x] New writes after sync: incr AOF works correctly - [x] Replica restart: reused AOF loads correctly - [x] Disk-based sync + preamble off: fallback to `BGREWRITEAOF` - [x] Diskless sync + preamble on: fallback to `BGREWRITEAOF` - [x] Diskless sync with stale local `dump.rdb`: stale file is not reused --------- Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>

…ations (valkey-io#3261) The "New Master down consecutively" test was sometimes failing under Valgrind by timing out. The new overrides match those used for the cluster in the first part of the file - see valkey-io#2672 Under Valgrind's 10-20x slowdown, a single failover requiring ~15 seconds of server time can exceed the test's 100-second wall-clock wait. Error text: ``` *** [err]: New Master down consecutively in tests/unit/cluster/slave-selection.tcl No failover detected when master 12 fails ``` Daily run failure: https://github.com/valkey-io/valkey/actions/runs/22421982161/job/64921545936#logs --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Rain Valentine <rainval@amazon.com>

…o#3263) Carries on from where valkey-io#3161 left off. The test-sanitizer-address-large-memory jobs were being OOM-killed on GitHub-hosted runners (15.6GB RAM) due to ASAN's 2-3x memory overhead. Changes: - Skip 4GB quicklist compression test under ASAN (requires ~16-24GB with dual buffers + ASAN overhead) - Reduce integration test sizes from 5GB to 4.1GB (preserves >4GB 32-bit boundary coverage) - Reduce XADD iterations from 10 to 3 - Add memory monitoring to track minimum free memory during CI runs Signed-off-by: Rain Valentine <rsg000@gmail.com>

hpatro · 2026-03-13T07:25:36Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28d410dc2c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

sarthakaggarwal97 · 2026-03-13T12:48:01Z

PR Summary

The PR introduces cluster bus I/O offloading to the shared I/O thread pool. The core changes are:

Architecture: Cluster bus read, write, and TLS accept operations can now be dispatched as jobs to I/O worker threads via shared SPMC (main→workers) and MPSC (workers→main) queues. The main thread retains exclusive ownership of cluster state mutation; only raw I/O work is offloaded. If the thread pool is inactive or enqueue fails, the system falls back to synchronous I/O on the main thread, tracked by stat_cluster_io_sync_fallbacks.

cluster_legacy.c (+570/-26): Major changes include new offload handler functions (clusterReadOffloadHandler, clusterWriteJob, clusterReadJob, clusterAcceptJob), completion callbacks that run on the main thread to process results, and a receive-buffer snapshot mechanism (io_rcvbuf_snapshot_len, io_rcvbuf_snapshot_packets) for safe concurrent reads. freeClusterLink now returns int (instead of void) to indicate whether the link was freed or deferred due to an in-flight I/O job. New stat tracking for queued inbound packets is added.

cluster_legacy.h (+62/-1): New enums clusterLinkIOState (IDLE/PENDING) and clusterIOResult (OK/BAD_HEADER/BAD_LENGTH/READ_ERROR/EOF/WRITE_ERROR) are added. The clusterLink struct gains atomic fields for I/O state tracking, receive-buffer snapshot metadata, and result codes. Atomic type aliases (atomic_size_t, atomic_mstime_t) are defined with C++ compatibility guards.

connection.h (+27/-3): A new ConnOwnerKind enum distinguishes whether conn->private_data points to a client or a clusterLink, enabling safety assertions. A new flag CONN_FLAG_ACCEPT_OFFLOAD_PENDING tracks in-flight accept offload jobs. The ioctl field is widened.

queues.c: New source file added to the build, providing queue implementations used by the offload infrastructure.

Build system: queues.c added to VALKEY_SERVER_SRCS in SourceFiles.cmake.

Design document: A comprehensive design-docs/cluster-io-offload.md documents the architecture, job lifecycle, synchronization model, fallback behavior, and statistics.

Walkthrough

This PR offloads cluster bus I/O (reads, writes, and TLS accepts) from the main thread to the shared I/O thread pool. Jobs flow through SPMC/MPSC queues (io_shared_inbox/io_shared_outbox) between the main thread and worker threads, while all cluster state mutation remains on the main thread. A synchronous fallback path is used when the pool is inactive or enqueue fails. A new queues.c source file and supporting data structures (atomic fields, IO state enums, result codes) are introduced to support the concurrent job dispatch.

File Groups

Files	Summary
`src/cluster_legacy.c`	Core implementation of cluster bus I/O offloading: new job functions for read/write/accept offload, completion callbacks on main thread, receive-buffer snapshot mechanism, deferred link freeing when I/O is in-flight. `freeClusterLink` signature changed from `void` to `int` return. New stat `stat_cluster_io_sync_fallbacks` and `stat_cluster_queued_inbound_packets` tracking.
`src/cluster_legacy.h`	New enums `clusterLinkIOState` (IDLE/PENDING) and `clusterIOResult` (6 result codes). New atomic type aliases (`atomic_size_t`, `atomic_mstime_t`) with C++ compat. Extended `clusterLink` struct with I/O state, snapshot, and result fields.
`src/connection.h`	New `ConnOwnerKind` enum (CLIENT vs CLUSTER_LINK) for safe `private_data` casting. New `CONN_FLAG_ACCEPT_OFFLOAD_PENDING` flag. Minor field width changes.
`src/queues.c`, `cmake/Modules/SourceFiles.cmake`	New queue implementation source file added to the build system for SPMC/MPSC queues used by the offload infrastructure.
`design-docs/cluster-io-offload.md`	Comprehensive design document covering architecture, job flow (inbox SPMC / outbox MPSC), synchronization model, fallback behavior, lifecycle, and observability stats.

Release Notes

New Feature: Cluster bus I/O (reads, writes, TLS accepts) can now be offloaded to the shared I/O thread pool, reducing main-thread load in cluster deployments. Falls back to synchronous I/O when the pool is unavailable.
New Feature: New server stats stat_cluster_io_sync_fallbacks and stat_cluster_queued_inbound_packets for monitoring cluster I/O offload behavior.
Documentation: Added design-docs/cluster-io-offload.md describing the offload architecture and synchronization model.

valkey-io#3964) Database-level ACL valkey-io#2309 introduced `alldbs` rule that was explicit for all users and because of that previous versions no longer had the ability to parse ACL strings produced by later versions. Omit `alldbs` in `ACLDescribeSelector()`, that is used in `ACL SAVE/LOAD` and `CONFIG REWRITE` command paths so that downgrades would be possible if new feature was not used (`db=` and `resetdbs` rules). Keep `ACL GETUSER` command's output as is and return `alldbs` in `databases` field because of command's field-value format. Add test to check that `ACL LIST` omits implicit `alldbs` and add check to existing `ACL SAVE` and `CONFIG REWRITE` tests. Fixes valkey-io#3915 Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

@sarthakaggarwal97

Sarthak Aggarwal (@sarthakaggarwal97) has been granted write permissions by maintainer consensus. This adds him to the Current Committers list in MAINTAINERS.md. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

…alkey-io#3920) ## Problem A crafted zipmap entry can set the value length to a value near `UINT32_MAX` so that the `l + e` sum (value length + one-byte free space) wraps in `unsigned int` arithmetic. The wrapped sum advances the validation cursor by a tiny amount, leaving `p` inside the buffer, so the `OUT_OF_RANGE` check passes and `zipmapValidateIntegrity` wrongly returns success. The field-length path has the same shape — advancing `p` by a ~4GB length wraps the pointer on 32-bit builds. `zipmapValidateIntegrity` is always called with `deep=1` from `rdb.c` when loading `RDB_TYPE_HASH_ZIPMAP`, **including via `RESTORE`**, so any client with `RESTORE` access can submit a payload that passes validation. On 32-bit platforms this leads to out-of-bounds access during the subsequent zipmap→listpack conversion. On 64-bit the downstream `lpSafeToAdd` cap happens to reject it (the raw ~4GB length exceeds `LISTPACK_MAX_SAFETY_SIZE`), but the validator should not accept a malformed payload in the first place — this is the function whose sole job is to reject it. ## Fix Bounds-check the attacker-controlled length against the bytes remaining in the zipmap, in 64-bit space, **before** any pointer arithmetic, for both the field-length and value-length paths. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test exercising the full attack surface; asserts rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (validator accepts the value-length payload) and **passes after the fix**, confirmed by stashing the fix during the integration run. - Full `integration/corrupt-dump` suite: 76 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

…valkey-io#3921) ## Problem A crafted `RESTORE` payload can store a `NAN` score in a listpack-encoded sorted set. The integrity validation (`lpValidateIntegrityAndDups`) only checks the listpack *structure* and member uniqueness — it does not check score validity — so the payload is accepted on load. When the sorted set is later converted to a skiplist (e.g. when it grows past `zset-max-listpack-entries`, or via any operation that triggers conversion), `zslInsertNode()` asserts the score is not `NAN` (`t_zset.c:260`) and the server aborts. **Any client with `RESTORE` access can remotely crash the server.** The skiplist RDB format (`RDB_TYPE_ZSET` / `RDB_TYPE_ZSET_2`) already rejects `NAN` scores at load time (`rdb.c`, "Zset with NAN score detected"). The listpack format (`RDB_TYPE_ZSET_LISTPACK`) had no equivalent check. ## Reproduction ``` RESTORE k 0 "\x11\x19\x19\x00\x00\x00\x04\x00\x82m1\x03\x83nan\x04\x82m2\x03\x832.5\x04\xFF\x50\x00...." # loads OK, then: ZADD k 9 x # forces listpack->skiplist conversion -> serverAssert(!isnan(node->score)) -> SIGABRT ``` ## Fix Add `zzlValidateScores()`, which scans the scores of a listpack zset after structural validation and rejects the payload if any score is `NAN`. Mirrors the existing skiplist-format check. `inf`/`-inf` and large finite scores remain accepted (only `NAN` is rejected), matching normal `ZADD` semantics. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test asserting rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (server crashes on conversion) and **passes after the fix**, by stashing the fix during the run. - Confirmed valid zsets, including `inf`/`-inf`/large finite scores, still load and convert correctly. - Full `integration/corrupt-dump` suite: 74 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

When `valkey-cli --rdb` trims the EOF marker, a failed `ftruncate` only printed a warning but still reported success and exited 0, leaving a corrupt RDB file. This makes it exit non-zero on that failure, matching how the `write()` failure is already handled in the same function. (Thank you, Madelyn, for tag teaming this with me.) Signed-off-by: Grace <glucier22@gmail.com>

…ey-io#3959) The Case 3 portion of the test was flaky: after a single round of `CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state and `wait_for_cluster_state fail` would time out with `Cluster node 1 cluster_state:ok`. The race is between R0's local DELSLOTS and the gossip already in flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS packet from R0 (whose myslots still claims slot 0) hits the isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds slot 0 back to R0 on R1. See: ``` if (isSlotUnclaimed(j) || server.cluster->slots[j]->configEpoch < senderConfigEpoch || clusterSlotFailoverGranted(j)) { ... clusterDelSlot(j); clusterAddSlot(sender, j); ... } ``` R0's subsequent "no longer claiming" PINGs cannot undo this, because that path only sets owner_not_claiming_slot and never clears slots[j]: ``` if (server.cluster->slots[j] == sender) { /* The slot is currently bound to the sender but the sender is no longer * claiming it. We don't want to unbind the slot yet as it can cause the cluster * to move to FAIL state and also throw client error. Keeping the slot bound to * the previous owner will cause a few client side redirects, but won't throw * any errors. We will keep track of the uncertainty in ownership to avoid * propagating misinformation about this slot's ownership using UPDATE * messages. */ bitmapSetBit(server.cluster->owner_not_claiming_slot, j); } ``` Combined with clusterUpdateState's full-coverage check looking only at slots[j] == NULL, R1 stays at cluster OK forever. ``` if (server.cluster->slots[j] == NULL || ...) { new_state = CLUSTER_FAIL; ... } ``` Rather than fighting the protocol's intentional asymmetry around "soft delete" via gossip, just retry the DELSLOTS pass until all three nodes converge to FAIL. This keeps the test focused on the CLUSTERSCAN error semantics it actually wants to verify. This closes valkey-io#3891. The test was added in valkey-io#3674. Signed-off-by: Binbin <binloveplay1314@qq.com>

## Summary `addReplyCommandSubCommands` unconditionally called `addReplySetLen(c, 0)` when a command has no subcommands, emitting a RESP3 Set type prefix (`~0`) regardless of the `use_map` parameter. The non-empty path (below it) already branches correctly on `use_map` — the empty early-return was simply missing the same logic. In RESP3, `COMMAND INFO <cmd>` returns the subcommands field as a Set (`~0`) instead of an Array (`*0`) for any command without subcommands (e.g. PING). Strict RESP3 client libraries that dispatch on collection type will misinterpret the response. Not visible in RESP2 since both Set and Array use the `*` prefix there. ## Fix Apply the same `use_map` branch to the empty case: - `addReplyMapLen(c, 0)` when `use_map=1` - `addReplyArrayLen(c, 0)` otherwise ## Test Added a `readraw` integration test in `tests/unit/introspection-2.tcl` that inspects the raw wire-level type byte for the subcommands field of `COMMAND INFO ping` in RESP3 mode, asserting `*0` (Array) rather than `~0` (Set). Signed-off-by: Rick Ramsay <49293857+rickrams@users.noreply.github.com> Signed-off-by: rickrams <rickrams@amazon.com>

…ey-io#3811) `off_t` (64-bit), but were read into `int` (32-bit) locals in `genValkeyInfoString()` and `handleBioThreadFinishedRDBDownload()`. This causes INFO replication to report negative `master_sync_total_bytes` during bio disk-based sync when RDB exceeds 2GB. Fix: change the local variable types from `int` to `off_t`. Signed-off-by: chx9 <lovelypiska@outlook.com>

Multi-command parsing in parseMultibulkBuffer (introduced valkey-io#2092) was disabled for replicated clients because the per-command replication offset relied on c->qb_pos as the right boundary of the just-applied command. Replication stream is actually a big pipeline, so if it can be supported, the processing speed of the replica can be improved. Decouple parsing position from application position by introducing a single per-client field that tracks the currently processed command's right boundary in querybuf: client.qb_applied: right boundary of the current command in querybuf. Set by the parsers (for c->argv, = c->qb_pos) and advanced incrementally by consumeCommandQueue() using each queued command's input_bytes (qb_applied += p->input_bytes). parsedCommand.input_bytes already records the number of querybuf bytes consumed to parse one command (multibulk header + each bulk-length line + arg bytes + per-arg CRLFs), see parseMultibulk for mode details. It is a relative quantity, independent of where querybuf has been trimmed since parsing, so it is exactly what's needed to advance qb_applied across multi-command parsing, so no extra per-command snapshot field is required. commandProcessed() now uses c->qb_applied instead of c->qb_pos to advance reploff by exactly the bytes of the just-applied command. beforeNextClient shifts only c->qb_applied (a single variable) when the replicated client's querybuf is trimmed; pending queue entries carry only relative input_bytes and therefore need no fix-up. qb_applied is populated unconditionally on the first-command path so that a client transitioning to replicated mid-command (e.g. SYNCSLOTS ESTABLISH installs slot_migration_job inside its own handler) still has a valid value when commandProcessed() runs. Signed-off-by: Binbin <binloveplay1314@qq.com>

…r fuzzy provenance matches (valkey-io#3933) - update verify-provenance to require near-duplicate evidence for fuzzy provenance matches - normalize master/primary and slave/replica terminology. - print the captured provenance script log in the check workflow. --------- Signed-off-by: Ping Xie <pingxie@outlook.com>

…-io#3982) ## Issue Description `log_crashes` in `tests/instances.tcl` — the multi-instance harness used by both `make test-sentinel` and `make test-cluster` — scans every instance's `err.txt` with `find_valgrind_errors` **unconditionally**: ```tcl set logs [glob */err.txt] foreach log $logs { set res [find_valgrind_errors $log true] if {$res != ""} { puts $res incr ::failed } } ``` The main test framework runs the identical scan only behind a valgrind guard — in `tests/support/server.tcl`, `check_valgrind_errors` is called from `kill_server` inside `if {$::valgrind}`. `find_valgrind_errors`' fallback (in `tests/support/util.tcl`) treats any stderr that lacks a valgrind leak-free summary as an error: ```tcl # Look for the absence of a leak free summary # (happens when the server isn't terminated properly). if {(![regexp -- {definitely lost: 0 bytes} $buf] && ![regexp -- {no leaks are possible} $buf])} { return $buf } ``` When the suite is **not** running under valgrind, that summary is never present, so **any non-empty `err.txt` is flagged as a failure**. ## Concrete trigger On hosts whose CPU does not expose the `constant_tsc` flag (common inside VMs / LXD), every `valkey-server` and `valkey-sentinel` prints to stderr at startup (`src/monotonic.c`): ``` monotonic: x86 linux, 'constant_tsc' flag not present ``` The sentinel suite then reports exactly **10 failures** (5 sentinel + 5 valkey instances) on every run — even though every individual test passes and no assertion fails. The cluster suite shares this harness and is affected identically. On hosts that expose `constant_tsc` the message is never printed, so the bug is invisible — which is why it only surfaces in certain CI/VM environments. ## Fix Wrap the valgrind `err.txt` scan in `if {$::valgrind}`, mirroring `server.tcl`. The sanitizer scan immediately below is left unguarded, matching `check_sanitizer_errors`, which always runs. No test coverage is lost. ```tcl if {$::valgrind} { set logs [glob */err.txt] foreach log $logs { set res [find_valgrind_errors $log true] if {$res != ""} { puts $res incr ::failed } } } ``` Signed-off-by: Smail Kourta <smail.kourta@canonical.com>

…on (valkey-io#3968) Adds REUSE 3.3 structure (REUSE.toml, LICENSES/) covering all source files and vendored dependencies. Replaces the non-standard dual-license COPYING file with a standard BSD-3-Clause text. Updates the description of custom patches to Lua and Jemalloc in deps/README.md. Benefits: - GitHub and OpenSSF Scorecard correctly detect the project license - reuse spdx generates a complete SPDX SBOM covering first-party code and all vendored deps - CI check prevents future contributors from introducing invalid SPDX identifiers - Per-file license clarity for downstream consumers (distro packagers, enterprises) --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…ey-io#3803) New hashtable API hashtableScanHasPassedKey. New guarantees around hashtable scan. --------- Signed-off-by: Rain Valentine <rainval@amazon.com>

Updated `kvstoreScan` to more clearly show when a kvstore cursor was used vs. a hashtable cursor. Signed-off-by: Jim Brunner <brunnerj@amazon.com>

Replace ~10k individual INCR round-trips with single SET commands using large values. And set shutdown-timeout 0 before shutting down the node. Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>

Extend zmalloc_purge to also trim the libc main arena Rename jemalloc_purge() to zmalloc_purge() and append a malloc_trim(3) call on glibc systems, so that free pages held in the libc main arena are returned to the OS. This benefits all callers: MEMORY PURGE, FLUSHALL, and FLUSHDB. Update MEMORY HELP and memory doctor diagnostics to mention libc internal allocations as a possible source of high RSS overhead. Signed-off-by: Vaibhav Gupta <wweebbssss@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

…functions (valkey-io#3357) ## Summary This PR removes the unused `void(callback)(dict *)` parameter from `functionReset()` and related functions. ## Changes The callback parameter in `functionReset()`, `functionsLibCtxFree()`, and `functionsLibCtxClear()` was never actually used - it was always passed as NULL or cast from an incompatible type. This cleanup: - Removes the `void(callback)(dict *)` parameter from `functionReset()` - Removes the callback parameter from `functionsLibCtxFree()` - Removes the callback parameter from `functionsLibCtxClear()` - Makes `functionsLibCtxReleaseCurrent()` static since it's only used internally - Removes the TODO comment in db.c about the callback incompatibility - Updates all call sites to use the simplified function signatures ## Files Changed - `src/functions.h`: Updated function declarations - `src/functions.c`: Removed callback parameters and updated implementations - `src/db.c`: Simplified `functionReset()` call, removed TODO comment - `src/lazyfree.c`: Updated `functionsLibCtxFree()` calls ## Rationale The `dictEmpty()` calls now explicitly pass NULL for the callback, which was effectively what was happening before since the parameter was unused. This simplifies the API and removes the need for the type cast workaround that was previously used in db.c. Signed-off-by: Justin Fung <justfung@amazon.com> Co-authored-by: Justin Fung <justfung@amazon.com>

#### Purpose This workflow was originally introduced in PR [valkey-io#3358](valkey-io#3358), where we detect the failures in our scheduled `daily` runs and create / update github issues. We want to do more things with AI with respect to tests failures. It could include potentially finding the root cause, any PR that broke the tests, some helpful dashboard to track daily tests, maybe some analysis or possible fix as well. To achieve that, we are moving this issue management out of this repository and into `valkey-ci-agent`. The Daily workflow in this repository still records per-job test failures, consolidates them into `all-test-failures.json`, and uploads the `all-test-failures` artifact. The workflow being removed here was only responsible for consuming that artifact and creating or updating GitHub issues. #### Changes Remove `.github/workflows/test-failure-detector.yml`. Issue creation and updates are now handled by the Test Failure Detector workflow in `valkey-ci-agent` through this PR [valkey-io#24](valkey-io/valkey-ci-agent#24). #### Notes This should be merged together with the corresponding `valkey-ci-agent` change so scheduled test-failure detection continues without a gap. Signed-off-by: Bonnie Chan <bonchan35@gmail.com>

Document the shared I/O threading infrastructure: SPMC inbox, MPSC outbox, per-worker SPSC inboxes, job lifecycle, drain semantics, live reconfiguration. --------- Signed-off-by: Harkrishn Patro <h_patro@apple.com>

…become SIGNED_MEMORY_CONFIG (valkey-io#2648) After valkey-io#3443, we have a new `SIGNED_MEMORY_CONFIG`, which enables the memory configuration to support storing the value -1. Make commandlog-request-larger-than and commandlog-reply-larger-than become SIGNED_MEMORY_CONFIG, so we can use it in a memory way. Signed-off-by: Binbin <binloveplay1314@qq.com>

255 is too short if valkey-server is being run from a long path, especially if many fields are included such as in the "Process title set as expected" test in `unit/other.tcl`, where the max length for the cmdline is 1024, so having the same number in both makes sense. Closes valkey-io#3832. Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>

…y-io#3980) The `test io-threads are runtime modifiable` test in `tests/unit/other.tcl` times out on the dedicated Valgrind jobs of the daily CI, failing the run. The failing test is introduced in valkey-io#3938. This PR reduces the loop to 10 iterations under Valgrind. **Failure links:** - https://github.com/valkey-io/valkey/actions/runs/27386948127 (Jun 12) - https://github.com/valkey-io/valkey/actions/runs/27315974006 (Jun 11) - https://github.com/valkey-io/valkey/actions/runs/27245311034 (Jun 10) --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

…alkey-io#3675) No bug exists — all NOT_KEY commands (CLUSTERSCAN, SSUBSCRIBE, SPUBLISH, SUNSUBSCRIBE) have NULL getkeys_proc, so doesCommandHaveKeys returns 0 and ACL correctly skips key checks. This PR makes the handling more explicit and defensive for future commands that may combine getkeys_proc with NOT_KEY key-specs. Changes: - Refactor doesCommandHaveKeys() from a one-line ternary into clearer if-else branches. Add an explicit check: if a command has getkeys_proc but all its key-specs are NOT_KEY, treat it as having no real keys. - Add a defensive NOT_KEY check in ACLSelectorCheckKey() to skip key-pattern ACL validation for NOT_KEY entries, consistent with the existing skip in getKeysUsingKeySpecs(). - Add tests: verify COMMAND GETKEYS/GETKEYSANDFLAGS report "no key arguments" for NOT_KEY commands; verify restricted ACL users can run CLUSTERSCAN without key-permission errors. Related: valkey-io#2934, valkey-io#3699 Signed-off-by: Binbin <binloveplay1314@qq.com>

While reviewing valkey-io#3675 I found a pre-existing `NOT_KEY` inconsistency in client tracking. `CLUSTERSCAN` is skipped by `COMMAND GETKEYS/ACL` through the key spec path, but tracking still uses `getKeysFromCommand()`. Updated to use `getKeysFromCommandWithSpecs` ``` raghav$ printf 'client tracking on\ninfo stats\nclusterscan 0-{06S}-0\ninfo stats\n' | /tmp/valkey-unstable-single/src/valkey-cli -p 7001 --raw | grep -E '^(OK|tracking_total_keys:|0$)' OK tracking_total_keys:2 0 tracking_total_keys:2 ```` This affects today `CLUSTERSCAN` because it is RO, `SPUBLISH/SSUBSCRIBE/SUNSUBSCRIBE` using `NOT_KEY` are not RO. Signed-off-by: nmvk <r@nmvk.com>

Add explicit validation for the FIELDS token before parsing numfields. Fixes valkey-io#4045. Signed-off-by: lcxn123 <156999965+lcxn123@users.noreply.github.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

…io#4058) When a primary receives an MFSTART (manual failover start) message, it was silently ignored if the sender was unknown or not a replica of this primary. This made it impossible to diagnose why a manual failover was timing out on the replica side — the primary had no indication that it ever received the request. In particular, cluster gossip may not have propagated the new replica relationship to the primary yet, so the primary may legitimately not recognize the sender as its replica. Without logging, this scenario leaves no trace on the primary side. Signed-off-by: Binbin <binloveplay1314@qq.com>

Some functions are newly deprecated in OpenSSL 4.0. This commit works around those. More specifically: - X509_cmp_current_time() in isCertValid() is no longer necessary as X509_check_certificate_times() compares the notBefore and notAfter on its own. Had to add a version check since this function is new in OpenSSL 4.0. - Replace deprecated X509_NAME_get_text_by_NID. Not a perfect fix, because the new implementation still assumes that the name does not contain embedded null characters which may not be true, e.g., if the name is of type UniversalString or BMPString. - Also fix constness of X509_get_subject_name return value. The latter two fixes are taken from this Fedora patch: https://src.fedoraproject.org/rpms/valkey/c/1a9c8847172ef3fb116a1e2fdb3871692378adae?branch=rawhide Closes valkey-io#4012 Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>

valkey-io#4059) Two minor cleanups: - Remove unnecessary incrRefCount(shared.hdel) just like other shared object. - Fix new_argv allocation size since HDEL only need (num_fields + 2) size. Signed-off-by: Binbin <binloveplay1314@qq.com>

Use I/O threads for cluster bus reads, writes, and TLS accepts, mirroring the client I/O threading model. Splits packet framing from application, adds deferred link teardown with io_refs for safe concurrent I/O, dual send queues, and a framed packet queue on clusterLink. Adds observability counters (CLUSTER INFO). Signed-off-by: Harkrishn Patro <h_patro@apple.com>

nmvk and others added 21 commits March 9, 2026 07:41

github-actions Bot assigned hpatro Mar 13, 2026

Repository owner deleted a comment from chatgpt-codex-connector Bot Mar 13, 2026

hpatro mentioned this pull request Mar 13, 2026

Cluster Bus I/O Offload valkey-io/valkey#3361

Open

chatgpt-codex-connector Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread src/io_threads.c

Comment thread src/io_threads.c Outdated

Comment thread src/cluster_legacy.c Outdated

Comment thread src/server.c Outdated

sarthakaggarwal97 reviewed Mar 13, 2026

View reviewed changes

Comment thread src/server.c Outdated

sarthakaggarwal97 reviewed Mar 13, 2026

View reviewed changes

Comment thread src/server.c Outdated

dvkashapov and others added 29 commits June 10, 2026 14:45

Add Sarthak Aggarwal as a committer (valkey-io#3975)

4f11dd6

Sarthak Aggarwal (@sarthakaggarwal97) has been granted write permissions by maintainer consensus. This adds him to the Current Committers list in MAINTAINERS.md. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

hashtable: scan no-duplicate guarantee and add HasPassedKey API (valk…

5ae43a4

…ey-io#3803) New hashtable API hashtableScanHasPassedKey. New guarantees around hashtable scan. --------- Signed-off-by: Rain Valentine <rainval@amazon.com>

Improve maintainability for kvstoreScan (valkey-io#3588)

436dcae

Updated `kvstoreScan` to more clearly show when a kvstore cursor was used vs. a hashtable cursor. Signed-off-by: Jim Brunner <brunnerj@amazon.com>

Reduce execution time for replica migration tests (valkey-io#4006)

fa62dce

Replace ~10k individual INCR round-trips with single SET commands using large values. And set shutdown-timeout 0 before shutting down the node. Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>

Add design doc for I/O threads (valkey-io#4022)

ead3dc6

Document the shared I/O threading infrastructure: SPMC inbox, MPSC outbox, per-worker SPSC inboxes, job lifecycle, drain semantics, live reconfiguration. --------- Signed-off-by: Harkrishn Patro <h_patro@apple.com>

Fix HGETDEL FIELDS token validation (valkey-io#4049)

d38ff73

Add explicit validation for the FIELDS token before parsing numfields. Fixes valkey-io#4045. Signed-off-by: lcxn123 <156999965+lcxn123@users.noreply.github.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

hpatro force-pushed the cluster-io-offload branch from b4fdc03 to 16d11d2 Compare June 29, 2026 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster IO Offload#4

Cluster IO Offload#4
hpatro wants to merge 225 commits into
io-threads-redesignfrom
cluster-io-offload

hpatro commented Mar 13, 2026 •

edited

Loading

Uh oh!

hpatro commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarthakaggarwal97 commented Mar 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

hpatro commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cluster Bus I/O Offload

Overview

Architecture

Key Design Decisions

Data Flow: Read Path

Data Flow: Write Path

Data Flow: Accept Path (TLS)

clusterLink I/O State

Invariants

Link State

I/O Thread MAY

I/O Thread MUST NOT

Main Thread MUST NOT While a Job Is In Flight

Dispatch Contract

Deferred Teardown

Failure Detection

Error Handling

Follow-Up

Uh oh!

hpatro commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarthakaggarwal97 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Walkthrough

File Groups

Release Notes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

hpatro commented Mar 13, 2026 •

edited

Loading

sarthakaggarwal97 commented Mar 13, 2026 •

edited

Loading