Skip to content

Cluster IO Offload#4

Open
hpatro wants to merge 225 commits into
io-threads-redesignfrom
cluster-io-offload
Open

Cluster IO Offload#4
hpatro wants to merge 225 commits into
io-threads-redesignfrom
cluster-io-offload

Conversation

@hpatro

@hpatro hpatro commented Mar 13, 2026

Copy link
Copy Markdown
Owner

Cluster Bus I/O Offload

Overview

Cluster bus read, write, and TLS accept work can run on the shared I/O thread
pool, while all cluster state mutation stays on the main thread.

Jobs flow through:

  • io_shared_inbox (SPMC): main thread -> I/O workers
  • io_shared_outbox (MPSC): I/O workers -> main thread

If the pool is inactive or enqueue fails, the caller falls back to synchronous
I/O on the main thread and increments stat_cluster_io_sync_fallbacks.

Architecture

+---------------------------+       +-------------------------------+       +---------------------------+
|       Main Thread         |       |        Shared Queues          |       |     I/O Thread Pool       |
|                           |       |                               |       |                           |
| clusterReadOffloadHandler |------>| io_shared_inbox (SPMC)        |------>| clusterReadJob            |
| clusterWriteHandler       |       |                               |       | clusterWriteJob           |
| clusterAcceptHandler      |       |                               |       | clusterAcceptJob          |
| clusterSendMessage        |       +-------------------------------+       +---------------------------+
|                           |
| processIOThreadsResponses |<------ io_shared_outbox (MPSC) <------ sendToMainThread(...)
| clusterHandle*Completion  |
| clusterProcessPacket      |
| freeClusterLink           |
+---------------------------+

Key Design Decisions

  1. Reuse the existing shared SPMC/MPSC queues; no cluster-specific threading primitives.
  2. I/O threads do transport work only: connRead, connWrite, connAccept, framing, and result publication.
  3. Main thread alone applies cluster packets and mutates cluster state.
  4. Write offload uses a single canonical send_msg_queue plus a snapshot boundary:
    io_last_send_block, io_head_offset, and io_nodes_sent.
  5. Read offload uses rcvbuf snapshot boundaries: io_rcvbuf_snapshot_len and io_rcvbuf_snapshot_packets.
  6. Read completion drains the queued rcvbuf snapshot fully; there is no bounded packet/time budget.
  7. Link teardown is deferred with io_refs and async_close.
  8. Accept offload is serialized per connection with CONN_FLAG_ACCEPT_OFFLOAD_PENDING.
  9. ConnOwnerKind lets generic connection/TLS code distinguish client-owned and cluster-owned connections safely.

Data Flow: Read Path

[MAIN dispatch] clusterReadOffloadHandler
    |
    | drain queued snapshot
    | trySendClusterReadToIOThreads()
    v
[QUEUE] io_shared_inbox
    | JOB_REQ_CLUSTER_READ
    v
[IO worker] clusterReadJob
    | connRead() into rcvbuf
    | grow rcvbuf if needed
    | snapshot complete packet prefix
    | set io_rcvbuf_snapshot_len / packets
    | set io_result
    v
[QUEUE] io_shared_outbox
    | JOB_RES_CLUSTER_READ
    v
[MAIN completion] clusterHandleReadCompletion
    | set io_read_state = IDLE
    | io_refs--
    | reconcile memory
    | drain queued rcvbuf snapshot fully
    | shrink/free as needed

If dispatch returns C_ERR, the main thread falls back to clusterReadHandler(conn).

Data Flow: Write Path

[MAIN dispatch] clusterSendMessage / clusterWriteHandler
    |
    | append to send_msg_queue
    | snapshot io_last_send_block / io_head_offset
    | trySendClusterWriteToIOThreads()
    v
[QUEUE] io_shared_inbox
    | JOB_REQ_CLUSTER_WRITE
    v
[IO worker] clusterWriteJob
    | write from queue head
    | stop at io_last_send_block
    | set io_nodes_sent
    | set io_head_offset
    | set io_result
    v
[QUEUE] io_shared_outbox
    | JOB_RES_CLUSTER_WRITE
    v
[MAIN completion] clusterHandleWriteCompletion
    | set io_write_state = IDLE
    | io_refs--
    | pop sent nodes
    | apply head_msg_send_offset
    | reinstall handler or free on error

Important note:

  • New messages appended while a write job is in flight stay on send_msg_queue,
    but the worker stops at io_last_send_block, so those new nodes are picked up
    by a later dispatch.

If dispatch returns C_ERR, clusterWriteHandler(conn) falls back to the synchronous connWrite() loop.

Data Flow: Accept Path (TLS)

[MAIN dispatch] clusterAcceptHandler
    |
    | create accepted conn
    | set owner kind
    | allow accept offload
    | trySendClusterAcceptToIOThreads()
    v
[QUEUE] io_shared_inbox
    | JOB_REQ_CLUSTER_ACCEPT
    v
[IO worker] clusterAcceptJob
    | connAccept(conn, NULL)
    v
[QUEUE] io_shared_outbox
    | JOB_RES_CLUSTER_ACCEPT
    v
[MAIN completion] clusterHandleAcceptCompletion
    | clear ACCEPT_OFFLOAD_PENDING
    | finalize conn state / refs
    | if ACCEPTING: return
    | else clusterConnAcceptHandler()
    | create clusterLink on success

Important note:

  • TLS retries can re-enter the generic accept offload path. Cluster-owned
    connections are routed back to trySendClusterAcceptToIOThreads, and the
    pending flag ensures only one accept job is in flight for that connection.

If dispatch returns C_ERR, the main thread falls back to connAccept(conn, clusterConnAcceptHandler).

clusterLink I/O State

IDLE --dispatch read-->  READ_PENDING  --read completion-->  IDLE
IDLE --dispatch write--> WRITE_PENDING --write completion--> IDLE

If freeClusterLink() sees io_refs > 0:
  detach link from node
  remove handlers
  set async_close = 1
  defer final free until the last completion drops io_refs to 0

Read and write jobs are mutually exclusive per link.

Invariants

Link State

  • io_refs >= 0 always.
  • io_refs == 0 implies both io_read_state and io_write_state are IDLE.
  • Read and write jobs are mutually exclusive.
  • send_msg_queue_mem tracks the canonical send_msg_queue.
  • Link buffer-limit accounting currently uses send_msg_queue_mem + rcvbuf_len.
  • stat_cluster_links_memory is updated on the main thread.

I/O Thread MAY

  • Call connRead() / connWrite() / connAccept() on the offloaded connection.
  • Grow rcvbuf and update rcvbuf_len / rcvbuf_alloc.
  • Scan rcvbuf and publish io_rcvbuf_snapshot_len / io_rcvbuf_snapshot_packets.
  • Read send_msg_queue up to the snapped io_last_send_block.
  • Write io_result, io_nodes_sent, io_head_offset, and last_io_read_time.
  • Publish completions with sendToMainThread().

I/O Thread MUST NOT

  • Touch clusterNode, clusterState, slot ownership, epoch/failover state, or any other cluster-global state.
  • Call clusterProcessPacket() or any packet-application logic.
  • Modify io_read_state, io_write_state, io_refs, or async_close.
  • Pop queue nodes or update send_msg_queue_mem.
  • Update server.stat_cluster_links_memory.
  • Free clusterLink, clusterNode, or connection objects.

Main Thread MUST NOT While a Job Is In Flight

  • For read: touch rcvbuf, rcvbuf_len, rcvbuf_alloc, io_rcvbuf_snapshot_len, or io_rcvbuf_snapshot_packets.
  • For write: pop already-visible queue nodes or rewrite the tracked head send offset.
  • Dispatch a second read/write job for the same link.

Dispatch Contract

All three dispatch helpers (trySendClusterReadToIOThreads,
trySendClusterWriteToIOThreads, trySendClusterAcceptToIOThreads) follow the
same pattern:

  1. connSetPostponeUpdateState(conn, 1) and connIncrRefs(conn) before enqueue.
  2. Roll back state and refs on enqueue failure.
  3. Finalize postponed connection state on completion.

Return values:

  • C_OK: work was offloaded, or an equivalent job is already pending.
  • C_ERR: pool inactive or enqueue failed, so caller may sync-fallback.

Deferred Teardown

If freeClusterLink(link) sees io_refs > 0:

  1. Detach the link from node fields.
  2. Remove read/write handlers so no new work is scheduled.
  3. Set async_close = 1.
  4. Return without closing/freeing immediately.

The actual connClose() and final free happen later, when the last completion
handler drops io_refs to 0.

Failure Detection

last_io_read_time is updated by the read worker on successful reads and is
included in the cluster node delay calculation.

data_delay = now - max(node->data_received,
                       node->link->last_io_read_time,
                       node->inbound_link->last_io_read_time)

The I/O thread stores last_io_read_time with release ordering; the main thread
loads it with acquire ordering.

Error Handling

Result / Condition Handling
CLUSTER_IO_BAD_HEADER / CLUSTER_IO_BAD_LENGTH Log warning, free link immediately
CLUSTER_IO_READ_ERROR / CLUSTER_IO_EOF Log debug, drain queued rcvbuf snapshot, then free link
CLUSTER_IO_WRITE_ERROR Log debug, free link
CONN_STATE_ACCEPTING after accept completion Leave connection open; TLS event flow will retry
Pool inactive / queue full Sync fallback + increment stat_cluster_io_sync_fallbacks

Follow-Up

  • Read completion currently drains the queued rcvbuf snapshot in one pass on the main thread and compacts the buffer after each packet.
  • Follow-up work: either reintroduce bounded completion with a correct continuation mechanism, or keep full drain semantics but switch to a front-offset model so we avoid repeated per-packet memmove() while a large burst is being applied.
  • Current behavior on CLUSTER_IO_BAD_HEADER / CLUSTER_IO_BAD_LENGTH is to close the link immediately, even if the same offloaded read also contained a valid packet prefix in the queued snapshot. This is accepted for now as an invalid-peer path, but if we ever need sync/offload parity here, read completion should drain the valid prefix before tearing the link down.
  • Create separate queue for cluster io jobs to avoid contention with client io jobs. This will allow us to prioritize different type of jobs and drain at different rate.

nmvk and others added 21 commits March 9, 2026 07:41
Implemented CLUSTERSCAN command for topology-aware scanning

Unlike `SCAN` which is local to a single node, `CLUSTERSCAN` provides a
mechanism that helps clients iterate across slot boundaries and handles
`MOVED` redirections.

**Key details**

* Global cluster iteration via `fingerprint-{hashtag}-cursor`
* Scan one slot at a time
* Start the CLUSTERSCAN with 0
* SLOT argument for parallel scanning of multiple slots
* Re-use scanGenericCommand for the response

**Cursor format:** `fingerprint-{hashtag}-localcursor`
 - Fingerprint is a hash of the node's DB seed that identifies the
   current memory layout. On mismatch, scan restarts from cursor 0
   rather than returning an error.
 - Fingerprint 0 indicates a cross slot cursor (e.g., initial cursor
   or slot transition) where validation is skipped.
 - Hashtag encodes the target slot
 - Local cursor tracks position within the slot

**Usage:**

```
CLUSTERSCAN <cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number]
```

```
  CLUSTERSCAN 0                    # Start scanning from slot 0
  CLUSTERSCAN <cursor>             # Continue from cursor
  CLUSTERSCAN 0 SLOT 1000          # Start scanning specific slot
  CLUSTERSCAN <cursor> MATCH user:* COUNT 100
```

---------

Signed-off-by: nmvk <r@nmvk.com>
Signed-off-by: Raghav <r@nmvk.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Before this change, if you built `valkey-server` (e.g. `make
valkey-server`) the LUA module was not built. With this change, the LUA
module is now a direct a dependency of `valkey-server`
- (unless `BUILD_LUA=no` is passed)

Added some colors to the Lua module & Lua lib `Makefile`s to they blend
nicely in the build output.

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
…time_t (valkey-io#3252)

Addresses issue valkey-io#2350

As noted in the issue, much of the expire code uses raw long long for
timestamps, which provides no semantic meaning about the unit or purpose
of the value. Valkey already defines mstime_t (milliseconds) and
ustime_t (microseconds) typedefs — this PR replaces bare long long
declarations with the appropriate typedef wherever the value represents
an expiration timestamp or time duration.

This PR only fixes a small subset in the codebase, but it is an
incremental step toward fully replacing the bare long long references.

---------

Signed-off-by: curious-george-rk <r.ebu@gmail.com>
Co-authored-by: curious-george-rk <r.ebu@gmail.com>
Now we will be able to add a `run-cluster-benchmark` label to run a
benchmark with cluster-mode enabled valkey-server

It will use the config
https://github.com/valkey-io/valkey/blob/unstable/.github/benchmark_configs/benchmark-config-arm.json
modified for for cluster mode with a single clustermode enabled instance of
valkey.

It uses the same single instance for the benchmark as for run-benchmark.
 
If both labels are used, they are sequential in the same concurrency group `group:
ec2-al-2023-pr-benchmarking-arm64`.

---------

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
…ey-io#2746)

This PR introduces a new config `cluster-message-gossip-perc` which
allows an operator to modify the amount of gossip node information to be
sent per ping/pong/meet message. It can be modified dynamically (Related
to valkey-io#2291). The default value
is 10% i.e. 10% of peer node information would be gossiped along with
each ping/pong/meet packet. Users can tune this configuration, setting
the value higher allows faster information dissemination whereas setting
it lower would lead to direct PING messages if no information was
received about a node with the `server.cluster_node_timeout/2` period.

Note: the behavior for partially failed gossip nodes still remains
intact where all the `pfail` nodes are part of the message for faster
propagation of information and faster transition of `PFAIL` to `FAIL`.

---------

Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
…o#3336)

This fixes a flaky dual-channel replication integration test:
https://github.com/valkey-io/valkey/actions/runs/22810251608/job/66165776198#step:8:7701

`INFO memory` field `used_memory_overhead` and `MEMORY STATS` field
`overhead.total` can change during dual-channel sync if replica's
pending replication buffer is still changing. This is probably more
visible in slower environments.

The test now collects `INFO` and `MEMORY STATS` in a single `MULTI/EXEC`
on both the primary and replica, so the compared values come from the
same snapshot.

Passing here:
https://github.com/sarthakaggarwal97/valkey/actions/runs/22864585326/job/66327772967

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…ation (valkey-io#3258)

Observation around `dictGetSomeKeys(N)` performance is that it is around
10x faster than randomly picking `N` elements over N*3 attempts.
Further, this avoids chances of not picking N elements due to duplicate
element selection with individual random pick. Overall, it should help
picking gossip node(s) for cluster bus message faster.

---------

Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
I found that on 32-bit platforms, the gtest fails if I add a declaration
`int aof_rewrite_for_replication;` between
https://github.com/valkey-io/valkey/blob/fc217fca2d7de246ab8494d9d8619b7348235f47/src/server.h#L1843
and
https://github.com/valkey-io/valkey/blob/fc217fca2d7de246ab8494d9d8619b7348235f47/src/server.h#L2108
The crash is caused by
https://github.com/valkey-io/valkey/blob/fc217fca2d7de246ab8494d9d8619b7348235f47/src/unit/wrappers.h#L40
stripping the _Atomic qualifier in C++ builds. This makes struct
valkeyServer have different alignment/field offsets in C-compiled code
(libvalkey.a) vs C++-compiled tests (gtest). As a result, fields after
these atomics are accessed at different offsets, leading to corrupted
reads/writes and a segfault.

I added VALKEY_ATOMIC, and it works as follows:
- C builds (production): VALKEY_ATOMIC(uint64_t) -> _Atomic uint64_t
(unchanged behavior)
- C++ builds (tests via wrappers.h): VALKEY_ATOMIC(uint64_t) ->
alignas(8) uint64_t (preserves 8-byte alignment)
- All existing atomic_load_explicit / atomic_store_explicit calls in C
code continue to work because _Atomic is present in C mode
- C++ test code only does direct field access (never uses atomic
operations), so alignas(N) type is sufficient

---------

Signed-off-by: cjx-zar <jxchenczar@foxmail.com>
…-io#3121)

It's a combination of MSET and EXPIRE, which allows us to set an
expiration
time simultaneously with the MSET operation.

Syntax for the new MSETEX command:
```
/* MSETEX numkeys key value [key value ...] [NX | XX]
 *     [EX seconds | PX milliseconds |
 *      EXAT seconds-timestamp | PXAT milliseconds-timestamp | KEEPTTL] */
```

The MSETEX command supports a set of options (like SET command):
- EX seconds -- Set the specified expire time, in seconds (a positive
integer).
- PX milliseconds -- Set the specified expire time, in milliseconds (a
positive
  integer).
- EXAT timestamp-seconds -- Set the specified Unix time at which the
keys will
  expire, in seconds (a positive integer).
- PXAT timestamp-milliseconds -- Set the specified Unix time at which
the keys
  will expire, in milliseconds (a positive integer).
- KEEPTTL -- Retain the time to live associated with the keys.
- NX -- Only set the keys if all keys do not already exist.
- XX -- Only set the keys if all keys already exists.

Return Value (like MSETNX command):
- Integer reply: 0 if no key was set due to NX or XX options.
- Integer reply: 1 if all the keys were set.

Closes valkey-io#2592

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
…valkey-io#3156)

Implemented a way to propagate az through gossip and obtain az
information
through the CLUSTER SHARDS and CLUSTER SLOTS commands. It will only be
displayed if the node is configured with it.

Closes valkey-io#3110

---------

Signed-off-by: Su Ko <rhtn1128@gmail.com>
`zdiffAlgorithm2()` can break out early once the destination cardinality
reaches zero. In that path, a temporary SDS created by `zuiSdsFromValue()`
was left dirty and never released, because that cleanup normally happens on
the next iterator step.

This patch explicitly discards the dirty value before the early `break`.
```
==11724== 11 bytes in 1 blocks are definitely lost in loss record 36 of 1,442
==11724==    at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==11724==    by 0x2FA974: ztrymalloc_usable_internal (zmalloc.c:156)
==11724==    by 0x2FAAEA: zmalloc_usable (zmalloc.c:200)
==11724==    by 0x282D25: _sdsnewlen (sds.c:102)
==11724==    by 0x2830B2: sdsnewlen (sds.c:169)
==11724==    by 0x2DB537: zuiSdsFromValue (t_zset.c:2290)
==11724==    by 0x2DBE00: zdiffAlgorithm2 (t_zset.c:2502)
==11724==    by 0x2DC085: zdiff (t_zset.c:2568)
==11724==    by 0x2DD01F: zunionInterDiffGenericCommand (t_zset.c:2817)
==11724==    by 0x2DD574: zdiffCommand (t_zset.c:2898)
==11724==    by 0x29F5AE: call (server.c:3883)
==11724==    by 0x2A1598: processCommand (server.c:4569)
```

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In flushall we will call killRDBChild to kill the RDB child, but
we need to wait for checkChildrenDone and waitpid to kick in, so
that we can call resetChildState to reset the state. The previous
wait_for_condition 10 100 (1s) might not be enough, changed it to
wait_for_condition 50 100 (5s).
```
[err]: Test FLUSHALL aborts bgsave in tests/integration/rdb.tcl
bgsave not aborted
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
…tCommandVector (valkey-io#3279)

If the command is in the form of "SET key value EX/PX/EXAT ttl",
then we don't need to rewrite the entire command vector.

In addition to saving the rewriting of the command vector, we also
save one command table lookup in this case ince we don't need to
lookup SET command again.

We already have many relevant test coverages in expire.tcl. This
PR does not change anything around the propagate.

Somehow like 2544755.

Signed-off-by: Binbin <binloveplay1314@qq.com>
With current Makefile, `LDFLAGS` are not used for modules.

This results in security options not applied.

```
$ annocheck /usr/lib64/valkey/modules/rdma.so 
annocheck: Version 12.99.
Hardened: rdma.so: FAIL: bind-now test because not linked with -Wl,-z,now 
Hardened: Rerun annocheck with --verbose to see more information on the tests.
Hardened: rdma.so: Overall: FAIL.
```

With this patch

```
$ annocheck /usr/lib64/valkey/modules/rdma.so 
annocheck: Version 12.99.
Hardened: rdma.so: PASS.
```

Signed-off-by: Remi Collet <remi@remirepo.net>
There is now a port of fast_float in C. So instead of having an optional
fast_float dependency, we can just use ffc instead, unconditionally.

https://github.com/kolemannix/ffc.h

It is a high quality port. The performance should be the same or
improved.

Note : I am the maintainer and main author of fast_float.

---------

Signed-off-by: Daniel Lemire <daniel@lemire.me>
- Supports `VALKEYMODULE_CONFIG_UNSIGNED` and `UNSIGNED_CONFIG` for a
   wider range of configurations.
- Supported the API for `ValkeyModule_RegisterUnsignedNumericConfig`
- `string2ull` method supports the length parameter, making it more
   secure

---------

Signed-off-by: artikell <739609084@qq.com>
Signed-off-by: skyfirelee <739609084@qq.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Fixes the failing run:

```
[err]: Force the use of IO threads and assert active IO thread usage in tests/unit/io-threads.tcl
Expected '0.000000' to be more than '0' (context: type source line 49 file /Users/runner/work/valkey/valkey/tests/unit/io-threads.tcl cmd {assert_morethan $used_active_time 0} proc ::test)
```

The change deflakes the test by generating enough I/O-thread work.

The test previously created 16 deferred clients but only sent 15 total
`INCR` commands. On fast runners, that burst could be short enough that
`used_active_time_io_thread_*` was still reported as `0.000000`, causing
the test to fail even though I/O threads had run.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…alkey-io#3352)

`availability_zone` was added to the HELLO command in valkey-io#1487, and
it was missing the reply schema and the test was wrongly marked
with `logreqres:skip`.

`availability-zone` was added to CLUSTER SHARDS and CLUSTER SLOTS
commands in valkey-io#3156, and it was missing the reply schema.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
…alkey-io#1901)

When a replica finishes a **disk-based** full sync and both `appendonly
yes` and `aof-use-rdb-preamble yes` are enabled, this PR reuses the
received `dump.rdb` as the new AOF base file, instead of triggering
`BGREWRITEAOF`.

This avoids generating an almost identical base snapshot and removes
redundant CPU/IO work.

Before:

1. Replica receives RDB via disk-based full sync
2. Replica loads RDB into memory
3. Replica may delete synced RDB (`rdb-del-sync-files`)
4. `BGREWRITEAOF` runs and generates a new base snapshot
5. New snapshot becomes AOF base

After:

1. Replica receives RDB via disk-based full sync
2. Replica loads RDB into memory
3. Synced RDB is renamed to the new AOF base file
4. A new incremental AOF is created
5. No `BGREWRITEAOF` is needed

Tests:

- [x] Disk-based sync + preamble on: RDB reused as AOF base
- [x] New writes after sync: incr AOF works correctly
- [x] Replica restart: reused AOF loads correctly
- [x] Disk-based sync + preamble off: fallback to `BGREWRITEAOF`
- [x] Diskless sync + preamble on: fallback to `BGREWRITEAOF`
- [x] Diskless sync with stale local `dump.rdb`: stale file is not
       reused

---------

Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>
…ations (valkey-io#3261)

The "New Master down consecutively" test was sometimes failing under
Valgrind by timing out. The new overrides match those used for the
cluster in the first part of the file - see valkey-io#2672

Under Valgrind's 10-20x slowdown, a single failover requiring ~15
seconds of server time can exceed the test's 100-second wall-clock wait.

Error text:
```
*** [err]: New Master down consecutively in tests/unit/cluster/slave-selection.tcl
No failover detected when master 12 fails
```

Daily run failure:
https://github.com/valkey-io/valkey/actions/runs/22421982161/job/64921545936#logs

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Rain Valentine <rainval@amazon.com>
…o#3263)

Carries on from where valkey-io#3161 left off. The test-sanitizer-address-large-memory
jobs were being OOM-killed on GitHub-hosted runners (15.6GB RAM) due to
ASAN's 2-3x memory overhead.

Changes:
- Skip 4GB quicklist compression test under ASAN (requires ~16-24GB with
dual buffers + ASAN overhead)
- Reduce integration test sizes from 5GB to 4.1GB (preserves >4GB 32-bit
boundary coverage)
- Reduce XADD iterations from 10 to 3
- Add memory monitoring to track minimum free memory during CI runs

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Repository owner deleted a comment from chatgpt-codex-connector Bot Mar 13, 2026
@hpatro

hpatro commented Mar 13, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Repository owner deleted a comment from chatgpt-codex-connector Bot Mar 13, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28d410dc2c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/io_threads.c
Comment thread src/io_threads.c Outdated
Comment thread src/cluster_legacy.c Outdated
Comment thread src/server.c Outdated
@sarthakaggarwal97

sarthakaggarwal97 commented Mar 13, 2026

Copy link
Copy Markdown

PR Summary

The PR introduces cluster bus I/O offloading to the shared I/O thread pool. The core changes are:

Architecture: Cluster bus read, write, and TLS accept operations can now be dispatched as jobs to I/O worker threads via shared SPMC (main→workers) and MPSC (workers→main) queues. The main thread retains exclusive ownership of cluster state mutation; only raw I/O work is offloaded. If the thread pool is inactive or enqueue fails, the system falls back to synchronous I/O on the main thread, tracked by stat_cluster_io_sync_fallbacks.

cluster_legacy.c (+570/-26): Major changes include new offload handler functions (clusterReadOffloadHandler, clusterWriteJob, clusterReadJob, clusterAcceptJob), completion callbacks that run on the main thread to process results, and a receive-buffer snapshot mechanism (io_rcvbuf_snapshot_len, io_rcvbuf_snapshot_packets) for safe concurrent reads. freeClusterLink now returns int (instead of void) to indicate whether the link was freed or deferred due to an in-flight I/O job. New stat tracking for queued inbound packets is added.

cluster_legacy.h (+62/-1): New enums clusterLinkIOState (IDLE/PENDING) and clusterIOResult (OK/BAD_HEADER/BAD_LENGTH/READ_ERROR/EOF/WRITE_ERROR) are added. The clusterLink struct gains atomic fields for I/O state tracking, receive-buffer snapshot metadata, and result codes. Atomic type aliases (atomic_size_t, atomic_mstime_t) are defined with C++ compatibility guards.

connection.h (+27/-3): A new ConnOwnerKind enum distinguishes whether conn->private_data points to a client or a clusterLink, enabling safety assertions. A new flag CONN_FLAG_ACCEPT_OFFLOAD_PENDING tracks in-flight accept offload jobs. The ioctl field is widened.

queues.c: New source file added to the build, providing queue implementations used by the offload infrastructure.

Build system: queues.c added to VALKEY_SERVER_SRCS in SourceFiles.cmake.

Design document: A comprehensive design-docs/cluster-io-offload.md documents the architecture, job lifecycle, synchronization model, fallback behavior, and statistics.

Walkthrough

This PR offloads cluster bus I/O (reads, writes, and TLS accepts) from the main thread to the shared I/O thread pool. Jobs flow through SPMC/MPSC queues (io_shared_inbox/io_shared_outbox) between the main thread and worker threads, while all cluster state mutation remains on the main thread. A synchronous fallback path is used when the pool is inactive or enqueue fails. A new queues.c source file and supporting data structures (atomic fields, IO state enums, result codes) are introduced to support the concurrent job dispatch.

File Groups

Files Summary
src/cluster_legacy.c Core implementation of cluster bus I/O offloading: new job functions for read/write/accept offload, completion callbacks on main thread, receive-buffer snapshot mechanism, deferred link freeing when I/O is in-flight. freeClusterLink signature changed from void to int return. New stat stat_cluster_io_sync_fallbacks and stat_cluster_queued_inbound_packets tracking.
src/cluster_legacy.h New enums clusterLinkIOState (IDLE/PENDING) and clusterIOResult (6 result codes). New atomic type aliases (atomic_size_t, atomic_mstime_t) with C++ compat. Extended clusterLink struct with I/O state, snapshot, and result fields.
src/connection.h New ConnOwnerKind enum (CLIENT vs CLUSTER_LINK) for safe private_data casting. New CONN_FLAG_ACCEPT_OFFLOAD_PENDING flag. Minor field width changes.
src/queues.c, cmake/Modules/SourceFiles.cmake New queue implementation source file added to the build system for SPMC/MPSC queues used by the offload infrastructure.
design-docs/cluster-io-offload.md Comprehensive design document covering architecture, job flow (inbox SPMC / outbox MPSC), synchronization model, fallback behavior, lifecycle, and observability stats.

Release Notes

  • New Feature: Cluster bus I/O (reads, writes, TLS accepts) can now be offloaded to the shared I/O thread pool, reducing main-thread load in cluster deployments. Falls back to synchronous I/O when the pool is unavailable.
  • New Feature: New server stats stat_cluster_io_sync_fallbacks and stat_cluster_queued_inbound_packets for monitoring cluster I/O offload behavior.
  • Documentation: Added design-docs/cluster-io-offload.md describing the offload architecture and synchronization model.

Comment thread src/server.c Outdated
Comment thread src/server.c Outdated
dvkashapov and others added 29 commits June 10, 2026 14:45
valkey-io#3964)

Database-level ACL valkey-io#2309 introduced `alldbs` rule that was explicit for
all users and because of that previous versions no longer had the
ability to parse ACL strings produced by later versions.
Omit `alldbs` in `ACLDescribeSelector()`, that is used in `ACL
SAVE/LOAD` and `CONFIG REWRITE` command paths so that downgrades would
be possible if new feature was not used (`db=` and `resetdbs` rules).
Keep `ACL GETUSER` command's output as is and return `alldbs` in
`databases` field because of command's field-value format.
Add test to check that `ACL LIST` omits implicit `alldbs` and add check
to existing `ACL SAVE` and `CONFIG REWRITE` tests.

Fixes valkey-io#3915

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Sarthak Aggarwal (@sarthakaggarwal97) has been granted write permissions
by maintainer consensus. This adds him to the Current Committers list in
MAINTAINERS.md.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
…alkey-io#3920)

## Problem

A crafted zipmap entry can set the value length to a value near
`UINT32_MAX` so that the `l + e` sum (value length + one-byte free
space) wraps in `unsigned int` arithmetic. The wrapped sum advances the
validation cursor by a tiny amount, leaving `p` inside the buffer, so
the `OUT_OF_RANGE` check passes and `zipmapValidateIntegrity` wrongly
returns success. The field-length path has the same shape — advancing
`p` by a ~4GB length wraps the pointer on 32-bit builds.

`zipmapValidateIntegrity` is always called with `deep=1` from `rdb.c`
when loading `RDB_TYPE_HASH_ZIPMAP`, **including via `RESTORE`**, so any
client with `RESTORE` access can submit a payload that passes
validation. On 32-bit platforms this leads to out-of-bounds access
during the subsequent zipmap→listpack conversion. On 64-bit the
downstream `lpSafeToAdd` cap happens to reject it (the raw ~4GB length
exceeds `LISTPACK_MAX_SAFETY_SIZE`), but the validator should not accept
a malformed payload in the first place — this is the function whose sole
job is to reject it.

## Fix

Bounds-check the attacker-controlled length against the bytes remaining
in the zipmap, in 64-bit space, **before** any pointer arithmetic, for
both the field-length and value-length paths.

## Testing

- `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test exercising
the full attack surface; asserts rejection and that the server stays up.
- Verified the test **fails on the pre-fix code** (validator accepts the
value-length payload) and **passes after the fix**, confirmed by
stashing the fix during the integration run.
- Full `integration/corrupt-dump` suite: 76 passed, 0 failed.

> [!NOTE]
> Found via structure-aware fuzzing of the RESTORE path. This issue was
generated by AI but verified, with love, by a human.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
…valkey-io#3921)

## Problem

A crafted `RESTORE` payload can store a `NAN` score in a
listpack-encoded sorted set. The integrity validation
(`lpValidateIntegrityAndDups`) only checks the listpack *structure* and
member uniqueness — it does not check score validity — so the payload is
accepted on load.

When the sorted set is later converted to a skiplist (e.g. when it grows
past `zset-max-listpack-entries`, or via any operation that triggers
conversion), `zslInsertNode()` asserts the score is not `NAN`
(`t_zset.c:260`) and the server aborts. **Any client with `RESTORE`
access can remotely crash the server.**

The skiplist RDB format (`RDB_TYPE_ZSET` / `RDB_TYPE_ZSET_2`) already
rejects `NAN` scores at load time (`rdb.c`, "Zset with NAN score
detected"). The listpack format (`RDB_TYPE_ZSET_LISTPACK`) had no
equivalent check.

## Reproduction

```
RESTORE k 0 "\x11\x19\x19\x00\x00\x00\x04\x00\x82m1\x03\x83nan\x04\x82m2\x03\x832.5\x04\xFF\x50\x00...."
# loads OK, then:
ZADD k 9 x      # forces listpack->skiplist conversion -> serverAssert(!isnan(node->score)) -> SIGABRT
```

## Fix

Add `zzlValidateScores()`, which scans the scores of a listpack zset
after structural validation and rejects the payload if any score is
`NAN`. Mirrors the existing skiplist-format check. `inf`/`-inf` and
large finite scores remain accepted (only `NAN` is rejected), matching
normal `ZADD` semantics.

## Testing

- `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test asserting
rejection and that the server stays up.
- Verified the test **fails on the pre-fix code** (server crashes on
conversion) and **passes after the fix**, by stashing the fix during the
run.
- Confirmed valid zsets, including `inf`/`-inf`/large finite scores,
still load and convert correctly.
- Full `integration/corrupt-dump` suite: 74 passed, 0 failed.

> [!NOTE]
> Found via structure-aware fuzzing of the RESTORE path. This issue was
generated by AI but verified, with love, by a human.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
When `valkey-cli --rdb` trims the EOF marker, a failed `ftruncate` only
printed a warning but still reported success and exited 0, leaving a
corrupt RDB file. This makes it exit non-zero on that failure, matching
how the `write()` failure is already handled in the same function.

(Thank you, Madelyn, for tag teaming this with me.)

Signed-off-by: Grace <glucier22@gmail.com>
…ey-io#3959)

The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.

The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
    if (isSlotUnclaimed(j) ||
        server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
        clusterSlotFailoverGranted(j)) {
        ...
        clusterDelSlot(j);
        clusterAddSlot(sender, j);
        ...
    }
```

R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
    if (server.cluster->slots[j] == sender) {
        /* The slot is currently bound to the sender but the sender is no longer
         * claiming it. We don't want to unbind the slot yet as it can cause the cluster
         * to move to FAIL state and also throw client error. Keeping the slot bound to
         * the previous owner will cause a few client side redirects, but won't throw
         * any errors. We will keep track of the uncertainty in ownership to avoid
         * propagating misinformation about this slot's ownership using UPDATE
         * messages. */
        bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
    }
```

Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
    if (server.cluster->slots[j] == NULL || ...) {
        new_state = CLUSTER_FAIL;
        ...
    }
```

Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.

This closes valkey-io#3891. The test was added in valkey-io#3674.

Signed-off-by: Binbin <binloveplay1314@qq.com>
## Summary

`addReplyCommandSubCommands` unconditionally called `addReplySetLen(c, 0)`
when a command has no subcommands, emitting a RESP3 Set type prefix (`~0`)
regardless of the `use_map` parameter. The non-empty path (below it) already
branches correctly on `use_map` — the empty early-return was simply missing
the same logic.

In RESP3, `COMMAND INFO <cmd>` returns the subcommands field as a Set (`~0`)
instead of an Array (`*0`) for any command without subcommands (e.g. PING).
Strict RESP3 client libraries that dispatch on collection type will
misinterpret the response. Not visible in RESP2 since both Set and Array use
the `*` prefix there.

## Fix

Apply the same `use_map` branch to the empty case:
- `addReplyMapLen(c, 0)` when `use_map=1`
- `addReplyArrayLen(c, 0)` otherwise

## Test

Added a `readraw` integration test in `tests/unit/introspection-2.tcl` that
inspects the raw wire-level type byte for the subcommands field of `COMMAND
INFO ping` in RESP3 mode, asserting `*0` (Array) rather than `~0` (Set).

Signed-off-by: Rick Ramsay <49293857+rickrams@users.noreply.github.com>
Signed-off-by: rickrams <rickrams@amazon.com>
…ey-io#3811)

`off_t` (64-bit), but were read into `int` (32-bit) locals in
`genValkeyInfoString()` and `handleBioThreadFinishedRDBDownload()`.

This causes INFO replication to report negative
`master_sync_total_bytes` during bio disk-based sync when RDB exceeds
2GB.

Fix: change the local variable types from `int` to `off_t`.

Signed-off-by: chx9 <lovelypiska@outlook.com>
Multi-command parsing in parseMultibulkBuffer (introduced valkey-io#2092) was
disabled for replicated clients because the per-command replication
offset relied on c->qb_pos as the right boundary of the just-applied
command. Replication stream is actually a big pipeline, so if it can
be supported, the processing speed of the replica can be improved.

Decouple parsing position from application position by introducing a
single per-client field that tracks the currently processed command's
right boundary in querybuf:

client.qb_applied: right boundary of the current command in
querybuf. Set by the parsers (for c->argv, = c->qb_pos) and
advanced incrementally by consumeCommandQueue() using each queued
command's input_bytes (qb_applied += p->input_bytes).
parsedCommand.input_bytes already records the number of querybuf bytes
consumed to parse one command (multibulk header + each bulk-length
line + arg bytes + per-arg CRLFs), see parseMultibulk for mode details.
It is a relative quantity, independent of where querybuf has been trimmed
since parsing, so it is exactly what's needed to advance qb_applied across
multi-command parsing, so no extra per-command snapshot field is required.

commandProcessed() now uses c->qb_applied instead of c->qb_pos to
advance reploff by exactly the bytes of the just-applied command.
beforeNextClient shifts only c->qb_applied (a single variable) when
the replicated client's querybuf is trimmed; pending queue entries
carry only relative input_bytes and therefore need no fix-up.

qb_applied is populated unconditionally on the first-command path so
that a client transitioning to replicated mid-command (e.g.
SYNCSLOTS ESTABLISH installs slot_migration_job inside its own handler)
still has a valid value when commandProcessed() runs.

Signed-off-by: Binbin <binloveplay1314@qq.com>
…r fuzzy provenance matches (valkey-io#3933)

- update verify-provenance to require near-duplicate evidence for fuzzy
provenance matches
 - normalize master/primary and slave/replica terminology.
 - print the captured provenance script log in the check workflow.

---------

Signed-off-by: Ping Xie <pingxie@outlook.com>
…-io#3982)

## Issue Description

`log_crashes` in `tests/instances.tcl` — the multi-instance harness used
by both `make test-sentinel` and `make test-cluster` — scans every
instance's `err.txt` with `find_valgrind_errors` **unconditionally**:

```tcl
set logs [glob */err.txt]
foreach log $logs {
    set res [find_valgrind_errors $log true]
    if {$res != ""} {
        puts $res
        incr ::failed
    }
}
```

The main test framework runs the identical scan only behind a valgrind
guard — in `tests/support/server.tcl`, `check_valgrind_errors` is called
from `kill_server` inside `if {$::valgrind}`.

`find_valgrind_errors`' fallback (in `tests/support/util.tcl`) treats
any stderr that lacks a valgrind leak-free summary as an error:

```tcl
# Look for the absence of a leak free summary
# (happens when the server isn't terminated properly).
if {(![regexp -- {definitely lost: 0 bytes} $buf] &&
     ![regexp -- {no leaks are possible} $buf])} {
    return $buf
}
```

When the suite is **not** running under valgrind, that summary is never
present, so **any non-empty `err.txt` is flagged as a failure**.

## Concrete trigger

On hosts whose CPU does not expose the `constant_tsc` flag (common
inside VMs / LXD), every `valkey-server` and `valkey-sentinel` prints to
stderr at startup (`src/monotonic.c`):

```
monotonic: x86 linux, 'constant_tsc' flag not present
```

The sentinel suite then reports exactly **10 failures** (5 sentinel + 5
valkey instances) on every run — even though every individual test
passes and no assertion fails. The cluster suite shares this harness and
is affected identically. On hosts that expose `constant_tsc` the message
is never printed, so the bug is invisible — which is why it only
surfaces in certain CI/VM environments.

## Fix

Wrap the valgrind `err.txt` scan in `if {$::valgrind}`, mirroring
`server.tcl`. The sanitizer scan immediately below is left unguarded,
matching `check_sanitizer_errors`, which always runs. No test coverage
is lost.

```tcl
if {$::valgrind} {
    set logs [glob */err.txt]
    foreach log $logs {
        set res [find_valgrind_errors $log true]
        if {$res != ""} {
            puts $res
            incr ::failed
        }
    }
}
```

Signed-off-by: Smail Kourta <smail.kourta@canonical.com>
…on (valkey-io#3968)

Adds REUSE 3.3 structure (REUSE.toml, LICENSES/) covering all source
files and vendored dependencies.

Replaces the non-standard dual-license COPYING file with a standard
BSD-3-Clause text.

Updates the description of custom patches to Lua and Jemalloc in
deps/README.md.

Benefits:

- GitHub and OpenSSF Scorecard correctly detect the project license
- reuse spdx generates a complete SPDX SBOM covering first-party code
  and all vendored deps
- CI check prevents future contributors from introducing invalid SPDX
  identifiers
- Per-file license clarity for downstream consumers (distro packagers,
  enterprises)

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…ey-io#3803)

New hashtable API hashtableScanHasPassedKey.  New guarantees around hashtable scan.

---------

Signed-off-by: Rain Valentine <rainval@amazon.com>
Updated `kvstoreScan` to more clearly show when a kvstore cursor was used vs. a hashtable cursor.

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Replace ~10k individual INCR round-trips with single SET commands
using large values. And set shutdown-timeout 0 before shutting down
the node.

Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Extend zmalloc_purge to also trim the libc main arena

Rename jemalloc_purge() to zmalloc_purge() and append a
malloc_trim(3) call on glibc systems, so that free pages held in
the libc main arena are returned to the OS. This benefits all
callers: MEMORY PURGE, FLUSHALL, and FLUSHDB.

Update MEMORY HELP and memory doctor diagnostics to mention libc
internal allocations as a possible source of high RSS overhead.

Signed-off-by: Vaibhav Gupta <wweebbssss@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
…functions (valkey-io#3357)

## Summary
This PR removes the unused `void(callback)(dict *)` parameter from
`functionReset()` and related functions.

## Changes
The callback parameter in `functionReset()`, `functionsLibCtxFree()`,
and `functionsLibCtxClear()` was never actually used - it was always
passed as NULL or cast from an incompatible type. This cleanup:

- Removes the `void(callback)(dict *)` parameter from `functionReset()`
- Removes the callback parameter from `functionsLibCtxFree()`
- Removes the callback parameter from `functionsLibCtxClear()`
- Makes `functionsLibCtxReleaseCurrent()` static since it's only used
internally
- Removes the TODO comment in db.c about the callback incompatibility
- Updates all call sites to use the simplified function signatures

## Files Changed
- `src/functions.h`: Updated function declarations
- `src/functions.c`: Removed callback parameters and updated
implementations
- `src/db.c`: Simplified `functionReset()` call, removed TODO comment
- `src/lazyfree.c`: Updated `functionsLibCtxFree()` calls

## Rationale
The `dictEmpty()` calls now explicitly pass NULL for the callback, which
was effectively what was happening before since the parameter was
unused. This simplifies the API and removes the need for the type cast
workaround that was previously used in db.c.

Signed-off-by: Justin Fung <justfung@amazon.com>
Co-authored-by: Justin Fung <justfung@amazon.com>
#### Purpose

This workflow was originally introduced in PR
[valkey-io#3358](valkey-io#3358), where we detect
the failures in our scheduled `daily` runs and create / update github
issues.

We want to do more things with AI with respect to tests failures. It
could include potentially finding the root cause, any PR that broke the
tests, some helpful dashboard to track daily tests, maybe some analysis
or possible fix as well.
To achieve that, we are moving this issue management out of this
repository and into `valkey-ci-agent`.

The Daily workflow in this repository still records per-job test
failures, consolidates them into `all-test-failures.json`, and uploads
the `all-test-failures` artifact. The workflow being removed here was
only responsible for consuming that artifact and creating or updating
GitHub issues.

#### Changes

Remove `.github/workflows/test-failure-detector.yml`.

Issue creation and updates are now handled by the Test Failure Detector
workflow in `valkey-ci-agent` through this PR
[valkey-io#24](valkey-io/valkey-ci-agent#24).

#### Notes

This should be merged together with the corresponding `valkey-ci-agent`
change so scheduled test-failure detection continues without a gap.

Signed-off-by: Bonnie Chan <bonchan35@gmail.com>
Document the shared I/O threading infrastructure: SPMC inbox, MPSC
outbox, per-worker SPSC inboxes, job lifecycle, drain semantics, live
reconfiguration.

---------

Signed-off-by: Harkrishn Patro <h_patro@apple.com>
…become SIGNED_MEMORY_CONFIG (valkey-io#2648)

After valkey-io#3443, we have a new `SIGNED_MEMORY_CONFIG`, which enables the memory
configuration to support storing the value -1.

Make commandlog-request-larger-than and commandlog-reply-larger-than become 
SIGNED_MEMORY_CONFIG, so we can use it in a memory way. 

Signed-off-by: Binbin <binloveplay1314@qq.com>
255 is too short if valkey-server is being run from a long path,
especially if many fields are included such as in the "Process title set
as expected" test in `unit/other.tcl`, where the max length for the
cmdline is 1024, so having the same number in both makes sense.

Closes valkey-io#3832.

Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>
…y-io#3980)

The `test io-threads are runtime modifiable` test in
`tests/unit/other.tcl` times out on the dedicated Valgrind jobs of the
daily CI, failing the run. The failing test is introduced in valkey-io#3938.

This PR reduces the loop to 10 iterations under Valgrind.

**Failure links:**
- https://github.com/valkey-io/valkey/actions/runs/27386948127 (Jun 12)
- https://github.com/valkey-io/valkey/actions/runs/27315974006 (Jun 11)
- https://github.com/valkey-io/valkey/actions/runs/27245311034 (Jun 10)

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…alkey-io#3675)

No bug exists — all NOT_KEY commands (CLUSTERSCAN, SSUBSCRIBE,
SPUBLISH, SUNSUBSCRIBE) have NULL getkeys_proc, so doesCommandHaveKeys
returns 0 and ACL correctly skips key checks. This PR makes the
handling more explicit and defensive for future commands that may
combine getkeys_proc with NOT_KEY key-specs.

Changes:

- Refactor doesCommandHaveKeys() from a one-line ternary into clearer
  if-else branches. Add an explicit check: if a command has getkeys_proc
  but all its key-specs are NOT_KEY, treat it as having no real keys.

- Add a defensive NOT_KEY check in ACLSelectorCheckKey() to skip
  key-pattern ACL validation for NOT_KEY entries, consistent with the
  existing skip in getKeysUsingKeySpecs().

- Add tests: verify COMMAND GETKEYS/GETKEYSANDFLAGS report "no key
  arguments" for NOT_KEY commands; verify restricted ACL users can
  run CLUSTERSCAN without key-permission errors.

Related: valkey-io#2934, valkey-io#3699

Signed-off-by: Binbin <binloveplay1314@qq.com>
While reviewing valkey-io#3675 I found a pre-existing `NOT_KEY` inconsistency in
client tracking.

`CLUSTERSCAN` is skipped by `COMMAND GETKEYS/ACL` through the key spec
path, but tracking still uses `getKeysFromCommand()`. Updated to use
`getKeysFromCommandWithSpecs`

```
raghav$ printf 'client tracking on\ninfo stats\nclusterscan 0-{06S}-0\ninfo stats\n'   | /tmp/valkey-unstable-single/src/valkey-cli -p 7001 --raw   | grep -E '^(OK|tracking_total_keys:|0$)'
OK
tracking_total_keys:2
0
tracking_total_keys:2
````

This affects today `CLUSTERSCAN` because it is RO,
`SPUBLISH/SSUBSCRIBE/SUNSUBSCRIBE` using `NOT_KEY` are not RO.

Signed-off-by: nmvk <r@nmvk.com>
Add explicit validation for the FIELDS token before parsing numfields.

Fixes valkey-io#4045.

Signed-off-by: lcxn123 <156999965+lcxn123@users.noreply.github.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
…io#4058)

When a primary receives an MFSTART (manual failover start) message,
it was silently ignored if the sender was unknown or not a replica of
this primary. This made it impossible to diagnose why a manual failover
was timing out on the replica side — the primary had no indication that
it ever received the request.

In particular, cluster gossip may not have propagated the new replica
relationship to the primary yet, so the primary may legitimately not
recognize the sender as its replica. Without logging, this scenario
leaves no trace on the primary side.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Some functions are newly deprecated in OpenSSL 4.0. This commit works
around those. More specifically:

- X509_cmp_current_time() in isCertValid() is no longer necessary as
X509_check_certificate_times() compares the notBefore and notAfter on
its own. Had to add a version check since this function is new in
OpenSSL 4.0.

- Replace deprecated X509_NAME_get_text_by_NID. Not a perfect fix,
because the new implementation still assumes that the name does not
contain embedded null characters which may not be true, e.g., if the
name is of type UniversalString or BMPString.

- Also fix constness of X509_get_subject_name return value.

The latter two fixes are taken from this Fedora patch:
https://src.fedoraproject.org/rpms/valkey/c/1a9c8847172ef3fb116a1e2fdb3871692378adae?branch=rawhide

Closes valkey-io#4012

Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>
valkey-io#4059)

Two minor cleanups:
- Remove unnecessary incrRefCount(shared.hdel) just like other shared object.
- Fix new_argv allocation size since HDEL only need (num_fields + 2) size.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Use I/O threads for cluster bus reads, writes, and TLS accepts,
mirroring the client I/O threading model. Splits packet framing from
application, adds deferred link teardown with io_refs for safe
concurrent I/O, dual send queues, and a framed packet queue on
clusterLink. Adds observability counters (CLUSTER INFO).

Signed-off-by: Harkrishn Patro <h_patro@apple.com>
@hpatro hpatro force-pushed the cluster-io-offload branch from b4fdc03 to 16d11d2 Compare June 29, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.