Cluster IO Offload#4
Conversation
Implemented CLUSTERSCAN command for topology-aware scanning
Unlike `SCAN` which is local to a single node, `CLUSTERSCAN` provides a
mechanism that helps clients iterate across slot boundaries and handles
`MOVED` redirections.
**Key details**
* Global cluster iteration via `fingerprint-{hashtag}-cursor`
* Scan one slot at a time
* Start the CLUSTERSCAN with 0
* SLOT argument for parallel scanning of multiple slots
* Re-use scanGenericCommand for the response
**Cursor format:** `fingerprint-{hashtag}-localcursor`
- Fingerprint is a hash of the node's DB seed that identifies the
current memory layout. On mismatch, scan restarts from cursor 0
rather than returning an error.
- Fingerprint 0 indicates a cross slot cursor (e.g., initial cursor
or slot transition) where validation is skipped.
- Hashtag encodes the target slot
- Local cursor tracks position within the slot
**Usage:**
```
CLUSTERSCAN <cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number]
```
```
CLUSTERSCAN 0 # Start scanning from slot 0
CLUSTERSCAN <cursor> # Continue from cursor
CLUSTERSCAN 0 SLOT 1000 # Start scanning specific slot
CLUSTERSCAN <cursor> MATCH user:* COUNT 100
```
---------
Signed-off-by: nmvk <r@nmvk.com>
Signed-off-by: Raghav <r@nmvk.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Before this change, if you built `valkey-server` (e.g. `make valkey-server`) the LUA module was not built. With this change, the LUA module is now a direct a dependency of `valkey-server` - (unless `BUILD_LUA=no` is passed) Added some colors to the Lua module & Lua lib `Makefile`s to they blend nicely in the build output. Signed-off-by: Eran Ifrah <eifrah@amazon.com>
…time_t (valkey-io#3252) Addresses issue valkey-io#2350 As noted in the issue, much of the expire code uses raw long long for timestamps, which provides no semantic meaning about the unit or purpose of the value. Valkey already defines mstime_t (milliseconds) and ustime_t (microseconds) typedefs — this PR replaces bare long long declarations with the appropriate typedef wherever the value represents an expiration timestamp or time duration. This PR only fixes a small subset in the codebase, but it is an incremental step toward fully replacing the bare long long references. --------- Signed-off-by: curious-george-rk <r.ebu@gmail.com> Co-authored-by: curious-george-rk <r.ebu@gmail.com>
Now we will be able to add a `run-cluster-benchmark` label to run a benchmark with cluster-mode enabled valkey-server It will use the config https://github.com/valkey-io/valkey/blob/unstable/.github/benchmark_configs/benchmark-config-arm.json modified for for cluster mode with a single clustermode enabled instance of valkey. It uses the same single instance for the benchmark as for run-benchmark. If both labels are used, they are sequential in the same concurrency group `group: ec2-al-2023-pr-benchmarking-arm64`. --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
…ey-io#2746) This PR introduces a new config `cluster-message-gossip-perc` which allows an operator to modify the amount of gossip node information to be sent per ping/pong/meet message. It can be modified dynamically (Related to valkey-io#2291). The default value is 10% i.e. 10% of peer node information would be gossiped along with each ping/pong/meet packet. Users can tune this configuration, setting the value higher allows faster information dissemination whereas setting it lower would lead to direct PING messages if no information was received about a node with the `server.cluster_node_timeout/2` period. Note: the behavior for partially failed gossip nodes still remains intact where all the `pfail` nodes are part of the message for faster propagation of information and faster transition of `PFAIL` to `FAIL`. --------- Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
…o#3336) This fixes a flaky dual-channel replication integration test: https://github.com/valkey-io/valkey/actions/runs/22810251608/job/66165776198#step:8:7701 `INFO memory` field `used_memory_overhead` and `MEMORY STATS` field `overhead.total` can change during dual-channel sync if replica's pending replication buffer is still changing. This is probably more visible in slower environments. The test now collects `INFO` and `MEMORY STATS` in a single `MULTI/EXEC` on both the primary and replica, so the compared values come from the same snapshot. Passing here: https://github.com/sarthakaggarwal97/valkey/actions/runs/22864585326/job/66327772967 Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…ation (valkey-io#3258) Observation around `dictGetSomeKeys(N)` performance is that it is around 10x faster than randomly picking `N` elements over N*3 attempts. Further, this avoids chances of not picking N elements due to duplicate element selection with individual random pick. Overall, it should help picking gossip node(s) for cluster bus message faster. --------- Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
I found that on 32-bit platforms, the gtest fails if I add a declaration `int aof_rewrite_for_replication;` between https://github.com/valkey-io/valkey/blob/fc217fca2d7de246ab8494d9d8619b7348235f47/src/server.h#L1843 and https://github.com/valkey-io/valkey/blob/fc217fca2d7de246ab8494d9d8619b7348235f47/src/server.h#L2108 The crash is caused by https://github.com/valkey-io/valkey/blob/fc217fca2d7de246ab8494d9d8619b7348235f47/src/unit/wrappers.h#L40 stripping the _Atomic qualifier in C++ builds. This makes struct valkeyServer have different alignment/field offsets in C-compiled code (libvalkey.a) vs C++-compiled tests (gtest). As a result, fields after these atomics are accessed at different offsets, leading to corrupted reads/writes and a segfault. I added VALKEY_ATOMIC, and it works as follows: - C builds (production): VALKEY_ATOMIC(uint64_t) -> _Atomic uint64_t (unchanged behavior) - C++ builds (tests via wrappers.h): VALKEY_ATOMIC(uint64_t) -> alignas(8) uint64_t (preserves 8-byte alignment) - All existing atomic_load_explicit / atomic_store_explicit calls in C code continue to work because _Atomic is present in C mode - C++ test code only does direct field access (never uses atomic operations), so alignas(N) type is sufficient --------- Signed-off-by: cjx-zar <jxchenczar@foxmail.com>
…-io#3121) It's a combination of MSET and EXPIRE, which allows us to set an expiration time simultaneously with the MSET operation. Syntax for the new MSETEX command: ``` /* MSETEX numkeys key value [key value ...] [NX | XX] * [EX seconds | PX milliseconds | * EXAT seconds-timestamp | PXAT milliseconds-timestamp | KEEPTTL] */ ``` The MSETEX command supports a set of options (like SET command): - EX seconds -- Set the specified expire time, in seconds (a positive integer). - PX milliseconds -- Set the specified expire time, in milliseconds (a positive integer). - EXAT timestamp-seconds -- Set the specified Unix time at which the keys will expire, in seconds (a positive integer). - PXAT timestamp-milliseconds -- Set the specified Unix time at which the keys will expire, in milliseconds (a positive integer). - KEEPTTL -- Retain the time to live associated with the keys. - NX -- Only set the keys if all keys do not already exist. - XX -- Only set the keys if all keys already exists. Return Value (like MSETNX command): - Integer reply: 0 if no key was set due to NX or XX options. - Integer reply: 1 if all the keys were set. Closes valkey-io#2592 --------- Signed-off-by: Binbin <binloveplay1314@qq.com>
…valkey-io#3156) Implemented a way to propagate az through gossip and obtain az information through the CLUSTER SHARDS and CLUSTER SLOTS commands. It will only be displayed if the node is configured with it. Closes valkey-io#3110 --------- Signed-off-by: Su Ko <rhtn1128@gmail.com>
`zdiffAlgorithm2()` can break out early once the destination cardinality reaches zero. In that path, a temporary SDS created by `zuiSdsFromValue()` was left dirty and never released, because that cleanup normally happens on the next iterator step. This patch explicitly discards the dirty value before the early `break`. ``` ==11724== 11 bytes in 1 blocks are definitely lost in loss record 36 of 1,442 ==11724== at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==11724== by 0x2FA974: ztrymalloc_usable_internal (zmalloc.c:156) ==11724== by 0x2FAAEA: zmalloc_usable (zmalloc.c:200) ==11724== by 0x282D25: _sdsnewlen (sds.c:102) ==11724== by 0x2830B2: sdsnewlen (sds.c:169) ==11724== by 0x2DB537: zuiSdsFromValue (t_zset.c:2290) ==11724== by 0x2DBE00: zdiffAlgorithm2 (t_zset.c:2502) ==11724== by 0x2DC085: zdiff (t_zset.c:2568) ==11724== by 0x2DD01F: zunionInterDiffGenericCommand (t_zset.c:2817) ==11724== by 0x2DD574: zdiffCommand (t_zset.c:2898) ==11724== by 0x29F5AE: call (server.c:3883) ==11724== by 0x2A1598: processCommand (server.c:4569) ``` Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In flushall we will call killRDBChild to kill the RDB child, but we need to wait for checkChildrenDone and waitpid to kick in, so that we can call resetChildState to reset the state. The previous wait_for_condition 10 100 (1s) might not be enough, changed it to wait_for_condition 50 100 (5s). ``` [err]: Test FLUSHALL aborts bgsave in tests/integration/rdb.tcl bgsave not aborted ``` Signed-off-by: Binbin <binloveplay1314@qq.com>
…tCommandVector (valkey-io#3279) If the command is in the form of "SET key value EX/PX/EXAT ttl", then we don't need to rewrite the entire command vector. In addition to saving the rewriting of the command vector, we also save one command table lookup in this case ince we don't need to lookup SET command again. We already have many relevant test coverages in expire.tcl. This PR does not change anything around the propagate. Somehow like 2544755. Signed-off-by: Binbin <binloveplay1314@qq.com>
With current Makefile, `LDFLAGS` are not used for modules. This results in security options not applied. ``` $ annocheck /usr/lib64/valkey/modules/rdma.so annocheck: Version 12.99. Hardened: rdma.so: FAIL: bind-now test because not linked with -Wl,-z,now Hardened: Rerun annocheck with --verbose to see more information on the tests. Hardened: rdma.so: Overall: FAIL. ``` With this patch ``` $ annocheck /usr/lib64/valkey/modules/rdma.so annocheck: Version 12.99. Hardened: rdma.so: PASS. ``` Signed-off-by: Remi Collet <remi@remirepo.net>
There is now a port of fast_float in C. So instead of having an optional fast_float dependency, we can just use ffc instead, unconditionally. https://github.com/kolemannix/ffc.h It is a high quality port. The performance should be the same or improved. Note : I am the maintainer and main author of fast_float. --------- Signed-off-by: Daniel Lemire <daniel@lemire.me>
- Supports `VALKEYMODULE_CONFIG_UNSIGNED` and `UNSIGNED_CONFIG` for a wider range of configurations. - Supported the API for `ValkeyModule_RegisterUnsignedNumericConfig` - `string2ull` method supports the length parameter, making it more secure --------- Signed-off-by: artikell <739609084@qq.com> Signed-off-by: skyfirelee <739609084@qq.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Fixes the failing run:
```
[err]: Force the use of IO threads and assert active IO thread usage in tests/unit/io-threads.tcl
Expected '0.000000' to be more than '0' (context: type source line 49 file /Users/runner/work/valkey/valkey/tests/unit/io-threads.tcl cmd {assert_morethan $used_active_time 0} proc ::test)
```
The change deflakes the test by generating enough I/O-thread work.
The test previously created 16 deferred clients but only sent 15 total
`INCR` commands. On fast runners, that burst could be short enough that
`used_active_time_io_thread_*` was still reported as `0.000000`, causing
the test to fail even though I/O threads had run.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…alkey-io#3352) `availability_zone` was added to the HELLO command in valkey-io#1487, and it was missing the reply schema and the test was wrongly marked with `logreqres:skip`. `availability-zone` was added to CLUSTER SHARDS and CLUSTER SLOTS commands in valkey-io#3156, and it was missing the reply schema. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>
…alkey-io#1901) When a replica finishes a **disk-based** full sync and both `appendonly yes` and `aof-use-rdb-preamble yes` are enabled, this PR reuses the received `dump.rdb` as the new AOF base file, instead of triggering `BGREWRITEAOF`. This avoids generating an almost identical base snapshot and removes redundant CPU/IO work. Before: 1. Replica receives RDB via disk-based full sync 2. Replica loads RDB into memory 3. Replica may delete synced RDB (`rdb-del-sync-files`) 4. `BGREWRITEAOF` runs and generates a new base snapshot 5. New snapshot becomes AOF base After: 1. Replica receives RDB via disk-based full sync 2. Replica loads RDB into memory 3. Synced RDB is renamed to the new AOF base file 4. A new incremental AOF is created 5. No `BGREWRITEAOF` is needed Tests: - [x] Disk-based sync + preamble on: RDB reused as AOF base - [x] New writes after sync: incr AOF works correctly - [x] Replica restart: reused AOF loads correctly - [x] Disk-based sync + preamble off: fallback to `BGREWRITEAOF` - [x] Diskless sync + preamble on: fallback to `BGREWRITEAOF` - [x] Diskless sync with stale local `dump.rdb`: stale file is not reused --------- Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>
…ations (valkey-io#3261) The "New Master down consecutively" test was sometimes failing under Valgrind by timing out. The new overrides match those used for the cluster in the first part of the file - see valkey-io#2672 Under Valgrind's 10-20x slowdown, a single failover requiring ~15 seconds of server time can exceed the test's 100-second wall-clock wait. Error text: ``` *** [err]: New Master down consecutively in tests/unit/cluster/slave-selection.tcl No failover detected when master 12 fails ``` Daily run failure: https://github.com/valkey-io/valkey/actions/runs/22421982161/job/64921545936#logs --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Rain Valentine <rainval@amazon.com>
…o#3263) Carries on from where valkey-io#3161 left off. The test-sanitizer-address-large-memory jobs were being OOM-killed on GitHub-hosted runners (15.6GB RAM) due to ASAN's 2-3x memory overhead. Changes: - Skip 4GB quicklist compression test under ASAN (requires ~16-24GB with dual buffers + ASAN overhead) - Reduce integration test sizes from 5GB to 4.1GB (preserves >4GB 32-bit boundary coverage) - Reduce XADD iterations from 10 to 3 - Add memory monitoring to track minimum free memory during CI runs Signed-off-by: Rain Valentine <rsg000@gmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 28d410dc2c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
PR SummaryThe PR introduces cluster bus I/O offloading to the shared I/O thread pool. The core changes are: Architecture: Cluster bus read, write, and TLS accept operations can now be dispatched as jobs to I/O worker threads via shared SPMC (main→workers) and MPSC (workers→main) queues. The main thread retains exclusive ownership of cluster state mutation; only raw I/O work is offloaded. If the thread pool is inactive or enqueue fails, the system falls back to synchronous I/O on the main thread, tracked by
Build system: Design document: A comprehensive WalkthroughThis PR offloads cluster bus I/O (reads, writes, and TLS accepts) from the main thread to the shared I/O thread pool. Jobs flow through SPMC/MPSC queues ( File Groups
Release Notes
|
valkey-io#3964) Database-level ACL valkey-io#2309 introduced `alldbs` rule that was explicit for all users and because of that previous versions no longer had the ability to parse ACL strings produced by later versions. Omit `alldbs` in `ACLDescribeSelector()`, that is used in `ACL SAVE/LOAD` and `CONFIG REWRITE` command paths so that downgrades would be possible if new feature was not used (`db=` and `resetdbs` rules). Keep `ACL GETUSER` command's output as is and return `alldbs` in `databases` field because of command's field-value format. Add test to check that `ACL LIST` omits implicit `alldbs` and add check to existing `ACL SAVE` and `CONFIG REWRITE` tests. Fixes valkey-io#3915 Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Sarthak Aggarwal (@sarthakaggarwal97) has been granted write permissions by maintainer consensus. This adds him to the Current Committers list in MAINTAINERS.md. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
…alkey-io#3920) ## Problem A crafted zipmap entry can set the value length to a value near `UINT32_MAX` so that the `l + e` sum (value length + one-byte free space) wraps in `unsigned int` arithmetic. The wrapped sum advances the validation cursor by a tiny amount, leaving `p` inside the buffer, so the `OUT_OF_RANGE` check passes and `zipmapValidateIntegrity` wrongly returns success. The field-length path has the same shape — advancing `p` by a ~4GB length wraps the pointer on 32-bit builds. `zipmapValidateIntegrity` is always called with `deep=1` from `rdb.c` when loading `RDB_TYPE_HASH_ZIPMAP`, **including via `RESTORE`**, so any client with `RESTORE` access can submit a payload that passes validation. On 32-bit platforms this leads to out-of-bounds access during the subsequent zipmap→listpack conversion. On 64-bit the downstream `lpSafeToAdd` cap happens to reject it (the raw ~4GB length exceeds `LISTPACK_MAX_SAFETY_SIZE`), but the validator should not accept a malformed payload in the first place — this is the function whose sole job is to reject it. ## Fix Bounds-check the attacker-controlled length against the bytes remaining in the zipmap, in 64-bit space, **before** any pointer arithmetic, for both the field-length and value-length paths. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test exercising the full attack surface; asserts rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (validator accepts the value-length payload) and **passes after the fix**, confirmed by stashing the fix during the integration run. - Full `integration/corrupt-dump` suite: 76 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
…valkey-io#3921) ## Problem A crafted `RESTORE` payload can store a `NAN` score in a listpack-encoded sorted set. The integrity validation (`lpValidateIntegrityAndDups`) only checks the listpack *structure* and member uniqueness — it does not check score validity — so the payload is accepted on load. When the sorted set is later converted to a skiplist (e.g. when it grows past `zset-max-listpack-entries`, or via any operation that triggers conversion), `zslInsertNode()` asserts the score is not `NAN` (`t_zset.c:260`) and the server aborts. **Any client with `RESTORE` access can remotely crash the server.** The skiplist RDB format (`RDB_TYPE_ZSET` / `RDB_TYPE_ZSET_2`) already rejects `NAN` scores at load time (`rdb.c`, "Zset with NAN score detected"). The listpack format (`RDB_TYPE_ZSET_LISTPACK`) had no equivalent check. ## Reproduction ``` RESTORE k 0 "\x11\x19\x19\x00\x00\x00\x04\x00\x82m1\x03\x83nan\x04\x82m2\x03\x832.5\x04\xFF\x50\x00...." # loads OK, then: ZADD k 9 x # forces listpack->skiplist conversion -> serverAssert(!isnan(node->score)) -> SIGABRT ``` ## Fix Add `zzlValidateScores()`, which scans the scores of a listpack zset after structural validation and rejects the payload if any score is `NAN`. Mirrors the existing skiplist-format check. `inf`/`-inf` and large finite scores remain accepted (only `NAN` is rejected), matching normal `ZADD` semantics. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test asserting rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (server crashes on conversion) and **passes after the fix**, by stashing the fix during the run. - Confirmed valid zsets, including `inf`/`-inf`/large finite scores, still load and convert correctly. - Full `integration/corrupt-dump` suite: 74 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
When `valkey-cli --rdb` trims the EOF marker, a failed `ftruncate` only printed a warning but still reported success and exited 0, leaving a corrupt RDB file. This makes it exit non-zero on that failure, matching how the `write()` failure is already handled in the same function. (Thank you, Madelyn, for tag teaming this with me.) Signed-off-by: Grace <glucier22@gmail.com>
…ey-io#3959) The Case 3 portion of the test was flaky: after a single round of `CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state and `wait_for_cluster_state fail` would time out with `Cluster node 1 cluster_state:ok`. The race is between R0's local DELSLOTS and the gossip already in flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS packet from R0 (whose myslots still claims slot 0) hits the isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds slot 0 back to R0 on R1. See: ``` if (isSlotUnclaimed(j) || server.cluster->slots[j]->configEpoch < senderConfigEpoch || clusterSlotFailoverGranted(j)) { ... clusterDelSlot(j); clusterAddSlot(sender, j); ... } ``` R0's subsequent "no longer claiming" PINGs cannot undo this, because that path only sets owner_not_claiming_slot and never clears slots[j]: ``` if (server.cluster->slots[j] == sender) { /* The slot is currently bound to the sender but the sender is no longer * claiming it. We don't want to unbind the slot yet as it can cause the cluster * to move to FAIL state and also throw client error. Keeping the slot bound to * the previous owner will cause a few client side redirects, but won't throw * any errors. We will keep track of the uncertainty in ownership to avoid * propagating misinformation about this slot's ownership using UPDATE * messages. */ bitmapSetBit(server.cluster->owner_not_claiming_slot, j); } ``` Combined with clusterUpdateState's full-coverage check looking only at slots[j] == NULL, R1 stays at cluster OK forever. ``` if (server.cluster->slots[j] == NULL || ...) { new_state = CLUSTER_FAIL; ... } ``` Rather than fighting the protocol's intentional asymmetry around "soft delete" via gossip, just retry the DELSLOTS pass until all three nodes converge to FAIL. This keeps the test focused on the CLUSTERSCAN error semantics it actually wants to verify. This closes valkey-io#3891. The test was added in valkey-io#3674. Signed-off-by: Binbin <binloveplay1314@qq.com>
## Summary `addReplyCommandSubCommands` unconditionally called `addReplySetLen(c, 0)` when a command has no subcommands, emitting a RESP3 Set type prefix (`~0`) regardless of the `use_map` parameter. The non-empty path (below it) already branches correctly on `use_map` — the empty early-return was simply missing the same logic. In RESP3, `COMMAND INFO <cmd>` returns the subcommands field as a Set (`~0`) instead of an Array (`*0`) for any command without subcommands (e.g. PING). Strict RESP3 client libraries that dispatch on collection type will misinterpret the response. Not visible in RESP2 since both Set and Array use the `*` prefix there. ## Fix Apply the same `use_map` branch to the empty case: - `addReplyMapLen(c, 0)` when `use_map=1` - `addReplyArrayLen(c, 0)` otherwise ## Test Added a `readraw` integration test in `tests/unit/introspection-2.tcl` that inspects the raw wire-level type byte for the subcommands field of `COMMAND INFO ping` in RESP3 mode, asserting `*0` (Array) rather than `~0` (Set). Signed-off-by: Rick Ramsay <49293857+rickrams@users.noreply.github.com> Signed-off-by: rickrams <rickrams@amazon.com>
…ey-io#3811) `off_t` (64-bit), but were read into `int` (32-bit) locals in `genValkeyInfoString()` and `handleBioThreadFinishedRDBDownload()`. This causes INFO replication to report negative `master_sync_total_bytes` during bio disk-based sync when RDB exceeds 2GB. Fix: change the local variable types from `int` to `off_t`. Signed-off-by: chx9 <lovelypiska@outlook.com>
Multi-command parsing in parseMultibulkBuffer (introduced valkey-io#2092) was disabled for replicated clients because the per-command replication offset relied on c->qb_pos as the right boundary of the just-applied command. Replication stream is actually a big pipeline, so if it can be supported, the processing speed of the replica can be improved. Decouple parsing position from application position by introducing a single per-client field that tracks the currently processed command's right boundary in querybuf: client.qb_applied: right boundary of the current command in querybuf. Set by the parsers (for c->argv, = c->qb_pos) and advanced incrementally by consumeCommandQueue() using each queued command's input_bytes (qb_applied += p->input_bytes). parsedCommand.input_bytes already records the number of querybuf bytes consumed to parse one command (multibulk header + each bulk-length line + arg bytes + per-arg CRLFs), see parseMultibulk for mode details. It is a relative quantity, independent of where querybuf has been trimmed since parsing, so it is exactly what's needed to advance qb_applied across multi-command parsing, so no extra per-command snapshot field is required. commandProcessed() now uses c->qb_applied instead of c->qb_pos to advance reploff by exactly the bytes of the just-applied command. beforeNextClient shifts only c->qb_applied (a single variable) when the replicated client's querybuf is trimmed; pending queue entries carry only relative input_bytes and therefore need no fix-up. qb_applied is populated unconditionally on the first-command path so that a client transitioning to replicated mid-command (e.g. SYNCSLOTS ESTABLISH installs slot_migration_job inside its own handler) still has a valid value when commandProcessed() runs. Signed-off-by: Binbin <binloveplay1314@qq.com>
…r fuzzy provenance matches (valkey-io#3933) - update verify-provenance to require near-duplicate evidence for fuzzy provenance matches - normalize master/primary and slave/replica terminology. - print the captured provenance script log in the check workflow. --------- Signed-off-by: Ping Xie <pingxie@outlook.com>
…-io#3982) ## Issue Description `log_crashes` in `tests/instances.tcl` — the multi-instance harness used by both `make test-sentinel` and `make test-cluster` — scans every instance's `err.txt` with `find_valgrind_errors` **unconditionally**: ```tcl set logs [glob */err.txt] foreach log $logs { set res [find_valgrind_errors $log true] if {$res != ""} { puts $res incr ::failed } } ``` The main test framework runs the identical scan only behind a valgrind guard — in `tests/support/server.tcl`, `check_valgrind_errors` is called from `kill_server` inside `if {$::valgrind}`. `find_valgrind_errors`' fallback (in `tests/support/util.tcl`) treats any stderr that lacks a valgrind leak-free summary as an error: ```tcl # Look for the absence of a leak free summary # (happens when the server isn't terminated properly). if {(![regexp -- {definitely lost: 0 bytes} $buf] && ![regexp -- {no leaks are possible} $buf])} { return $buf } ``` When the suite is **not** running under valgrind, that summary is never present, so **any non-empty `err.txt` is flagged as a failure**. ## Concrete trigger On hosts whose CPU does not expose the `constant_tsc` flag (common inside VMs / LXD), every `valkey-server` and `valkey-sentinel` prints to stderr at startup (`src/monotonic.c`): ``` monotonic: x86 linux, 'constant_tsc' flag not present ``` The sentinel suite then reports exactly **10 failures** (5 sentinel + 5 valkey instances) on every run — even though every individual test passes and no assertion fails. The cluster suite shares this harness and is affected identically. On hosts that expose `constant_tsc` the message is never printed, so the bug is invisible — which is why it only surfaces in certain CI/VM environments. ## Fix Wrap the valgrind `err.txt` scan in `if {$::valgrind}`, mirroring `server.tcl`. The sanitizer scan immediately below is left unguarded, matching `check_sanitizer_errors`, which always runs. No test coverage is lost. ```tcl if {$::valgrind} { set logs [glob */err.txt] foreach log $logs { set res [find_valgrind_errors $log true] if {$res != ""} { puts $res incr ::failed } } } ``` Signed-off-by: Smail Kourta <smail.kourta@canonical.com>
…on (valkey-io#3968) Adds REUSE 3.3 structure (REUSE.toml, LICENSES/) covering all source files and vendored dependencies. Replaces the non-standard dual-license COPYING file with a standard BSD-3-Clause text. Updates the description of custom patches to Lua and Jemalloc in deps/README.md. Benefits: - GitHub and OpenSSF Scorecard correctly detect the project license - reuse spdx generates a complete SPDX SBOM covering first-party code and all vendored deps - CI check prevents future contributors from introducing invalid SPDX identifiers - Per-file license clarity for downstream consumers (distro packagers, enterprises) --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…ey-io#3803) New hashtable API hashtableScanHasPassedKey. New guarantees around hashtable scan. --------- Signed-off-by: Rain Valentine <rainval@amazon.com>
Updated `kvstoreScan` to more clearly show when a kvstore cursor was used vs. a hashtable cursor. Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Replace ~10k individual INCR round-trips with single SET commands using large values. And set shutdown-timeout 0 before shutting down the node. Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Extend zmalloc_purge to also trim the libc main arena Rename jemalloc_purge() to zmalloc_purge() and append a malloc_trim(3) call on glibc systems, so that free pages held in the libc main arena are returned to the OS. This benefits all callers: MEMORY PURGE, FLUSHALL, and FLUSHDB. Update MEMORY HELP and memory doctor diagnostics to mention libc internal allocations as a possible source of high RSS overhead. Signed-off-by: Vaibhav Gupta <wweebbssss@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
…functions (valkey-io#3357) ## Summary This PR removes the unused `void(callback)(dict *)` parameter from `functionReset()` and related functions. ## Changes The callback parameter in `functionReset()`, `functionsLibCtxFree()`, and `functionsLibCtxClear()` was never actually used - it was always passed as NULL or cast from an incompatible type. This cleanup: - Removes the `void(callback)(dict *)` parameter from `functionReset()` - Removes the callback parameter from `functionsLibCtxFree()` - Removes the callback parameter from `functionsLibCtxClear()` - Makes `functionsLibCtxReleaseCurrent()` static since it's only used internally - Removes the TODO comment in db.c about the callback incompatibility - Updates all call sites to use the simplified function signatures ## Files Changed - `src/functions.h`: Updated function declarations - `src/functions.c`: Removed callback parameters and updated implementations - `src/db.c`: Simplified `functionReset()` call, removed TODO comment - `src/lazyfree.c`: Updated `functionsLibCtxFree()` calls ## Rationale The `dictEmpty()` calls now explicitly pass NULL for the callback, which was effectively what was happening before since the parameter was unused. This simplifies the API and removes the need for the type cast workaround that was previously used in db.c. Signed-off-by: Justin Fung <justfung@amazon.com> Co-authored-by: Justin Fung <justfung@amazon.com>
#### Purpose This workflow was originally introduced in PR [valkey-io#3358](valkey-io#3358), where we detect the failures in our scheduled `daily` runs and create / update github issues. We want to do more things with AI with respect to tests failures. It could include potentially finding the root cause, any PR that broke the tests, some helpful dashboard to track daily tests, maybe some analysis or possible fix as well. To achieve that, we are moving this issue management out of this repository and into `valkey-ci-agent`. The Daily workflow in this repository still records per-job test failures, consolidates them into `all-test-failures.json`, and uploads the `all-test-failures` artifact. The workflow being removed here was only responsible for consuming that artifact and creating or updating GitHub issues. #### Changes Remove `.github/workflows/test-failure-detector.yml`. Issue creation and updates are now handled by the Test Failure Detector workflow in `valkey-ci-agent` through this PR [valkey-io#24](valkey-io/valkey-ci-agent#24). #### Notes This should be merged together with the corresponding `valkey-ci-agent` change so scheduled test-failure detection continues without a gap. Signed-off-by: Bonnie Chan <bonchan35@gmail.com>
Document the shared I/O threading infrastructure: SPMC inbox, MPSC outbox, per-worker SPSC inboxes, job lifecycle, drain semantics, live reconfiguration. --------- Signed-off-by: Harkrishn Patro <h_patro@apple.com>
…become SIGNED_MEMORY_CONFIG (valkey-io#2648) After valkey-io#3443, we have a new `SIGNED_MEMORY_CONFIG`, which enables the memory configuration to support storing the value -1. Make commandlog-request-larger-than and commandlog-reply-larger-than become SIGNED_MEMORY_CONFIG, so we can use it in a memory way. Signed-off-by: Binbin <binloveplay1314@qq.com>
255 is too short if valkey-server is being run from a long path, especially if many fields are included such as in the "Process title set as expected" test in `unit/other.tcl`, where the max length for the cmdline is 1024, so having the same number in both makes sense. Closes valkey-io#3832. Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>
…y-io#3980) The `test io-threads are runtime modifiable` test in `tests/unit/other.tcl` times out on the dedicated Valgrind jobs of the daily CI, failing the run. The failing test is introduced in valkey-io#3938. This PR reduces the loop to 10 iterations under Valgrind. **Failure links:** - https://github.com/valkey-io/valkey/actions/runs/27386948127 (Jun 12) - https://github.com/valkey-io/valkey/actions/runs/27315974006 (Jun 11) - https://github.com/valkey-io/valkey/actions/runs/27245311034 (Jun 10) --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…alkey-io#3675) No bug exists — all NOT_KEY commands (CLUSTERSCAN, SSUBSCRIBE, SPUBLISH, SUNSUBSCRIBE) have NULL getkeys_proc, so doesCommandHaveKeys returns 0 and ACL correctly skips key checks. This PR makes the handling more explicit and defensive for future commands that may combine getkeys_proc with NOT_KEY key-specs. Changes: - Refactor doesCommandHaveKeys() from a one-line ternary into clearer if-else branches. Add an explicit check: if a command has getkeys_proc but all its key-specs are NOT_KEY, treat it as having no real keys. - Add a defensive NOT_KEY check in ACLSelectorCheckKey() to skip key-pattern ACL validation for NOT_KEY entries, consistent with the existing skip in getKeysUsingKeySpecs(). - Add tests: verify COMMAND GETKEYS/GETKEYSANDFLAGS report "no key arguments" for NOT_KEY commands; verify restricted ACL users can run CLUSTERSCAN without key-permission errors. Related: valkey-io#2934, valkey-io#3699 Signed-off-by: Binbin <binloveplay1314@qq.com>
While reviewing valkey-io#3675 I found a pre-existing `NOT_KEY` inconsistency in client tracking. `CLUSTERSCAN` is skipped by `COMMAND GETKEYS/ACL` through the key spec path, but tracking still uses `getKeysFromCommand()`. Updated to use `getKeysFromCommandWithSpecs` ``` raghav$ printf 'client tracking on\ninfo stats\nclusterscan 0-{06S}-0\ninfo stats\n' | /tmp/valkey-unstable-single/src/valkey-cli -p 7001 --raw | grep -E '^(OK|tracking_total_keys:|0$)' OK tracking_total_keys:2 0 tracking_total_keys:2 ```` This affects today `CLUSTERSCAN` because it is RO, `SPUBLISH/SSUBSCRIBE/SUNSUBSCRIBE` using `NOT_KEY` are not RO. Signed-off-by: nmvk <r@nmvk.com>
Add explicit validation for the FIELDS token before parsing numfields. Fixes valkey-io#4045. Signed-off-by: lcxn123 <156999965+lcxn123@users.noreply.github.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
…io#4058) When a primary receives an MFSTART (manual failover start) message, it was silently ignored if the sender was unknown or not a replica of this primary. This made it impossible to diagnose why a manual failover was timing out on the replica side — the primary had no indication that it ever received the request. In particular, cluster gossip may not have propagated the new replica relationship to the primary yet, so the primary may legitimately not recognize the sender as its replica. Without logging, this scenario leaves no trace on the primary side. Signed-off-by: Binbin <binloveplay1314@qq.com>
Some functions are newly deprecated in OpenSSL 4.0. This commit works around those. More specifically: - X509_cmp_current_time() in isCertValid() is no longer necessary as X509_check_certificate_times() compares the notBefore and notAfter on its own. Had to add a version check since this function is new in OpenSSL 4.0. - Replace deprecated X509_NAME_get_text_by_NID. Not a perfect fix, because the new implementation still assumes that the name does not contain embedded null characters which may not be true, e.g., if the name is of type UniversalString or BMPString. - Also fix constness of X509_get_subject_name return value. The latter two fixes are taken from this Fedora patch: https://src.fedoraproject.org/rpms/valkey/c/1a9c8847172ef3fb116a1e2fdb3871692378adae?branch=rawhide Closes valkey-io#4012 Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>
valkey-io#4059) Two minor cleanups: - Remove unnecessary incrRefCount(shared.hdel) just like other shared object. - Fix new_argv allocation size since HDEL only need (num_fields + 2) size. Signed-off-by: Binbin <binloveplay1314@qq.com>
Use I/O threads for cluster bus reads, writes, and TLS accepts, mirroring the client I/O threading model. Splits packet framing from application, adds deferred link teardown with io_refs for safe concurrent I/O, dual send queues, and a framed packet queue on clusterLink. Adds observability counters (CLUSTER INFO). Signed-off-by: Harkrishn Patro <h_patro@apple.com>
b4fdc03 to
16d11d2
Compare
Cluster Bus I/O Offload
Overview
Cluster bus read, write, and TLS accept work can run on the shared I/O thread
pool, while all cluster state mutation stays on the main thread.
Jobs flow through:
io_shared_inbox(SPMC): main thread -> I/O workersio_shared_outbox(MPSC): I/O workers -> main threadIf the pool is inactive or enqueue fails, the caller falls back to synchronous
I/O on the main thread and increments
stat_cluster_io_sync_fallbacks.Architecture
Key Design Decisions
connRead,connWrite,connAccept, framing, and result publication.send_msg_queueplus a snapshot boundary:io_last_send_block,io_head_offset, andio_nodes_sent.rcvbufsnapshot boundaries:io_rcvbuf_snapshot_lenandio_rcvbuf_snapshot_packets.rcvbufsnapshot fully; there is no bounded packet/time budget.io_refsandasync_close.CONN_FLAG_ACCEPT_OFFLOAD_PENDING.ConnOwnerKindlets generic connection/TLS code distinguish client-owned and cluster-owned connections safely.Data Flow: Read Path
If dispatch returns
C_ERR, the main thread falls back toclusterReadHandler(conn).Data Flow: Write Path
Important note:
send_msg_queue,but the worker stops at
io_last_send_block, so those new nodes are picked upby a later dispatch.
If dispatch returns
C_ERR,clusterWriteHandler(conn)falls back to the synchronousconnWrite()loop.Data Flow: Accept Path (TLS)
Important note:
connections are routed back to
trySendClusterAcceptToIOThreads, and thepending flag ensures only one accept job is in flight for that connection.
If dispatch returns
C_ERR, the main thread falls back toconnAccept(conn, clusterConnAcceptHandler).clusterLink I/O State
Read and write jobs are mutually exclusive per link.
Invariants
Link State
io_refs >= 0always.io_refs == 0implies bothio_read_stateandio_write_stateare IDLE.send_msg_queue_memtracks the canonicalsend_msg_queue.send_msg_queue_mem + rcvbuf_len.stat_cluster_links_memoryis updated on the main thread.I/O Thread MAY
connRead()/connWrite()/connAccept()on the offloaded connection.rcvbufand updatercvbuf_len/rcvbuf_alloc.rcvbufand publishio_rcvbuf_snapshot_len/io_rcvbuf_snapshot_packets.send_msg_queueup to the snappedio_last_send_block.io_result,io_nodes_sent,io_head_offset, andlast_io_read_time.sendToMainThread().I/O Thread MUST NOT
clusterNode,clusterState, slot ownership, epoch/failover state, or any other cluster-global state.clusterProcessPacket()or any packet-application logic.io_read_state,io_write_state,io_refs, orasync_close.send_msg_queue_mem.server.stat_cluster_links_memory.clusterLink,clusterNode, or connection objects.Main Thread MUST NOT While a Job Is In Flight
rcvbuf,rcvbuf_len,rcvbuf_alloc,io_rcvbuf_snapshot_len, orio_rcvbuf_snapshot_packets.Dispatch Contract
All three dispatch helpers (
trySendClusterReadToIOThreads,trySendClusterWriteToIOThreads,trySendClusterAcceptToIOThreads) follow thesame pattern:
connSetPostponeUpdateState(conn, 1)andconnIncrRefs(conn)before enqueue.Return values:
C_OK: work was offloaded, or an equivalent job is already pending.C_ERR: pool inactive or enqueue failed, so caller may sync-fallback.Deferred Teardown
If
freeClusterLink(link)seesio_refs > 0:async_close = 1.The actual
connClose()and final free happen later, when the last completionhandler drops
io_refsto0.Failure Detection
last_io_read_timeis updated by the read worker on successful reads and isincluded in the cluster node delay calculation.
The I/O thread stores
last_io_read_timewith release ordering; the main threadloads it with acquire ordering.
Error Handling
CLUSTER_IO_BAD_HEADER/CLUSTER_IO_BAD_LENGTHCLUSTER_IO_READ_ERROR/CLUSTER_IO_EOFrcvbufsnapshot, then free linkCLUSTER_IO_WRITE_ERRORCONN_STATE_ACCEPTINGafter accept completionstat_cluster_io_sync_fallbacksFollow-Up
rcvbufsnapshot in one pass on the main thread and compacts the buffer after each packet.memmove()while a large burst is being applied.CLUSTER_IO_BAD_HEADER/CLUSTER_IO_BAD_LENGTHis to close the link immediately, even if the same offloaded read also contained a valid packet prefix in the queued snapshot. This is accepted for now as an invalid-peer path, but if we ever need sync/offload parity here, read completion should drain the valid prefix before tearing the link down.