[Load Test] ARC runners CI - target 7.2#10
Closed
roshkhatri wants to merge 1961 commits into
Closed
Conversation
Signed-off-by: Ping Xie <pingxie@outlook.com>
Simple change to `hashtableDebug` to improve readability: * Improve indentation * Remove "(empty)" lines Signed-off-by: Jim Brunner <brunnerj@amazon.com>
### Overview This PR adds support for automatic background TLS reloading, closes valkey-io#2649 TLS validity checks and fail-fast behavior on invalid certificates are handled separately in valkey-io#2999. - New configuration - `tls-auto-reload-interval <seconds>` - `0` disabled (default, backward compatible) - `>0` check interval in seconds - TLS materials change detection in background - SHA-256 fingerprint checking for certificate files - `inode + mtime` checking for CA certificate directories and key files - Skips reload if materials haven't changed `tlsCheckMaterialsAndUpdateCache` - TLS contexts reload - CPU-intensive certificate parsing happens in dedicated BIO worker thread `BIO_TLS_RELOAD` - Main thread never blocks, atomically swaps SSL contexts - Two-phase reload: background preparation `tlsConfigureAsync` + main thread application `tlsApplyPendingReload` **Note**: Original TLS load and reload still remain in main thread using `tlsConfigureSync`, including: - Initial TLS load (server startup) - Runtime reload via CONFIG SET --------- Signed-off-by: Yang Zhao <zymy701@gmail.com>
…alkey-io#3082) Currently, the weekly runs do not progress if there is a failed workflow as github CI treats `fail-fast` to be true by default. With this change, we continue to test all the branches even after failure. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…alkey-io#3022) if the hrandfield(e.g. hrandfield myhash) command without other args does not find a valid field, it will return an uninitialized lval. ``` debug set-active-expire no hsetex myhash ex 1 fields 2 f1 v1 f2 v2 after 1s... hrandfield myhash [will return some uninitialized number] ``` related: valkey-io#3021 Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
The test case added in valkey-io#3042 fails on platforms without MPTCP support. This change automatically skips the MPTCP tests on platforms without MTPCP. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…table (valkey-io#3092) Thanks madolson for pointing out the issue with valkey-io#2907. Whenever we were using the label, it was picking up the head of the unstable instead of the PR. The reason was that we changed the target to `pull_request_target`. We had to do that since we wanted to remove the label after the run is completed. With this change, it correctly picks up the commit hash of the PR and checks it out. To test this, I merged the changes in my forked repository's [unstable](valkey-io/valkey@unstable...sarthakaggarwal97:valkey:unstable), created a dummy [PR](sarthakaggarwal97#61), and was able to verify that the commit of the PR is being checked out in the [action](https://github.com/sarthakaggarwal97/valkey/actions/runs/21229598167/job/61084986422?pr=61#step:3:81). Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…y-io#3088) Memory leak in the VM_GetCommandKeysWithFlags function when MAX_KEYS_BUFFER is reached. We will malloc the memory so we should always free the getKeysResult. ``` /* Resize if necessary */ if (numkeys > result->size) { if (result->keys != result->keysbuf) { /* We're not using a static buffer, just (re)alloc */ result->keys = zrealloc(result->keys, numkeys * sizeof(keyReference)); } else { /* We are using a static buffer, copy its contents */ result->keys = zmalloc(numkeys * sizeof(keyReference)); if (result->numkeys) memcpy(result->keys, result->keysbuf, result->numkeys * sizeof(keyReference)); } result->size = numkeys; } ``` Closes valkey-io#3087. Signed-off-by: arshidkv12 <arshidkv12@gmail.com> Signed-off-by: Arshid <arshidkv12@gmail.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
When using XREAD STREAMS <stream> + on an empty stream created with MKSTREAM, valkey returns an error instead of nil. This happens because is missing a check on the stream length. The fix adds the length check on the condition. Fixes valkey-io#2728 Signed-off-by: diego-ciciani01 <diego.ciciani@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Minor fix to use the right var. Signed-off-by: Vitah Lin <vitahlin@gmail.com>
Signed-off-by: Yang Zhao <zymy701@gmail.com>
We are already double running the tests with CMake, and we are building CMake with TLS, so just making it so we run the tests with TLS. This seems like an simple update so that we are always running the TLS tests. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
As the hashtable grows, rehash is triggered. Rehash consumes a
significant amount of CPU time, resulting in noticeable performance
degradation. Much of this CPU time is spent on random memory access
(when accessing hashtable entries). This CPU usage can be optimized
through concurrent memory access.
```
// The original rehash process:
For each bucket and its child buckets, all entries are processed as follows:
1) Access the entry, which involves random memory access.
2) Compute the hash value of the entry.
3) Based on the hash value, insert the entries into the new hashtable.
// The optimized rehash process:
The original steps are transformed into batch processing and phased execution, leveraging
the CPU's multi-issue capability to achieve concurrent memory access. The steps are as follows:
1) Access multiple entries serially within a for loop. Since each entry access is independent,
modern CPU architectures can perform concurrent memory access.
2) Compute the hash values for all entries.
3) Based on the hash values, insert the entries into the new hashtable.
```
---------
Signed-off-by: chzhoo <czawyx@163.com>
…bility and easier merging (valkey-io#3015) These lists are getting longer recently, and they're annoying to edit. I recently did a lot of rebasing and got tired of it. 😅 --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Fix some outdated comments, improved some comments. Signed-off-by: Binbin <binloveplay1314@qq.com>
…ey-io#3060) There are currently several issues with the existing hash field expiration mechanism: 1. `HINCRBY` is propagated to the replica "as-is". It mean it relies on the fact that the state of the hash is the same on the primary and the replica. HFE did change this assumption as the field might be expired only when the replica will handle the propagated `hincrby`. the problem is that the replica does not "expire" fields by it's own. it needs to respect the request from the primary and always try to use the existing field. This can lead to either miss-alignment with the value on the primary and the replica AND even a disconnection since the replica might hold and "expired" field which is not in "integer" format... 2. HINCRBYFLOAT is currently ALWAYS propagating `hset` - this means that the expiration time of an entry will always be removed on the replica side (it needs to propagate HSETEX when expiration time needs to be maintained) 3. Currently all hash write commands which are mutating values might overwrite an expired field. In such cases the existing implementation will "silently" do so. The problem is that the user will not get any key-space-notificaiton explaining the reason for the behavior. For example, when `hincrby` is issued overwriting an expired field which was not yet "cleaned" by active-expiration it will reset the counter to '0' before incrementing it. this means that the user might ask: why is the value '1' and not bigger, "I did not see any notification that the old value expired"... 4. HSETEX with KEEPTTL suffers from a "somewhat" similar problem as #(1). the replica will receive the propagated command, but will not know if the primary "replaced" the entry which is expired now but might not have been expired when the primary applied it. There are 2 options for a solution: 1. we could propagate `hdel` for every entry we are "overwritting" (batch them if we can) 2. propagate the commands "by effect". For example - have `hincrby` always propagate either HSET or HSETEX. This will not solve the '#(4)' problem above though, for which we might HAVE to propagate `hdel` I tend to go with the second option. The reason is that it is expected to have less impact on replication stream and should include less processing time on the replicas and network traffic. Specifically for HSETEX with KEEPTTL we will have to propagate the `hdel` in case we overwritten an expired field, but that would help limit the impact of this propagation. --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: Sourav Singh Rawat <aidenfrostbite@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
…alkey-io#3086) Currently after valkey-io#2652 in copy avoidance path we unconditionally track `c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than -1`. This PR provides ability to skip that accounting in copy avoidance path based on config value. If `commandlog-request-larger-than -1` then performance is same as before valkey-io#2652. Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it is not tracked in main thread based on decision made in main thread. Read Performance (write performance is the same): ``` With this change: Summary: throughput summary: 2191732.75 requests per second latency summary (msec): avg min p50 p95 p99 max 1.720 0.072 1.743 1.919 2.647 23.983 Unstable: Summary: throughput summary: 1658197.25 requests per second latency summary (msec): avg min p50 p95 p99 max 2.299 0.120 2.343 2.503 3.319 4.791 Config: commandlog-request-larger-than -1 ^^ without this performance is just like on unstable branch databases 1 save "" appendonly no rdbcompression no activedefrag no maxclients 1000 io-threads 9 protected-mode no hz 10 maxmemory 2gb ``` --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
I/O threads when active do busy polling for work. This puts the CPU time
near 100% per thread even when they aren't fully utilized. Tracking the
time an IO thread spends doing useful work vs waiting for work helps
understand the utilization of individual threads.
INFO fields:
used_active_time_io_thread_%d
... where %d is the IO thread ID number (between 1 and the number of I/O
threads).
To get a percentage over a period of time, compare the delta of this
metric against a delta of the server's uptime or the INFO field
`server_time_usec`.
---------
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
This caused some unaligned load of server values on 32bit compilations. Example: https://github.com/valkey-io/valkey/actions/runs/21352969942/job/61491923381?pr=3111#step:6:3064 Some insights here: valkey-io#3111 (comment) Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
…alkey-io#3120) The test is currently flakey, because until we merge: valkey-io#3001 when the expiration time provided is in the past, and the field does not exist the HSETX will just silently ignore it, without incrementing the statistics. I prefer to focus on writing a dedicated test for: valkey-io#3001 and deflake this test now. Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
On some Intel systems it was observed that TSC `ticksPerMicrosecond` derived from `/proc/cpuinfo` was different from calibrated `ticksPerMicrosecond` or kernel derived one, so it was decided to always calibrate and look for `constant_tsc` flag. Resolves valkey-io#2597 --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
This PR adds the certificates validation at TLS load, rejects invalid (expired/not-yet-valid) certificates: Apply to all TLS config paths: - Server certificates `tls-cert-file` - Server-side client certificates `tls-client-cert-file` - CA certificate file `tls-ca-cert-file` - CA certificate directory `tls-ca-cert-dir` (now eagerly loaded to be consistent with file-based CAs) Apply to both scenarios: - Server startup (initial TLS load) - Runtime reload vis `CONFIG SET` ### Implementation - Added `isCertValid` function to check if an X509 certificate is within its validity period (not expired, not future-dated) - Added `areAllCaCertsValid` function to iterate through all loaded CA certificates and validate them - Added `loadCaCertDir` function to eagerly load all certificates from a directory into the X509_STORE - Modified `createSSLContext` to validate: - Server/client certificates immediately after loading - All CA certificates after loading from file/directory ### Test results #### 1. Server startup (initial TLS load) ``` tls-cert-file ./tests/tls/server-expired.crt 41522:M 31 Dec 2025 16:13:18.851 # Server TLS certificate is invalid. Aborting TLS configuration. 41522:M 31 Dec 2025 16:13:18.851 # Failed to configure TLS. Check logs for more info. tls-client-cert-file ./tests/tls/client-expired.crt 41557:M 31 Dec 2025 16:14:43.296 # Client TLS certificate is invalid. Aborting TLS configuration. 41557:M 31 Dec 2025 16:14:43.296 # Failed to configure TLS. Check logs for more info. tls-ca-cert-file ./tests/tls/ca-expired.crt tls-ca-cert-dir ./tests/tls/ca-expired 41567:M 31 Dec 2025 16:15:15.635 # One or more loaded CA certificates are invalid. Aborting TLS configuration. 41567:M 31 Dec 2025 16:15:15.635 # Failed to configure TLS. Check logs for more info. ``` #### 2. Runtime reload via CONFIG SET ``` 127.0.0.1:6379> config set tls-cert-file ./tests/tls/server-expired.crt (error) ERR CONFIG SET failed (possibly related to argument 'tls-cert-file') - Unable to update TLS configuration. Check server logs. 62975:M 02 Jan 2026 20:10:43.588 # Server TLS certificate is invalid. Aborting TLS configuration. 62975:M 02 Jan 2026 20:10:43.588 # Failed applying new configuration. Possibly related to new tls-cert-file setting. Restoring previous settings. 127.0.0.1:6379> config set tls-client-cert-file ./tests/tls/client-expired.crt (error) ERR CONFIG SET failed (possibly related to argument 'tls-client-cert-file') - Unable to update TLS configuration. Check server logs. 62975:M 02 Jan 2026 20:10:57.972 # Client TLS certificate is invalid. Aborting TLS configuration. 62975:M 02 Jan 2026 20:10:57.972 # Failed applying new configuration. Possibly related to new tls-client-cert-file setting. Restoring previous settings. 127.0.0.1:6379> config set tls-ca-cert-file ./tests/tls/ca-expired.crt 127.0.0.1:6379> config set tls-ca-cert-dir ./tests/tls/ca-expired (error) ERR CONFIG SET failed (possibly related to argument 'tls-ca-cert-file') - Unable to update TLS configuration. Check server logs. 62975:M 02 Jan 2026 20:10:50.175 # One or more loaded CA certificates are invalid. Aborting TLS configuration. 62975:M 02 Jan 2026 20:10:50.175 # Failed applying new configuration. Possibly related to new tls-ca-cert-file setting. Restoring previous settings. ``` --------- Signed-off-by: Yang Zhao <zymy701@gmail.com>
…mary after direct writes of keys with expiration (valkey-io#2953) The bug mentioned in this valkey-io#2952 has been fixed. --------- Signed-off-by: li-benson <1260437731@qq.com>
Addresses: valkey-io#2908 This decouples the low-level LRU/LFU code from the high-level `server.h` file. * LRU/LFU specific config values were removed from `server.h` and relocated to `lrulfu.h` * A new LRULFU API was created to update the time (and policy) rather than accessing globals from `server.` This doesn't address the root of the original test failure in that `server.h` may compile with different offsets (due to `off_t`) based on the include order. See: valkey-io#2908 (comment) Signed-off-by: Jim Brunner <brunnerj@amazon.com>
…alkey-io#2867) Embedding the skiplist header reduces memory jumps, thus optimizing sorted set query speed. ``` // Before typedef struct zskiplist { struct zskiplistNode *header, *tail; unsigned long length; int level; /* Height of the skiplist. */ } zskiplist; // After typedef struct zskiplist { unsigned long length; struct zskiplistNode *tail; /* Since the span at level 0 is always 1 or 0 , * this field is instead used for storing the height of the skiplist. */ struct zskiplistLevel { struct zskiplistNode *forward; unsigned long span; } level[ZSKIPLIST_MAXLEVEL]; } zskiplist; ``` --------- Signed-off-by: chzhoo <czawyx@163.com>
…uring startup (valkey-io#3067) Before ca6aead, even if setlocale fails, we won't do anything. ``` setlocale(LC_COLLATE,""); ``` And now we will exit the process if the setlocale fails, this resulted in process startup failures on some machines. (some kind of breaking change?) ``` if (setlocale(LC_COLLATE,server.locale_collate) == NULL) { serverLog(LL_WARNING, "Failed to configure LOCALE for invalid locale name."); exit(1); } ``` In this commit, if we fail to set the locale_collate through environment variables, we maintain backward compatibility and do not exit. In a case where we did exit before, we don't exit now, seems safe. It's not a breaking change and the behavior in this case was undocumented. Signed-off-by: Binbin <binloveplay1314@qq.com>
If valgrind is used then test for used_active_time_io_thread fails
because valgrind is slow, so skip it
```
[err]: Force the use of IO threads and assert active IO thread usage in tests/unit/io-threads.tcl
Expected (7.941596 - 5.876588) < (1000/1000) (context: type eval line 53 cmd {assert {($used_active_time - $initial_active_times($i)) < ($sleep_time_ms/1000)}} proc ::test)
```
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
This change amends valkey-io#3086 in which `commandlog-request-larger-than` was mixed up with `commandlog-reply-larger-than` while checking for config values and in tests. --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
…-io#3001) In the original implementation of Hash Field Expiration (valkey-io#2089), the HSETEX command was implemented to report keyspace notifications only for performed changes. This is mostly aligned with other Hash commands (for example, HDEL will also not report `hdel` event for items which does not exist) The HSETEX case is somewhat different and is more like the `HSET` case. During HSETEX, after the command validations pass, items are ALWAYS "added" to the object, even though they might not actually be added. This case is the same for when the hash object is empty or when all the provided fields do not exist in the object (as reported [here](valkey-io#2998)) This PR changes the way `HSETEX` will report keyspace notifications so that: 1. `hset` notification will ALWAYS be reported if all command validations pass. 2. `hexpire` will be reported in case the command include an expiration time (even past time) 3. `hxpired` will be reported in case the provided expiration time is in the past (or 0) 4. `hdel` will be reported in case the hash exists (or created as part of the command) and following the command execution it was left empty. 5. we will always return '1' as a return value of tHSETEX command which passed all validations. Before that we returned 1 only if we applied the change cross ALL the input fields, so in case some of them did not exist and a past time was set we would return 0. --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: Jacob Murphy <jkmurphy@google.com>
This PR fixes various issues with Lua module built on macOS using CMake / Apple Clang. ## Summary of changes The Lua module build configuration has been updated to support macOS in addition to Linux. Two key changes were made to resolve build issues on Apple platforms. First, the valkeylua shared library now explicitly includes `sha1.c` and `rand.c` source files and links against the fpconv library, ensuring all required symbols are available at link time. Second, the --disable-new-dtags linker flag is now conditionally applied only on Unix systems excluding macOS, since this flag is not supported by the Apple linker. In addition, enable position independent code compilation for the fpconv static library by setting the `POSITION_INDEPENDENT_CODE` property. This allows the library to be linked into shared libraries and position-independent executables. Last, the Lua library filename definition was previously hardcoded to use the `.so` extension for all platforms. This change conditionally sets the library name based on the target platform: `.so` for Linux/Unix systems and `.dylib` for macOS (and other non-Linux Unix systems). This ensures that the correct dynamic library extension is used when loading the Lua module at runtime. ** Generated by CodeLite. ** Signed-off-by: Eran Ifrah <eifrah@amazon.com>
…io#3554) If not cleared, the job may no longer be valid by the time the client goes to cleanup. This dangling reference could cause a crash if you set slot-migration-log-max-len to 0 and are very unlucky. Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Remove eval script cache entries that belong to a scripting engine when that engine is unregistered. This prevents the eval cache from retaining dangling engine pointers and keeps the tracked script memory in sync after engine shutdown. The scripting engine unregister path now invokes a new eval cleanup helper, which scans the cached scripts, drops matching entries from the LRU list and dictionary, and adjusts cache memory accounting accordingly. * scripting engine * eval cache Signed-off-by: Eran Ifrah <eifrah@amazon.com>
…egacy files (valkey-io#2297) (valkey-io#3382) Migrated the remaining cluster tests to tests/unit/cluster/ to use the same framework for all cluster tests. Cleaned up the obsolete cluster test framework files and updated the CI workflows to use the new unified test runner. Changes: Moved and mapped 6 test files: - 03-failover-loop.tcl → Merged into existing failover.tcl - 04-resharding.tcl → resharding.tcl - 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl → replica-migration-slow.tcl - 07-replica-migration.tcl → Merged into existing replica-migration.tcl - 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl Other changes: - Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv) - Added process_is_alive check in cluster_util.tcl to fix an exception in failover tests caused by executing ps on dead processes - Heavy tests (resharding, replica-migration-slow) marked with slow tag and wrapped in run_solo to prevent resource contention in sanitizer environments - replica-migration-slow marked with valgrind:skip tag since it is very slow - Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl, includes/, and helpers/ - Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@") - Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster tests are now included in ./runtest Closes valkey-io#2297. Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
) The argument `count /= 2` modifies `count` as a side effect, and the following `count /= 2` divides it again unnecessarily. Since `count` is not used after this point, fix it by using `count / 2` without the side effect and remove the redundant second assignment. Signed-off-by: djk1027 <djk9510271@gmail.com>
This change was introduced in valkey-io#3382. This test is already very slow on its own. Under valgrind it gets slow enough that the per-node restart step lets primaries be marked FAIL and triggers failovers, after which "Verify slaves consistency" no longer holds since it assumes the original topology. It was never run under valgrind before and exercises nothing valgrind meaningfully covers, so just tag it valgrind:skip. Signed-off-by: Binbin <binloveplay1314@qq.com>
I noticed this LTO compile warning in the eval code. Looks like it's
getting confused about an sds length, even though checked above. Just
added an assert to clarify.
The warning:
```c
LINK valkey-server
eval.c: In function ‘evalExtractShebangFlags’:
eval.c:263:27: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]
*out_engine = zcalloc(engine_name_len + 1);
^
zmalloc.c:324:7: note: in a call to allocation function ‘valkey_calloc’ declared here
void *zcalloc(size_t size) {
^
```
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
…y-io#2823) ## Background Add structured datasets loading capability. Support CSV and TSV file formats. Use `__field:fieldname__` placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers. Use `--maxdocs` to limit the dataset loading. Rather than modifying the existing placeholder system, we detect field placeholders and switch to a separate code path that builds commands from scratch using `valkeyFormatCommandArgv()`. This ensures: - Zero impact on existing functionality - Full support for variable-size content - Thread-safe atomic record iteration - Compatible with pipelining and threading modes __Usage examples__ ```sh # Strings - Simple key-value with dataset fields ./valkey-benchmark --dataset products.csv -n 10000 SET product:__rand_int__ "__field:name__" # Sets - Unique collections from dataset ./valkey-benchmark --dataset categories.csv -n 10000 SADD tags:__rand_int__ "__field:category__" # CSV dataset with document limit ./valkey-benchmark --dataset wiki.csv --maxdocs 100000 -n 50000 HSET doc:__rand_int__ title "__field:title__" body "__field:abstract__" # Mixed placeholders (dataset + random) ./valkey-benchmark --dataset terms.csv -r 5000000 -n 50000 HSET search:__rand_int__ term "__field:term__" score __rand_1st__ ``` __Full-Text Search Benchmarking__ ```sh # Search hit scenarios (existing terms) ./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__" # Search miss scenarios (non-existent terms) ./valkey-benchmark --dataset miss_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__" # Query variations ./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "@title:__field:term__" ./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__*" ``` __Benchmark Results__ Test environment: __Instance:__ AWS c7i.16xlarge, 64 vCPU Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory | Configuration | Throughput | CPU Usage | Wall Time | Memory Peak | |---------------|------------|-----------|-----------|-------------| | Single-threaded, P1 | 93,295 RPS | 99% | 71.4s | 5.8GB | | Multi-threaded (10), P1 | 93,332 RPS | 137% | 71.5s | 5.8GB | | Single-threaded, P10 | 274,499 RPS | 96% | 36.1s | 5.8GB | | Multi-threaded (4), P10 | 344,589 RPS | 161% | 32.4s | 5.8GB | --------- Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com> Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>
Address this compile warning:
```c
CC util.o
util.c:638:1: warning: ‘no_sanitize’ attribute directive ignored [-Wattributes]
__attribute__((no_sanitize_address, no_sanitize("thread"), used)) static int (*string2ll_resolver(void))(const char *, size_t, long long *) {
^~~~~~~~~~~~~
```
Addresses portability concerns around these attributes.
---------
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Free BYPOLYGON points before returning from invalid COUNT parsing paths in GEOSEARCH/GEOSEARCHSTORE. Closes valkey-io#3567 --------- Signed-off-by: Su Ko <rhtn1128@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
When read() returns 0 (EOF/connection closed) in syncRead(), errno is not set by POSIX, so it retains a stale value (typically 0). This causes callers using connGetLastError() to log strerror(0) which is the misleading string "Success". Set errno = ECONNRESET on EOF in syncRead(), matching the existing pattern used for the timeout case (errno = ETIMEDOUT). Also set conn->last_errno = errno in connSocketSyncWrite, connSocketSyncRead, and connSocketSyncReadLine wrappers, matching the pattern used by their async counterparts connSocketWrite and connSocketRead. After this fix, replica logs will show: "I/O error reading bulk count from PRIMARY: Connection reset by peer" instead of the misleading: "I/O error reading bulk count from PRIMARY: Success" --------- Signed-off-by: Abhishek Mathur <matshek@amazon.com> Signed-off-by: djk1027 <djk9510271@gmail.com> Co-authored-by: Abhishek Mathur <matshek@amazon.com> Co-authored-by: Daejun Kim <djk9510271@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
valkey-io#3586) clusterManagerNodePrimaryRandom() called srand(time(NULL)) on every invocation, then immediately rand() % primary_count. When called in a tight loop for uncovered slots, all calls within the same wall-clock second produce the identical seed, causing every uncovered slot to be assigned to the same primary node. Remove the srand() call since the PRNG is already seeded at startup (srand(time(NULL) ^ getpid()) at line 9838). This allows rand() to advance its state across calls, distributing uncovered slots randomly across available primaries. --------- Signed-off-by: Abhishek Mathur <matshek@amazon.com> Co-authored-by: Abhishek Mathur <matshek@amazon.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
…o#3598) In `getKeysUsingKeySpecs`, when extracting keys based on the `KSPEC_FK_KEYNUM `spec (like in the `EVAL` command), the server read the number of keys from the arguments and calculated the expected end index. However, it called `getKeysPrepareResult` to allocate memory for the result array before validating whether last was within the bounds of the actual arguments provided. If a client sent a command with a huge declared number of keys (e.g., `COMMAND GETKEYS EVAL "return 1" 2147483647 key1`), the server would allocate a massive amount of memory. Since `vm.overcommit_memory` is recommended, this allocation would NOT normally have triggered OOM (we never wrote to it so there is no physical memory allocated), but if you disable overcommit, this could trigger an OOM. You can reproduce it with: ``` $ prlimit --as=1073741824 src/valkey-server --save "" ... 384270:M 30 Apr 2026 04:27:24.456 * Ready to accept connections tcp ... <in valkey-cli> 127.0.0.1:6379> command getkeys eval "return 1" 2147483647 key1 ... <in server log> 384270:M 30 Apr 2026 04:29:26.950 # Out Of Memory allocating 17179869176 bytes! ``` ## Solution * Moved the bounds check `if (last >= argc || last < first || first >= argc)` to execute before the call to `getKeysPrepareResult`, preventing the large allocation on invalid input. * To further catch issues like this, protected against integer overflow during the calculation of last by using a long long temporary variable. If it exceeds INT_MAX or falls below INT_MIN, the spec is marked invalid immediately. Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…alkey-io#3592) Fixes valkey-io#1905 ## Summary The direct use of `je_calloc` in `src/allocator_defrag.c` causes compilation failures on systems (e.g., Arch Linux with GCC 14.2.1) where `calloc` is marked as deprecated and `-Werror=deprecated` is enabled. ## Changes Replace the two `je_calloc` calls in `allocatorDefragInit()` with `zcalloc_num`, which is the proper Valkey allocation wrapper that provides the same semantics (num × size with zero-fill) without directly invoking the deprecated `calloc` symbol. ## Testing - Build compiles cleanly - Integration tests pass (unit/memefficiency, defrag, unit/other — 51 passed, 0 failed) Signed-off-by: jaduffy <jaduffy@amazon.com>
…et node disappears (valkey-io#3596) ### Summary This PR fixes a NULL pointer dereference (SIGSEGV) in `connectSlotExportJob()` (`src/cluster_migrateslots.c`) that can crash a Valkey cluster node, causing a denial-of-service condition. ### Root Cause When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with state `SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick, `proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks up the target node via `clusterLookupNode()`. `clusterLookupNode()` can legitimately return `NULL` — for example, if the target node is removed from the cluster (e.g. via `CLUSTER FORGET`) between the time the migration job is created and the time the cron fires. This is a realistic race condition in any cluster topology change scenario. The return value was **never checked**, so the subsequent call to `getNodeDefaultReplicationPort(n)` immediately dereferences the NULL pointer, crashing the process: ```c // Before fix — vulnerable clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN); int port = getNodeDefaultReplicationPort(n); // SIGSEGV if n == NULL serverLog(..., n->ip, port); // second dereference Signed-off-by: chenshi5012 <chenshi5012@163.com>
…io#3591) When XTRIM marks the last entry in a listpack node as deleted, lpNext() returns NULL after the lp-count field (EOF). The delta calculation (p - lp) on a NULL pointer is undefined behavior and produces a garbage pointer, corrupting the listpack. A subsequent XREAD hitting the corrupted node triggers the lpValidateNext assertion failure and crashes the server. Guard the delta calculation with a NULL check so the while(p) loop terminates naturally when the last entry is reached. Fixes valkey-io#3569 Signed-off-by: Saurabh Kher <saurabh@amazon.com> Co-authored-by: Saurabh Kher <saurabh@amazon.com>
…3601) The function lpEncodeBacklen() uses `<= 127` for the 1-byte case but `< 16383`, `< 2097151`, and `< 268435455` for the subsequent cases. This means the exact values 16383, 2097151, and 268435455 (i.e. 2^14-1, 2^21-1, 2^28-1) unnecessarily use one extra byte than needed: - `l < 16383` → `16383` (2^14-1) uses 3 bytes instead of 2 - `l < 2097151` → `2097151` (2^21-1) uses 4 bytes instead of 3 - `l < 268435455` → `268435455` (2^28-1) uses 5 bytes instead of 4 The decoding side (`lpDecodeBacklen`) is unaffected since it parses continuation bits continuously without discrete range checks. This is a correctness issue and has no impact on data integrity since encoding and decoding use the same function boundaries, but it wastes up to 1 byte per affected entry. Signed-off-by: fanpei91 <fanpei91@gmail.com>
Big endian support on Valkey is "best effort" and not guaranteed, but we haven't been doing any regular testing at all afaik. This PR adds a job to the daily workflow to run UTs on an emulated big endian platform. Integration tests failed excessively because of how slow emulation is. I fixed several problems with tests and improved UT coverage of key points where endian byte order matters - and fwiw I didn't find any bugs. I think the main coverage gap remaining after this is RDB serialization (maybe little endian <-> big endian round trips?) There are couple lines of endian-specific code for valkey-io#3166 and this change can test it. Signed-off-by: Rain Valentine <rsg000@gmail.com>
It's important to enabled ASAN on run-extra-tests label so we can catch some of the bugs in the PRs before they are merged into unstable. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Owner
Author
|
Closing - switching to direct daily.yml dispatch per-release-branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Load test: validating self-hosted ARC runners against release branch 7.2. This PR will be closed after testing.