Skip to content

[Load Test] ARC runners CI - target 7.2#10

Closed
roshkhatri wants to merge 1961 commits into
7.2from
test-arc-runners
Closed

[Load Test] ARC runners CI - target 7.2#10
roshkhatri wants to merge 1961 commits into
7.2from
test-arc-runners

Conversation

@roshkhatri

Copy link
Copy Markdown
Owner

Load test: validating self-hosted ARC runners against release branch 7.2. This PR will be closed after testing.

PingXie and others added 30 commits January 20, 2026 14:57
Signed-off-by: Ping Xie <pingxie@outlook.com>
Simple change to `hashtableDebug` to improve readability:
* Improve indentation
* Remove "(empty)" lines

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
### Overview
This PR adds support for automatic background TLS reloading, closes
valkey-io#2649
TLS validity checks and fail-fast behavior on invalid certificates are
handled separately in valkey-io#2999.
- New configuration
  - `tls-auto-reload-interval <seconds>`
  - `0` disabled (default, backward compatible)
  - `>0` check interval in seconds
- TLS materials change detection in background
  - SHA-256 fingerprint checking for certificate files
- `inode + mtime` checking for CA certificate directories and key files
- Skips reload if materials haven't changed
`tlsCheckMaterialsAndUpdateCache`
- TLS contexts reload
- CPU-intensive certificate parsing happens in dedicated BIO worker
thread `BIO_TLS_RELOAD`
  - Main thread never blocks, atomically swaps SSL contexts
- Two-phase reload: background preparation `tlsConfigureAsync` + main
thread application `tlsApplyPendingReload`


**Note**: Original TLS load and reload still remain in main thread using
`tlsConfigureSync`, including:
- Initial TLS load (server startup)
- Runtime reload via CONFIG SET


---------

Signed-off-by: Yang Zhao <zymy701@gmail.com>
…alkey-io#3082)

Currently, the weekly runs do not progress if there is a failed workflow
as github CI treats `fail-fast` to be true by default. With this change,
we continue to test all the branches even after failure.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…alkey-io#3022)

if the hrandfield(e.g. hrandfield myhash) command without other args
does not find a valid field, it will return an uninitialized lval.

```
debug set-active-expire no
hsetex myhash ex 1 fields 2 f1 v1 f2 v2
after 1s...
hrandfield myhash
[will return some uninitialized number]
```

related: valkey-io#3021

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
The test case added in valkey-io#3042 fails on platforms without MPTCP support.

This change automatically skips the MPTCP tests on platforms without
MTPCP.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…table (valkey-io#3092)

Thanks madolson for pointing out the issue with
valkey-io#2907.

Whenever we were using the label, it was picking up the head of the
unstable instead of the PR. The reason was that we changed the target to
`pull_request_target`. We had to do that since we wanted to remove the
label after the run is completed.

With this change, it correctly picks up the commit hash of the PR and
checks it out.

To test this, I merged the changes in my forked repository's
[unstable](valkey-io/valkey@unstable...sarthakaggarwal97:valkey:unstable),
created a dummy
[PR](sarthakaggarwal97#61), and was able
to verify that the commit of the PR is being checked out in the
[action](https://github.com/sarthakaggarwal97/valkey/actions/runs/21229598167/job/61084986422?pr=61#step:3:81).

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…y-io#3088)

Memory leak in the VM_GetCommandKeysWithFlags function when
MAX_KEYS_BUFFER is reached.
We will malloc the memory so we should always free the getKeysResult.
```
    /* Resize if necessary */
    if (numkeys > result->size) {
        if (result->keys != result->keysbuf) {
            /* We're not using a static buffer, just (re)alloc */
            result->keys = zrealloc(result->keys, numkeys * sizeof(keyReference));
        } else {
            /* We are using a static buffer, copy its contents */
            result->keys = zmalloc(numkeys * sizeof(keyReference));
            if (result->numkeys) memcpy(result->keys, result->keysbuf, result->numkeys * sizeof(keyReference));
        }
        result->size = numkeys;
    }
```

Closes valkey-io#3087.

Signed-off-by: arshidkv12 <arshidkv12@gmail.com>
Signed-off-by: Arshid <arshidkv12@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
When using XREAD STREAMS <stream> + on an empty stream created with
MKSTREAM, valkey returns an error instead of nil.

This happens because is missing a check on the stream length.

The fix adds the length check on the condition.

Fixes valkey-io#2728

Signed-off-by: diego-ciciani01 <diego.ciciani@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Minor fix to use the right var.

Signed-off-by: Vitah Lin <vitahlin@gmail.com>
Signed-off-by: Yang Zhao <zymy701@gmail.com>
We are already double running the tests with CMake, and we are building
CMake with TLS, so just making it so we run the tests with TLS. This
seems like an simple update so that we are always running the TLS tests.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
As the hashtable grows, rehash is triggered. Rehash consumes a
significant amount of CPU time, resulting in noticeable performance
degradation. Much of this CPU time is spent on random memory access
(when accessing hashtable entries). This CPU usage can be optimized
through concurrent memory access.

```
//  The original rehash process:
For each bucket and its child buckets, all entries are processed as follows:
    1) Access the entry, which involves random memory access.
    2) Compute the hash value of the entry.
    3) Based on the hash value, insert the entries into the new hashtable.

//  The optimized rehash process:
The original steps are transformed into batch processing and phased execution, leveraging
the CPU's multi-issue capability to achieve concurrent memory access. The steps are as follows:
    1) Access multiple entries serially within a for loop. Since each entry access is independent,
       modern CPU architectures can perform concurrent memory access.
    2) Compute the hash values for all entries.
    3) Based on the hash values, insert the entries into the new hashtable.
```

---------

Signed-off-by: chzhoo <czawyx@163.com>
…bility and easier merging (valkey-io#3015)

These lists are getting longer recently, and they're annoying to edit. I
recently did a lot of rebasing and got tired of it. 😅

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Fix some outdated comments, improved some comments.

Signed-off-by: Binbin <binloveplay1314@qq.com>
…ey-io#3060)

There are currently several issues with the existing hash field
expiration mechanism:
1. `HINCRBY` is propagated to the replica "as-is". It mean it relies on
the fact that the state of the hash is the same on the primary and the
replica. HFE did change this assumption as the field might be expired
only when the replica will handle the propagated `hincrby`. the problem
is that the replica does not "expire" fields by it's own. it needs to
respect the request from the primary and always try to use the existing
field. This can lead to either miss-alignment with the value on the
primary and the replica AND even a disconnection since the replica might
hold and "expired" field which is not in "integer" format...
2. HINCRBYFLOAT is currently ALWAYS propagating `hset` - this means that
the expiration time of an entry will always be removed on the replica
side (it needs to propagate HSETEX when expiration time needs to be
maintained)
3. Currently all hash write commands which are mutating values might
overwrite an expired field. In such cases the existing implementation
will "silently" do so. The problem is that the user will not get any
key-space-notificaiton explaining the reason for the behavior. For
example, when `hincrby` is issued overwriting an expired field which was
not yet "cleaned" by active-expiration it will reset the counter to '0'
before incrementing it. this means that the user might ask: why is the
value '1' and not bigger, "I did not see any notification that the old
value expired"...
4. HSETEX with KEEPTTL suffers from a "somewhat" similar problem as
#(1). the replica will receive the propagated command, but will not know
if the primary "replaced" the entry which is expired now but might not
have been expired when the primary applied it.

There are 2 options for a solution:
1. we could propagate `hdel` for every entry we are "overwritting"
(batch them if we can)
2. propagate the commands "by effect". For example - have `hincrby`
always propagate either HSET or HSETEX. This will not solve the '#(4)'
problem above though, for which we might HAVE to propagate `hdel`

I tend to go with the second option. The reason is that it is expected
to have less impact on replication stream and should include less
processing time on the replicas and network traffic. Specifically for
HSETEX with KEEPTTL we will have to propagate the `hdel` in case we
overwritten an expired field, but that would help limit the impact of
this propagation.

---------

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Sourav Singh Rawat <aidenfrostbite@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
…alkey-io#3086)

Currently after valkey-io#2652 in copy avoidance path we unconditionally track
`c->net_output_bytes_curr_cmd` even when `commandlog-request-larger-than
-1`. This PR provides ability to skip that accounting in copy avoidance
path based on config value. If `commandlog-request-larger-than -1` then
performance is same as before valkey-io#2652.

Also added tracking of `c->net_output_bytes_curr_cmd` in IO thread if it
is not tracked in main thread based on decision made in main thread.

Read Performance (write performance is the same):
```
With this change:
Summary:
  throughput summary: 2191732.75 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.720     0.072     1.743     1.919     2.647    23.983

Unstable: 
Summary:
  throughput summary: 1658197.25 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.299     0.120     2.343     2.503     3.319     4.791

Config:
commandlog-request-larger-than -1
^^ without this performance is just like on unstable branch
databases 1
save ""
appendonly no
rdbcompression no
activedefrag no
maxclients 1000
io-threads 9
protected-mode no
hz 10
maxmemory 2gb
```

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
I/O threads when active do busy polling for work. This puts the CPU time
near 100% per thread even when they aren't fully utilized. Tracking the
time an IO thread spends doing useful work vs waiting for work helps
understand the utilization of individual threads.

INFO fields:

    used_active_time_io_thread_%d

... where %d is the IO thread ID number (between 1 and the number of I/O
threads).

To get a percentage over a period of time, compare the delta of this
metric against a delta of the server's uptime or the INFO field
`server_time_usec`.

---------

Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
This caused some unaligned load of server values on 32bit compilations.

Example:
https://github.com/valkey-io/valkey/actions/runs/21352969942/job/61491923381?pr=3111#step:6:3064

Some insights here:
valkey-io#3111 (comment)

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
…alkey-io#3120)

The test is currently flakey, because until we merge:
valkey-io#3001
when the expiration time provided is in the past, and the field does not
exist the HSETX will just silently ignore it, without incrementing the
statistics.

I prefer to focus on writing a dedicated test for:
valkey-io#3001 and deflake this test now.

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
On some Intel systems it was observed that TSC `ticksPerMicrosecond`
derived from `/proc/cpuinfo` was different from calibrated
`ticksPerMicrosecond` or kernel derived one, so it was decided to always
calibrate and look for `constant_tsc` flag.

Resolves valkey-io#2597

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
This PR adds the certificates validation at TLS load, rejects invalid
(expired/not-yet-valid) certificates:

Apply to all TLS config paths:
- Server certificates `tls-cert-file`
- Server-side client certificates `tls-client-cert-file`
- CA certificate file `tls-ca-cert-file` 
- CA certificate directory `tls-ca-cert-dir` (now eagerly loaded to be
consistent with file-based CAs)

Apply to both scenarios:
- Server startup (initial TLS load)
- Runtime reload vis `CONFIG SET`

### Implementation
- Added `isCertValid` function to check if an X509 certificate is within
its validity period (not expired, not future-dated)
- Added `areAllCaCertsValid` function to iterate through all loaded CA
certificates and validate them
- Added `loadCaCertDir` function to eagerly load all certificates from a
directory into the X509_STORE
- Modified `createSSLContext` to validate:
  - Server/client certificates immediately after loading
  - All CA certificates after loading from file/directory

### Test results

#### 1. Server startup (initial TLS load)
```
tls-cert-file ./tests/tls/server-expired.crt

41522:M 31 Dec 2025 16:13:18.851 # Server TLS certificate is invalid. Aborting TLS configuration.
41522:M 31 Dec 2025 16:13:18.851 # Failed to configure TLS. Check logs for more info.


tls-client-cert-file ./tests/tls/client-expired.crt

41557:M 31 Dec 2025 16:14:43.296 # Client TLS certificate is invalid. Aborting TLS configuration.
41557:M 31 Dec 2025 16:14:43.296 # Failed to configure TLS. Check logs for more info.


tls-ca-cert-file ./tests/tls/ca-expired.crt
tls-ca-cert-dir ./tests/tls/ca-expired

41567:M 31 Dec 2025 16:15:15.635 # One or more loaded CA certificates are invalid. Aborting TLS configuration.
41567:M 31 Dec 2025 16:15:15.635 # Failed to configure TLS. Check logs for more info.
```

#### 2. Runtime reload via CONFIG SET
```
127.0.0.1:6379> config set tls-cert-file ./tests/tls/server-expired.crt
(error) ERR CONFIG SET failed (possibly related to argument 'tls-cert-file') - Unable to update TLS configuration. Check server logs.

62975:M 02 Jan 2026 20:10:43.588 # Server TLS certificate is invalid. Aborting TLS configuration.
62975:M 02 Jan 2026 20:10:43.588 # Failed applying new configuration. Possibly related to new tls-cert-file setting. Restoring previous settings.


127.0.0.1:6379> config set tls-client-cert-file ./tests/tls/client-expired.crt
(error) ERR CONFIG SET failed (possibly related to argument 'tls-client-cert-file') - Unable to update TLS configuration. Check server logs.

62975:M 02 Jan 2026 20:10:57.972 # Client TLS certificate is invalid. Aborting TLS configuration.
62975:M 02 Jan 2026 20:10:57.972 # Failed applying new configuration. Possibly related to new tls-client-cert-file setting. Restoring previous settings.


127.0.0.1:6379> config set tls-ca-cert-file ./tests/tls/ca-expired.crt
127.0.0.1:6379> config set tls-ca-cert-dir ./tests/tls/ca-expired
(error) ERR CONFIG SET failed (possibly related to argument 'tls-ca-cert-file') - Unable to update TLS configuration. Check server logs.

62975:M 02 Jan 2026 20:10:50.175 # One or more loaded CA certificates are invalid. Aborting TLS configuration.
62975:M 02 Jan 2026 20:10:50.175 # Failed applying new configuration. Possibly related to new tls-ca-cert-file setting. Restoring previous settings.
```

---------

Signed-off-by: Yang Zhao <zymy701@gmail.com>
…mary after direct writes of keys with expiration (valkey-io#2953)

The bug mentioned in this
valkey-io#2952 has been fixed.

---------

Signed-off-by: li-benson <1260437731@qq.com>
Addresses: valkey-io#2908

This decouples the low-level LRU/LFU code from the high-level `server.h`
file.
* LRU/LFU specific config values were removed from `server.h` and
relocated to `lrulfu.h`
* A new LRULFU API was created to update the time (and policy) rather
than accessing globals from `server.`

This doesn't address the root of the original test failure in that
`server.h` may compile with different offsets (due to `off_t`) based on
the include order. See:
valkey-io#2908 (comment)

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
…alkey-io#2867)

Embedding the skiplist header reduces memory jumps, thus optimizing
sorted set query speed.

```
// Before
typedef struct zskiplist {
        struct zskiplistNode *header, *tail;
        unsigned long length;
        int level; /* Height of the skiplist. */
} zskiplist;


// After
typedef struct zskiplist {
        unsigned long length; 
        struct zskiplistNode *tail;
        /* Since the span at level 0 is always 1 or 0 , 
         * this field is instead used for storing the height of the skiplist. */
        struct zskiplistLevel {
                struct zskiplistNode *forward;
                unsigned long span;
        } level[ZSKIPLIST_MAXLEVEL];
} zskiplist;
```

---------

Signed-off-by: chzhoo <czawyx@163.com>
…uring startup (valkey-io#3067)

Before ca6aead, even if setlocale fails, we
won't do anything.
```
setlocale(LC_COLLATE,"");
```

And now we will exit the process if the setlocale fails, this resulted in process
startup failures on some machines. (some kind of breaking change?)
```
if (setlocale(LC_COLLATE,server.locale_collate) == NULL) {
    serverLog(LL_WARNING, "Failed to configure LOCALE for invalid locale name.");
    exit(1);
}
```

In this commit, if we fail to set the locale_collate through environment variables,
we maintain backward compatibility and do not exit.

In a case where we did exit before, we don't exit now, seems safe. It's not a breaking
change and the behavior in this case was undocumented.

Signed-off-by: Binbin <binloveplay1314@qq.com>
If valgrind is used then test for used_active_time_io_thread fails
because valgrind is slow, so skip it

```
[err]: Force the use of IO threads and assert active IO thread usage in tests/unit/io-threads.tcl                                                                                                                     
Expected (7.941596 - 5.876588) < (1000/1000) (context: type eval line 53 cmd {assert {($used_active_time - $initial_active_times($i)) < ($sleep_time_ms/1000)}} proc ::test)
```

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
This change amends valkey-io#3086 in which `commandlog-request-larger-than` was
mixed up with `commandlog-reply-larger-than` while checking for config
values and in tests.

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
…-io#3001)

In the original implementation of Hash Field Expiration
(valkey-io#2089), the HSETEX command was
implemented to report keyspace notifications only for performed changes.
This is mostly aligned with other Hash commands (for example, HDEL will
also not report `hdel` event for items which does not exist)
The HSETEX case is somewhat different and is more like the `HSET` case.
During HSETEX, after the command validations pass, items are ALWAYS
"added" to the object, even though they might not actually be added.
This case is the same for when the hash object is empty or when all the
provided fields do not exist in the object (as reported
[here](valkey-io#2998))

This PR changes the way `HSETEX` will report keyspace notifications so
that:
1. `hset` notification will ALWAYS be reported if all command
validations pass.
2. `hexpire` will be reported in case the command include an expiration
time (even past time)
3. `hxpired` will be reported in case the provided expiration time is in
the past (or 0)
4. `hdel` will be reported in case the hash exists (or created as part
of the command) and following the command execution it was left empty.
5. we will always return '1' as a return value of tHSETEX command which
passed all validations. Before that we returned 1 only if we applied the
change cross ALL the input fields, so in case some of them did not exist
and a past time was set we would return 0.

---------

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Jacob Murphy <jkmurphy@google.com>
This PR fixes various issues with Lua module built on macOS using CMake
/ Apple Clang.

## Summary of changes

The Lua module build configuration has been updated to support macOS in
addition to Linux. Two key changes were made to resolve build issues on
Apple platforms.

First, the valkeylua shared library now explicitly includes `sha1.c` and
`rand.c` source files and links against the fpconv library, ensuring all
required symbols are available at link time.

Second, the --disable-new-dtags linker flag is now conditionally applied
only on Unix systems excluding macOS, since this flag is not supported
by the Apple linker.

In addition, enable position independent code compilation for the fpconv
static library by setting the `POSITION_INDEPENDENT_CODE` property. This
allows the library to be linked into shared libraries and
position-independent executables.

Last, the Lua library filename definition was previously hardcoded to
use the `.so` extension for all platforms. This change conditionally
sets the library name based on the target platform: `.so` for Linux/Unix
systems and `.dylib` for macOS (and other non-Linux Unix systems). This
ensures that the correct dynamic library extension is used when loading
the Lua module at runtime.

** Generated by CodeLite. **

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
murphyjacob4 and others added 27 commits April 27, 2026 11:05
…io#3554)

If not cleared, the job may no longer be valid by the time the client
goes to cleanup. This dangling reference could cause a crash if you set
slot-migration-log-max-len to 0 and are very unlucky.

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Remove eval script cache entries that belong to a scripting engine when
that engine is unregistered. This prevents the eval cache from retaining
dangling engine pointers and keeps the tracked script memory in sync
after engine shutdown.

The scripting engine unregister path now invokes a new eval cleanup
helper, which scans the cached scripts, drops matching entries from the
LRU list and dictionary, and adjusts cache memory accounting accordingly.

* scripting engine
* eval cache

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
…egacy files (valkey-io#2297) (valkey-io#3382)

Migrated the remaining cluster tests to tests/unit/cluster/ to use the same
framework for all cluster tests. Cleaned up the obsolete cluster test framework
files and updated the CI workflows to use the new unified test runner.

Changes:
  Moved and mapped 6 test files:
  - 03-failover-loop.tcl → Merged into existing failover.tcl
  - 04-resharding.tcl → resharding.tcl
  - 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl →
  replica-migration-slow.tcl
  - 07-replica-migration.tcl → Merged into existing replica-migration.tcl
  - 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl

Other changes:
  - Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv)
  - Added process_is_alive check in cluster_util.tcl to fix an exception in
  failover tests caused by executing ps on dead processes
  - Heavy tests (resharding, replica-migration-slow) marked with slow tag and
  wrapped in run_solo to prevent resource contention in sanitizer environments
  - replica-migration-slow marked with valgrind:skip tag since it is very slow
  - Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl,
  includes/, and helpers/
  - Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@")
  - Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster
  tests are now included in ./runtest

Closes valkey-io#2297.

Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
)

The argument `count /= 2` modifies `count` as a side effect, and the
following `count /= 2` divides it again unnecessarily.
Since `count` is not used after this point, fix it by using `count / 2`
without the side effect and remove the redundant second assignment.

Signed-off-by: djk1027 <djk9510271@gmail.com>
This change was introduced in valkey-io#3382. This test is already very slow on
its own. Under valgrind it gets slow enough that the per-node restart
step lets primaries be marked FAIL and triggers failovers, after which
"Verify slaves consistency" no longer holds since it assumes the original
topology.

It was never run under valgrind before and exercises nothing valgrind
meaningfully covers, so just tag it valgrind:skip.

Signed-off-by: Binbin <binloveplay1314@qq.com>
I noticed this LTO compile warning in the eval code. Looks like it's
getting confused about an sds length, even though checked above. Just
added an assert to clarify.

The warning:
```c
    LINK valkey-server
eval.c: In function ‘evalExtractShebangFlags’:
eval.c:263:27: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]
             *out_engine = zcalloc(engine_name_len + 1);
                           ^
zmalloc.c:324:7: note: in a call to allocation function ‘valkey_calloc’ declared here
 void *zcalloc(size_t size) {
       ^
```

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
…y-io#2823)

## Background

Add structured datasets loading capability. Support CSV and TSV file
formats. Use `__field:fieldname__` placeholders to replace the
corresponding fields from the dataset file. Support natural content size
of varying length. Allow mixed placeholder usage combining dataset
fields with random generators. Enable automatic field discovery from
CSV/TSV headers. Use `--maxdocs` to limit the dataset loading.

Rather than modifying the existing placeholder system, we detect field
placeholders and switch to a separate code path that builds commands
from scratch using `valkeyFormatCommandArgv()`. This ensures:

- Zero impact on existing functionality
- Full support for variable-size content
- Thread-safe atomic record iteration
- Compatible with pipelining and threading modes

__Usage examples__

```sh
# Strings - Simple key-value with dataset fields
./valkey-benchmark --dataset products.csv -n 10000 SET product:__rand_int__ "__field:name__"

# Sets - Unique collections from dataset
./valkey-benchmark --dataset categories.csv -n 10000 SADD tags:__rand_int__ "__field:category__"

# CSV dataset with document limit
./valkey-benchmark --dataset wiki.csv --maxdocs 100000 -n 50000 HSET doc:__rand_int__ title "__field:title__" body "__field:abstract__"

# Mixed placeholders (dataset + random)
./valkey-benchmark --dataset terms.csv -r 5000000 -n 50000 HSET search:__rand_int__ term "__field:term__" score __rand_1st__
```

__Full-Text Search Benchmarking__

```sh
# Search hit scenarios (existing terms)
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Search miss scenarios (non-existent terms)
./valkey-benchmark --dataset miss_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Query variations
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "@title:__field:term__"
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__*"
```

__Benchmark Results__


Test environment:
__Instance:__ AWS c7i.16xlarge, 64 vCPU

Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory

| Configuration | Throughput | CPU Usage | Wall Time | Memory Peak |
|---------------|------------|-----------|-----------|-------------|
| Single-threaded, P1 | 93,295 RPS | 99% | 71.4s | 5.8GB |
| Multi-threaded (10), P1 | 93,332 RPS | 137% | 71.5s | 5.8GB |
| Single-threaded, P10 | 274,499 RPS | 96% | 36.1s | 5.8GB |
| Multi-threaded (4), P10 | 344,589 RPS | 161% | 32.4s | 5.8GB |

---------

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>
Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>
Address this compile warning:
```c
    CC util.o
util.c:638:1: warning: ‘no_sanitize’ attribute directive ignored [-Wattributes]
 __attribute__((no_sanitize_address, no_sanitize("thread"), used)) static int (*string2ll_resolver(void))(const char *, size_t, long long *) {
 ^~~~~~~~~~~~~
```
Addresses portability concerns around these attributes.

---------

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Free BYPOLYGON points before returning from invalid COUNT parsing paths
in GEOSEARCH/GEOSEARCHSTORE.

Closes valkey-io#3567

---------

Signed-off-by: Su Ko <rhtn1128@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
When read() returns 0 (EOF/connection closed) in syncRead(), errno is
not set by POSIX, so it retains a stale value (typically 0). This causes
callers using connGetLastError() to log strerror(0) which is the
misleading string "Success".

Set errno = ECONNRESET on EOF in syncRead(), matching the existing
pattern used for the timeout case (errno = ETIMEDOUT).

Also set conn->last_errno = errno in connSocketSyncWrite,
connSocketSyncRead, and connSocketSyncReadLine wrappers, matching the
pattern used by their async counterparts connSocketWrite and
connSocketRead.

After this fix, replica logs will show:
  "I/O error reading bulk count from PRIMARY: Connection reset by peer"
instead of the misleading:
  "I/O error reading bulk count from PRIMARY: Success"

---------

Signed-off-by: Abhishek Mathur <matshek@amazon.com>
Signed-off-by: djk1027 <djk9510271@gmail.com>
Co-authored-by: Abhishek Mathur <matshek@amazon.com>
Co-authored-by: Daejun Kim <djk9510271@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
valkey-io#3586)

clusterManagerNodePrimaryRandom() called srand(time(NULL)) on every
invocation, then immediately rand() % primary_count. When called in a
tight loop for uncovered slots, all calls within the same wall-clock
second produce the identical seed, causing every uncovered slot to be
assigned to the same primary node.

Remove the srand() call since the PRNG is already seeded at startup
(srand(time(NULL) ^ getpid()) at line 9838). This allows rand() to
advance its state across calls, distributing uncovered slots randomly
across available primaries.

---------

Signed-off-by: Abhishek Mathur <matshek@amazon.com>
Co-authored-by: Abhishek Mathur <matshek@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
…o#3598)

In `getKeysUsingKeySpecs`, when extracting keys based on the
`KSPEC_FK_KEYNUM `spec (like in the `EVAL` command), the server read the
number of keys from the arguments and calculated the expected end index.

However, it called `getKeysPrepareResult` to allocate memory for the
result array before validating whether last was within the bounds of the
actual arguments provided.

If a client sent a command with a huge declared number of keys (e.g.,
`COMMAND GETKEYS EVAL "return 1" 2147483647 key1`), the server would
allocate a massive amount of memory. Since `vm.overcommit_memory` is
recommended, this allocation would NOT normally have triggered OOM (we
never wrote to it so there is no physical memory allocated), but if you
disable overcommit, this could trigger an OOM.

You can reproduce it with:

```
$ prlimit --as=1073741824 src/valkey-server --save ""
...
384270:M 30 Apr 2026 04:27:24.456 * Ready to accept connections tcp

...
<in valkey-cli>
127.0.0.1:6379> command getkeys eval "return 1" 2147483647 key1

...
<in server log>
384270:M 30 Apr 2026 04:29:26.950 # Out Of Memory allocating 17179869176 bytes!
```

## Solution

* Moved the bounds check `if (last >= argc || last < first || first >=
argc)` to execute before the call to `getKeysPrepareResult`, preventing
the large allocation on invalid input.
* To further catch issues like this, protected against integer overflow
during the calculation of last by using a long long temporary variable.
If it exceeds INT_MAX or falls below INT_MIN, the spec is marked invalid
immediately.

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
…alkey-io#3592)

Fixes valkey-io#1905

## Summary
The direct use of `je_calloc` in `src/allocator_defrag.c` causes
compilation failures on systems (e.g., Arch Linux with GCC 14.2.1) where
`calloc` is marked as deprecated and `-Werror=deprecated` is enabled.

## Changes
Replace the two `je_calloc` calls in `allocatorDefragInit()` with
`zcalloc_num`, which is the proper Valkey allocation wrapper that
provides the same semantics (num × size with zero-fill) without directly
invoking the deprecated `calloc` symbol.

## Testing
- Build compiles cleanly
- Integration tests pass (unit/memefficiency, defrag, unit/other — 51
passed, 0 failed)

Signed-off-by: jaduffy <jaduffy@amazon.com>
…et node disappears (valkey-io#3596)

### Summary

This PR fixes a NULL pointer dereference (SIGSEGV) in
`connectSlotExportJob()`
(`src/cluster_migrateslots.c`) that can crash a Valkey cluster node,
causing a
denial-of-service condition.

### Root Cause

When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with
state
`SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick,
`proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks
up the
target node via `clusterLookupNode()`.

`clusterLookupNode()` can legitimately return `NULL` — for example, if
the
target node is removed from the cluster (e.g. via `CLUSTER FORGET`)
between the
time the migration job is created and the time the cron fires. This is a
realistic race condition in any cluster topology change scenario.

The return value was **never checked**, so the subsequent call to
`getNodeDefaultReplicationPort(n)` immediately dereferences the NULL
pointer,
crashing the process:

```c
// Before fix — vulnerable
clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN);
int port = getNodeDefaultReplicationPort(n);  // SIGSEGV if n == NULL
serverLog(..., n->ip, port);                  // second dereference

Signed-off-by: chenshi5012 <chenshi5012@163.com>
…io#3591)

When XTRIM marks the last entry in a listpack node as deleted, lpNext()
returns NULL after the lp-count field (EOF). The delta calculation (p -
lp) on a NULL pointer is undefined behavior and produces a garbage
pointer, corrupting the listpack. A subsequent XREAD hitting the
corrupted node triggers the lpValidateNext assertion failure and crashes
the server.

Guard the delta calculation with a NULL check so the while(p) loop
terminates naturally when the last entry is reached.

Fixes valkey-io#3569

Signed-off-by: Saurabh Kher <saurabh@amazon.com>
Co-authored-by: Saurabh Kher <saurabh@amazon.com>
…3601)

The function lpEncodeBacklen() uses `<= 127` for the 1-byte case but `<
16383`, `< 2097151`, and `< 268435455` for the subsequent cases. This
means the exact values 16383, 2097151, and 268435455 (i.e. 2^14-1,
2^21-1, 2^28-1) unnecessarily use one extra byte than needed:

- `l < 16383` → `16383` (2^14-1) uses 3 bytes instead of 2
- `l < 2097151` → `2097151` (2^21-1) uses 4 bytes instead of 3
- `l < 268435455` → `268435455` (2^28-1) uses 5 bytes instead of 4

The decoding side (`lpDecodeBacklen`) is unaffected since it parses
continuation bits continuously without discrete range checks.

This is a correctness issue and has no impact on data integrity since
encoding and decoding use the same function boundaries, but it wastes up
to 1 byte per affected entry.

Signed-off-by: fanpei91 <fanpei91@gmail.com>
Big endian support on Valkey is "best effort" and not guaranteed, but we
haven't been doing any regular testing at all afaik. This PR adds a job
to the daily workflow to run UTs on an emulated big endian platform.
Integration tests failed excessively because of how slow emulation is.

I fixed several problems with tests and improved UT coverage of key
points where endian byte order matters - and fwiw I didn't find any
bugs. I think the main coverage gap remaining after this is RDB
serialization (maybe little endian <-> big endian round trips?)

There are couple lines of endian-specific code for valkey-io#3166 and this change
can test it.

Signed-off-by: Rain Valentine <rsg000@gmail.com>
It's important to enabled ASAN on run-extra-tests label so we can
catch some of the bugs in the PRs before they are merged into unstable.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
@roshkhatri

Copy link
Copy Markdown
Owner Author

Closing - switching to direct daily.yml dispatch per-release-branch

@roshkhatri roshkhatri closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.