Skip to content

[backport] Backport sweep for 9.1#3774

Merged
ranshid merged 15 commits into
9.1from
agent/backport/sweep/9.1
Jun 11, 2026
Merged

[backport] Backport sweep for 9.1#3774
ranshid merged 15 commits into
9.1from
agent/backport/sweep/9.1

Conversation

@valkeyrie-ops

@valkeyrie-ops valkeyrie-ops Bot commented May 19, 2026

Copy link
Copy Markdown

Backport sweep for 9.1

Automated cherry-picks from PRs marked "To be backported".

Applied

Source PR Title Detail
#3743 Fix buffered_reply assert in HFE commands with module keyspace notifications cherry-picked in a prior sweep
#3766 Fix flaky block_keyspace_notification test for HGETDEL notify race cherry-picked in a prior sweep
#3800 Fix heap-use-after-free in ACL LOAD when client free is deferred cherry-picked in a prior sweep
#3723 Fix double-finish and RESP reply violation in cluster slot migration cherry-picked in a prior sweep
#3872 Redacting customer information when hide_user_data_from_log is true in rdb.c, networking.c, debug.c and t_hash cherry-picked in a prior sweep
#3846 Fix use-after-free in VM_RegisterClusterMessageReceiver cherry-picked in a prior sweep
#3806 Add ALL_DBS flag to CLUSTER FLUSHSLOT for database-level ACL cherry-picked in a prior sweep
#3847 Harden SENTINEL commands and config rewrite against control-character injection
#3801 Validate every DB clause in COPY against ACL db= permissions
#3851 Replace AUTOMATION_PAT with Valkeyrie Bot GitHub App token
#3848 Fix cluster AUX-field control-character and delimiter injection
#3544 Revert "IO-Threads redesign cleanup work (#3544)" cherry-picked in a prior sweep
#3888 Report exact dbid for COPY in ACL LOG when db= access is denied conflicts resolved by Claude Code
#3942 Fix shard_id format specifier in UPDATE message log
#3941 Avoid random() % 0 undefined behaviour when cluster-node-timeout < 30

Generated by valkey-ci-agent using Claude Code.

@coderabbitai

coderabbitai Bot commented May 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2689ac00-661d-4bcf-aedd-00f97716e17c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 69.95074% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.79%. Comparing base (5789b3c) to head (279cd8d).

Files with missing lines Patch % Lines
src/sentinel.c 0.00% 32 Missing ⚠️
src/rdb.c 23.07% 10 Missing ⚠️
src/debug.c 60.00% 6 Missing ⚠️
src/networking.c 61.53% 5 Missing ⚠️
src/io_threads.c 83.33% 4 Missing ⚠️
src/t_hash.c 66.66% 3 Missing ⚠️
src/module.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              9.1    #3774      +/-   ##
==========================================
+ Coverage   76.55%   76.79%   +0.24%     
==========================================
  Files         163      163              
  Lines       81027    81098      +71     
==========================================
+ Hits        62030    62281     +251     
+ Misses      18997    18817     -180     
Files with missing lines Coverage Δ
src/acl.c 92.56% <100.00%> (+0.03%) ⬆️
src/cluster.c 92.66% <100.00%> (ø)
src/cluster_legacy.c 88.62% <100.00%> (+0.07%) ⬆️
src/cluster_migrateslots.c 92.14% <100.00%> (ø)
src/commands.def 100.00% <ø> (ø)
src/config.c 78.56% <100.00%> (+0.46%) ⬆️
src/db.c 94.81% <100.00%> (+0.03%) ⬆️
src/multi.c 97.23% <100.00%> (-0.68%) ⬇️
src/queues.c 99.32% <100.00%> (-0.03%) ⬇️
src/server.h 100.00% <ø> (ø)
... and 8 more

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@valkeyrie-ops valkeyrie-ops Bot force-pushed the agent/backport/sweep/9.1 branch from 2dfead7 to 43f3274 Compare May 20, 2026 10:09
@sarthakaggarwal97 sarthakaggarwal97 requested a review from ranshid May 20, 2026 18:44
sarthakaggarwal97 added a commit to valkey-io/valkey-ci-agent that referenced this pull request May 22, 2026
…te missing labels (#9)

* Filter project-board PRs by source repo in backport sweep

The backport sweep iterates every PullRequest item on the configured
project board and treats anything tagged 'To be backported' as a
candidate. Project boards aggregate PRs across an entire org, so PRs
from sibling repos (e.g. valkey-io.github.io blog posts on the
valkey-9.1 board) were flowing through to the cherry-pick path and
failing with:

    git fetch origin <merge-sha>
    fatal: couldn't find remote ref <merge-sha>
    exit 128

because the SHA only exists in the source PR's repo, not in the one
we're sweeping into.

Fix: project items now carry the source repo, and the discovery filters
candidates whose repository.nameWithOwner doesn't match the configured
target. The GraphQL query selects the new field; if a (cached) payload
lacks it the candidate is kept (we don't want to start refusing PRs
silently — the repo filter is a refinement, not a hard requirement).

Concretely, this is what was happening on PR
valkey-io/valkey#3774 — the per-candidate table
listed PR #553 'Add Valkey 9.1 release announcement blog post' with
'error  Command [git fetch origin f3ea299c...] returned non-zero exit
status 128.' That commit lives in valkey-io/valkey-io.github.io, not
in valkey-io/valkey.

6 new tests cover: cross-repo PR is dropped, matching-repo PR is kept,
unmerged/wrong-status PRs are dropped before the repo check, missing
repository field doesn't block (forward-compat), and a regression guard
that the GraphQL query selects nameWithOwner.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

* Auto-create missing labels on backport PR creation

Backport PRs opened by the agent on valkey-io/valkey were shipping
unlabeled because the configured 'backport' and 'ai-resolved-conflicts'
labels do not exist on the target repo. The label-apply call failed
silently with a warning and the run continued.

Now BackportPRCreator ensures each label exists on the repo before
applying it, creating any missing label with sensible color and
description defaults. A 422 from create_label (concurrent creation race)
is treated as success; other failures are logged at error level.

Also drop the unused 'llm-resolved-conflicts' default. Production config
in repos.yml uses 'ai-resolved-conflicts'; align all code defaults to
match so we have a single label for AI-resolved conflicts.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
enjoy-binbin and others added 15 commits June 10, 2026 10:25
…cations (#3743)

When a module subscribes to NOTIFY_HASH keyspace events and blocks the
client in the notification callback, hexpireGenericCommand,
hgetdelCommand,
and hpersistCommand would trigger debugServerAssert in
notifyKeyspaceEvent()
because addReplyArrayLen() sets buffered_reply=1 before the notification
fires. This debug assert was added in #1819.

Affected commands: HEXPIRE, HPEXPIRE, HEXPIREAT, HPEXPIREAT, HGETDEL,
HPERSIST.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
…3766)

The HFE-blocking-keyspace-notify test added in #3743 is racy and
intermittently fails with:

```
Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}'
(context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test)
```

## Why it races

When `HGETDEL` empties the hash, `hgetdelCommand` fires two
notifications back-to-back on the main thread:

1. `notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)`
2. `notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)` (the key was
removed)

The `block_keyspace_notification` test module's callback always spawns a
background thread that sleeps 1s, appends to the event log, and
optionally unblocks the client. For the second notification,
`RM_BlockClient` returns `NULL` (the client is already blocked by the
first), so the second thread never unblocks anyone — it only appends
`"del"` to the log.

The first thread is the one that unblocks the client, so `HGETDEL`
returns as soon as that thread has logged `"hdel"`. The test client then
immediately reads `b_keyspace.events`, racing the second thread which
has not yet acquired the event-log mutex. On a loaded host or under
valgrind the second thread routinely loses the race, leaving only
`"hdel"` visible.

## Fix

`wait_for_condition` until both events have been logged before asserting
on their contents. This matches the pattern already used by the four
other event-log assertions in this file (e.g. the `Server-created
keyspace notification` and `Event that fires twice` blocks).

```tcl
wait_for_condition 50 100 {
    [llength [r b_keyspace.events]] >= 2
} else {
    fail "Did not see both hdel and del events: [r b_keyspace.events]"
}
assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]
```

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
### Bug

When `ACL LOAD` deletes a user, it calls `freeClientOrCloseLater()` on
clients authenticated as that user. If `freeClient()` defers the actual
free (because the client has `flag.protected` set or has pending IO),
the client remains in `server.clients` with `c->user` still pointing to
the old user object.

After `raxFreeWithCallback(old_users, ACLFreeUserVoid)` frees all old
user objects, `c->user` becomes a dangling pointer. Any subsequent code
that dereferences `c->user` on the deferred client triggers a
heap-use-after-free.

### Crash sequence

1. Client A runs `ACL LOAD`
2. Client B is authenticated as a user that was removed from the ACL
file
3. Client B has `flag.protected` set (e.g. has pending IO (IO threads)
4. `ACLLoadFromFile` → `freeClientOrCloseLater(B, 0)` → `freeClient(B)`
→ defers to `freeClientAsync` because B is protected
5. `raxFreeWithCallback(old_users, ...)` frees all old user objects
including B's user
6. B is still in `server.clients` with `c->user` pointing to freed
memory
7. `CLIENT LIST` or any path that calls `catClientInfoString`
dereferences `c->user->name` → **UAF**

### Fix

Set `c->user` to default user before calling `freeClientOrCloseLater()`
in `ACLLoadFromFile`. The existing NULL guard in `catClientInfoString`
(`client->user ? client->user->name : "(superuser)"`) handles this
safely.

### Test

Added `DEBUG PROTECT-CLIENT <client-id>` subcommand that calls
`protectClient()` on the target, forcing `freeClient` to defer to async.
This enables deterministic reproduction without timing races.

The test:
1. Creates a user, connects a client as that user
2. Protects the client via `DEBUG PROTECT-CLIENT`
3. Removes the user from the ACL file and runs `ACL LOAD`
4. Runs `CLIENT LIST` which dereferences `c->user->name`

Without the fix, this crashes under ASAN with:
```
ERROR: AddressSanitizer: heap-use-after-free in catClientInfoString
READ of size 8
freed by: raxRecursiveFree → raxFreeWithCallback → ACLLoadFromFile
```

---------

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
…3723)

1. Fix double-call to finishSlotMigrationJob in
clusterUpdateSlotImportsOnOwnershipChange(): when n == myself, the
function was called once with 'assigned to myself' message, then fell
through to call it again with 'no longer owned by source'. Added else
branch to make the two error paths mutually exclusive.

2. Fix RESP protocol violation in clusterCommandSyncSlotsFinish():
addReply(c, shared.ok) was sent unconditionally before validating the
job name and state, so on error the client would receive both +OK and
-ERR. Moved the OK reply after all validations pass.

Note that it is an internal command and the only clients sending it are
primary to replica connections and fake AOF clients. Both of those turn
off replies, so the reply doesn't actually get parsed and therefore the
violation has no effect. But good to clean it up

Signed-off-by: chx9 <lovelypiska@outlook.com>
…n rdb.c, networking.c, debug.c and t_hash (#3872)

Removing logging key names when hide_user_data_from_log is true in rdb.c
file as well as networking.c, t_hash.c and debug.c files

Made simple if else in order to distinguish for read ability

---------

Signed-off-by: zackcam <zackcam@amazon.com>
When unregistering a cluster message receiver (callback = NULL) for the
head node of the linked list, the code incorrectly updated
clusterReceivers[type]->next instead of clusterReceivers[type] itself.
This left the array entry pointing to freed memory, causing a
use-after-free on any subsequent list traversal.

---------

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
CLUSTER FLUSHSLOT internally calls delKeysInSlot() which iterates
over all databases and removes the slot's keys from every database,
just like FLUSHALL command will touch all the databases. And it was
missing the ALL_DBS command flag introduced in #2309.

Add test to cover this case, also do a extra cleanup, replace the
manual start_multiple_servers with a single start_cluster call.

Signed-off-by: Binbin <binloveplay1314@qq.com>
… injection (#3847)

Reject any SENTINEL SET argument containing control characters
(0x00-0x1F, 0x7F) to prevent config injection via newline-embedded
values.

---------

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
copyDbIdArgs assumed the DB token was always at argv[3], so
'COPY src dst REPLACE DB n' or 'COPY src dst DB a DB b' could
bypass db= restrictions. Mirror copyCommand's parser to walk
all argv tokens, collect every DB destination, and let the ACL
checker validate each one.

DB ACL was added in #2309.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Use actions/create-github-app-token to generate a scoped installation
token for valkey-release-automation instead of the broad AUTOMATION_PAT.

Part of valkey-io/valkey-release-automation#53

---------

Signed-off-by: Jules Lasarte <lasartej@amazon.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Co-authored-by: Jules Lasarte <lasartej@amazon.com>
Harden isValidAuxChar() to reject control characters (0x00-0x1F, 0x7F)
and format-significant delimiters (comma, equals, quotes, backslash,
space) that could corrupt `nodes.conf` parsing or enable injection of
crafted node entries on restart.

Additionally, add a validator for `cluster-announce-ip` which previously
had no input validation, allowing arbitrary bytes to be persisted into
the cluster config file.

---------

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
This reverts the
[commit](fdd9039)
that was merged as part of the PR #3544 due to a performance regression
observed [here](#3750)

Signed-off-by: akash kumar <akumdev@amazon.com>
Rework commandDbIdArgs so each helper returns the argv index of every dbid
argument it owns. ACLSelectorCheckCmd uses those positions to set keyidxptr
directly, replacing the cmd->proc-based offset table. The caller now reads
the dbid value via getLongLongFromObject(argv[positions[i]], ...). New
db=-checked commands only need to implement their own *DbIdArgs helper.

The fix is for COPY: the old chain fell through to 0, so the "object"
field in ACL LOG was always showed "copy", it now reports the first
denied dbid.

Based on #3801, also see it for more details, DB ACL was added in #2309.

Signed-off-by: Binbin <binloveplay1314@qq.com>
clusterNode.shard_id is a fixed-size char[CLUSTER_NAMELEN] buffer
that is not guaranteed to be NUL-terminated, so it must be printed
with %.40s.

This was introduced in #2510.

Signed-off-by: Binbin <binloveplay1314@qq.com>
…#3941)

Since #2449 made the failover delay relative to cluster-node-timeout.
Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout
below 30, including the legal minimum 0 will collapses delay to zero,
and `x % 0` is undefined behaviour.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@valkeyrie-ops valkeyrie-ops Bot force-pushed the agent/backport/sweep/9.1 branch from bea3028 to 279cd8d Compare June 10, 2026 10:27
@ranshid

ranshid commented Jun 11, 2026

Copy link
Copy Markdown
Member
  1. The valgrind tests are failing on CLUSTER MYSHARDID reports same shard id after cluster restart which is something that is fixed by Deflake cluster-shards "same shard id after restart" test #3949 but also carry dependency on fix: add CLUSTER SAVECONFIG before pause in shard restart tests #3904. I suggest to decide on backporting them separetly

  2. the fedora tests are failing on what I think is a timing race in a recently-added test.
    The failing test is: Improve CLUSTERSCAN error handling test with broader coverage #3674 "Improve CLUSTERSCAN error handling test with broader coverage".
    The test expects: unassigned slot → Hash slot not served, even while the cluster is fail. That requires the command to evaluate slot ownership before global cluster state.
    From the server logs, this window is noisy: R0 (dac26c…, the slot-0 owner) was paused in Case 1, resumed in Case 2, and the peers were still flapping their view of it (Marking node R0 as failing (quorum reached) at 20.864 → ok at 21.003 → fail at 21.270). wait_for_cluster_state fail only guarantees the cluster is down — not why. So the clusterscan ran against a cluster whose FAIL state wasn't yet attributable to "slot 0 is unbound" from R0's authoritative view, and the generic The cluster is down won over Hash slot not served.

  3. the DCO are failing since the authors github account and signoff commit account is different and in this case the author is added to the PR after a squash merge of the PR. I feel safe to override the DCO in this case since the emails are correct.

@enjoy-binbin can you TAL at this and approve (or not) the backport as well?

@ranshid ranshid requested a review from enjoy-binbin June 11, 2026 07:49
@enjoy-binbin

Copy link
Copy Markdown
Member

#3959 fix one of the race, @ranshid Do you wanna take a look and see if it is a same case?

@ranshid

ranshid commented Jun 11, 2026

Copy link
Copy Markdown
Member

#3959

@enjoy-binbin makes sense! I agree this seems like should solve the second race issue. I think we should backport it, but I would prefer to complete this batch of backports and focus on stablizing the daily CI for 9.1 as a followup. WDYT?

@enjoy-binbin enjoy-binbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound good to me.

@ranshid ranshid merged commit 71545e0 into 9.1 Jun 11, 2026
97 of 100 checks passed
@valkeyrie-ops valkeyrie-ops Bot deleted the agent/backport/sweep/9.1 branch June 11, 2026 10:32
@sarthakaggarwal97

Copy link
Copy Markdown
Contributor

@ranshid I think you squashed and merged the backport PR. It messes a bit with branch commit history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants