[backport] Backport sweep for 9.1 by valkeyrie-ops[bot] · Pull Request #3774 · valkey-io/valkey

valkeyrie-ops · 2026-05-19T10:35:15Z

Backport sweep for 9.1

Automated cherry-picks from PRs marked "To be backported".

Applied

Source PR	Title	Detail
#3743	Fix buffered_reply assert in HFE commands with module keyspace notifications	cherry-picked in a prior sweep
#3766	Fix flaky block_keyspace_notification test for HGETDEL notify race	cherry-picked in a prior sweep
#3800	Fix heap-use-after-free in ACL LOAD when client free is deferred	cherry-picked in a prior sweep
#3723	Fix double-finish and RESP reply violation in cluster slot migration	cherry-picked in a prior sweep
#3872	Redacting customer information when hide_user_data_from_log is true in rdb.c, networking.c, debug.c and t_hash	cherry-picked in a prior sweep
#3846	Fix use-after-free in VM_RegisterClusterMessageReceiver	cherry-picked in a prior sweep
#3806	Add ALL_DBS flag to CLUSTER FLUSHSLOT for database-level ACL	cherry-picked in a prior sweep
#3847	Harden SENTINEL commands and config rewrite against control-character injection
#3801	Validate every DB clause in COPY against ACL db= permissions
#3851	Replace AUTOMATION_PAT with Valkeyrie Bot GitHub App token
#3848	Fix cluster AUX-field control-character and delimiter injection
#3544	Revert "IO-Threads redesign cleanup work (#3544)"	cherry-picked in a prior sweep
#3888	Report exact dbid for COPY in ACL LOG when db= access is denied	conflicts resolved by Claude Code
#3942	Fix shard_id format specifier in UPDATE message log
#3941	Avoid random() % 0 undefined behaviour when cluster-node-timeout < 30

Generated by valkey-ci-agent using Claude Code.

coderabbitai · 2026-05-19T10:35:21Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2689ac00-661d-4bcf-aedd-00f97716e17c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-19T11:37:44Z

Codecov Report

❌ Patch coverage is 69.95074% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.79%. Comparing base (5789b3c) to head (279cd8d).

Files with missing lines	Patch %	Lines
src/sentinel.c	0.00%	32 Missing ⚠️
src/rdb.c	23.07%	10 Missing ⚠️
src/debug.c	60.00%	6 Missing ⚠️
src/networking.c	61.53%	5 Missing ⚠️
src/io_threads.c	83.33%	4 Missing ⚠️
src/t_hash.c	66.66%	3 Missing ⚠️
src/module.c	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              9.1    #3774      +/-   ##
==========================================
+ Coverage   76.55%   76.79%   +0.24%     
==========================================
  Files         163      163              
  Lines       81027    81098      +71     
==========================================
+ Hits        62030    62281     +251     
+ Misses      18997    18817     -180

Files with missing lines	Coverage Δ
src/acl.c	`92.56% <100.00%> (+0.03%)`	⬆️
src/cluster.c	`92.66% <100.00%> (ø)`
src/cluster_legacy.c	`88.62% <100.00%> (+0.07%)`	⬆️
src/cluster_migrateslots.c	`92.14% <100.00%> (ø)`
src/commands.def	`100.00% <ø> (ø)`
src/config.c	`78.56% <100.00%> (+0.46%)`	⬆️
src/db.c	`94.81% <100.00%> (+0.03%)`	⬆️
src/multi.c	`97.23% <100.00%> (-0.68%)`	⬇️
src/queues.c	`99.32% <100.00%> (-0.03%)`	⬇️
src/server.h	`100.00% <ø> (ø)`
... and 8 more

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…te missing labels (#9) * Filter project-board PRs by source repo in backport sweep The backport sweep iterates every PullRequest item on the configured project board and treats anything tagged 'To be backported' as a candidate. Project boards aggregate PRs across an entire org, so PRs from sibling repos (e.g. valkey-io.github.io blog posts on the valkey-9.1 board) were flowing through to the cherry-pick path and failing with: git fetch origin <merge-sha> fatal: couldn't find remote ref <merge-sha> exit 128 because the SHA only exists in the source PR's repo, not in the one we're sweeping into. Fix: project items now carry the source repo, and the discovery filters candidates whose repository.nameWithOwner doesn't match the configured target. The GraphQL query selects the new field; if a (cached) payload lacks it the candidate is kept (we don't want to start refusing PRs silently — the repo filter is a refinement, not a hard requirement). Concretely, this is what was happening on PR valkey-io/valkey#3774 — the per-candidate table listed PR #553 'Add Valkey 9.1 release announcement blog post' with 'error Command [git fetch origin f3ea299c...] returned non-zero exit status 128.' That commit lives in valkey-io/valkey-io.github.io, not in valkey-io/valkey. 6 new tests cover: cross-repo PR is dropped, matching-repo PR is kept, unmerged/wrong-status PRs are dropped before the repo check, missing repository field doesn't block (forward-compat), and a regression guard that the GraphQL query selects nameWithOwner. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> * Auto-create missing labels on backport PR creation Backport PRs opened by the agent on valkey-io/valkey were shipping unlabeled because the configured 'backport' and 'ai-resolved-conflicts' labels do not exist on the target repo. The label-apply call failed silently with a warning and the run continued. Now BackportPRCreator ensures each label exists on the repo before applying it, creating any missing label with sensible color and description defaults. A 422 from create_label (concurrent creation race) is treated as success; other failures are logged at error level. Also drop the unused 'llm-resolved-conflicts' default. Production config in repos.yml uses 'ai-resolved-conflicts'; align all code defaults to match so we have a single label for AI-resolved conflicts. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

…cations (#3743) When a module subscribes to NOTIFY_HASH keyspace events and blocks the client in the notification callback, hexpireGenericCommand, hgetdelCommand, and hpersistCommand would trigger debugServerAssert in notifyKeyspaceEvent() because addReplyArrayLen() sets buffered_reply=1 before the notification fires. This debug assert was added in #1819. Affected commands: HEXPIRE, HPEXPIRE, HEXPIREAT, HPEXPIREAT, HGETDEL, HPERSIST. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>

…3766) The HFE-blocking-keyspace-notify test added in #3743 is racy and intermittently fails with: ``` Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}' (context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test) ``` ## Why it races When `HGETDEL` empties the hash, `hgetdelCommand` fires two notifications back-to-back on the main thread: 1. `notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)` 2. `notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)` (the key was removed) The `block_keyspace_notification` test module's callback always spawns a background thread that sleeps 1s, appends to the event log, and optionally unblocks the client. For the second notification, `RM_BlockClient` returns `NULL` (the client is already blocked by the first), so the second thread never unblocks anyone — it only appends `"del"` to the log. The first thread is the one that unblocks the client, so `HGETDEL` returns as soon as that thread has logged `"hdel"`. The test client then immediately reads `b_keyspace.events`, racing the second thread which has not yet acquired the event-log mutex. On a loaded host or under valgrind the second thread routinely loses the race, leaving only `"hdel"` visible. ## Fix `wait_for_condition` until both events have been logged before asserting on their contents. This matches the pattern already used by the four other event-log assertions in this file (e.g. the `Server-created keyspace notification` and `Event that fires twice` blocks). ```tcl wait_for_condition 50 100 { [llength [r b_keyspace.events]] >= 2 } else { fail "Did not see both hdel and del events: [r b_keyspace.events]" } assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]] ``` Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

### Bug When `ACL LOAD` deletes a user, it calls `freeClientOrCloseLater()` on clients authenticated as that user. If `freeClient()` defers the actual free (because the client has `flag.protected` set or has pending IO), the client remains in `server.clients` with `c->user` still pointing to the old user object. After `raxFreeWithCallback(old_users, ACLFreeUserVoid)` frees all old user objects, `c->user` becomes a dangling pointer. Any subsequent code that dereferences `c->user` on the deferred client triggers a heap-use-after-free. ### Crash sequence 1. Client A runs `ACL LOAD` 2. Client B is authenticated as a user that was removed from the ACL file 3. Client B has `flag.protected` set (e.g. has pending IO (IO threads) 4. `ACLLoadFromFile` → `freeClientOrCloseLater(B, 0)` → `freeClient(B)` → defers to `freeClientAsync` because B is protected 5. `raxFreeWithCallback(old_users, ...)` frees all old user objects including B's user 6. B is still in `server.clients` with `c->user` pointing to freed memory 7. `CLIENT LIST` or any path that calls `catClientInfoString` dereferences `c->user->name` → **UAF** ### Fix Set `c->user` to default user before calling `freeClientOrCloseLater()` in `ACLLoadFromFile`. The existing NULL guard in `catClientInfoString` (`client->user ? client->user->name : "(superuser)"`) handles this safely. ### Test Added `DEBUG PROTECT-CLIENT <client-id>` subcommand that calls `protectClient()` on the target, forcing `freeClient` to defer to async. This enables deterministic reproduction without timing races. The test: 1. Creates a user, connects a client as that user 2. Protects the client via `DEBUG PROTECT-CLIENT` 3. Removes the user from the ACL file and runs `ACL LOAD` 4. Runs `CLIENT LIST` which dereferences `c->user->name` Without the fix, this crashes under ASAN with: ``` ERROR: AddressSanitizer: heap-use-after-free in catClientInfoString READ of size 8 freed by: raxRecursiveFree → raxFreeWithCallback → ACLLoadFromFile ``` --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

…3723) 1. Fix double-call to finishSlotMigrationJob in clusterUpdateSlotImportsOnOwnershipChange(): when n == myself, the function was called once with 'assigned to myself' message, then fell through to call it again with 'no longer owned by source'. Added else branch to make the two error paths mutually exclusive. 2. Fix RESP protocol violation in clusterCommandSyncSlotsFinish(): addReply(c, shared.ok) was sent unconditionally before validating the job name and state, so on error the client would receive both +OK and -ERR. Moved the OK reply after all validations pass. Note that it is an internal command and the only clients sending it are primary to replica connections and fake AOF clients. Both of those turn off replies, so the reply doesn't actually get parsed and therefore the violation has no effect. But good to clean it up Signed-off-by: chx9 <lovelypiska@outlook.com>

…n rdb.c, networking.c, debug.c and t_hash (#3872) Removing logging key names when hide_user_data_from_log is true in rdb.c file as well as networking.c, t_hash.c and debug.c files Made simple if else in order to distinguish for read ability --------- Signed-off-by: zackcam <zackcam@amazon.com>

When unregistering a cluster message receiver (callback = NULL) for the head node of the linked list, the code incorrectly updated clusterReceivers[type]->next instead of clusterReceivers[type] itself. This left the array entry pointing to freed memory, causing a use-after-free on any subsequent list traversal. --------- Signed-off-by: Eran Ifrah <eifrah@amazon.com>

CLUSTER FLUSHSLOT internally calls delKeysInSlot() which iterates over all databases and removes the slot's keys from every database, just like FLUSHALL command will touch all the databases. And it was missing the ALL_DBS command flag introduced in #2309. Add test to cover this case, also do a extra cleanup, replace the manual start_multiple_servers with a single start_cluster call. Signed-off-by: Binbin <binloveplay1314@qq.com>

… injection (#3847) Reject any SENTINEL SET argument containing control characters (0x00-0x1F, 0x7F) to prevent config injection via newline-embedded values. --------- Signed-off-by: Eran Ifrah <eifrah@amazon.com>

copyDbIdArgs assumed the DB token was always at argv[3], so 'COPY src dst REPLACE DB n' or 'COPY src dst DB a DB b' could bypass db= restrictions. Mirror copyCommand's parser to walk all argv tokens, collect every DB destination, and let the ACL checker validate each one. DB ACL was added in #2309. Signed-off-by: Binbin <binloveplay1314@qq.com>

Use actions/create-github-app-token to generate a scoped installation token for valkey-release-automation instead of the broad AUTOMATION_PAT. Part of valkey-io/valkey-release-automation#53 --------- Signed-off-by: Jules Lasarte <lasartej@amazon.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Co-authored-by: Jules Lasarte <lasartej@amazon.com>

Harden isValidAuxChar() to reject control characters (0x00-0x1F, 0x7F) and format-significant delimiters (comma, equals, quotes, backslash, space) that could corrupt `nodes.conf` parsing or enable injection of crafted node entries on restart. Additionally, add a validator for `cluster-announce-ip` which previously had no input validation, allowing arbitrary bytes to be persisted into the cluster config file. --------- Signed-off-by: Eran Ifrah <eifrah@amazon.com>

This reverts the [commit](fdd9039) that was merged as part of the PR #3544 due to a performance regression observed [here](#3750) Signed-off-by: akash kumar <akumdev@amazon.com>

Rework commandDbIdArgs so each helper returns the argv index of every dbid argument it owns. ACLSelectorCheckCmd uses those positions to set keyidxptr directly, replacing the cmd->proc-based offset table. The caller now reads the dbid value via getLongLongFromObject(argv[positions[i]], ...). New db=-checked commands only need to implement their own *DbIdArgs helper. The fix is for COPY: the old chain fell through to 0, so the "object" field in ACL LOG was always showed "copy", it now reports the first denied dbid. Based on #3801, also see it for more details, DB ACL was added in #2309. Signed-off-by: Binbin <binloveplay1314@qq.com>

clusterNode.shard_id is a fixed-size char[CLUSTER_NAMELEN] buffer that is not guaranteed to be NUL-terminated, so it must be printed with %.40s. This was introduced in #2510. Signed-off-by: Binbin <binloveplay1314@qq.com>

…#3941) Since #2449 made the failover delay relative to cluster-node-timeout. Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout below 30, including the legal minimum 0 will collapses delay to zero, and `x % 0` is undefined behaviour. Signed-off-by: Binbin <binloveplay1314@qq.com>

ranshid · 2026-06-11T07:49:17Z

The valgrind tests are failing on CLUSTER MYSHARDID reports same shard id after cluster restart which is something that is fixed by Deflake cluster-shards "same shard id after restart" test #3949 but also carry dependency on fix: add CLUSTER SAVECONFIG before pause in shard restart tests #3904. I suggest to decide on backporting them separetly
the fedora tests are failing on what I think is a timing race in a recently-added test.
The failing test is: Improve CLUSTERSCAN error handling test with broader coverage #3674 "Improve CLUSTERSCAN error handling test with broader coverage".
The test expects: unassigned slot → Hash slot not served, even while the cluster is fail. That requires the command to evaluate slot ownership before global cluster state.
From the server logs, this window is noisy: R0 (dac26c…, the slot-0 owner) was paused in Case 1, resumed in Case 2, and the peers were still flapping their view of it (Marking node R0 as failing (quorum reached) at 20.864 → ok at 21.003 → fail at 21.270). wait_for_cluster_state fail only guarantees the cluster is down — not why. So the clusterscan ran against a cluster whose FAIL state wasn't yet attributable to "slot 0 is unbound" from R0's authoritative view, and the generic The cluster is down won over Hash slot not served.
the DCO are failing since the authors github account and signoff commit account is different and in this case the author is added to the PR after a squash merge of the PR. I feel safe to override the DCO in this case since the emails are correct.

@enjoy-binbin can you TAL at this and approve (or not) the backport as well?

enjoy-binbin · 2026-06-11T07:53:47Z

#3959 fix one of the race, @ranshid Do you wanna take a look and see if it is a same case?

ranshid · 2026-06-11T08:16:57Z

#3959

@enjoy-binbin makes sense! I agree this seems like should solve the second race issue. I think we should backport it, but I would prefer to complete this batch of backports and focus on stablizing the daily CI for 9.1 as a followup. WDYT?

enjoy-binbin

Sound good to me.

sarthakaggarwal97 · 2026-06-15T22:07:42Z

@ranshid I think you squashed and merged the backport PR. It messes a bit with branch commit history.

valkeyrie-ops Bot force-pushed the agent/backport/sweep/9.1 branch from 2dfead7 to 43f3274 Compare May 20, 2026 10:09

sarthakaggarwal97 requested a review from ranshid May 20, 2026 18:44

sarthakaggarwal97 mentioned this pull request May 20, 2026

Backport sweep: filter project-board PRs by source repo and auto-create missing labels valkey-io/valkey-ci-agent#9

Merged

enjoy-binbin and others added 15 commits June 10, 2026 10:25

Revert "IO-Threads redesign cleanup work (#3544)" (#3756)

17b9513

This reverts the [commit](fdd9039) that was merged as part of the PR #3544 due to a performance regression observed [here](#3750) Signed-off-by: akash kumar <akumdev@amazon.com>

Fix shard_id format specifier in UPDATE message log (#3942)

676c7d4

clusterNode.shard_id is a fixed-size char[CLUSTER_NAMELEN] buffer that is not guaranteed to be NUL-terminated, so it must be printed with %.40s. This was introduced in #2510. Signed-off-by: Binbin <binloveplay1314@qq.com>

valkeyrie-ops Bot force-pushed the agent/backport/sweep/9.1 branch from bea3028 to 279cd8d Compare June 10, 2026 10:27

ranshid approved these changes Jun 11, 2026

View reviewed changes

ranshid requested a review from enjoy-binbin June 11, 2026 07:49

enjoy-binbin approved these changes Jun 11, 2026

View reviewed changes

ranshid merged commit 71545e0 into 9.1 Jun 11, 2026
97 of 100 checks passed

valkeyrie-ops Bot deleted the agent/backport/sweep/9.1 branch June 11, 2026 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[backport] Backport sweep for 9.1#3774

[backport] Backport sweep for 9.1#3774
ranshid merged 15 commits into
9.1from
agent/backport/sweep/9.1

valkeyrie-ops Bot commented May 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented May 19, 2026 •

edited

Loading

Uh oh!

ranshid commented Jun 11, 2026

Uh oh!

enjoy-binbin commented Jun 11, 2026

Uh oh!

ranshid commented Jun 11, 2026

Uh oh!

enjoy-binbin left a comment

Uh oh!

Uh oh!

sarthakaggarwal97 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

valkeyrie-ops Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backport sweep for 9.1

Applied

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ranshid commented Jun 11, 2026

Uh oh!

enjoy-binbin commented Jun 11, 2026

Uh oh!

ranshid commented Jun 11, 2026

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarthakaggarwal97 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

valkeyrie-ops Bot commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading