[backport] Backport sweep for 9.1#3774
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 9.1 #3774 +/- ##
==========================================
+ Coverage 76.55% 76.79% +0.24%
==========================================
Files 163 163
Lines 81027 81098 +71
==========================================
+ Hits 62030 62281 +251
+ Misses 18997 18817 -180
🚀 New features to boost your workflow:
|
2dfead7 to
43f3274
Compare
…te missing labels (#9) * Filter project-board PRs by source repo in backport sweep The backport sweep iterates every PullRequest item on the configured project board and treats anything tagged 'To be backported' as a candidate. Project boards aggregate PRs across an entire org, so PRs from sibling repos (e.g. valkey-io.github.io blog posts on the valkey-9.1 board) were flowing through to the cherry-pick path and failing with: git fetch origin <merge-sha> fatal: couldn't find remote ref <merge-sha> exit 128 because the SHA only exists in the source PR's repo, not in the one we're sweeping into. Fix: project items now carry the source repo, and the discovery filters candidates whose repository.nameWithOwner doesn't match the configured target. The GraphQL query selects the new field; if a (cached) payload lacks it the candidate is kept (we don't want to start refusing PRs silently — the repo filter is a refinement, not a hard requirement). Concretely, this is what was happening on PR valkey-io/valkey#3774 — the per-candidate table listed PR #553 'Add Valkey 9.1 release announcement blog post' with 'error Command [git fetch origin f3ea299c...] returned non-zero exit status 128.' That commit lives in valkey-io/valkey-io.github.io, not in valkey-io/valkey. 6 new tests cover: cross-repo PR is dropped, matching-repo PR is kept, unmerged/wrong-status PRs are dropped before the repo check, missing repository field doesn't block (forward-compat), and a regression guard that the GraphQL query selects nameWithOwner. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> * Auto-create missing labels on backport PR creation Backport PRs opened by the agent on valkey-io/valkey were shipping unlabeled because the configured 'backport' and 'ai-resolved-conflicts' labels do not exist on the target repo. The label-apply call failed silently with a warning and the run continued. Now BackportPRCreator ensures each label exists on the repo before applying it, creating any missing label with sensible color and description defaults. A 422 from create_label (concurrent creation race) is treated as success; other failures are logged at error level. Also drop the unused 'llm-resolved-conflicts' default. Production config in repos.yml uses 'ai-resolved-conflicts'; align all code defaults to match so we have a single label for AI-resolved conflicts. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…cations (#3743) When a module subscribes to NOTIFY_HASH keyspace events and blocks the client in the notification callback, hexpireGenericCommand, hgetdelCommand, and hpersistCommand would trigger debugServerAssert in notifyKeyspaceEvent() because addReplyArrayLen() sets buffered_reply=1 before the notification fires. This debug assert was added in #1819. Affected commands: HEXPIRE, HPEXPIRE, HEXPIREAT, HPEXPIREAT, HGETDEL, HPERSIST. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>
…3766) The HFE-blocking-keyspace-notify test added in #3743 is racy and intermittently fails with: ``` Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}' (context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test) ``` ## Why it races When `HGETDEL` empties the hash, `hgetdelCommand` fires two notifications back-to-back on the main thread: 1. `notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)` 2. `notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)` (the key was removed) The `block_keyspace_notification` test module's callback always spawns a background thread that sleeps 1s, appends to the event log, and optionally unblocks the client. For the second notification, `RM_BlockClient` returns `NULL` (the client is already blocked by the first), so the second thread never unblocks anyone — it only appends `"del"` to the log. The first thread is the one that unblocks the client, so `HGETDEL` returns as soon as that thread has logged `"hdel"`. The test client then immediately reads `b_keyspace.events`, racing the second thread which has not yet acquired the event-log mutex. On a loaded host or under valgrind the second thread routinely loses the race, leaving only `"hdel"` visible. ## Fix `wait_for_condition` until both events have been logged before asserting on their contents. This matches the pattern already used by the four other event-log assertions in this file (e.g. the `Server-created keyspace notification` and `Event that fires twice` blocks). ```tcl wait_for_condition 50 100 { [llength [r b_keyspace.events]] >= 2 } else { fail "Did not see both hdel and del events: [r b_keyspace.events]" } assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]] ``` Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
### Bug When `ACL LOAD` deletes a user, it calls `freeClientOrCloseLater()` on clients authenticated as that user. If `freeClient()` defers the actual free (because the client has `flag.protected` set or has pending IO), the client remains in `server.clients` with `c->user` still pointing to the old user object. After `raxFreeWithCallback(old_users, ACLFreeUserVoid)` frees all old user objects, `c->user` becomes a dangling pointer. Any subsequent code that dereferences `c->user` on the deferred client triggers a heap-use-after-free. ### Crash sequence 1. Client A runs `ACL LOAD` 2. Client B is authenticated as a user that was removed from the ACL file 3. Client B has `flag.protected` set (e.g. has pending IO (IO threads) 4. `ACLLoadFromFile` → `freeClientOrCloseLater(B, 0)` → `freeClient(B)` → defers to `freeClientAsync` because B is protected 5. `raxFreeWithCallback(old_users, ...)` frees all old user objects including B's user 6. B is still in `server.clients` with `c->user` pointing to freed memory 7. `CLIENT LIST` or any path that calls `catClientInfoString` dereferences `c->user->name` → **UAF** ### Fix Set `c->user` to default user before calling `freeClientOrCloseLater()` in `ACLLoadFromFile`. The existing NULL guard in `catClientInfoString` (`client->user ? client->user->name : "(superuser)"`) handles this safely. ### Test Added `DEBUG PROTECT-CLIENT <client-id>` subcommand that calls `protectClient()` on the target, forcing `freeClient` to defer to async. This enables deterministic reproduction without timing races. The test: 1. Creates a user, connects a client as that user 2. Protects the client via `DEBUG PROTECT-CLIENT` 3. Removes the user from the ACL file and runs `ACL LOAD` 4. Runs `CLIENT LIST` which dereferences `c->user->name` Without the fix, this crashes under ASAN with: ``` ERROR: AddressSanitizer: heap-use-after-free in catClientInfoString READ of size 8 freed by: raxRecursiveFree → raxFreeWithCallback → ACLLoadFromFile ``` --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
…3723) 1. Fix double-call to finishSlotMigrationJob in clusterUpdateSlotImportsOnOwnershipChange(): when n == myself, the function was called once with 'assigned to myself' message, then fell through to call it again with 'no longer owned by source'. Added else branch to make the two error paths mutually exclusive. 2. Fix RESP protocol violation in clusterCommandSyncSlotsFinish(): addReply(c, shared.ok) was sent unconditionally before validating the job name and state, so on error the client would receive both +OK and -ERR. Moved the OK reply after all validations pass. Note that it is an internal command and the only clients sending it are primary to replica connections and fake AOF clients. Both of those turn off replies, so the reply doesn't actually get parsed and therefore the violation has no effect. But good to clean it up Signed-off-by: chx9 <lovelypiska@outlook.com>
…n rdb.c, networking.c, debug.c and t_hash (#3872) Removing logging key names when hide_user_data_from_log is true in rdb.c file as well as networking.c, t_hash.c and debug.c files Made simple if else in order to distinguish for read ability --------- Signed-off-by: zackcam <zackcam@amazon.com>
When unregistering a cluster message receiver (callback = NULL) for the head node of the linked list, the code incorrectly updated clusterReceivers[type]->next instead of clusterReceivers[type] itself. This left the array entry pointing to freed memory, causing a use-after-free on any subsequent list traversal. --------- Signed-off-by: Eran Ifrah <eifrah@amazon.com>
CLUSTER FLUSHSLOT internally calls delKeysInSlot() which iterates over all databases and removes the slot's keys from every database, just like FLUSHALL command will touch all the databases. And it was missing the ALL_DBS command flag introduced in #2309. Add test to cover this case, also do a extra cleanup, replace the manual start_multiple_servers with a single start_cluster call. Signed-off-by: Binbin <binloveplay1314@qq.com>
… injection (#3847) Reject any SENTINEL SET argument containing control characters (0x00-0x1F, 0x7F) to prevent config injection via newline-embedded values. --------- Signed-off-by: Eran Ifrah <eifrah@amazon.com>
copyDbIdArgs assumed the DB token was always at argv[3], so 'COPY src dst REPLACE DB n' or 'COPY src dst DB a DB b' could bypass db= restrictions. Mirror copyCommand's parser to walk all argv tokens, collect every DB destination, and let the ACL checker validate each one. DB ACL was added in #2309. Signed-off-by: Binbin <binloveplay1314@qq.com>
Use actions/create-github-app-token to generate a scoped installation token for valkey-release-automation instead of the broad AUTOMATION_PAT. Part of valkey-io/valkey-release-automation#53 --------- Signed-off-by: Jules Lasarte <lasartej@amazon.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Co-authored-by: Jules Lasarte <lasartej@amazon.com>
Harden isValidAuxChar() to reject control characters (0x00-0x1F, 0x7F) and format-significant delimiters (comma, equals, quotes, backslash, space) that could corrupt `nodes.conf` parsing or enable injection of crafted node entries on restart. Additionally, add a validator for `cluster-announce-ip` which previously had no input validation, allowing arbitrary bytes to be persisted into the cluster config file. --------- Signed-off-by: Eran Ifrah <eifrah@amazon.com>
Rework commandDbIdArgs so each helper returns the argv index of every dbid argument it owns. ACLSelectorCheckCmd uses those positions to set keyidxptr directly, replacing the cmd->proc-based offset table. The caller now reads the dbid value via getLongLongFromObject(argv[positions[i]], ...). New db=-checked commands only need to implement their own *DbIdArgs helper. The fix is for COPY: the old chain fell through to 0, so the "object" field in ACL LOG was always showed "copy", it now reports the first denied dbid. Based on #3801, also see it for more details, DB ACL was added in #2309. Signed-off-by: Binbin <binloveplay1314@qq.com>
clusterNode.shard_id is a fixed-size char[CLUSTER_NAMELEN] buffer that is not guaranteed to be NUL-terminated, so it must be printed with %.40s. This was introduced in #2510. Signed-off-by: Binbin <binloveplay1314@qq.com>
…#3941) Since #2449 made the failover delay relative to cluster-node-timeout. Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout below 30, including the legal minimum 0 will collapses delay to zero, and `x % 0` is undefined behaviour. Signed-off-by: Binbin <binloveplay1314@qq.com>
bea3028 to
279cd8d
Compare
@enjoy-binbin can you TAL at this and approve (or not) the backport as well? |
|
@enjoy-binbin makes sense! I agree this seems like should solve the second race issue. I think we should backport it, but I would prefer to complete this batch of backports and focus on stablizing the daily CI for 9.1 as a followup. WDYT? |
|
@ranshid I think you squashed and merged the backport PR. It messes a bit with branch commit history. |
Backport sweep for 9.1
Automated cherry-picks from PRs marked "To be backported".
Applied
Generated by valkey-ci-agent using Claude Code.