Batched MGET/MSET dict prefetch with dictType-driven payload hints by fcostaoliveira · Pull Request #15133 · redis/redis

fcostaoliveira · 2026-04-28T20:53:18Z

Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses
across the keys of a single multi-key command. Builds on the cross-command
batched prefetch framework introduced in #14017 and the dict-prefetch state
machine in memory_prefetch.c, and lifts the kvobject-aware bits out of the
state machine into two new dictType callbacks so the same machinery can
be reused for other dict-encoded types later (hash hashtable, sets, sorted
sets) without paying for kvobj-specific code paths in the core loop.

Bundles the work originally proposed in #14899 (MGET prefetch framework, by
@mpozniak95) and #15043 (MSET batch prefetch), with several follow-up
commits on top:

review-pass simplifications addressing @ShooterIT — drop the dead
expire-dict prefetch callback, drop the single-key MGET special case,
route MSET through dictPrefetchKeys so the existing entry's value is
warmed too;
structural refactor of memory_prefetch.c by @ShooterIT (clearer
types + explicit Init/Reset/Run/Free lifecycle);
MGET single-path simplification addressing @ShooterIT's second pass —
drop MGET_LIGHT_PREFETCH_MAX and the inner bucket-only branch; small
requests get a single dictPrefetchKeys call covering all keys, larger
requests batch in chunks via the same call;
shared prefetchKeysBatch(c, slot, start, stride) helper by
@ShooterIT that deduplicates MGET (stride=1) and MSET (stride=2)
batching paths, with adaptive batch sizing (≥ 2 batches remain → take
one; else take all remaining in one call) and a stack-allocating
fixed-size lookup array.

Design

Two new optional callbacks on dictType:

typedef struct dictType {
    ...
    /* Bring the entry's key payload into cache before keyCompare runs.
     * Returns the address to prefetch, or NULL if the entry alone is enough. */
    void *(*prefetchEntryKey)(const dictEntry *de);

    /* Called only after a key match. Returns the value-side payload to
     * prefetch (or NULL). */
    void *(*prefetchEntryValue)(const dictEntry *de);
} dictType;

dbDictType registers both. The kv-aware logic — the dictEntryIsKey()
shortcut for embedded kvobjs, and kv->ptr for OBJ_STRING /
OBJ_ENCODING_RAW values — now lives in two small helpers in server.c:

static void *dbDictPrefetchEntryKey(const dictEntry *de) {
    return dictEntryIsKey(de) ? NULL : dictGetKey(de);
}

static void *dbDictPrefetchEntryValue(const dictEntry *de) {
    kvobj *kv = dictGetKey(de);
    return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW)
            ? kv->ptr : NULL;
}

The PrefetchGetValueDataFunc typedef and the per-call get_val_data
parameter on dictPrefetchKeys() / dictPrefetch() are removed — the
dict's own type drives both ends. This also removes the foot-gun where
callers (like mgetCommand) had to remember whether to pass
prefetchGetObjectValuePtr or NULL. memory_prefetch.c no longer
references kvobj, kvobjGetKey, or any specific value layout.

State machine

Two file-local types in memory_prefetch.c:

Type	Role
`dictPrefetchLookup`	Per-key snapshot of an in-flight, software-pipelined `dictFind` (mirrors the locals a synchronous `dictFind` would carry across one bucket walk).
`dictPrefetcher`	Driver that advances a batch of `dictPrefetchLookup`s through the FSM, yielding to the next in-flight lookup each time a prefetch is issued.

Five-stage lifecycle for each lookup, driven by the prefetcher:

                                                           │
                                                         start
                                                           │
                                                  ┌────────▼─────────┐
                                       ┌─────────►│  PREFETCH_BUCKET ├────►────────┐
                                       │          └────────┬─────────┘            no more tables
                                       │             bucket│found                  │
                                       │                   │                       │
        entry not found - goto next table         ┌────────▼────────┐              │
                                       └────◄─────┤ PREFETCH_ENTRY  │              ▼
                                    ┌────────────►└────────┬────────┘              │
                                    │                 entry│found                  │
                                    │                      │                       │
                                    │          ┌───────────▼─────────────┐         │
                                    │          │   PREFETCH_ENTRY_KEY    │ ◄── dictType->prefetchEntryKey(de)
                                    │          └───────────┬─────────────┘         │
                                    │                      │                       │
        key mismatch - goto next entry                     │                       │
                                    │          ┌───────────▼─────────────┐         │
                                    └──────◄───│   PREFETCH_ENTRY_VALUE  │ ◄── keyCompare; on match,
                                               └───────────┬─────────────┘     dictType->prefetchEntryValue(de)
                                                           │                       │
                                                 ┌─────────▼─────────────┐         │
                                                 │     PREFETCH_DONE     │◄────────┘
                                                 └───────────────────────┘

PREFETCH_BUCKET first picks ht_table[0], then flips to ht_table[1]
if the dict is mid-rehash, then transitions to PREFETCH_DONE if no
more tables remain.

memory_prefetch.c exposes a small lifecycle that any caller can drive:

dictPrefetcherInit(p, max_keys);                  /* one-shot heap alloc of lookups[] */
dictPrefetcherReset(p, dicts, keys, nkeys);       /* configure for one batch */
dictPrefetcherRun(p);                             /* drive FSM until all PREFETCH_DONE */
dictPrefetcherFree(p);                            /* release */

Each FSM stage is a named static function (dictPrefetchBucket,
dictPrefetchEntry, dictPrefetchEntryKey, dictPrefetchEntryValue),
so the dictPrefetcherRun driver is a four-line switch over the state.

The state machine is dict-pure: no kvobj field on dictPrefetchLookup,
no kvobjGetKey reach-through. Round-robin advance semantics — a state
only advances the cursor if a prefetch was actually issued — are
preserved, so the embedded-kvobj fast path
(dictEntryIsKey(de) == 1 → callback returns NULL) still skips the
extra prefetch and falls straight into the compare on the next loop
iteration.

The cross-command path (prefetchCommands / PrefetchCommandsBatch)
embeds a dictPrefetcher initialized once at startup and reset per
batch, so cross-command prefetching no longer allocates per call.

Intra-command API

void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys);

A single multi-key command (e.g. MGET) can prefetch dict data for a
batch of its own keys, reusing the same state machine that the
cross-command path uses. Single-key calls (nkeys <= 1) early-return —
nothing to interleave with. The implementation stack-allocates a
fixed-size lookup array bounded by DICT_PREFETCH_MAX_SIZE = 64 (no
VLA, predictable stack usage), so the intra-command path doesn't touch
the heap.

Notes on the call sites

A shared helper picks the next prefetch batch and warms it via
dictPrefetchKeys:

/* Pick the next prefetch batch starting at argv[start] and warm it via
 * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for
 * key/value pairs (MSET). Returns the chosen batch size in items. */
static int prefetchKeysBatch(client *c, int slot, int start, int stride);

Adaptive batch sizing inside the helper: if at least two full batches
(PREFETCH_BATCH_SIZE * 2 = 32 items) remain, take one batch
(PREFETCH_BATCH_SIZE = 16); otherwise take all remaining items in one
call. This generalizes the small-request fast path so the trailing
batch of a large request also gets the single-call benefit.

MGET (mgetCommand) — gated by
do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1,
with already_prefetched = c->current_pending_cmd && (c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED).
When do_prefetch is set, each iteration calls
prefetchKeysBatch(c, slot, j, 1) and then sequentially
lookupKeyReads + replies the chosen batch. When do_prefetch is
clear (cross-command path already warmed the keys, or batch
prefetching is off), the loop takes all remaining items in one go
and skips the prefetch.
MSET / MSETNX (msetGenericCommand) — same do_prefetch gate as
MGET with stride = 2. For the NX flag the NX-check loop runs
lookupKeyWrite (which already warmed everything via
prefetchKeysBatch); the SET loop then disables further prefetch
(do_prefetch &&= !nx) so we don't re-prefetch on the second pass.
Going through the full state machine (rather than bucket-only) means
dbDictType's prefetchEntryValue callback runs on a key match —
warming the old kvobj's payload, which setKey -> dbReplaceValue -> updateKeysizesHist(oldlen, newlen) then reads to compute the
histogram delta. The slot dict is re-fetched per batch — in cluster
mode the slot dict can be freed mid-MSET (KVSTORE_FREE_EMPTY_DICTS
- expireIfNeeded), so a cached pointer would otherwise dangle.
Cross-command batch path (addCommandToBatch) — sets
PENDING_CMD_KEYS_PREFETCHED on every command added to the batch,
even on partial-batch overflow (was: only when ALL keys fit). The
intra-command path then uniformly skips supplemental prefetching for
any command the batch touched. Rationale: running both paths
(cross-command warm + intra-command supplement) caused a measured
−9.6 % regression on x86 with pipeline-10, and the partial cross-
command warmup is sufficient for the head of the keyset; the cold
tail goes through normal lookup, which is still cheaper than running
the FSM a second time on already-warm keys.
Future types: each dict's dictType can register its own
prefetchEntryKey / prefetchEntryValue (e.g. for the hashtable hash
encoding, the field-sds and value-sds payloads), without touching
memory_prefetch.c.

Benchmark validation

Validated on x86-aws-m7i.metal-24xl-2 (Intel Xeon 8488C, Sapphire Rapids,
bare-metal). PR head 792c9ca1c vs unstable 2432f5527.

Test	Baseline `2432f5527` (Ops/sec)	PR `792c9ca1c` (Ops/sec)	Δ
`5Mkeys-string-mget-100B-2keys`	162 657.34	164 968.43	+1.42 %
`5Mkeys-string-mget-100B-5keys`	146 959.66	150 364.64 (mean of 3 dp)	+2.32 %
`5Mkeys-string-mget-100B-10keys`	125 275.66	133 586.86	+6.63 %
`5Mkeys-string-mget-100B-20keys`	90 897.79	107 037.88	+17.76 %
`5Mkeys-string-mget-100B-30keys`	73 006.87	88 796.06	+21.63 %
`5Mkeys-string-mget-100B-100keys`	31 149.71	42 723.75	+37.16 %
`10Mkeys-load-string-mset-10-keys-with-10B-values`	131 641.65	134 606.62	+2.25 %
`10Mkeys-load-string-mset-10-keys-with-100B-values`	127 918.11	132 016.12	+3.20 %
`10Mkeys-load-string-mset-50-keys-with-10B-values`	65 659.64	67 208.60	+2.36 %
`10Mkeys-load-string-mset-50-keys-with-100B-values`	60 718.20	63 409.87	+4.43 %

MGET wins scale with batch size — the prefetch hides one DRAM hop per
key, so the longer the batch, the more memory parallelism we recover.
MSET is positive across the board, with the largest test
(50-keys-100B) showing the value-prefetch fix paying off first
(highest cache pressure). ARM (Graviton4) was uniformly flat in earlier
runs against the same unstable head, which matches what the underlying
hardware prefetcher already gives.

Note

Medium Risk
Touches core dict lookup prefetching and rewrites MGET/MSET hot paths; while performance-focused, subtle mistakes could cause correctness regressions (e.g., stale dict pointers, cluster slot handling) or reduced performance under some workloads.

Overview
Adds dictType-configurable prefetch hooks (prefetchEntryKey/prefetchEntryValue) and rewrites the dict prefetch logic into a reusable, kvobj-agnostic state machine that can warm both key and value payloads during lookups.

Introduces a new intra-command API dictPrefetchKeys() and updates MGET/MSET/MSETNX to prefetch their own key batches (with adaptive batching), while coordinating with the existing cross-command prefetcher via a new PENDING_CMD_KEYS_PREFETCHED flag to avoid double-prefetching.

Registers db keyspace-specific prefetch callbacks in dbDictType and adds unit tests covering multi-batch MSET execution and an expiry/batch-boundary regression case.

^{Reviewed by Cursor Bugbot for commit bc081cb. Bugbot is set up for automated code reviews on this repo. Configure here.}

…) API

…keys Add PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. Set it in addCommandToBatch() when a command is added to the cross-command prefetch batch. In mgetCommand, check the flag and fall back to plain sequential lookups when keys are already warm from the batch. This eliminates redundant prefetch work that caused regressions of up to -16% with many I/O threads, while preserving the large gains (+52-54%) for pipelined workloads where the batch overflows.

The previous guard only skipped intra-command dictPrefetchKeys() when PENDING_CMD_KEYS_PREFETCHED was set, which requires ALL keys to fit in the batch. For MGET 100 keys with a batch size of 32, the flag was never set — both prefetch paths always ran. On x86 with pipeline-10, this double-prefetching caused -9.6% regression on mget-100B-100keys-pipeline-10 and -5.6% on setget2000c with 8 IO threads due to cache-bandwidth contention in ctxPrefetchAndAdvance. Fix: skip intra-command prefetch when either: 1. PENDING_CMD_KEYS_PREFETCHED is set (all keys fit in batch), OR 2. Pipeline is active (ready_len > 1) — partial batch warmup is sufficient, double-prefetching hurts more than it helps. Single non-pipelined MGET (ready_len <= 1) still uses intra-command prefetch, preserving the +11% ARM / +25% x86 improvement.

The ready_len > 1 heuristic that skips intra-command dictPrefetchKeys() assumed cross-command batch prefetching always runs during pipelining. When prefetch-batch-max-size is 0 (batch disabled), addCommandToBatch() returns C_ERR immediately and no keys are warmed, but ready_len > 1 still holds for any pipeline — causing MGET to skip its only prefetch opportunity. Add server.prefetch_batch_max_size > 0 to the ready_len guard so intra-command prefetch runs when the batch system is disabled.

The prefetchGetObjectValuePtr callback adds one state-machine iteration per key to prefetch the first cache line of the value allocation. For EMBSTR values (<=44B) it returns NULL (value is embedded in the kv object), making the iteration pure overhead. For large RAW values (>=1KiB) it warms <10% of the data while adding measurable cost — profiling shows 3B samples in ctxPrefetchValdata vs 17.8B total in the prefetch path. Pass NULL instead: skip value-data prefetch entirely in the MGET intra-command path. The dict bucket/entry/kv-object prefetch provides the bulk of the improvement; addReplyBulk handles value data access sequentially.

Three optimizations to the dict prefetch state machine: 1. Replace modulo with branch in ctxPrefetchAndAdvance — avoids expensive division on every state transition. 2. Add remaining counter to DictPrefetchCtx — ctxNextInfo returns NULL immediately when all keys are done instead of scanning the full array. Eliminates O(n) scan per call as keys complete. 3. Fast-path in ctxPrefetchKvobj when no value-data callback and entry is a direct key — do key comparison inline and skip the VALDATA state entirely, saving one round-trip through the state machine per key.

For batches of <=10 keys, replace the full dictPrefetchKeys state machine with a simple loop that prefetches hash table buckets only. The state machine has ~4 states per key with function dispatch and round-robin scanning overhead. Profiling shows this costs 4.3% of total CPU for 10-key MGET with 1KiB values — matching the observed regression exactly. A single-pass bucket prefetch eliminates the iteration overhead while still warming the hash chain heads. Larger batches (>10 keys) continue using the full state machine where the multi-round interleaving provides net benefit.

Adds intra-command batched prefetch to msetGenericCommand. When processing multiple keys, dict bucket pointers for the next batch of 16 keys are prefetched into L1 cache before executing setKey for each key. This hides memory latency from scattered dict lookups in large MSET commands. The prefetch is advisory — correctness does not depend on it. If a rehash occurs mid-batch (triggered by a prior setKey expanding the dict), the prefetched bucket may be stale. The subsequent lookupKeyWriteWithLink inside setKey re-derives the correct bucket. Both the MSETNX existence-check pass and the write pass are batched. Single-key MSET or empty-dict cases skip prefetch entirely.

Addresses three review threads on PR redis#15043: 1. Stale dict pointer (augmentcode + cursor bugbot, high/medium). In cluster mode, server.db[].keys is created with KVSTORE_FREE_EMPTY_DICTS. A prior batch's setKey -> expireIfNeeded may delete the last pre-existing key in the slot, triggering freeDictIfNeeded; the subsequent dbAddByLink then allocates a new dict. The cached dict* from before the batch would be a dangling pointer on the next msetPrefetchBatch call. Re-fetch the slot dict inside a new msetMaybePrefetchBatch helper that runs per batch. 2. Helper signature (sundb). Change msetPrefetchBatch to take `robj **argv` instead of `client *c`, making the contract explicit. 3. Duplication (sundb). Collapse the `if (use_prefetch) { batched } else { simple }` branches in both the MSETNX existence-check pass and the write pass into a single batched loop that conditionally calls the prefetch helper. `use_prefetch` now only gates the prefetch call, not the loop structure. Add regression tests in tests/unit/type/string.tcl: - MSET spanning multiple prefetch batches (16, 17, 32, 33, 40 keys) - MSET overwriting expired keys across a batch boundary (exercises the expireIfNeeded path in the same code that would UAF under cluster mode)

…mework

… callback Two review comments from Cursor Bugbot on dcb8aa8: 1. Missing `#undef MGET_LIGHT_PREFETCH_MAX` at the end of mgetCommand. MGET_BATCH was already undef'd but its sibling was not. 2. The `dictPrefetchKeys()` docstring example passed `prefetchGetObjectValuePtr` as the value-data callback, while the mgetCommand caller passes NULL on purpose (the callback's extra state machine work costs more than warming the value payload saves, since the following lookupKeyRead + addReplyBulk touch the payload immediately and the hardware prefetcher is already on it). Update the example to show the typical NULL form and document when passing a non-NULL callback would actually be worthwhile.

Decouple the prefetch state machine from kvobject. The state machine previously knew about kvobj fields (current_kv, kvobjGetKey) and the default value-data callback (prefetchGetObjectValuePtr) lived inside memory_prefetch.c. That coupled a generic dict facility to one specific dict's stored-key layout. Two new optional callbacks on dictType: void *prefetchEntryKey(const dictEntry *de); void *prefetchEntryValue(const dictEntry *de); The state machine renames the post-entry states to match: PREFETCH_KVOBJ -> PREFETCH_ENTRY_KEY PREFETCH_VALDATA -> PREFETCH_ENTRY_VALUE PREFETCH_ENTRY_KEY asks the dictType to bring the entry's key payload into cache (returns the address to prefetch, or NULL if the entry already carries everything keyCompare needs). PREFETCH_ENTRY_VALUE runs the keyCompare and on a hit asks the dictType to prefetch the value-side payload. dbDictType registers both callbacks in server.c: - prefetchEntryKey returns dictGetKey(de) only when the kvobj is out-of-line (matches today's dictEntryIsKey check). - prefetchEntryValue returns kv->ptr for OBJ_STRING/RAW (the old prefetchGetObjectValuePtr behaviour, now where it belongs). - dbExpiresDictType only sets prefetchEntryKey (no values). The PrefetchGetValueDataFunc typedef and the trailing get_val_data parameter are removed from the dictPrefetchKeys() public API and from the cross-command dictPrefetch() wrapper - the dict's own type now drives this end-to-end. mgetCommand's call site updates accordingly.

ctxPrefetchEntryValue was passing dictGetKey(entry) directly to dictCompareKeys. For dbDictType the stored key is a kvobj — keyCompare (dictSdsCompareKV) expects an sds, so it dereferenced kvobj+offset as if it were an sds, reading garbage 'flags' and length fields and walking off the heap. The previous form of this state (PREFETCH_KVOBJ -> PREFETCH_VALDATA) extracted the sds via kvobjGetKey before calling keyCompare. The dict-pure equivalent is dictType->keyFromStoredKey, which dictFindLink also uses (see dictFindLinkInternal -> dictStoredKey2Key). Fix: apply keyFromStoredKey to dictGetKey(entry) before passing it to dictCompareKeys when the dict provides the callback. dbDictType / dbExpiresDictType set keyFromStoredKey = kvGetKey, so the conversion yields the sds again; dicts without the callback (sds-keyed dicts) keep the entry's stored key unchanged. Reproduces under sanitizer (heap-buffer-overflow on sdslen reading kvobj-as-sds) and SIGSEGV in dictSdsCompareKV when a hash chain has more than one entry and the cross-command batch prefetch hits the compare path. Single-entry chains hit the !dictGetNext shortcut and so escaped local smoke testing.

augmentcode · 2026-04-29T08:13:52Z

🤖 Augment PR Summary

Summary: This PR reduces multi-key lookup latency by overlapping dict-lookup memory accesses using batched prefetching.

Changes:

Extend dictType with optional prefetchEntryKey/prefetchEntryValue hooks to describe key/value payload prefetch targets.
Refactor memory_prefetch.c’s dict prefetch state machine to be kvobj-agnostic and driven only by dictType callbacks.
Add an intra-command API (dictPrefetchKeys()) so single multi-key commands can prefetch their own key batches.
Update cross-command batch prefetching to rely on dictType hooks and introduce PENDING_CMD_KEYS_PREFETCHED to skip redundant intra-command prefetch.
Rework MGET to process keys in batches (light bucket-only prefetch for small batches; full state machine for larger ones).
Rework MSET/MSETNX to prefetch buckets in 16-key batches and re-fetch the slot dict each batch to avoid stale pointers.
Add unit tests covering multi-batch MSET behavior and an expiry-related regression case across the batch boundary.

Technical Notes: Prefetching remains advisory and is conditionally disabled in some pipelined scenarios to avoid cache-bandwidth contention.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-04-29T08:13:53Z

+    /* Determine the dict for the first key's slot (MGET requires all keys
+     * in the same slot in cluster mode). */
+    int slot = server.cluster_enabled ? getKeySlot(c->argv[1]->ptr) : 0;
+    dict *d = kvstoreGetDict(c->db->keys, slot);


src/t_string.c:706: mgetCommand caches dict *d once and reuses it across multiple 16-key batches, but in cluster mode lookupKeyRead() can expire/delete the last key in the slot and KVSTORE_FREE_EMPTY_DICTS may free the slot dict, leaving d dangling for the next batch’s prefetch/hash operations. This is the same class of hazard you already call out and avoid in msetMaybePrefetchBatch by re-fetching the dict per batch.

Severity: high

Other Locations

src/memory_prefetch.c:433

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-04-29T08:13:53Z

+     *   key payload behind the entry is warm before keyCompare runs), or
+     *   NULL if nothing extra is needed (e.g. the key is co-located with
+     *   the entry).
+     * prefetchEntryValue: called only after a key match. Returns an


src/dict.h:148: The comment says prefetchEntryValue is “called only after a key match”, but memory_prefetch.c can enter the match path without comparing keys for the last entry in a bucket (the “assume hit” fast path). This contract mismatch could lead future dictType implementations to make unsafe assumptions inside prefetchEntryValue.

Severity: medium

Other Locations

src/memory_prefetch.c:265

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-04-29T08:13:53Z

+     * (typically ≤ 16–32) to avoid excessive stack usage.
+     * Zero-initialize so that keys marked PREFETCH_DONE by ctxInit
+     * (null/empty dict) don't carry indeterminate field values. */
+    KeyPrefetchInfo pf_info[nkeys];


src/memory_prefetch.c:462: dictPrefetchKeys() stack-allocates KeyPrefetchInfo pf_info[nkeys]; if a caller ever passes an unbounded nkeys (e.g. derived directly from argc), this can cause stack exhaustion and crash. Since this is now an exported API in memory_prefetch.h, it feels easy for a future call site to accidentally violate the “keep nkeys bounded” assumption.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

Three fixes from the cursor and augmentcode bot reviews on the dict-pure prefetch branch: 1. mgetCommand: re-fetch the slot dict per batch. The dict pointer was fetched once at the top and reused across all 16-key batches. In cluster mode, db->keys is created with KVSTORE_FREE_EMPTY_DICTS, so a prior batch's lookupKeyRead -> expireIfNeeded path can empty the slot and free the dict; the next batch's dictGetHash(d, ...) would then UAF. Same hazard the MSET batch path already guards against via msetMaybePrefetchBatch. Mirror that pattern: kvstoreGetDict per batch, with an empty/null dict shortcut that still calls lookupKeyRead per key to preserve touch / stats / notification side-effects. Cost: one extra kvstoreGetDict per 16 keys. 2. dictType.prefetchEntryValue contract: the docstring claimed the callback is 'called only after a key match', but the state machine has a 'last entry in chain, not rehashing' shortcut that calls the callback without comparing keys. Update the doc to spell out the presumed-match semantics so future dictType authors don't write callbacks that assume a verified equal. 3. dictPrefetchKeys: cap nkeys with serverAssert + named constant (DICT_PREFETCH_KEYS_MAX = 64). KeyPrefetchInfo is stack-allocated via VLA and must not be sized off an attacker-controlled value; in-tree callers batch in groups of 16, the cap leaves headroom without risking stack exhaustion.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit d6d9ac9. Configure here.}

fcostaoliveira · 2026-05-05T20:32:01Z

@ShooterIT — your follow-up refactor is in as 792c9ca1c. Net −85 LOC (+67 / −152). Big readability wins:

Single shared prefetchKeysBatch(c, slot, start, stride) helper kills the MGET/MSET duplication (stride=1 for MGET, stride=2 for MSET key/value pairs).
Adaptive batch sizing (if (batch >= PREFETCH_BATCH_SIZE*2) batch = PREFETCH_BATCH_SIZE; else take all remaining) generalizes the small-request fast path to also cover the trailing batch of large requests.
VLA lookups[nkeys] → fixed lookups[DICT_PREFETCH_MAX_SIZE] (no VLA, portable).
Constant rename DICT_PREFETCH_KEYS_MAX → DICT_PREFETCH_MAX_SIZE (matches prefetch_batch_max_size config).
MSETNX SET loop skips prefetch (NX-check loop already warmed everything).
addCommandToBatch now sets PENDING_CMD_KEYS_PREFETCHED even on partial batches (was: only when ALL keys fit). This was the one semantic change worth validating since the prior comment about "−9.6 % regression on x86 with pipeline-10 when both paths run" left it ambiguous — turns out suppressing intra-command supplementation on partial batches doesn't regress mget-100B-100keys either (see table).

Validation — `x86-aws-m7i.metal-24xl-2`

PR 792c9ca1c vs unstable 2432f5527, alongside the previous head a46e18965 for refactor delta:

Test	Base `2432f552`	Prev `a46e1896`	PR `792c9ca1`	Δ vs base	Δ vs prev
`5Mkeys-string-mget-100B-2keys`	162 657.34	163 601.69	164 968.43	+1.42 %	+0.84 %
`5Mkeys-string-mget-100B-5keys`	146 959.66	153 192.74	150 364.64 (mean of 3 dp: 147 804 / 152 065 / 151 223)	−1.85 %	−1.85 %
`5Mkeys-string-mget-100B-10keys`	125 275.66	133 124.41	133 586.86	+6.63 %	+0.35 %
`5Mkeys-string-mget-100B-20keys`	90 897.79	106 756.88	107 037.88	+17.76 %	+0.26 %
`5Mkeys-string-mget-100B-30keys`	73 006.87	87 653.67	88 796.06	+21.63 %	+1.30 %
`5Mkeys-string-mget-100B-100keys`	31 149.71	42 347.75	42 723.75	+37.16 %	+0.89 %
`10Mkeys-load-string-mset-10-keys-with-10B-values`	131 641.65	133 212.63	134 606.62	+2.25 %	+1.05 %
`10Mkeys-load-string-mset-10-keys-with-100B-values`	127 918.11	129 232.20	132 016.12	+3.20 %	+2.15 %
`10Mkeys-load-string-mset-50-keys-with-10B-values`	65 659.64	67 970.65	67 208.60	+2.36 %	−1.12 %
`10Mkeys-load-string-mset-50-keys-with-100B-values`	60 718.20	63 604.58	63 409.87	+4.43 %	−0.31 %

Refactor is performance-neutral vs the prev head (8 of 10 within ±1.5 %, two within ±2.2 %). All headline wins vs unstable preserved — mget-100B-100keys actually nudged up from +35.95 % → +37.16 %. The partial-batch PENDING_CMD_KEYS_PREFETCHED change you made in addCommandToBatch was the only behavioural concern worth re-checking, and it's clean.

fcostaoliveira · 2026-05-06T09:17:33Z

Two follow-up validation rounds on x86-aws-m7i.metal-24xl-2 against the same unstable baseline (now 7ecc04f59d — the current head) that surface deltas under topologies the prior table didn't cover.

`oss-standalone-08-io-threads` — io-threads safety check

The change in addCommandToBatch (commit 792c9ca1c) sets PENDING_CMD_KEYS_PREFETCHED even on partial cross-command batches. Worth confirming this doesn't regress io-threads workloads where multi-key commands could overflow the batch and lose the supplemental intra-command prefetch.

Test	Baseline `7ecc04f59d` (Ops/sec)	PR `792c9ca1c` (Ops/sec)	Δ
`5Mkeys-string-mget-100B-100keys-pipeline-10`	41 218.99	60 581.10	+46.97 %
`5Mkeys-string-mget-100B-30keys-pipeline-10`	148 639.87	212 409.44	+42.90 %
`1Mkeys-string-setget200c-512B-pipeline-10`	2 414 365.79	2 594 377.61	+7.46 %
`1Mkeys-string-set-get-short-expiration-512B`	1 272 013.21	1 342 755.00	+5.56 %
`3Mkeys-string-mixed-20-80-with-512B-values-pipeline-10-2000_conns`	2 423 562.75	2 515 002.31	+3.77 %

The cold-tail concern was unfounded — eliminating supplemental intra-command prefetching on commands the batch already touched removes redundant cache-bandwidth contention. Multi-key MGET wins are larger with io-threads (+47 % / +43 %) than the no-iothreads version (+37 % on mget-100B-100keys at oss-standalone).

`oss-standalone` pipeline-10 — non-iothreads pipelined

Test	Baseline `7ecc04f59d` (Ops/sec)	PR `792c9ca1c` (Ops/sec)	Δ
`5Mkeys-string-mget-100B-2keys-pipeline-10`	870 056.10	885 536.36	+1.78 %
`5Mkeys-string-mget-100B-5keys-pipeline-10`	460 668.12	500 265.21	+8.60 %
`5Mkeys-string-mget-100B-30keys-pipeline-10`	102 827.68	131 288.11	+27.68 %
`5Mkeys-string-mget-100B-100keys-pipeline-10`	56 911.76	59 489.88	+4.53 %

Coverage matrix (PR `792c9ca1c` vs `7ecc04f59d`, m7i-2)

Workload	non-pipeline (`oss-standalone`)	pipeline-10 (`oss-standalone`)	pipeline-10 (`oss-standalone-08-io-threads`)
`mget-100B-2keys`	+1.42 %	+1.78 %	—
`mget-100B-5keys`	+0.57 % (mean of 3 dp)	+8.60 %	—
`mget-100B-10keys`	+6.63 %	—	—
`mget-100B-20keys`	+17.76 %	—	—
`mget-100B-30keys`	+21.63 %	+27.68 %	+42.90 %
`mget-100B-100keys`	+37.16 %	+4.53 %	+46.97 %
`set-get-short-expiration-512B`	—	—	+5.56 %
`setget200c-512B-pipeline-10`	—	—	+7.46 %
`mixed-20-80-512B-pipeline-10-2000_conns`	—	—	+3.77 %

No regression on any shape. The mget-100B-100keys-pipeline-10 +4.53 % being smaller than its non-pipeline +37.16 % is the expected interaction — at 100 keys the cross-command batch (~32 keys per prefetch_batch_max_size) already warms the head of the keyset before the intra-command path runs, leaving less DRAM latency to hide. Add io-threads (parallel reply chain + cross-command batch) and the wins compound to +47 %.

ShooterIT

generally LGTM, small comments

ShooterIT · 2026-05-06T12:01:25Z

        addReplyBulkCBuffer(c,(char*)str+start,end-start+1);
    }
 }
+/* Batch size for intra-command key prefetching. */


Suggested change

/* Batch size for intra-command key prefetching. */

/* Batch size for intra-command key prefetching. */

ShooterIT · 2026-05-06T12:08:33Z

+        # Regression test for dict-pointer staleness across batches
+        # (see src/t_string.c:msetMaybePrefetchBatch). When lookupKeyWrite in
+        # batch 1 expires a pre-existing key, under cluster mode the slot
+        # dict may be freed (KVSTORE_FREE_EMPTY_DICTS) and recreated
+        # mid-command; msetGenericCommand must re-fetch the slot dict per
+        # batch. This test exercises the same code path in standalone mode.


please update comments or tests

ShooterIT · 2026-05-06T12:16:46Z

    .dictMetadataBytes = setDictMetadataBytes,
 };

+/* dbDictType prefetch callbacks --------------------------------------------


------------------------ is not align with other function style, maybe remove them, or refine it

Suggested change

/* dbDictType prefetch callbacks --------------------------------------------

/* dbDictType prefetch callbacks.

btw, i think we should move them at the front of dictType, maybe L572

sundb · 2026-05-08T06:18:55Z

+    addReplyArrayLen(c, numkeys);
+
+    /* MGET requires all keys in the same slot in cluster mode. */
+    int slot = server.cluster_enabled ? getKeySlot(c->argv[1]->ptr) : 0;


what about use c->current_pending_cmd->slot, if it's not INVALID_CLUSTER_SLOT, we can use it directly.

- src/server.c: drop the '----' separator from the dbDictType prefetch callbacks comment block and move the two callbacks ahead of the dictType definitions section, near the related dictResizeAllowed helper, instead of sitting between setDictType and dbDictType. - src/t_string.c: blank-line separation between getrangeCommand and the PREFETCH_BATCH_SIZE define for readability. - tests/unit/type/string.tcl: update stale function reference in the 'MSET overwrites expired keys across batch boundary' regression comment from msetMaybePrefetchBatch to the post-refactor prefetchKeysBatch helper.

fcostaoliveira · 2026-05-08T11:13:22Z

Thanks @ShooterIT — addressed all three in 517cb9b:

src/t_string.c:685 — added the blank-line separation before the PREFETCH_BATCH_SIZE define.
src/server.c:624 — dropped the ------ separator and moved the two prefetch callbacks ahead of the dictType section (right after dictResizeAllowed, before objectKeyPointerValueDictType), per your L572 suggestion.
tests/unit/type/string.tcl:282 — updated the stale msetMaybePrefetchBatch reference to prefetchKeysBatch (post-refactor name).

Local runtest --single unit/type/string green, including the MSET overwrites expired keys across batch boundary regression test.

Avoid recomputing the cluster slot via getKeySlot() (CRC16 over the first key) in mgetCommand and msetGenericCommand when the cross-command batching path has already populated c->current_pending_cmd->slot. Fall back to getKeySlot() when current_pending_cmd is NULL or the cached slot is INVALID_CLUSTER_SLOT (cross-slot or no slot set). Adds the cluster.h include for INVALID_CLUSTER_SLOT visibility.

fcostaoliveira · 2026-05-08T11:21:15Z

Thanks @sundb — good catch, addressed in 3d724a05c. Both mgetCommand and msetGenericCommand now reuse c->current_pending_cmd->slot when it's already populated by the cross-command batching path, falling back to getKeySlot() only when current_pending_cmd is NULL or the cached slot is INVALID_CLUSTER_SLOT (no-slot / cross-slot error case).

int slot = 0;
if (server.cluster_enabled) {
    slot = (c->current_pending_cmd &&
            c->current_pending_cmd->slot != INVALID_CLUSTER_SLOT)
        ? c->current_pending_cmd->slot
        : getKeySlot(c->argv[1]->ptr);
}

Local runtest --single unit/type/string green.

fcostaoliveira · 2026-05-08T14:04:10Z

Latest commit 3d724a05c (sundb slot-reuse) — performance reconfirmed across multiple x86 + ARM runners.

Triggered a fresh round comparing PR head 3d724a05c against current unstable head 7cf63635f0 (which now includes the just-merged #15071 jemalloc-size-hint patch).

Throughput — `m7i-2` (Sapphire Rapids x86, multi-dp, σ-disjoint where ≥ 2 dp/cell)

Test	Topology	unstable mean (n)	PR mean (n)	Δ
`5Mkeys-string-mget-10B-100keys-pipeline-10`	io-threads-8	26,304 (n=1)	49,831 (n=2)	+89.44 %
`5Mkeys-string-mget-10B-100keys-pipeline-10`	standalone	22,953 (n=1)	42,053 (n=2)	+83.21 %
`5Mkeys-string-mget-10B-100keys`	standalone	22,136 (n=1)	39,472 (n=2)	+78.31 %
`5Mkeys-string-mget-100B-30keys-pipeline-10`	io-threads-8	148,763 (n=2)	209,899 (n=2)	+41.10 % σ↑
`5Mkeys-string-mget-100B-100keys-pipeline-10`	io-threads-8	43,505 (n=2)	60,563 (n=3)	+39.21 % σ↑
`5Mkeys-string-mget-100B-100keys`	standalone	31,084 (n=2)	42,688 (n=3)	+37.33 % σ↑
`10Mkeys-string-mget-80keys-100B-multitenant-tagged`	standalone	30,108 (n=2)	40,811 (n=3)	+35.55 % σ↑
`5Mkeys-string-mget-100B-30keys-pipeline-10`	standalone	105,791 (n=2)	129,903 (n=2)	+22.79 % σ↑
`5Mkeys-string-mget-100B-30keys`	standalone	72,249 (n=2)	88,433 (n=2)	+22.40 % σ↑
`5Mkeys-string-mget-100B-20keys`	standalone	91,047 (n=2)	106,422 (n=2)	+16.89 % σ↑
`5Mkeys-string-mget-512B-5keys-pipeline-10`	io-threads-8	815,187 (n=1)	935,741 (n=2)	+14.79 %
`5Mkeys-string-mget-100B-5keys-pipeline-10`	standalone	464,503 (n=2)	503,364 (n=2)	+8.37 % σ↑
`5Mkeys-string-mget-100B-5keys-pipeline-10`	io-threads-8	902,737 (n=2)	977,641 (n=2)	+8.30 % σ↑
`5Mkeys-string-mget-512B-5keys-pipeline-10`	standalone	314,000 (n=1)	333,018 (n=2)	+6.06 %
`5Mkeys-string-mget-1KiB-5keys`	standalone	118,086 (n=1)	124,493 (n=2)	+5.43 %
`5Mkeys-string-mget-100B-100keys-pipeline-10`	standalone	56,483 (n=2)	59,229 (n=3)	+4.86 % σ↑
`5Mkeys-string-mget-10B-100keys`	io-threads-8	30,950 (n=1)	32,049 (n=2)	+3.55 %
`5Mkeys-string-mget-100B-5keys`	standalone	145,255 (n=2)	150,094 (n=2)	+3.33 % σ↑
`5Mkeys-string-mget-512B-5keys`	standalone	132,357 (n=1)	136,500 (n=2)	+3.13 %
`10Mkeys-load-string-mset-10-keys-with-100B-values`	standalone	129,437 (n=2)	131,074 (n=3)	+1.26 % σ↑
`5Mkeys-string-mget-100B-2keys-pipeline-10`	standalone	886,782 (n=2)	891,897 (n=2)	+0.58 %
`5Mkeys-string-mget-1KiB-5keys-pipeline-10`	standalone	123,797 (n=1)	123,513 (n=2)	-0.23 %
`5Mkeys-string-mget-100B-2keys-pipeline-10`	io-threads-8	1,610,943 (n=2)	1,578,659 (n=2)	-2.00 %
`5Mkeys-string-mget-100B-2keys`	standalone	168,012 (n=2)	164,302 (n=2)	-2.21 % σ↓
`5Mkeys-string-mget-1KiB-5keys-pipeline-10`	io-threads-8	477,506 (n=1)	415,693 (n=2)	-12.94 % (*)

σ↑ / σ↓ = σ-disjoint (PR min > BL max or vice versa) when both sides have ≥ 2 dp.

(*) The −12.94 % on mget-1KiB-5keys-pipeline-10 / io-threads-8 is the same outlier flagged on #14907 (also σ-confirmed there at −12.05 %). The test's YAML spec is missing the --pipeline 10 client argument that its 100B and 512B siblings have, so it actually runs unpipelined despite the name. Both PR and baseline run the same broken spec, so the delta is real, but the magnitude doesn't match the prefetch mechanism (its 5-key siblings at 100B and 512B are +1 to +15 %). Likely a spec hygiene issue separate from this PR — happy to fix the spec under a separate change.

Throughput — `m8g-2` (Graviton4 ARM, multi-dp, σ-disjoint where ≥ 2 dp/cell)

Test	Topology	unstable mean (n)	PR mean (n)	Δ
`5Mkeys-string-mget-10B-100keys-pipeline-10`	io-threads-8	3,059 (n=1)	6,986 (n=2)	+128.34 %
`5Mkeys-string-mget-100B-100keys-pipeline-10`	io-threads-8	6,261 (n=2)	9,188 (n=3)	+46.76 % σ↑
`5Mkeys-string-mget-10B-100keys`	standalone	21,106 (n=1)	29,716 (n=2)	+40.79 %
`5Mkeys-string-mget-10B-100keys-pipeline-10`	standalone	22,608 (n=1)	31,728 (n=2)	+40.34 %
`5Mkeys-string-mget-100B-30keys-pipeline-10`	io-threads-8	22,804 (n=2)	29,772 (n=2)	+30.56 % σ↑
`5Mkeys-string-mget-100B-5keys-pipeline-10`	io-threads-8	115,835 (n=2)	126,741 (n=2)	+9.42 % σ↑
`5Mkeys-string-mget-512B-5keys-pipeline-10`	io-threads-8	95,888 (n=1)	102,753 (n=2)	+7.16 %
`5Mkeys-string-mget-100B-2keys-pipeline-10`	io-threads-8	602,821 (n=2)	635,721 (n=2)	+5.46 % σ↑
`10Mkeys-string-mget-80keys-100B-multitenant-tagged`	standalone	29,761 (n=2)	30,526 (n=3)	+2.57 % σ↑
`5Mkeys-string-mget-100B-30keys-pipeline-10`	standalone	67,169 (n=2)	67,965 (n=2)	+1.18 %
`5Mkeys-string-mget-100B-100keys`	standalone	25,818 (n=2)	26,089 (n=3)	+1.05 % σ↑
`10Mkeys-load-string-mset-10-keys-with-100B-values`	standalone	109,132 (n=2)	109,808 (n=3)	+0.62 % σ↑
`5Mkeys-string-mget-512B-5keys-pipeline-10`	standalone	170,550 (n=1)	171,282 (n=2)	+0.43 %
`5Mkeys-string-mget-100B-100keys-pipeline-10`	standalone	26,244 (n=2)	26,266 (n=3)	+0.08 %
`5Mkeys-string-mget-100B-2keys-pipeline-10`	standalone	701,637 (n=2)	701,509 (n=2)	-0.02 %
`5Mkeys-string-mget-100B-5keys-pipeline-10`	standalone	351,784 (n=2)	351,625 (n=2)	-0.05 %
`5Mkeys-string-mget-512B-5keys`	standalone	129,069 (n=1)	128,976 (n=2)	-0.07 %
`5Mkeys-string-mget-1KiB-5keys-pipeline-10`	standalone	113,092 (n=1)	112,987 (n=2)	-0.09 %
`5Mkeys-string-mget-100B-5keys`	standalone	139,258 (n=1)	139,089 (n=2)	-0.12 %
`5Mkeys-string-mget-100B-2keys`	standalone	160,049 (n=2)	159,567 (n=2)	-0.30 % σ↓
`5Mkeys-string-mget-1KiB-5keys-pipeline-10`	io-threads-8	111,884 (n=1)	111,834 (n=2)	-0.04 %
`5Mkeys-string-mget-10B-100keys`	io-threads-8	18,356 (n=1)	18,251 (n=2)	-0.57 %
`5Mkeys-string-mget-100B-20keys`	standalone	84,033 (n=2)	83,318 (n=2)	-0.85 % σ↓
`5Mkeys-string-mget-1KiB-5keys`	standalone	114,255 (n=1)	112,518 (n=2)	-1.52 %
`5Mkeys-string-mget-100B-30keys`	standalone	69,012 (n=2)	67,441 (n=2)	-2.28 % σ↓

ARM picture splits two ways: all the big wins land on ARM io-threads-8 (where IO-thread interleaving disrupts the HW prefetcher and the explicit dictPrefetchKeys warm-up does real work — 4 σ-confirmed io-threads wins +5 to +47 %), while ARM standalone is essentially flat to slightly negative on small-N MGET (Graviton4's wide HW prefetcher already absorbs the multi-key MGET case, so the explicit prefetch is overhead). The two big standalone wins on mget-10B-100keys{,-pipeline-10} are the exception — small values + many keys overwhelm even Graviton4's HW prefetcher.

Cross-check on `m7i-profiler` (single-dp, x86 SPR, with TMA)

Same 25-test set: 24 wins, 1 flat (−1.5 % MSET-load). Headlines reproduce: mget-10B-100keys-pipeline-10 io-threads +95.19 %, mget-100B-100keys-pipeline-10 io-threads +57.58 %, mget-100B-30keys-pipeline-10 io-threads +36.68 %.

TMA on the headline test

mget-100B-100keys-pipeline-10 / io-threads-8: L1_Bound −3.2 pp, Frontend_Bound −1.5 pp, Memory_Bound +3.5 pp (L3 +2.1 pp). Optimization signature: prefetch warming kvobj into L1 before keyCompare, fewer instructions per command, memory becomes the new ceiling at sharply higher throughput.

Notes

Sundb slot-reuse change in 3d724a05c is gated by server.cluster_enabled → on every benchmark above (all oss-standalone*) the new code path is functionally identical to the prior PR head; the result is a clean reproduction of the prior round's wins.
2-key MGET shows a small consistent flat-to-slight-negative (−0.4 to −2.2 %), consistent with the optimization being an overhead-amortization play that needs ≥ 5 keys to win.
Across the 50 paired m7i-2 + m8g-2 cells: 17 σ-disjoint wins, 5 σ-disjoint flat-negatives (all ≤ −2.3 %, all ARM-standalone-small-N where Graviton4's HW prefetch dominates), 1 σ-disjoint outlier (mget-1KiB-5keys-pipeline-10 io-threads-8, spec issue noted above).

ShooterIT

just a style issue

Co-authored-by: Yuan Wang <wangyuancode@163.com>

fcostaoliveira · 2026-05-08T16:00:05Z

just a style issue

committed the suggestion. thank you @ShooterIT !

Co-authored-by: Yuan Wang <wangyuancode@163.com>

ShooterIT · 2026-05-09T11:45:02Z

Daily CI: https://github.com/ShooterIT/redis/actions/runs/25600281283

mpozniak95 and others added 17 commits March 30, 2026 21:08

Extend memory_prefetch framework with intra-command dictPrefetchKeys(…

79bb48c

…) API

Remove unused fields current_dicts and get_value_data_func

6911b26

Add zero-initializing

f0bcfa5

Merge remote-tracking branch 'origin/unstable' into mset-batch-prefetch

e04e2ec

Merge remote-tracking branch 'origin/unstable' into mget_prefetch_fra…

5f3e670

…mework

Merge remote-tracking branch 'origin/unstable' into pr-14899

d83e288

Merge branch 'pr-15043' into mget-mset-dict-pure

fc17dfc

fcostaoliveira changed the title ~~Alternative: dict-pure MGET/MSET prefetch via dictType callbacks~~ Batched MGET/MSET dict prefetch with dictType-driven payload hints Apr 28, 2026

fcostaoliveira requested review from ShooterIT and sundb April 29, 2026 08:06

fcostaoliveira marked this pull request as ready for review April 29, 2026 08:06

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread src/t_string.c

augmentcode Bot reviewed Apr 29, 2026

View reviewed changes

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread src/t_string.c Outdated

ShooterIT reviewed May 1, 2026

View reviewed changes

Comment thread src/server.c Outdated

ShooterIT reviewed May 1, 2026

View reviewed changes

Comment thread src/t_string.c Outdated

ShooterIT reviewed May 1, 2026

View reviewed changes

Comment thread src/t_string.c Outdated

fcostaoliveira requested a review from ShooterIT May 5, 2026 10:51

refactor mset/mget

792c9ca

fcostaoliveira mentioned this pull request May 5, 2026

refactor mset/mget filipecosta90/redis#23

Closed

ShooterIT reviewed May 6, 2026

View reviewed changes

sundb reviewed May 8, 2026

View reviewed changes

fcostaoliveira requested review from ShooterIT and sundb May 8, 2026 11:23

ShooterIT reviewed May 8, 2026

View reviewed changes

Comment thread src/t_string.c Outdated

ShooterIT approved these changes May 8, 2026

View reviewed changes

Update src/t_string.c

1dc8662

Co-authored-by: Yuan Wang <wangyuancode@163.com>

ShooterIT reviewed May 9, 2026

View reviewed changes

Comment thread src/t_string.c Outdated

Update t_string.c

c6389fa

Co-authored-by: Yuan Wang <wangyuancode@163.com>

Merge branch 'unstable' into mget-mset-dict-pure

bc081cb

ShooterIT merged commit 62551a7 into redis:unstable May 11, 2026
18 checks passed

ShooterIT added the release-notes indication that this issue needs to be mentioned in the release notes label May 11, 2026

ShooterIT added this to Redis 8.8 May 11, 2026

github-project-automation Bot moved this to Todo in Redis 8.8 May 11, 2026

ShooterIT moved this from Todo to Done in Redis 8.8 May 11, 2026

This was referenced May 11, 2026

Add batched dict bucket prefetch for MSET/MSETNX #15043

Closed

Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET #14899

Closed

paulorsousa mentioned this pull request May 21, 2026

Fix cache line false sharing on per-IO-thread stat counters #14907

Merged

	/* Batch size for intra-command key prefetching. */

	/* Batch size for intra-command key prefetching. */

	/* dbDictType prefetch callbacks --------------------------------------------
	/* dbDictType prefetch callbacks.

Uh oh!

Conversation

fcostaoliveira commented Apr 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

State machine

Intra-command API

Notes on the call sites

Benchmark validation

Uh oh!

Uh oh!

augmentcode Bot commented Apr 29, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fcostaoliveira commented May 5, 2026

Validation — x86-aws-m7i.metal-24xl-2

Uh oh!

fcostaoliveira commented May 6, 2026

oss-standalone-08-io-threads — io-threads safety check

oss-standalone pipeline-10 — non-iothreads pipelined

Coverage matrix (PR 792c9ca1c vs 7ecc04f59d, m7i-2)

Uh oh!

ShooterIT left a comment

Choose a reason for hiding this comment

Uh oh!

ShooterIT May 6, 2026

Choose a reason for hiding this comment

Uh oh!

ShooterIT May 6, 2026

Choose a reason for hiding this comment

Uh oh!

ShooterIT May 6, 2026

Choose a reason for hiding this comment

Uh oh!

sundb May 8, 2026

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira commented May 8, 2026

Uh oh!

fcostaoliveira commented May 8, 2026

Uh oh!

fcostaoliveira commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Throughput — m7i-2 (Sapphire Rapids x86, multi-dp, σ-disjoint where ≥ 2 dp/cell)

Throughput — m8g-2 (Graviton4 ARM, multi-dp, σ-disjoint where ≥ 2 dp/cell)

Cross-check on m7i-profiler (single-dp, x86 SPR, with TMA)

TMA on the headline test

Notes

Uh oh!

Uh oh!

ShooterIT left a comment

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira commented May 8, 2026

Uh oh!

Uh oh!

ShooterIT commented May 9, 2026

Uh oh!

Uh oh!

fcostaoliveira commented Apr 28, 2026 •

edited by cursor Bot

Loading

augmentcode Bot Apr 29, 2026 •

edited

Loading

augmentcode Bot Apr 29, 2026 •

edited

Loading

augmentcode Bot Apr 29, 2026 •

edited

Loading

Validation — `x86-aws-m7i.metal-24xl-2`

`oss-standalone-08-io-threads` — io-threads safety check

`oss-standalone` pipeline-10 — non-iothreads pipelined

Coverage matrix (PR `792c9ca1c` vs `7ecc04f59d`, m7i-2)

fcostaoliveira commented May 8, 2026 •

edited

Loading

Throughput — `m7i-2` (Sapphire Rapids x86, multi-dp, σ-disjoint where ≥ 2 dp/cell)

Throughput — `m8g-2` (Graviton4 ARM, multi-dp, σ-disjoint where ≥ 2 dp/cell)

Cross-check on `m7i-profiler` (single-dp, x86 SPR, with TMA)