Add batched dict bucket prefetch for MSET/MSETNX by fcostaoliveira · Pull Request #15043 · redis/redis

fcostaoliveira · 2026-04-14T16:56:17Z

Summary

Add intra-command batched prefetch to msetGenericCommand. When processing
multiple keys in a single MSET/MSETNX command, the dict bucket pointers for
the next batch of 16 keys are prefetched into L1 cache before executing
setKey for each key. This hides DRAM latency from scattered dict lookups
in large keyspaces where the dict exceeds L3 cache.

The optimization is analogous to the HGETALL batched prefetch (#14988) and
the MGET cross-command prefetch (#14899), but applied to the MSET write path.

When it helps:

Keyspace > L3 cache (10M+ keys on typical server hardware)
Multiple keys per MSET command (10+ keys per command)
Random key access patterns (not sequential)

When it doesn't help:

Small keyspaces (1M keys fits in L3 — dict stays cache-warm)
Single-key SET (no prefetch window)
Sequential key patterns (hardware prefetcher handles these)

How it works

The MSET loop is restructured into batches of MSET_BATCH_SIZE (16) keys:

Prefetch phase: For each key in the batch, compute the dict hash and
prefetch the bucket pointer from ht_table[0] (and ht_table[1] if
rehashing). This is advisory — it merely suggests cache lines to load.
Execute phase: For each key in the batch, call setKey() with the
cache already warm from step 1.

For MSETNX, both the existence-check pass and the write pass are batched.

The prefetch is advisory — correctness does not depend on it. If a rehash
occurs mid-batch (triggered by a prior setKey expanding the dict), the
prefetched bucket may be stale, but setKey → lookupKeyWriteWithLink →
dictFindLink re-derives the correct bucket.

Benchmark Results

Tested on x86-aws-m7i.metal-24xl-2 (Intel Xeon Platinum 8488C, 96 cores,
bare metal), oss-standalone topology, 10M keys (dict ~800MB, well beyond
L3 cache).

10 keys per MSET

Test	Baseline (unstable)	PR	Change
10M × 1000B (RAW)	97,435	101,644	+4.3%
10M × 100B (RAW)	127,689	132,416	+3.7%
10M × 10B (EMBSTR)	132,908	134,659	+1.3%

50 keys per MSET

Test	Baseline (unstable)	PR	Change
10M × 10B (EMBSTR)	65,714	69,115	+5.2%
10M × 100B (RAW)	61,021	63,665	+4.3%

1M keys (L3-resident, no benefit expected)

Test	Baseline (unstable)	PR	Change
1M × 10B	123,191	124,979	+1.5% (noise)

Pattern: Larger values = more memory scatter = more cache misses = more
prefetch benefit. More keys per MSET = wider prefetch batch = bigger
improvement. The EMBSTR encoding (10B) shows less benefit because
key + value are embedded in the same allocation (fewer pointer chases).

Regression check

Zero regressions. Single-key SET/GET and pipelined workloads are unaffected
(the numkeys > 1 && dict non-empty guard skips prefetch for these cases).

Files Changed

src/t_string.c — msetGenericCommand(): batched prefetch loop +
msetPrefetchBatch() static helper

Note

Medium Risk
Touches the hot MSET/MSETNX write path and introduces direct dict bucket prefetching plus batched iteration, which could surface subtle correctness or memory-safety issues if dict/slot lifetime assumptions are wrong (mitigated by per-batch dict re-fetch and added tests).

Overview
Adds an intra-command, batched dict-bucket prefetch to msetGenericCommand, restructuring both the MSETNX existence-check pass and the MSET write pass into 16-key batches and prefetching the relevant hash table buckets (including rehash table when applicable) before doing lookups/sets.

Includes new unit tests that exercise MSET across the 16-key batch boundary and a regression test ensuring correctness when expired keys can cause per-slot dicts to be freed/recreated mid-command.

^{Reviewed by Cursor Bugbot for commit e04e2ec. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds intra-command batched prefetch to msetGenericCommand. When processing multiple keys, dict bucket pointers for the next batch of 16 keys are prefetched into L1 cache before executing setKey for each key. This hides memory latency from scattered dict lookups in large MSET commands. The prefetch is advisory — correctness does not depend on it. If a rehash occurs mid-batch (triggered by a prior setKey expanding the dict), the prefetched bucket may be stale. The subsequent lookupKeyWriteWithLink inside setKey re-derives the correct bucket. Both the MSETNX existence-check pass and the write pass are batched. Single-key MSET or empty-dict cases skip prefetch entirely.

augmentcode · 2026-04-14T17:02:46Z

🤖 Augment PR Summary

Summary: This PR adds an intra-command batched dict-bucket prefetch optimization to the MSET/MSETNX write path.

Changes:

Introduces a new static helper (msetPrefetchBatch) to prefetch per-key hash table bucket pointers
Restructures msetGenericCommand into batches of 16 keys to separate prefetch and execution phases
Applies the batching both to the MSETNX existence-check pass and the actual set pass
Prefetches both HT[0] and HT[1] buckets when the dict is rehashing
Uses a guard to skip prefetch for single-key commands and empty dicts

Technical Notes: The prefetch is advisory and intended to reduce DRAM latency from random dict lookups in large keyspaces.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-04-14T17:02:47Z

+     * entries to look up. Single-key MSET or empty dict (all inserts) skip
+     * prefetch since there's no lookup to warm. */
+    int slot = server.cluster_enabled ? getKeySlot(c->argv[1]->ptr) : 0;
+    dict *d = kvstoreGetDict(c->db->keys, slot);


dict *d is cached once and reused for all subsequent msetPrefetchBatch() calls, but lookupKeyWrite() / setKey() can delete expired keys and (in cluster mode with KVSTORE_FREE_EMPTY_DICTS) potentially free the slot dict when it becomes empty, making d a dangling pointer. Consider ensuring the prefetch path can’t retain a stale dict * across expiration-driven deletions within the same command.

Severity: high

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

Fixed in 1b8bcce. The slot dict is now re-fetched per batch via the new msetMaybePrefetchBatch helper, so a prior batch's setKey → expireIfNeeded → freeDictIfNeeded → dbAddByLink sequence (cluster mode, KVSTORE_FREE_EMPTY_DICTS) can't leave us with a dangling pointer. Cost: one extra kvstoreGetDict call per 16 keys.

Added a regression test in tests/unit/type/string.tcl ("MSET overwrites expired keys across batch boundary") that exercises the expire-then-insert path across a 20-key MSET (two batches). ASAN + test-sanitizer-address and test-external-cluster are green on the follow-up.

fcostaoliveira · 2026-04-14T17:03:48Z

Polar Signals CPU Profile: unstable vs PR

Profiled on x86-aws-m7i.metal-24xl-profiler (Intel Xeon Platinum 8488C) with
parca-agent, test memtier_benchmark-10Mkeys-load-string-mset-50-keys-with-100B-values.

dict.c — the prefetch target

Function	Unstable (flat)	PR (flat)	Change
`dictFindLinkInternal`	5.1B (8.5%)	0.58B (1.0%)	−88%
`dictStoredKey2Key`	1.2B (2.0%)	1.1B (1.8%)	−11%
`dictGetKey`	0.68B (1.1%)	0.32B (0.5%)	−54%
`dictGetHash`	0.53B (0.9%)	1.53B (2.6%)	+189% (expected — prefetch adds hash computation)
`dictFindLink` (wrapper)	0.05B	0.16B	+220% (more calls from batch loop)

dictFindLinkInternal flat cost dropped 8.5× (from 8.5% to 1.0%) — the cache misses
that dominated dict lookups are now hidden by the prefetch.

The dictGetHash increase (+189%) is expected: the prefetch phase computes hashes
ahead of time. But the net CPU savings from cache-warm lookups far outweigh the
extra hash computation — total msetGenericCommand cumulative drops from 76.5% to 33.7%.

msetGenericCommand — total cost

Metric	Unstable	PR	Change
Cumulative	45.8B (76.5%)	20.2B (33.7%)	−56%

The MSET function spends 56% less total CPU with prefetch enabled, freeing cycles
for I/O processing (writeToClient, readQueryFromClient).

Throughput on profiler runner

Build	Ops/sec
Unstable	66,949
PR	68,850
Change	+2.8%

(Profiler runner has higher overhead from parca-agent sampling — the bare-metal
runners show +4.3% on this test.)

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 61e3e75. Configure here.}

fcostaoliveira · 2026-04-14T17:21:31Z

Intel TMA (Top-Down Microarchitecture Analysis) — Pipeline Slot Funnel

Collected via topdown-profiler (L4 depth) on x86-aws-m7i.metal-24xl-profiler
(Intel Xeon Platinum 8488C, Sapphire Rapids).

Test: memtier_benchmark-10Mkeys-load-string-mset-50-keys-with-10B-values

Pipeline Slot Funnel Comparison

TMA Metric	Unstable	PR (prefetch)	Δ
Retiring (useful work)	41.7%	40.2%	-1.5pp
Backend_Bound	36.0%	35.0%	-1.0pp
└ Memory_Bound	18.7%	17.9%	-0.8pp
└ L3_Bound	20.4%	19.0%	-1.4pp
└ L1_Bound	7.7%	7.9%	+0.2pp
└ DRAM_Bound	0.0%	0.1%	—
└ Core_Bound	8.9%	8.9%	—
└ Serializing_Operation	13.9%	13.1%	-0.8pp
Frontend_Bound	18.4%	19.3%	+0.9pp
└ Fetch_Bandwidth	10.8%	11.4%	+0.6pp
Bad_Speculation	3.9%	5.5%	+1.6pp

Interpretation

L3_Bound dropped 1.4pp (20.4% → 19.0%) — the prefetch is successfully hiding L3 cache misses from dict bucket lookups. This aligns with the Polar Signals data showing dictFindLinkInternal flat cost dropping 8.5× (from 8.5% to 1.0%).
Frontend_Bound increased 0.9pp — expected cost of the extra prefetch instructions (dictGetHash + redis_prefetch_read in the batch loop). This is the price paid for the cache warming.
Serializing_Operation dropped 0.8pp — fewer stalls from serialized memory access chains (the prefetch breaks the dependency chain between consecutive dict lookups).
Net: Backend_Bound -1.0pp, Memory_Bound -0.8pp — the microarchitectural data confirms the throughput improvement comes from reduced memory stalls, exactly as designed.

sundb · 2026-04-21T12:10:09Z

+        if (use_prefetch) {
+            for (j = 1; j < c->argc; j += 2 * MSET_BATCH_SIZE) {
+                int remaining = (c->argc - j + 1) / 2;
+                int batch = remaining < MSET_BATCH_SIZE ? remaining : MSET_BATCH_SIZE;
+                msetPrefetchBatch(d, c, j, batch);
+                for (int k = 0; k < batch; k++) {
+                    if (lookupKeyWrite(c->db, c->argv[j + k * 2]) != NULL) {
+                        addReply(c, shared.czero);
+                        return;
+                    }
+                }
+            }
+        } else {
+            for (j = 1; j < c->argc; j += 2) {
+                if (lookupKeyWrite(c->db, c->argv[j]) != NULL) {
+                    addReply(c, shared.czero);
+                    return;
+                }


Is there a way to make these two use the same code?
the same below.

Collapsed in 1b8bcce. Both the MSETNX existence-check pass and the write pass now use a single batched-loop structure; use_prefetch only gates the prefetch call (not the loop shape), and the else { simple loop } branches are gone. The prefetch itself is routed through a new msetMaybePrefetchBatch helper that re-fetches the slot dict per batch (also addressing the augmentcode/cursor bugbot concern about stale dict *).

sundb · 2026-04-21T12:11:22Z

+ * occurs mid-batch (triggered by a prior setKey), the prefetched bucket may
+ * be stale, but the subsequent lookupKeyWrite/setKey will re-derive the
+ * correct bucket. */
+static void msetPrefetchBatch(dict *d, client *c, int start, int count) {


Passing argv would be more precise.

Suggested change

static void msetPrefetchBatch(dict *d, client *c, int start, int count) {

static void msetPrefetchBatch(dict *d, robj *argv, int start, int count) {

Applied in 1b8bcce — signature is now msetPrefetchBatch(dict *d, robj **argv, int start, int count).

Addresses three review threads on PR redis#15043: 1. Stale dict pointer (augmentcode + cursor bugbot, high/medium). In cluster mode, server.db[].keys is created with KVSTORE_FREE_EMPTY_DICTS. A prior batch's setKey -> expireIfNeeded may delete the last pre-existing key in the slot, triggering freeDictIfNeeded; the subsequent dbAddByLink then allocates a new dict. The cached dict* from before the batch would be a dangling pointer on the next msetPrefetchBatch call. Re-fetch the slot dict inside a new msetMaybePrefetchBatch helper that runs per batch. 2. Helper signature (sundb). Change msetPrefetchBatch to take `robj **argv` instead of `client *c`, making the contract explicit. 3. Duplication (sundb). Collapse the `if (use_prefetch) { batched } else { simple }` branches in both the MSETNX existence-check pass and the write pass into a single batched loop that conditionally calls the prefetch helper. `use_prefetch` now only gates the prefetch call, not the loop structure. Add regression tests in tests/unit/type/string.tcl: - MSET spanning multiple prefetch batches (16, 17, 32, 33, 40 keys) - MSET overwriting expired keys across a batch boundary (exercises the expireIfNeeded path in the same code that would UAF under cluster mode)

fcostaoliveira · 2026-04-21T14:48:20Z

CE Performance Automation : step 1 of 2 (build) STARTING...

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-04-21 14:48:20.127389
You can check each build/benchmark progress in grafana:

git hash: e04e2ec
git branch: mset-batch-prefetch
commit date and time: 2026-04-21 15:18:19+01:00
commit summary: Merge remote-tracking branch 'origin/unstable' into mset-batch-prefetch
test filters:
- command priority lower limit: 0
- command priority upper limit: 100000
- test name regex: .mset.
- command group regex: .*

fcostaoliveira · 2026-04-21T14:49:20Z

CE Performance Automation : step 1 of 2 (build) DONE.

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-04-21 17:50:22.714540 and took 59 seconds.
You can check each build/benchmark progress in grafana:

git hash: e04e2ec
git branch: mset-batch-prefetch
commit date and time: 2026-04-21 15:18:19+01:00
commit summary: Merge remote-tracking branch 'origin/unstable' into mset-batch-prefetch
test filters:
- command priority lower limit: 0
- command priority upper limit: 100000
- test name regex: .mset.
- command group regex: .*

You can check a comparison in detail via the grafana link

fcostaoliveira · 2026-04-21T23:17:55Z

Benchmark update — MSET on `x86-aws-m7i.metal-24xl-2`

Head-to-head on the latest commit (e04e2ecad9, merge of origin/unstable into the PR branch) vs redis/redis unstable (0fa78fd8fd). Topology: oss-standalone. 1 datapoint per side — re-running for 3-datapoint stable medians next.

Improvements (≥ 3%)

Test	Baseline (ops/sec)	PR (ops/sec)	Δ
`10Mkeys-load-string-mset-50-keys-with-10B-values`	65,166	70,529	+8.2%
`10Mkeys-load-string-mset-50-keys-with-100B-values`	61,376	65,584	+6.9%
`10Mkeys-load-string-mset-10-keys-with-100B-values`	128,794	133,960	+4.0%
`1Mkeys-load-string-mset-10-keys-with-100B-values`	113,461	117,142	+3.2%

Below the 3% threshold

Test	Baseline (ops/sec)	PR (ops/sec)	Δ
`10Mkeys-load-string-mset-10-keys-with-10B-values`	134,977	138,582	+2.7%
`1Mkeys-load-string-mset-10-keys-with-10B-values`	122,092	124,487	+2.0%
`10Mkeys-load-string-mset-10-keys-with-1000B-values`	102,009	103,484	+1.4%
`1Mkeys-load-hash-hmset-5-fields-with-1000B-values`	102,268	104,162	+1.9% (not MSET, not affected by the change)

Pattern: largest wins on the 10M-key variants with 50 keys per MSET — widest prefetch window on a DRAM-bound dict, exactly the scenario the change is designed for. 100B value tests win more than 10B/1000B tests because 100B straddles the RAW/EMBSTR boundary where the dict-bucket prefetch matters most (RAW requires a separate value allocation, unlike EMBSTR).

No regressions. No MSET test shows a negative delta.

Environment

Runner: x86-aws-m7i.metal-24xl-2 (AWS m7i.metal-24xl, Intel Xeon Platinum 8488C "Sapphire Rapids", 96 cores, bare-metal)
Benchmark harness: redis-benchmarks-specification v0.3.0
Build: gcc 15.2.0, Debian Bookworm, make -j
3-datapoint stable-median run coming next to tighten confidence.

@mpozniak95

…15133) Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses across the keys of a single multi-key command. Builds on the cross-command batched prefetch framework introduced in #14017 and the dict-prefetch state machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the state machine into two new `dictType` callbacks so the same machinery can be reused for other dict-encoded types later (hash hashtable, sets, sorted sets) without paying for `kvobj`-specific code paths in the core loop. Bundles the work originally proposed in #14899 (MGET prefetch framework, by @mpozniak95) and #15043 (MSET batch prefetch). ## Design Two new optional callbacks on `dictType`: ```c typedef struct dictType { ... /* Bring the entry's key payload into cache before keyCompare runs. * Returns the address to prefetch, or NULL if the entry alone is enough. */ void *(*prefetchEntryKey)(const dictEntry *de); /* Called only after a key match. Returns the value-side payload to * prefetch (or NULL). */ void *(*prefetchEntryValue)(const dictEntry *de); } dictType; ``` `dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()` shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` / `OBJ_ENCODING_RAW` values — now lives in two small helpers in `server.c`: ```c static void *dbDictPrefetchEntryKey(const dictEntry *de) { return dictEntryIsKey(de) ? NULL : dictGetKey(de); } static void *dbDictPrefetchEntryValue(const dictEntry *de) { kvobj *kv = dictGetKey(de); return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW) ? kv->ptr : NULL; } ``` The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data` parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the dict's own type drives both ends. This also removes the foot-gun where callers (like `mgetCommand`) had to remember whether to pass `prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer references `kvobj`, `kvobjGetKey`, or any specific value layout. ## State machine Two file-local types in `memory_prefetch.c`: | Type | Role | |---|---| | `dictPrefetchLookup` | Per-key snapshot of an in-flight, software-pipelined `dictFind` (mirrors the locals a synchronous `dictFind` would carry across one bucket walk). | | `dictPrefetcher` | Driver that advances a batch of `dictPrefetchLookup`s through the FSM, yielding to the next in-flight lookup each time a prefetch is issued. | Five-stage lifecycle for each lookup, driven by the prefetcher: ```text │ start │ ┌────────▼─────────┐ ┌─────────►│ PREFETCH_BUCKET ├────►────────┐ │ └────────┬─────────┘ no more tables │ bucket│found │ │ │ │ entry not found - goto next table ┌────────▼────────┐ │ └────◄─────┤ PREFETCH_ENTRY │ ▼ ┌────────────►└────────┬────────┘ │ │ entry│found │ │ │ │ │ ┌───────────▼─────────────┐ │ │ │ PREFETCH_ENTRY_KEY │ ◄── dictType->prefetchEntryKey(de) │ └───────────┬─────────────┘ │ │ │ │ key mismatch - goto next entry │ │ │ ┌───────────▼─────────────┐ │ └──────◄───│ PREFETCH_ENTRY_VALUE │ ◄── keyCompare; on match, └───────────┬─────────────┘ dictType->prefetchEntryValue(de) │ │ ┌─────────▼─────────────┐ │ │ PREFETCH_DONE │◄────────┘ └───────────────────────┘ ``` `PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]` if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no more tables remain. `memory_prefetch.c` exposes a small lifecycle that any caller can drive: ```c dictPrefetcherInit(p, max_keys); /* one-shot heap alloc of lookups[] */ dictPrefetcherReset(p, dicts, keys, nkeys); /* configure for one batch */ dictPrefetcherRun(p); /* drive FSM until all PREFETCH_DONE */ dictPrefetcherFree(p); /* release */ ``` Each FSM stage is a named static function (`dictPrefetchBucket`, `dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`), so the `dictPrefetcherRun` driver is a four-line `switch` over the state. The state machine is dict-pure: no `kvobj` field on `dictPrefetchLookup`, no `kvobjGetKey` reach-through. Round-robin advance semantics — a state only advances the cursor if a prefetch was actually issued — are preserved, so the embedded-kvobj fast path (`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the extra prefetch and falls straight into the compare on the next loop iteration. The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`) embeds a `dictPrefetcher` initialized once at startup and reset per batch, so cross-command prefetching no longer allocates per call. ## Intra-command API ```c void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys); ``` A single multi-key command (e.g. MGET) can prefetch dict data for a batch of its own keys, reusing the same state machine that the cross-command path uses. Single-key calls (`nkeys <= 1`) early-return — nothing to interleave with. The implementation stack-allocates a fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no VLA, predictable stack usage), so the intra-command path doesn't touch the heap. ## Notes on the call sites A shared helper picks the next prefetch batch and warms it via `dictPrefetchKeys`: ```c /* Pick the next prefetch batch starting at argv[start] and warm it via * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for * key/value pairs (MSET). Returns the chosen batch size in items. */ static int prefetchKeysBatch(client *c, int slot, int start, int stride); ``` Adaptive batch sizing inside the helper: if at least two full batches (`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch (`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one call. This generalizes the small-request fast path so the trailing batch of a large request also gets the single-call benefit. - **MGET (`mgetCommand`)** — gated by `do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`, with `already_prefetched = c->current_pending_cmd && (c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`. When `do_prefetch` is set, each iteration calls `prefetchKeysBatch(c, slot, j, 1)` and then sequentially `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is clear (cross-command path already warmed the keys, or batch prefetching is off), the loop takes all remaining items in one go and skips the prefetch. - **MSET / MSETNX (`msetGenericCommand`)** — same `do_prefetch` gate as MGET with `stride = 2`. For the NX flag the NX-check loop runs `lookupKeyWrite` (which already warmed everything via `prefetchKeysBatch`); the SET loop then disables further prefetch (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass. Going through the full state machine (rather than bucket-only) means `dbDictType`'s `prefetchEntryValue` callback runs on a key match — warming the old kvobj's payload, which `setKey -> dbReplaceValue -> updateKeysizesHist(oldlen, newlen)` then reads to compute the histogram delta. The slot dict is re-fetched per batch — in cluster mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS` + `expireIfNeeded`), so a cached pointer would otherwise dangle. - **Cross-command batch path (`addCommandToBatch`)** — sets `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch, even on partial-batch overflow (was: only when ALL keys fit). The intra-command path then uniformly skips supplemental prefetching for any command the batch touched. Rationale: running both paths (cross-command warm + intra-command supplement) caused a measured −9.6 % regression on x86 with pipeline-10, and the partial cross- command warmup is sufficient for the head of the keyset; the cold tail goes through normal lookup, which is still cheaper than running the FSM a second time on already-warm keys. - **Future types**: each dict's `dictType` can register its own `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash encoding, the field-sds and value-sds payloads), without touching `memory_prefetch.c`. ## Benchmark validation On x86, performance improvements are significant for larger batch sizes: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44% - 5Mkeys-string-mget-100B-100keys: +37.33% - 5Mkeys-string-mget-100B-30keys: +22.40% On ARM (Graviton4), the gains are even more pronounced: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34% - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76% Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions. --------- Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>

ShooterIT · 2026-05-11T09:35:49Z

closing this since we merged #15133

fcostaoliveira requested a review from sundb April 14, 2026 16:57

augmentcode Bot reviewed Apr 14, 2026

View reviewed changes

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/t_string.c Outdated

fcostaoliveira requested a review from skaslev April 15, 2026 11:17

sundb reviewed Apr 21, 2026

View reviewed changes

fcostaoliveira added 2 commits April 21, 2026 13:52

Merge remote-tracking branch 'origin/unstable' into mset-batch-prefetch

e04e2ec

fcostaoliveira requested a review from sundb April 21, 2026 14:46

fcostaoliveira requested a review from ShooterIT April 22, 2026 12:57

fcostaoliveira mentioned this pull request Apr 28, 2026

Batched MGET/MSET dict prefetch with dictType-driven payload hints #15133

Merged

ShooterIT closed this May 11, 2026

	static void msetPrefetchBatch(dict d, client c, int start, int count) {
	static void msetPrefetchBatch(dict d, robj argv, int start, int count) {

Uh oh!

Conversation

fcostaoliveira commented Apr 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Benchmark Results

10 keys per MSET

50 keys per MSET

1M keys (L3-resident, no benefit expected)

Regression check

Files Changed

Uh oh!

augmentcode Bot commented Apr 14, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira commented Apr 14, 2026

Polar Signals CPU Profile: unstable vs PR

dict.c — the prefetch target

msetGenericCommand — total cost

Throughput on profiler runner

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fcostaoliveira commented Apr 14, 2026

Intel TMA (Top-Down Microarchitecture Analysis) — Pipeline Slot Funnel

Pipeline Slot Funnel Comparison

Interpretation

Uh oh!

sundb Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

sundb Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

fcostaoliveira commented Apr 21, 2026

CE Performance Automation : step 1 of 2 (build) STARTING...

Uh oh!

fcostaoliveira commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CE Performance Automation : step 1 of 2 (build) DONE.

Uh oh!

fcostaoliveira commented Apr 21, 2026

Benchmark update — MSET on x86-aws-m7i.metal-24xl-2

Improvements (≥ 3%)

Below the 3% threshold

Environment

Uh oh!

ShooterIT commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fcostaoliveira commented Apr 14, 2026 •

edited by cursor Bot

Loading

augmentcode Bot Apr 14, 2026 •

edited

Loading

fcostaoliveira commented Apr 21, 2026 •

edited

Loading

Benchmark update — MSET on `x86-aws-m7i.metal-24xl-2`