Add batched dict bucket prefetch for MSET/MSETNX#15043
Conversation
Adds intra-command batched prefetch to msetGenericCommand. When processing multiple keys, dict bucket pointers for the next batch of 16 keys are prefetched into L1 cache before executing setKey for each key. This hides memory latency from scattered dict lookups in large MSET commands. The prefetch is advisory — correctness does not depend on it. If a rehash occurs mid-batch (triggered by a prior setKey expanding the dict), the prefetched bucket may be stale. The subsequent lookupKeyWriteWithLink inside setKey re-derives the correct bucket. Both the MSETNX existence-check pass and the write pass are batched. Single-key MSET or empty-dict cases skip prefetch entirely.
🤖 Augment PR SummarySummary: This PR adds an intra-command batched dict-bucket prefetch optimization to the MSET/MSETNX write path. Changes:
Technical Notes: The prefetch is advisory and intended to reduce DRAM latency from random dict lookups in large keyspaces. 🤖 Was this summary useful? React with 👍 or 👎 |
| * entries to look up. Single-key MSET or empty dict (all inserts) skip | ||
| * prefetch since there's no lookup to warm. */ | ||
| int slot = server.cluster_enabled ? getKeySlot(c->argv[1]->ptr) : 0; | ||
| dict *d = kvstoreGetDict(c->db->keys, slot); |
There was a problem hiding this comment.
dict *d is cached once and reused for all subsequent msetPrefetchBatch() calls, but lookupKeyWrite() / setKey() can delete expired keys and (in cluster mode with KVSTORE_FREE_EMPTY_DICTS) potentially free the slot dict when it becomes empty, making d a dangling pointer. Consider ensuring the prefetch path can’t retain a stale dict * across expiration-driven deletions within the same command.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Fixed in 1b8bcce. The slot dict is now re-fetched per batch via the new msetMaybePrefetchBatch helper, so a prior batch's setKey → expireIfNeeded → freeDictIfNeeded → dbAddByLink sequence (cluster mode, KVSTORE_FREE_EMPTY_DICTS) can't leave us with a dangling pointer. Cost: one extra kvstoreGetDict call per 16 keys.
Added a regression test in tests/unit/type/string.tcl ("MSET overwrites expired keys across batch boundary") that exercises the expire-then-insert path across a 20-key MSET (two batches). ASAN + test-sanitizer-address and test-external-cluster are green on the follow-up.
Polar Signals CPU Profile: unstable vs PRProfiled on dict.c — the prefetch target
The msetGenericCommand — total cost
The MSET function spends 56% less total CPU with prefetch enabled, freeing cycles Throughput on profiler runner
(Profiler runner has higher overhead from parca-agent sampling — the bare-metal |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 61e3e75. Configure here.
Intel TMA (Top-Down Microarchitecture Analysis) — Pipeline Slot FunnelCollected via topdown-profiler (L4 depth) on Test: Pipeline Slot Funnel Comparison
Interpretation
|
| if (use_prefetch) { | ||
| for (j = 1; j < c->argc; j += 2 * MSET_BATCH_SIZE) { | ||
| int remaining = (c->argc - j + 1) / 2; | ||
| int batch = remaining < MSET_BATCH_SIZE ? remaining : MSET_BATCH_SIZE; | ||
| msetPrefetchBatch(d, c, j, batch); | ||
| for (int k = 0; k < batch; k++) { | ||
| if (lookupKeyWrite(c->db, c->argv[j + k * 2]) != NULL) { | ||
| addReply(c, shared.czero); | ||
| return; | ||
| } | ||
| } | ||
| } | ||
| } else { | ||
| for (j = 1; j < c->argc; j += 2) { | ||
| if (lookupKeyWrite(c->db, c->argv[j]) != NULL) { | ||
| addReply(c, shared.czero); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Is there a way to make these two use the same code?
the same below.
There was a problem hiding this comment.
Collapsed in 1b8bcce. Both the MSETNX existence-check pass and the write pass now use a single batched-loop structure; use_prefetch only gates the prefetch call (not the loop shape), and the else { simple loop } branches are gone. The prefetch itself is routed through a new msetMaybePrefetchBatch helper that re-fetches the slot dict per batch (also addressing the augmentcode/cursor bugbot concern about stale dict *).
| * occurs mid-batch (triggered by a prior setKey), the prefetched bucket may | ||
| * be stale, but the subsequent lookupKeyWrite/setKey will re-derive the | ||
| * correct bucket. */ | ||
| static void msetPrefetchBatch(dict *d, client *c, int start, int count) { |
There was a problem hiding this comment.
Passing argv would be more precise.
| static void msetPrefetchBatch(dict *d, client *c, int start, int count) { | |
| static void msetPrefetchBatch(dict *d, robj *argv, int start, int count) { |
There was a problem hiding this comment.
Applied in 1b8bcce — signature is now msetPrefetchBatch(dict *d, robj **argv, int start, int count).
Addresses three review threads on PR redis#15043: 1. Stale dict pointer (augmentcode + cursor bugbot, high/medium). In cluster mode, server.db[].keys is created with KVSTORE_FREE_EMPTY_DICTS. A prior batch's setKey -> expireIfNeeded may delete the last pre-existing key in the slot, triggering freeDictIfNeeded; the subsequent dbAddByLink then allocates a new dict. The cached dict* from before the batch would be a dangling pointer on the next msetPrefetchBatch call. Re-fetch the slot dict inside a new msetMaybePrefetchBatch helper that runs per batch. 2. Helper signature (sundb). Change msetPrefetchBatch to take `robj **argv` instead of `client *c`, making the contract explicit. 3. Duplication (sundb). Collapse the `if (use_prefetch) { batched } else { simple }` branches in both the MSETNX existence-check pass and the write pass into a single batched loop that conditionally calls the prefetch helper. `use_prefetch` now only gates the prefetch call, not the loop structure. Add regression tests in tests/unit/type/string.tcl: - MSET spanning multiple prefetch batches (16, 17, 32, 33, 40 keys) - MSET overwriting expired keys across a batch boundary (exercises the expireIfNeeded path in the same code that would UAF under cluster mode)
CE Performance Automation : step 1 of 2 (build) STARTING...This comment was automatically generated given a benchmark was triggered.
|
CE Performance Automation : step 1 of 2 (build) DONE.This comment was automatically generated given a benchmark was triggered.
You can check a comparison in detail via the grafana link |
Benchmark update — MSET on
|
| Test | Baseline (ops/sec) | PR (ops/sec) | Δ |
|---|---|---|---|
10Mkeys-load-string-mset-50-keys-with-10B-values |
65,166 | 70,529 | +8.2% |
10Mkeys-load-string-mset-50-keys-with-100B-values |
61,376 | 65,584 | +6.9% |
10Mkeys-load-string-mset-10-keys-with-100B-values |
128,794 | 133,960 | +4.0% |
1Mkeys-load-string-mset-10-keys-with-100B-values |
113,461 | 117,142 | +3.2% |
Below the 3% threshold
| Test | Baseline (ops/sec) | PR (ops/sec) | Δ |
|---|---|---|---|
10Mkeys-load-string-mset-10-keys-with-10B-values |
134,977 | 138,582 | +2.7% |
1Mkeys-load-string-mset-10-keys-with-10B-values |
122,092 | 124,487 | +2.0% |
10Mkeys-load-string-mset-10-keys-with-1000B-values |
102,009 | 103,484 | +1.4% |
1Mkeys-load-hash-hmset-5-fields-with-1000B-values |
102,268 | 104,162 | +1.9% (not MSET, not affected by the change) |
Pattern: largest wins on the 10M-key variants with 50 keys per MSET — widest prefetch window on a DRAM-bound dict, exactly the scenario the change is designed for. 100B value tests win more than 10B/1000B tests because 100B straddles the RAW/EMBSTR boundary where the dict-bucket prefetch matters most (RAW requires a separate value allocation, unlike EMBSTR).
No regressions. No MSET test shows a negative delta.
Environment
- Runner:
x86-aws-m7i.metal-24xl-2(AWSm7i.metal-24xl, Intel Xeon Platinum 8488C "Sapphire Rapids", 96 cores, bare-metal) - Benchmark harness:
redis-benchmarks-specificationv0.3.0 - Build:
gcc 15.2.0, Debian Bookworm,make -j - 3-datapoint stable-median run coming next to tighten confidence.
…15133) Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses across the keys of a single multi-key command. Builds on the cross-command batched prefetch framework introduced in #14017 and the dict-prefetch state machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the state machine into two new `dictType` callbacks so the same machinery can be reused for other dict-encoded types later (hash hashtable, sets, sorted sets) without paying for `kvobj`-specific code paths in the core loop. Bundles the work originally proposed in #14899 (MGET prefetch framework, by @mpozniak95) and #15043 (MSET batch prefetch). ## Design Two new optional callbacks on `dictType`: ```c typedef struct dictType { ... /* Bring the entry's key payload into cache before keyCompare runs. * Returns the address to prefetch, or NULL if the entry alone is enough. */ void *(*prefetchEntryKey)(const dictEntry *de); /* Called only after a key match. Returns the value-side payload to * prefetch (or NULL). */ void *(*prefetchEntryValue)(const dictEntry *de); } dictType; ``` `dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()` shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` / `OBJ_ENCODING_RAW` values — now lives in two small helpers in `server.c`: ```c static void *dbDictPrefetchEntryKey(const dictEntry *de) { return dictEntryIsKey(de) ? NULL : dictGetKey(de); } static void *dbDictPrefetchEntryValue(const dictEntry *de) { kvobj *kv = dictGetKey(de); return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW) ? kv->ptr : NULL; } ``` The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data` parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the dict's own type drives both ends. This also removes the foot-gun where callers (like `mgetCommand`) had to remember whether to pass `prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer references `kvobj`, `kvobjGetKey`, or any specific value layout. ## State machine Two file-local types in `memory_prefetch.c`: | Type | Role | |---|---| | `dictPrefetchLookup` | Per-key snapshot of an in-flight, software-pipelined `dictFind` (mirrors the locals a synchronous `dictFind` would carry across one bucket walk). | | `dictPrefetcher` | Driver that advances a batch of `dictPrefetchLookup`s through the FSM, yielding to the next in-flight lookup each time a prefetch is issued. | Five-stage lifecycle for each lookup, driven by the prefetcher: ```text │ start │ ┌────────▼─────────┐ ┌─────────►│ PREFETCH_BUCKET ├────►────────┐ │ └────────┬─────────┘ no more tables │ bucket│found │ │ │ │ entry not found - goto next table ┌────────▼────────┐ │ └────◄─────┤ PREFETCH_ENTRY │ ▼ ┌────────────►└────────┬────────┘ │ │ entry│found │ │ │ │ │ ┌───────────▼─────────────┐ │ │ │ PREFETCH_ENTRY_KEY │ ◄── dictType->prefetchEntryKey(de) │ └───────────┬─────────────┘ │ │ │ │ key mismatch - goto next entry │ │ │ ┌───────────▼─────────────┐ │ └──────◄───│ PREFETCH_ENTRY_VALUE │ ◄── keyCompare; on match, └───────────┬─────────────┘ dictType->prefetchEntryValue(de) │ │ ┌─────────▼─────────────┐ │ │ PREFETCH_DONE │◄────────┘ └───────────────────────┘ ``` `PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]` if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no more tables remain. `memory_prefetch.c` exposes a small lifecycle that any caller can drive: ```c dictPrefetcherInit(p, max_keys); /* one-shot heap alloc of lookups[] */ dictPrefetcherReset(p, dicts, keys, nkeys); /* configure for one batch */ dictPrefetcherRun(p); /* drive FSM until all PREFETCH_DONE */ dictPrefetcherFree(p); /* release */ ``` Each FSM stage is a named static function (`dictPrefetchBucket`, `dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`), so the `dictPrefetcherRun` driver is a four-line `switch` over the state. The state machine is dict-pure: no `kvobj` field on `dictPrefetchLookup`, no `kvobjGetKey` reach-through. Round-robin advance semantics — a state only advances the cursor if a prefetch was actually issued — are preserved, so the embedded-kvobj fast path (`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the extra prefetch and falls straight into the compare on the next loop iteration. The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`) embeds a `dictPrefetcher` initialized once at startup and reset per batch, so cross-command prefetching no longer allocates per call. ## Intra-command API ```c void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys); ``` A single multi-key command (e.g. MGET) can prefetch dict data for a batch of its own keys, reusing the same state machine that the cross-command path uses. Single-key calls (`nkeys <= 1`) early-return — nothing to interleave with. The implementation stack-allocates a fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no VLA, predictable stack usage), so the intra-command path doesn't touch the heap. ## Notes on the call sites A shared helper picks the next prefetch batch and warms it via `dictPrefetchKeys`: ```c /* Pick the next prefetch batch starting at argv[start] and warm it via * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for * key/value pairs (MSET). Returns the chosen batch size in items. */ static int prefetchKeysBatch(client *c, int slot, int start, int stride); ``` Adaptive batch sizing inside the helper: if at least two full batches (`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch (`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one call. This generalizes the small-request fast path so the trailing batch of a large request also gets the single-call benefit. - **MGET (`mgetCommand`)** — gated by `do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`, with `already_prefetched = c->current_pending_cmd && (c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`. When `do_prefetch` is set, each iteration calls `prefetchKeysBatch(c, slot, j, 1)` and then sequentially `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is clear (cross-command path already warmed the keys, or batch prefetching is off), the loop takes all remaining items in one go and skips the prefetch. - **MSET / MSETNX (`msetGenericCommand`)** — same `do_prefetch` gate as MGET with `stride = 2`. For the NX flag the NX-check loop runs `lookupKeyWrite` (which already warmed everything via `prefetchKeysBatch`); the SET loop then disables further prefetch (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass. Going through the full state machine (rather than bucket-only) means `dbDictType`'s `prefetchEntryValue` callback runs on a key match — warming the old kvobj's payload, which `setKey -> dbReplaceValue -> updateKeysizesHist(oldlen, newlen)` then reads to compute the histogram delta. The slot dict is re-fetched per batch — in cluster mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS` + `expireIfNeeded`), so a cached pointer would otherwise dangle. - **Cross-command batch path (`addCommandToBatch`)** — sets `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch, even on partial-batch overflow (was: only when ALL keys fit). The intra-command path then uniformly skips supplemental prefetching for any command the batch touched. Rationale: running both paths (cross-command warm + intra-command supplement) caused a measured −9.6 % regression on x86 with pipeline-10, and the partial cross- command warmup is sufficient for the head of the keyset; the cold tail goes through normal lookup, which is still cheaper than running the FSM a second time on already-warm keys. - **Future types**: each dict's `dictType` can register its own `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash encoding, the field-sds and value-sds payloads), without touching `memory_prefetch.c`. ## Benchmark validation On x86, performance improvements are significant for larger batch sizes: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44% - 5Mkeys-string-mget-100B-100keys: +37.33% - 5Mkeys-string-mget-100B-30keys: +22.40% On ARM (Graviton4), the gains are even more pronounced: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34% - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76% Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions. --------- Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>
|
closing this since we merged #15133 |

Summary
Add intra-command batched prefetch to
msetGenericCommand. When processingmultiple keys in a single MSET/MSETNX command, the dict bucket pointers for
the next batch of 16 keys are prefetched into L1 cache before executing
setKeyfor each key. This hides DRAM latency from scattered dict lookupsin large keyspaces where the dict exceeds L3 cache.
The optimization is analogous to the HGETALL batched prefetch (#14988) and
the MGET cross-command prefetch (#14899), but applied to the MSET write path.
When it helps:
When it doesn't help:
How it works
The MSET loop is restructured into batches of
MSET_BATCH_SIZE(16) keys:prefetch the bucket pointer from
ht_table[0](andht_table[1]ifrehashing). This is advisory — it merely suggests cache lines to load.
setKey()with thecache already warm from step 1.
For MSETNX, both the existence-check pass and the write pass are batched.
The prefetch is advisory — correctness does not depend on it. If a rehash
occurs mid-batch (triggered by a prior
setKeyexpanding the dict), theprefetched bucket may be stale, but
setKey→lookupKeyWriteWithLink→dictFindLinkre-derives the correct bucket.Benchmark Results
Tested on
x86-aws-m7i.metal-24xl-2(Intel Xeon Platinum 8488C, 96 cores,bare metal),
oss-standalonetopology, 10M keys (dict ~800MB, well beyondL3 cache).
10 keys per MSET
50 keys per MSET
1M keys (L3-resident, no benefit expected)
Pattern: Larger values = more memory scatter = more cache misses = more
prefetch benefit. More keys per MSET = wider prefetch batch = bigger
improvement. The EMBSTR encoding (10B) shows less benefit because
key + value are embedded in the same allocation (fewer pointer chases).
Regression check
Zero regressions. Single-key SET/GET and pipelined workloads are unaffected
(the
numkeys > 1 && dict non-emptyguard skips prefetch for these cases).Files Changed
src/t_string.c—msetGenericCommand(): batched prefetch loop +msetPrefetchBatch()static helperNote
Medium Risk
Touches the hot
MSET/MSETNXwrite path and introduces direct dict bucket prefetching plus batched iteration, which could surface subtle correctness or memory-safety issues if dict/slot lifetime assumptions are wrong (mitigated by per-batch dict re-fetch and added tests).Overview
Adds an intra-command, batched dict-bucket prefetch to
msetGenericCommand, restructuring both theMSETNXexistence-check pass and theMSETwrite pass into 16-key batches and prefetching the relevant hash table buckets (including rehash table when applicable) before doing lookups/sets.Includes new unit tests that exercise
MSETacross the 16-key batch boundary and a regression test ensuring correctness when expired keys can cause per-slot dicts to be freed/recreated mid-command.Reviewed by Cursor Bugbot for commit e04e2ec. Bugbot is set up for automated code reviews on this repo. Configure here.