Skip to content

Add batched dict bucket prefetch for MSET/MSETNX#15043

Closed
fcostaoliveira wants to merge 3 commits into
redis:unstablefrom
filipecosta90:mset-batch-prefetch
Closed

Add batched dict bucket prefetch for MSET/MSETNX#15043
fcostaoliveira wants to merge 3 commits into
redis:unstablefrom
filipecosta90:mset-batch-prefetch

Conversation

@fcostaoliveira

@fcostaoliveira fcostaoliveira commented Apr 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add intra-command batched prefetch to msetGenericCommand. When processing
multiple keys in a single MSET/MSETNX command, the dict bucket pointers for
the next batch of 16 keys are prefetched into L1 cache before executing
setKey for each key. This hides DRAM latency from scattered dict lookups
in large keyspaces where the dict exceeds L3 cache.

The optimization is analogous to the HGETALL batched prefetch (#14988) and
the MGET cross-command prefetch (#14899), but applied to the MSET write path.

When it helps:

  • Keyspace > L3 cache (10M+ keys on typical server hardware)
  • Multiple keys per MSET command (10+ keys per command)
  • Random key access patterns (not sequential)

When it doesn't help:

  • Small keyspaces (1M keys fits in L3 — dict stays cache-warm)
  • Single-key SET (no prefetch window)
  • Sequential key patterns (hardware prefetcher handles these)

How it works

The MSET loop is restructured into batches of MSET_BATCH_SIZE (16) keys:

  1. Prefetch phase: For each key in the batch, compute the dict hash and
    prefetch the bucket pointer from ht_table[0] (and ht_table[1] if
    rehashing). This is advisory — it merely suggests cache lines to load.
  2. Execute phase: For each key in the batch, call setKey() with the
    cache already warm from step 1.

For MSETNX, both the existence-check pass and the write pass are batched.

The prefetch is advisory — correctness does not depend on it. If a rehash
occurs mid-batch (triggered by a prior setKey expanding the dict), the
prefetched bucket may be stale, but setKeylookupKeyWriteWithLink
dictFindLink re-derives the correct bucket.

Benchmark Results

Tested on x86-aws-m7i.metal-24xl-2 (Intel Xeon Platinum 8488C, 96 cores,
bare metal), oss-standalone topology, 10M keys (dict ~800MB, well beyond
L3 cache).

10 keys per MSET

Test Baseline (unstable) PR Change
10M × 1000B (RAW) 97,435 101,644 +4.3%
10M × 100B (RAW) 127,689 132,416 +3.7%
10M × 10B (EMBSTR) 132,908 134,659 +1.3%

50 keys per MSET

Test Baseline (unstable) PR Change
10M × 10B (EMBSTR) 65,714 69,115 +5.2%
10M × 100B (RAW) 61,021 63,665 +4.3%

1M keys (L3-resident, no benefit expected)

Test Baseline (unstable) PR Change
1M × 10B 123,191 124,979 +1.5% (noise)

Pattern: Larger values = more memory scatter = more cache misses = more
prefetch benefit. More keys per MSET = wider prefetch batch = bigger
improvement. The EMBSTR encoding (10B) shows less benefit because
key + value are embedded in the same allocation (fewer pointer chases).

Regression check

Zero regressions. Single-key SET/GET and pipelined workloads are unaffected
(the numkeys > 1 && dict non-empty guard skips prefetch for these cases).

Files Changed

  • src/t_string.cmsetGenericCommand(): batched prefetch loop +
    msetPrefetchBatch() static helper

Note

Medium Risk
Touches the hot MSET/MSETNX write path and introduces direct dict bucket prefetching plus batched iteration, which could surface subtle correctness or memory-safety issues if dict/slot lifetime assumptions are wrong (mitigated by per-batch dict re-fetch and added tests).

Overview
Adds an intra-command, batched dict-bucket prefetch to msetGenericCommand, restructuring both the MSETNX existence-check pass and the MSET write pass into 16-key batches and prefetching the relevant hash table buckets (including rehash table when applicable) before doing lookups/sets.

Includes new unit tests that exercise MSET across the 16-key batch boundary and a regression test ensuring correctness when expired keys can cause per-slot dicts to be freed/recreated mid-command.

Reviewed by Cursor Bugbot for commit e04e2ec. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds intra-command batched prefetch to msetGenericCommand. When
processing multiple keys, dict bucket pointers for the next batch of 16
keys are prefetched into L1 cache before executing setKey for each key.
This hides memory latency from scattered dict lookups in large MSET
commands.

The prefetch is advisory — correctness does not depend on it. If a
rehash occurs mid-batch (triggered by a prior setKey expanding the dict),
the prefetched bucket may be stale. The subsequent lookupKeyWriteWithLink
inside setKey re-derives the correct bucket.

Both the MSETNX existence-check pass and the write pass are batched.
Single-key MSET or empty-dict cases skip prefetch entirely.
@fcostaoliveira fcostaoliveira requested a review from sundb April 14, 2026 16:57
@augmentcode

augmentcode Bot commented Apr 14, 2026

Copy link
Copy Markdown
🤖 Augment PR Summary

Summary: This PR adds an intra-command batched dict-bucket prefetch optimization to the MSET/MSETNX write path.

Changes:

  • Introduces a new static helper (msetPrefetchBatch) to prefetch per-key hash table bucket pointers
  • Restructures msetGenericCommand into batches of 16 keys to separate prefetch and execution phases
  • Applies the batching both to the MSETNX existence-check pass and the actual set pass
  • Prefetches both HT[0] and HT[1] buckets when the dict is rehashing
  • Uses a guard to skip prefetch for single-key commands and empty dicts

Technical Notes: The prefetch is advisory and intended to reduce DRAM latency from random dict lookups in large keyspaces.

🤖 Was this summary useful? React with 👍 or 👎

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/t_string.c Outdated
* entries to look up. Single-key MSET or empty dict (all inserts) skip
* prefetch since there's no lookup to warm. */
int slot = server.cluster_enabled ? getKeySlot(c->argv[1]->ptr) : 0;
dict *d = kvstoreGetDict(c->db->keys, slot);

@augmentcode augmentcode Bot Apr 14, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict *d is cached once and reused for all subsequent msetPrefetchBatch() calls, but lookupKeyWrite() / setKey() can delete expired keys and (in cluster mode with KVSTORE_FREE_EMPTY_DICTS) potentially free the slot dict when it becomes empty, making d a dangling pointer. Consider ensuring the prefetch path can’t retain a stale dict * across expiration-driven deletions within the same command.

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b8bcce. The slot dict is now re-fetched per batch via the new msetMaybePrefetchBatch helper, so a prior batch's setKeyexpireIfNeededfreeDictIfNeededdbAddByLink sequence (cluster mode, KVSTORE_FREE_EMPTY_DICTS) can't leave us with a dangling pointer. Cost: one extra kvstoreGetDict call per 16 keys.

Added a regression test in tests/unit/type/string.tcl ("MSET overwrites expired keys across batch boundary") that exercises the expire-then-insert path across a 20-key MSET (two batches). ASAN + test-sanitizer-address and test-external-cluster are green on the follow-up.

@fcostaoliveira

Copy link
Copy Markdown
Collaborator Author

Polar Signals CPU Profile: unstable vs PR

Profiled on x86-aws-m7i.metal-24xl-profiler (Intel Xeon Platinum 8488C) with
parca-agent, test memtier_benchmark-10Mkeys-load-string-mset-50-keys-with-100B-values.

dict.c — the prefetch target

Function Unstable (flat) PR (flat) Change
dictFindLinkInternal 5.1B (8.5%) 0.58B (1.0%) −88%
dictStoredKey2Key 1.2B (2.0%) 1.1B (1.8%) −11%
dictGetKey 0.68B (1.1%) 0.32B (0.5%) −54%
dictGetHash 0.53B (0.9%) 1.53B (2.6%) +189% (expected — prefetch adds hash computation)
dictFindLink (wrapper) 0.05B 0.16B +220% (more calls from batch loop)

dictFindLinkInternal flat cost dropped 8.5× (from 8.5% to 1.0%) — the cache misses
that dominated dict lookups are now hidden by the prefetch.

The dictGetHash increase (+189%) is expected: the prefetch phase computes hashes
ahead of time. But the net CPU savings from cache-warm lookups far outweigh the
extra hash computation — total msetGenericCommand cumulative drops from 76.5% to 33.7%.

msetGenericCommand — total cost

Metric Unstable PR Change
Cumulative 45.8B (76.5%) 20.2B (33.7%) −56%

The MSET function spends 56% less total CPU with prefetch enabled, freeing cycles
for I/O processing (writeToClient, readQueryFromClient).

Throughput on profiler runner

Build Ops/sec
Unstable 66,949
PR 68,850
Change +2.8%

(Profiler runner has higher overhead from parca-agent sampling — the bare-metal
runners show +4.3% on this test.)

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 61e3e75. Configure here.

Comment thread src/t_string.c Outdated
@fcostaoliveira

Copy link
Copy Markdown
Collaborator Author

Intel TMA (Top-Down Microarchitecture Analysis) — Pipeline Slot Funnel

Collected via topdown-profiler (L4 depth) on x86-aws-m7i.metal-24xl-profiler
(Intel Xeon Platinum 8488C, Sapphire Rapids).

Test: memtier_benchmark-10Mkeys-load-string-mset-50-keys-with-10B-values

Pipeline Slot Funnel Comparison

TMA Metric Unstable PR (prefetch) Δ
Retiring (useful work) 41.7% 40.2% -1.5pp
Backend_Bound 36.0% 35.0% -1.0pp
└ Memory_Bound 18.7% 17.9% -0.8pp
   └ L3_Bound 20.4% 19.0% -1.4pp
   └ L1_Bound 7.7% 7.9% +0.2pp
   └ DRAM_Bound 0.0% 0.1%
└ Core_Bound 8.9% 8.9%
   └ Serializing_Operation 13.9% 13.1% -0.8pp
Frontend_Bound 18.4% 19.3% +0.9pp
└ Fetch_Bandwidth 10.8% 11.4% +0.6pp
Bad_Speculation 3.9% 5.5% +1.6pp

Interpretation

  • L3_Bound dropped 1.4pp (20.4% → 19.0%) — the prefetch is successfully hiding L3 cache misses from dict bucket lookups. This aligns with the Polar Signals data showing dictFindLinkInternal flat cost dropping 8.5× (from 8.5% to 1.0%).

  • Frontend_Bound increased 0.9pp — expected cost of the extra prefetch instructions (dictGetHash + redis_prefetch_read in the batch loop). This is the price paid for the cache warming.

  • Serializing_Operation dropped 0.8pp — fewer stalls from serialized memory access chains (the prefetch breaks the dependency chain between consecutive dict lookups).

  • Net: Backend_Bound -1.0pp, Memory_Bound -0.8pp — the microarchitectural data confirms the throughput improvement comes from reduced memory stalls, exactly as designed.

@fcostaoliveira fcostaoliveira requested a review from skaslev April 15, 2026 11:17
Comment thread src/t_string.c Outdated
Comment on lines +753 to +770
if (use_prefetch) {
for (j = 1; j < c->argc; j += 2 * MSET_BATCH_SIZE) {
int remaining = (c->argc - j + 1) / 2;
int batch = remaining < MSET_BATCH_SIZE ? remaining : MSET_BATCH_SIZE;
msetPrefetchBatch(d, c, j, batch);
for (int k = 0; k < batch; k++) {
if (lookupKeyWrite(c->db, c->argv[j + k * 2]) != NULL) {
addReply(c, shared.czero);
return;
}
}
}
} else {
for (j = 1; j < c->argc; j += 2) {
if (lookupKeyWrite(c->db, c->argv[j]) != NULL) {
addReply(c, shared.czero);
return;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to make these two use the same code?
the same below.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapsed in 1b8bcce. Both the MSETNX existence-check pass and the write pass now use a single batched-loop structure; use_prefetch only gates the prefetch call (not the loop shape), and the else { simple loop } branches are gone. The prefetch itself is routed through a new msetMaybePrefetchBatch helper that re-fetches the slot dict per batch (also addressing the augmentcode/cursor bugbot concern about stale dict *).

Comment thread src/t_string.c Outdated
* occurs mid-batch (triggered by a prior setKey), the prefetched bucket may
* be stale, but the subsequent lookupKeyWrite/setKey will re-derive the
* correct bucket. */
static void msetPrefetchBatch(dict *d, client *c, int start, int count) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing argv would be more precise.

Suggested change
static void msetPrefetchBatch(dict *d, client *c, int start, int count) {
static void msetPrefetchBatch(dict *d, robj *argv, int start, int count) {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 1b8bcce — signature is now msetPrefetchBatch(dict *d, robj **argv, int start, int count).

Addresses three review threads on PR redis#15043:

1. Stale dict pointer (augmentcode + cursor bugbot, high/medium).
   In cluster mode, server.db[].keys is created with KVSTORE_FREE_EMPTY_DICTS.
   A prior batch's setKey -> expireIfNeeded may delete the last pre-existing
   key in the slot, triggering freeDictIfNeeded; the subsequent dbAddByLink
   then allocates a new dict. The cached dict* from before the batch would be
   a dangling pointer on the next msetPrefetchBatch call. Re-fetch the slot
   dict inside a new msetMaybePrefetchBatch helper that runs per batch.

2. Helper signature (sundb). Change msetPrefetchBatch to take `robj **argv`
   instead of `client *c`, making the contract explicit.

3. Duplication (sundb). Collapse the `if (use_prefetch) { batched } else
   { simple }` branches in both the MSETNX existence-check pass and the
   write pass into a single batched loop that conditionally calls the
   prefetch helper. `use_prefetch` now only gates the prefetch call, not
   the loop structure.

Add regression tests in tests/unit/type/string.tcl:
- MSET spanning multiple prefetch batches (16, 17, 32, 33, 40 keys)
- MSET overwriting expired keys across a batch boundary (exercises the
  expireIfNeeded path in the same code that would UAF under cluster mode)
@fcostaoliveira fcostaoliveira requested a review from sundb April 21, 2026 14:46
@fcostaoliveira

Copy link
Copy Markdown
Collaborator Author

CE Performance Automation : step 1 of 2 (build) STARTING...

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-04-21 14:48:20.127389
You can check each build/benchmark progress in grafana:

  • git hash: e04e2ec
  • git branch: mset-batch-prefetch
  • commit date and time: 2026-04-21 15:18:19+01:00
  • commit summary: Merge remote-tracking branch 'origin/unstable' into mset-batch-prefetch
  • test filters:
    • command priority lower limit: 0
    • command priority upper limit: 100000
    • test name regex: .mset.
    • command group regex: .*

@fcostaoliveira

fcostaoliveira commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator Author

CE Performance Automation : step 1 of 2 (build) DONE.

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-04-21 17:50:22.714540 and took 59 seconds.
You can check each build/benchmark progress in grafana:

  • git hash: e04e2ec
  • git branch: mset-batch-prefetch
  • commit date and time: 2026-04-21 15:18:19+01:00
  • commit summary: Merge remote-tracking branch 'origin/unstable' into mset-batch-prefetch
  • test filters:
    • command priority lower limit: 0
    • command priority upper limit: 100000
    • test name regex: .mset.
    • command group regex: .*

You can check a comparison in detail via the grafana link

@fcostaoliveira

Copy link
Copy Markdown
Collaborator Author

Benchmark update — MSET on x86-aws-m7i.metal-24xl-2

Head-to-head on the latest commit (e04e2ecad9, merge of origin/unstable into the PR branch) vs redis/redis unstable (0fa78fd8fd). Topology: oss-standalone. 1 datapoint per side — re-running for 3-datapoint stable medians next.

Improvements (≥ 3%)

Test Baseline (ops/sec) PR (ops/sec) Δ
10Mkeys-load-string-mset-50-keys-with-10B-values 65,166 70,529 +8.2%
10Mkeys-load-string-mset-50-keys-with-100B-values 61,376 65,584 +6.9%
10Mkeys-load-string-mset-10-keys-with-100B-values 128,794 133,960 +4.0%
1Mkeys-load-string-mset-10-keys-with-100B-values 113,461 117,142 +3.2%

Below the 3% threshold

Test Baseline (ops/sec) PR (ops/sec) Δ
10Mkeys-load-string-mset-10-keys-with-10B-values 134,977 138,582 +2.7%
1Mkeys-load-string-mset-10-keys-with-10B-values 122,092 124,487 +2.0%
10Mkeys-load-string-mset-10-keys-with-1000B-values 102,009 103,484 +1.4%
1Mkeys-load-hash-hmset-5-fields-with-1000B-values 102,268 104,162 +1.9% (not MSET, not affected by the change)

Pattern: largest wins on the 10M-key variants with 50 keys per MSET — widest prefetch window on a DRAM-bound dict, exactly the scenario the change is designed for. 100B value tests win more than 10B/1000B tests because 100B straddles the RAW/EMBSTR boundary where the dict-bucket prefetch matters most (RAW requires a separate value allocation, unlike EMBSTR).

No regressions. No MSET test shows a negative delta.

Environment

  • Runner: x86-aws-m7i.metal-24xl-2 (AWS m7i.metal-24xl, Intel Xeon Platinum 8488C "Sapphire Rapids", 96 cores, bare-metal)
  • Benchmark harness: redis-benchmarks-specification v0.3.0
  • Build: gcc 15.2.0, Debian Bookworm, make -j
  • 3-datapoint stable-median run coming next to tighten confidence.

ShooterIT added a commit that referenced this pull request May 11, 2026
…15133)

Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses
across the keys of a single multi-key command. Builds on the cross-command
batched prefetch framework introduced in #14017 and the dict-prefetch state
machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the
state machine into two new `dictType` callbacks so the same machinery can
be reused for other dict-encoded types later (hash hashtable, sets, sorted
sets) without paying for `kvobj`-specific code paths in the core loop.

Bundles the work originally proposed in #14899 (MGET prefetch framework,
by @mpozniak95) and #15043 (MSET batch prefetch).

## Design

Two new optional callbacks on `dictType`:

```c
typedef struct dictType {
    ...
    /* Bring the entry's key payload into cache before keyCompare runs.
     * Returns the address to prefetch, or NULL if the entry alone is enough. */
    void *(*prefetchEntryKey)(const dictEntry *de);

    /* Called only after a key match. Returns the value-side payload to
     * prefetch (or NULL). */
    void *(*prefetchEntryValue)(const dictEntry *de);
} dictType;
```

`dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()`
shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` /
`OBJ_ENCODING_RAW` values — now lives in two small helpers in
`server.c`:

```c
static void *dbDictPrefetchEntryKey(const dictEntry *de) {
    return dictEntryIsKey(de) ? NULL : dictGetKey(de);
}

static void *dbDictPrefetchEntryValue(const dictEntry *de) {
    kvobj *kv = dictGetKey(de);
    return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW)
            ? kv->ptr : NULL;
}
```

The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data`
parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the
dict's own type drives both ends. This also removes the foot-gun where
callers (like `mgetCommand`) had to remember whether to pass
`prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer
references `kvobj`, `kvobjGetKey`, or any specific value layout.

## State machine

Two file-local types in `memory_prefetch.c`:

| Type | Role |
|---|---|
| `dictPrefetchLookup` | Per-key snapshot of an in-flight,
software-pipelined `dictFind` (mirrors the locals a synchronous
`dictFind` would carry across one bucket walk). |
| `dictPrefetcher` | Driver that advances a batch of
`dictPrefetchLookup`s through the FSM, yielding to the next in-flight
lookup each time a prefetch is issued. |

Five-stage lifecycle for each lookup, driven by the prefetcher:

```text
                                                           │
                                                         start
                                                           │
                                                  ┌────────▼─────────┐
                                       ┌─────────►│  PREFETCH_BUCKET ├────►────────┐
                                       │          └────────┬─────────┘            no more tables
                                       │             bucket│found                  │
                                       │                   │                       │
        entry not found - goto next table         ┌────────▼────────┐              │
                                       └────◄─────┤ PREFETCH_ENTRY  │              ▼
                                    ┌────────────►└────────┬────────┘              │
                                    │                 entry│found                  │
                                    │                      │                       │
                                    │          ┌───────────▼─────────────┐         │
                                    │          │   PREFETCH_ENTRY_KEY    │ ◄── dictType->prefetchEntryKey(de)
                                    │          └───────────┬─────────────┘         │
                                    │                      │                       │
        key mismatch - goto next entry                     │                       │
                                    │          ┌───────────▼─────────────┐         │
                                    └──────◄───│   PREFETCH_ENTRY_VALUE  │ ◄── keyCompare; on match,
                                               └───────────┬─────────────┘     dictType->prefetchEntryValue(de)
                                                           │                       │
                                                 ┌─────────▼─────────────┐         │
                                                 │     PREFETCH_DONE     │◄────────┘
                                                 └───────────────────────┘
```

`PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]`
if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no
more tables remain.

`memory_prefetch.c` exposes a small lifecycle that any caller can drive:

```c
dictPrefetcherInit(p, max_keys);                  /* one-shot heap alloc of lookups[] */
dictPrefetcherReset(p, dicts, keys, nkeys);       /* configure for one batch */
dictPrefetcherRun(p);                             /* drive FSM until all PREFETCH_DONE */
dictPrefetcherFree(p);                            /* release */
```

Each FSM stage is a named static function (`dictPrefetchBucket`,
`dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`),
so the `dictPrefetcherRun` driver is a four-line `switch` over the
state.

The state machine is dict-pure: no `kvobj` field on
`dictPrefetchLookup`,
no `kvobjGetKey` reach-through. Round-robin advance semantics — a state
only advances the cursor if a prefetch was actually issued — are
preserved, so the embedded-kvobj fast path
(`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the
extra prefetch and falls straight into the compare on the next loop
iteration.

The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`)
embeds a `dictPrefetcher` initialized once at startup and reset per
batch, so cross-command prefetching no longer allocates per call.

## Intra-command API

```c
void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys);
```

A single multi-key command (e.g. MGET) can prefetch dict data for a
batch of its own keys, reusing the same state machine that the
cross-command path uses. Single-key calls (`nkeys <= 1`) early-return —
nothing to interleave with. The implementation stack-allocates a
fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no
VLA, predictable stack usage), so the intra-command path doesn't touch
the heap.

## Notes on the call sites

A shared helper picks the next prefetch batch and warms it via
`dictPrefetchKeys`:

```c
/* Pick the next prefetch batch starting at argv[start] and warm it via
 * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for
 * key/value pairs (MSET). Returns the chosen batch size in items. */
static int prefetchKeysBatch(client *c, int slot, int start, int stride);
```

Adaptive batch sizing inside the helper: if at least two full batches
(`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch
(`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one
call. This generalizes the small-request fast path so the trailing
batch of a large request also gets the single-call benefit.

- **MGET (`mgetCommand`)** — gated by
`do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`,
with `already_prefetched = c->current_pending_cmd &&
(c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`.
  When `do_prefetch` is set, each iteration calls
  `prefetchKeysBatch(c, slot, j, 1)` and then sequentially
  `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is
  clear (cross-command path already warmed the keys, or batch
  prefetching is off), the loop takes all remaining items in one go
  and skips the prefetch.

- **MSET / MSETNX (`msetGenericCommand`)** — same `do_prefetch` gate as
  MGET with `stride = 2`. For the NX flag the NX-check loop runs
  `lookupKeyWrite` (which already warmed everything via
  `prefetchKeysBatch`); the SET loop then disables further prefetch
  (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass.
  Going through the full state machine (rather than bucket-only) means
  `dbDictType`'s `prefetchEntryValue` callback runs on a key match —
  warming the old kvobj's payload, which `setKey -> dbReplaceValue ->
  updateKeysizesHist(oldlen, newlen)` then reads to compute the
  histogram delta. The slot dict is re-fetched per batch — in cluster
  mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS`
  + `expireIfNeeded`), so a cached pointer would otherwise dangle.

- **Cross-command batch path (`addCommandToBatch`)** — sets
  `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch,
  even on partial-batch overflow (was: only when ALL keys fit). The
  intra-command path then uniformly skips supplemental prefetching for
  any command the batch touched. Rationale: running both paths
  (cross-command warm + intra-command supplement) caused a measured
  −9.6 % regression on x86 with pipeline-10, and the partial cross-
  command warmup is sufficient for the head of the keyset; the cold
  tail goes through normal lookup, which is still cheaper than running
  the FSM a second time on already-warm keys.

- **Future types**: each dict's `dictType` can register its own
  `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash
  encoding, the field-sds and value-sds payloads), without touching
  `memory_prefetch.c`.

## Benchmark validation

On x86, performance improvements are significant for larger batch sizes:
  - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44%
  - 5Mkeys-string-mget-100B-100keys: +37.33%
  - 5Mkeys-string-mget-100B-30keys: +22.40%
On ARM (Graviton4), the gains are even more pronounced:
  - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34%
  - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76%
Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions.

---------

Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
@ShooterIT

Copy link
Copy Markdown
Member

closing this since we merged #15133

@ShooterIT ShooterIT closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants