Batched MGET/MSET dict prefetch with dictType-driven payload hints#15133
Conversation
…keys Add PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. Set it in addCommandToBatch() when a command is added to the cross-command prefetch batch. In mgetCommand, check the flag and fall back to plain sequential lookups when keys are already warm from the batch. This eliminates redundant prefetch work that caused regressions of up to -16% with many I/O threads, while preserving the large gains (+52-54%) for pipelined workloads where the batch overflows.
The previous guard only skipped intra-command dictPrefetchKeys() when PENDING_CMD_KEYS_PREFETCHED was set, which requires ALL keys to fit in the batch. For MGET 100 keys with a batch size of 32, the flag was never set — both prefetch paths always ran. On x86 with pipeline-10, this double-prefetching caused -9.6% regression on mget-100B-100keys-pipeline-10 and -5.6% on setget2000c with 8 IO threads due to cache-bandwidth contention in ctxPrefetchAndAdvance. Fix: skip intra-command prefetch when either: 1. PENDING_CMD_KEYS_PREFETCHED is set (all keys fit in batch), OR 2. Pipeline is active (ready_len > 1) — partial batch warmup is sufficient, double-prefetching hurts more than it helps. Single non-pipelined MGET (ready_len <= 1) still uses intra-command prefetch, preserving the +11% ARM / +25% x86 improvement.
The ready_len > 1 heuristic that skips intra-command dictPrefetchKeys() assumed cross-command batch prefetching always runs during pipelining. When prefetch-batch-max-size is 0 (batch disabled), addCommandToBatch() returns C_ERR immediately and no keys are warmed, but ready_len > 1 still holds for any pipeline — causing MGET to skip its only prefetch opportunity. Add server.prefetch_batch_max_size > 0 to the ready_len guard so intra-command prefetch runs when the batch system is disabled.
The prefetchGetObjectValuePtr callback adds one state-machine iteration per key to prefetch the first cache line of the value allocation. For EMBSTR values (<=44B) it returns NULL (value is embedded in the kv object), making the iteration pure overhead. For large RAW values (>=1KiB) it warms <10% of the data while adding measurable cost — profiling shows 3B samples in ctxPrefetchValdata vs 17.8B total in the prefetch path. Pass NULL instead: skip value-data prefetch entirely in the MGET intra-command path. The dict bucket/entry/kv-object prefetch provides the bulk of the improvement; addReplyBulk handles value data access sequentially.
Three optimizations to the dict prefetch state machine: 1. Replace modulo with branch in ctxPrefetchAndAdvance — avoids expensive division on every state transition. 2. Add remaining counter to DictPrefetchCtx — ctxNextInfo returns NULL immediately when all keys are done instead of scanning the full array. Eliminates O(n) scan per call as keys complete. 3. Fast-path in ctxPrefetchKvobj when no value-data callback and entry is a direct key — do key comparison inline and skip the VALDATA state entirely, saving one round-trip through the state machine per key.
For batches of <=10 keys, replace the full dictPrefetchKeys state machine with a simple loop that prefetches hash table buckets only. The state machine has ~4 states per key with function dispatch and round-robin scanning overhead. Profiling shows this costs 4.3% of total CPU for 10-key MGET with 1KiB values — matching the observed regression exactly. A single-pass bucket prefetch eliminates the iteration overhead while still warming the hash chain heads. Larger batches (>10 keys) continue using the full state machine where the multi-round interleaving provides net benefit.
Adds intra-command batched prefetch to msetGenericCommand. When processing multiple keys, dict bucket pointers for the next batch of 16 keys are prefetched into L1 cache before executing setKey for each key. This hides memory latency from scattered dict lookups in large MSET commands. The prefetch is advisory — correctness does not depend on it. If a rehash occurs mid-batch (triggered by a prior setKey expanding the dict), the prefetched bucket may be stale. The subsequent lookupKeyWriteWithLink inside setKey re-derives the correct bucket. Both the MSETNX existence-check pass and the write pass are batched. Single-key MSET or empty-dict cases skip prefetch entirely.
Addresses three review threads on PR redis#15043: 1. Stale dict pointer (augmentcode + cursor bugbot, high/medium). In cluster mode, server.db[].keys is created with KVSTORE_FREE_EMPTY_DICTS. A prior batch's setKey -> expireIfNeeded may delete the last pre-existing key in the slot, triggering freeDictIfNeeded; the subsequent dbAddByLink then allocates a new dict. The cached dict* from before the batch would be a dangling pointer on the next msetPrefetchBatch call. Re-fetch the slot dict inside a new msetMaybePrefetchBatch helper that runs per batch. 2. Helper signature (sundb). Change msetPrefetchBatch to take `robj **argv` instead of `client *c`, making the contract explicit. 3. Duplication (sundb). Collapse the `if (use_prefetch) { batched } else { simple }` branches in both the MSETNX existence-check pass and the write pass into a single batched loop that conditionally calls the prefetch helper. `use_prefetch` now only gates the prefetch call, not the loop structure. Add regression tests in tests/unit/type/string.tcl: - MSET spanning multiple prefetch batches (16, 17, 32, 33, 40 keys) - MSET overwriting expired keys across a batch boundary (exercises the expireIfNeeded path in the same code that would UAF under cluster mode)
… callback Two review comments from Cursor Bugbot on dcb8aa8: 1. Missing `#undef MGET_LIGHT_PREFETCH_MAX` at the end of mgetCommand. MGET_BATCH was already undef'd but its sibling was not. 2. The `dictPrefetchKeys()` docstring example passed `prefetchGetObjectValuePtr` as the value-data callback, while the mgetCommand caller passes NULL on purpose (the callback's extra state machine work costs more than warming the value payload saves, since the following lookupKeyRead + addReplyBulk touch the payload immediately and the hardware prefetcher is already on it). Update the example to show the typical NULL form and document when passing a non-NULL callback would actually be worthwhile.
Decouple the prefetch state machine from kvobject. The state machine
previously knew about kvobj fields (current_kv, kvobjGetKey) and the
default value-data callback (prefetchGetObjectValuePtr) lived inside
memory_prefetch.c. That coupled a generic dict facility to one
specific dict's stored-key layout.
Two new optional callbacks on dictType:
void *prefetchEntryKey(const dictEntry *de);
void *prefetchEntryValue(const dictEntry *de);
The state machine renames the post-entry states to match:
PREFETCH_KVOBJ -> PREFETCH_ENTRY_KEY
PREFETCH_VALDATA -> PREFETCH_ENTRY_VALUE
PREFETCH_ENTRY_KEY asks the dictType to bring the entry's key payload
into cache (returns the address to prefetch, or NULL if the entry
already carries everything keyCompare needs). PREFETCH_ENTRY_VALUE
runs the keyCompare and on a hit asks the dictType to prefetch the
value-side payload.
dbDictType registers both callbacks in server.c:
- prefetchEntryKey returns dictGetKey(de) only when the kvobj is
out-of-line (matches today's dictEntryIsKey check).
- prefetchEntryValue returns kv->ptr for OBJ_STRING/RAW (the old
prefetchGetObjectValuePtr behaviour, now where it belongs).
- dbExpiresDictType only sets prefetchEntryKey (no values).
The PrefetchGetValueDataFunc typedef and the trailing get_val_data
parameter are removed from the dictPrefetchKeys() public API and from
the cross-command dictPrefetch() wrapper - the dict's own type now
drives this end-to-end. mgetCommand's call site updates accordingly.
ctxPrefetchEntryValue was passing dictGetKey(entry) directly to dictCompareKeys. For dbDictType the stored key is a kvobj — keyCompare (dictSdsCompareKV) expects an sds, so it dereferenced kvobj+offset as if it were an sds, reading garbage 'flags' and length fields and walking off the heap. The previous form of this state (PREFETCH_KVOBJ -> PREFETCH_VALDATA) extracted the sds via kvobjGetKey before calling keyCompare. The dict-pure equivalent is dictType->keyFromStoredKey, which dictFindLink also uses (see dictFindLinkInternal -> dictStoredKey2Key). Fix: apply keyFromStoredKey to dictGetKey(entry) before passing it to dictCompareKeys when the dict provides the callback. dbDictType / dbExpiresDictType set keyFromStoredKey = kvGetKey, so the conversion yields the sds again; dicts without the callback (sds-keyed dicts) keep the entry's stored key unchanged. Reproduces under sanitizer (heap-buffer-overflow on sdslen reading kvobj-as-sds) and SIGSEGV in dictSdsCompareKV when a hash chain has more than one entry and the cross-command batch prefetch hits the compare path. Single-entry chains hit the !dictGetNext shortcut and so escaped local smoke testing.
🤖 Augment PR SummarySummary: This PR reduces multi-key lookup latency by overlapping dict-lookup memory accesses using batched prefetching. Changes:
Technical Notes: Prefetching remains advisory and is conditionally disabled in some pipelined scenarios to avoid cache-bandwidth contention. 🤖 Was this summary useful? React with 👍 or 👎 |
| /* Determine the dict for the first key's slot (MGET requires all keys | ||
| * in the same slot in cluster mode). */ | ||
| int slot = server.cluster_enabled ? getKeySlot(c->argv[1]->ptr) : 0; | ||
| dict *d = kvstoreGetDict(c->db->keys, slot); |
There was a problem hiding this comment.
src/t_string.c:706: mgetCommand caches dict *d once and reuses it across multiple 16-key batches, but in cluster mode lookupKeyRead() can expire/delete the last key in the slot and KVSTORE_FREE_EMPTY_DICTS may free the slot dict, leaving d dangling for the next batch’s prefetch/hash operations. This is the same class of hazard you already call out and avoid in msetMaybePrefetchBatch by re-fetching the dict per batch.
Severity: high
Other Locations
src/memory_prefetch.c:433
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| * key payload behind the entry is warm before keyCompare runs), or | ||
| * NULL if nothing extra is needed (e.g. the key is co-located with | ||
| * the entry). | ||
| * prefetchEntryValue: called only after a key match. Returns an |
There was a problem hiding this comment.
src/dict.h:148: The comment says prefetchEntryValue is “called only after a key match”, but memory_prefetch.c can enter the match path without comparing keys for the last entry in a bucket (the “assume hit” fast path). This contract mismatch could lead future dictType implementations to make unsafe assumptions inside prefetchEntryValue.
Severity: medium
Other Locations
src/memory_prefetch.c:265
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| * (typically ≤ 16–32) to avoid excessive stack usage. | ||
| * Zero-initialize so that keys marked PREFETCH_DONE by ctxInit | ||
| * (null/empty dict) don't carry indeterminate field values. */ | ||
| KeyPrefetchInfo pf_info[nkeys]; |
There was a problem hiding this comment.
src/memory_prefetch.c:462: dictPrefetchKeys() stack-allocates KeyPrefetchInfo pf_info[nkeys]; if a caller ever passes an unbounded nkeys (e.g. derived directly from argc), this can cause stack exhaustion and crash. Since this is now an exported API in memory_prefetch.h, it feels easy for a future call site to accidentally violate the “keep nkeys bounded” assumption.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
Three fixes from the cursor and augmentcode bot reviews on the dict-pure prefetch branch: 1. mgetCommand: re-fetch the slot dict per batch. The dict pointer was fetched once at the top and reused across all 16-key batches. In cluster mode, db->keys is created with KVSTORE_FREE_EMPTY_DICTS, so a prior batch's lookupKeyRead -> expireIfNeeded path can empty the slot and free the dict; the next batch's dictGetHash(d, ...) would then UAF. Same hazard the MSET batch path already guards against via msetMaybePrefetchBatch. Mirror that pattern: kvstoreGetDict per batch, with an empty/null dict shortcut that still calls lookupKeyRead per key to preserve touch / stats / notification side-effects. Cost: one extra kvstoreGetDict per 16 keys. 2. dictType.prefetchEntryValue contract: the docstring claimed the callback is 'called only after a key match', but the state machine has a 'last entry in chain, not rehashing' shortcut that calls the callback without comparing keys. Update the doc to spell out the presumed-match semantics so future dictType authors don't write callbacks that assume a verified equal. 3. dictPrefetchKeys: cap nkeys with serverAssert + named constant (DICT_PREFETCH_KEYS_MAX = 64). KeyPrefetchInfo is stack-allocated via VLA and must not be sized off an attacker-controlled value; in-tree callers batch in groups of 16, the cap leaves headroom without risking stack exhaustion.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit d6d9ac9. Configure here.
|
@ShooterIT — your follow-up refactor is in as
Validation —
|
| Test | Base 2432f552 |
Prev a46e1896 |
PR 792c9ca1 |
Δ vs base | Δ vs prev |
|---|---|---|---|---|---|
5Mkeys-string-mget-100B-2keys |
162 657.34 | 163 601.69 | 164 968.43 | +1.42 % | +0.84 % |
5Mkeys-string-mget-100B-5keys |
146 959.66 | 153 192.74 | 150 364.64 (mean of 3 dp: 147 804 / 152 065 / 151 223) | −1.85 % | −1.85 % |
5Mkeys-string-mget-100B-10keys |
125 275.66 | 133 124.41 | 133 586.86 | +6.63 % | +0.35 % |
5Mkeys-string-mget-100B-20keys |
90 897.79 | 106 756.88 | 107 037.88 | +17.76 % | +0.26 % |
5Mkeys-string-mget-100B-30keys |
73 006.87 | 87 653.67 | 88 796.06 | +21.63 % | +1.30 % |
5Mkeys-string-mget-100B-100keys |
31 149.71 | 42 347.75 | 42 723.75 | +37.16 % | +0.89 % |
10Mkeys-load-string-mset-10-keys-with-10B-values |
131 641.65 | 133 212.63 | 134 606.62 | +2.25 % | +1.05 % |
10Mkeys-load-string-mset-10-keys-with-100B-values |
127 918.11 | 129 232.20 | 132 016.12 | +3.20 % | +2.15 % |
10Mkeys-load-string-mset-50-keys-with-10B-values |
65 659.64 | 67 970.65 | 67 208.60 | +2.36 % | −1.12 % |
10Mkeys-load-string-mset-50-keys-with-100B-values |
60 718.20 | 63 604.58 | 63 409.87 | +4.43 % | −0.31 % |
Refactor is performance-neutral vs the prev head (8 of 10 within ±1.5 %, two within ±2.2 %). All headline wins vs unstable preserved — mget-100B-100keys actually nudged up from +35.95 % → +37.16 %. The partial-batch PENDING_CMD_KEYS_PREFETCHED change you made in addCommandToBatch was the only behavioural concern worth re-checking, and it's clean.
|
Two follow-up validation rounds on
|
| Test | Baseline 7ecc04f59d (Ops/sec) |
PR 792c9ca1c (Ops/sec) |
Δ |
|---|---|---|---|
5Mkeys-string-mget-100B-100keys-pipeline-10 |
41 218.99 | 60 581.10 | +46.97 % |
5Mkeys-string-mget-100B-30keys-pipeline-10 |
148 639.87 | 212 409.44 | +42.90 % |
1Mkeys-string-setget200c-512B-pipeline-10 |
2 414 365.79 | 2 594 377.61 | +7.46 % |
1Mkeys-string-set-get-short-expiration-512B |
1 272 013.21 | 1 342 755.00 | +5.56 % |
3Mkeys-string-mixed-20-80-with-512B-values-pipeline-10-2000_conns |
2 423 562.75 | 2 515 002.31 | +3.77 % |
The cold-tail concern was unfounded — eliminating supplemental intra-command prefetching on commands the batch already touched removes redundant cache-bandwidth contention. Multi-key MGET wins are larger with io-threads (+47 % / +43 %) than the no-iothreads version (+37 % on mget-100B-100keys at oss-standalone).
oss-standalone pipeline-10 — non-iothreads pipelined
| Test | Baseline 7ecc04f59d (Ops/sec) |
PR 792c9ca1c (Ops/sec) |
Δ |
|---|---|---|---|
5Mkeys-string-mget-100B-2keys-pipeline-10 |
870 056.10 | 885 536.36 | +1.78 % |
5Mkeys-string-mget-100B-5keys-pipeline-10 |
460 668.12 | 500 265.21 | +8.60 % |
5Mkeys-string-mget-100B-30keys-pipeline-10 |
102 827.68 | 131 288.11 | +27.68 % |
5Mkeys-string-mget-100B-100keys-pipeline-10 |
56 911.76 | 59 489.88 | +4.53 % |
Coverage matrix (PR 792c9ca1c vs 7ecc04f59d, m7i-2)
| Workload | non-pipeline (oss-standalone) |
pipeline-10 (oss-standalone) |
pipeline-10 (oss-standalone-08-io-threads) |
|---|---|---|---|
mget-100B-2keys |
+1.42 % | +1.78 % | — |
mget-100B-5keys |
+0.57 % (mean of 3 dp) | +8.60 % | — |
mget-100B-10keys |
+6.63 % | — | — |
mget-100B-20keys |
+17.76 % | — | — |
mget-100B-30keys |
+21.63 % | +27.68 % | +42.90 % |
mget-100B-100keys |
+37.16 % | +4.53 % | +46.97 % |
set-get-short-expiration-512B |
— | — | +5.56 % |
setget200c-512B-pipeline-10 |
— | — | +7.46 % |
mixed-20-80-512B-pipeline-10-2000_conns |
— | — | +3.77 % |
No regression on any shape. The mget-100B-100keys-pipeline-10 +4.53 % being smaller than its non-pipeline +37.16 % is the expected interaction — at 100 keys the cross-command batch (~32 keys per prefetch_batch_max_size) already warms the head of the keyset before the intra-command path runs, leaving less DRAM latency to hide. Add io-threads (parallel reply chain + cross-command batch) and the wins compound to +47 %.
ShooterIT
left a comment
There was a problem hiding this comment.
generally LGTM, small comments
| addReplyBulkCBuffer(c,(char*)str+start,end-start+1); | ||
| } | ||
| } | ||
| /* Batch size for intra-command key prefetching. */ |
There was a problem hiding this comment.
| /* Batch size for intra-command key prefetching. */ | |
| /* Batch size for intra-command key prefetching. */ |
| # Regression test for dict-pointer staleness across batches | ||
| # (see src/t_string.c:msetMaybePrefetchBatch). When lookupKeyWrite in | ||
| # batch 1 expires a pre-existing key, under cluster mode the slot | ||
| # dict may be freed (KVSTORE_FREE_EMPTY_DICTS) and recreated | ||
| # mid-command; msetGenericCommand must re-fetch the slot dict per | ||
| # batch. This test exercises the same code path in standalone mode. |
There was a problem hiding this comment.
please update comments or tests
| .dictMetadataBytes = setDictMetadataBytes, | ||
| }; | ||
|
|
||
| /* dbDictType prefetch callbacks -------------------------------------------- |
There was a problem hiding this comment.
------------------------ is not align with other function style, maybe remove them, or refine it
| /* dbDictType prefetch callbacks -------------------------------------------- | |
| /* dbDictType prefetch callbacks. |
btw, i think we should move them at the front of dictType, maybe L572
| addReplyArrayLen(c, numkeys); | ||
|
|
||
| /* MGET requires all keys in the same slot in cluster mode. */ | ||
| int slot = server.cluster_enabled ? getKeySlot(c->argv[1]->ptr) : 0; |
There was a problem hiding this comment.
what about use c->current_pending_cmd->slot, if it's not INVALID_CLUSTER_SLOT, we can use it directly.
- src/server.c: drop the '----' separator from the dbDictType prefetch callbacks comment block and move the two callbacks ahead of the dictType definitions section, near the related dictResizeAllowed helper, instead of sitting between setDictType and dbDictType. - src/t_string.c: blank-line separation between getrangeCommand and the PREFETCH_BATCH_SIZE define for readability. - tests/unit/type/string.tcl: update stale function reference in the 'MSET overwrites expired keys across batch boundary' regression comment from msetMaybePrefetchBatch to the post-refactor prefetchKeysBatch helper.
|
Thanks @ShooterIT — addressed all three in 517cb9b:
Local |
Avoid recomputing the cluster slot via getKeySlot() (CRC16 over the first key) in mgetCommand and msetGenericCommand when the cross-command batching path has already populated c->current_pending_cmd->slot. Fall back to getKeySlot() when current_pending_cmd is NULL or the cached slot is INVALID_CLUSTER_SLOT (cross-slot or no slot set). Adds the cluster.h include for INVALID_CLUSTER_SLOT visibility.
|
Thanks @sundb — good catch, addressed in int slot = 0;
if (server.cluster_enabled) {
slot = (c->current_pending_cmd &&
c->current_pending_cmd->slot != INVALID_CLUSTER_SLOT)
? c->current_pending_cmd->slot
: getKeySlot(c->argv[1]->ptr);
}Local |
|
Latest commit Triggered a fresh round comparing PR head Throughput —
|
| Test | Topology | unstable mean (n) | PR mean (n) | Δ |
|---|---|---|---|---|
5Mkeys-string-mget-10B-100keys-pipeline-10 |
io-threads-8 | 26,304 (n=1) | 49,831 (n=2) | +89.44 % |
5Mkeys-string-mget-10B-100keys-pipeline-10 |
standalone | 22,953 (n=1) | 42,053 (n=2) | +83.21 % |
5Mkeys-string-mget-10B-100keys |
standalone | 22,136 (n=1) | 39,472 (n=2) | +78.31 % |
5Mkeys-string-mget-100B-30keys-pipeline-10 |
io-threads-8 | 148,763 (n=2) | 209,899 (n=2) | +41.10 % σ↑ |
5Mkeys-string-mget-100B-100keys-pipeline-10 |
io-threads-8 | 43,505 (n=2) | 60,563 (n=3) | +39.21 % σ↑ |
5Mkeys-string-mget-100B-100keys |
standalone | 31,084 (n=2) | 42,688 (n=3) | +37.33 % σ↑ |
10Mkeys-string-mget-80keys-100B-multitenant-tagged |
standalone | 30,108 (n=2) | 40,811 (n=3) | +35.55 % σ↑ |
5Mkeys-string-mget-100B-30keys-pipeline-10 |
standalone | 105,791 (n=2) | 129,903 (n=2) | +22.79 % σ↑ |
5Mkeys-string-mget-100B-30keys |
standalone | 72,249 (n=2) | 88,433 (n=2) | +22.40 % σ↑ |
5Mkeys-string-mget-100B-20keys |
standalone | 91,047 (n=2) | 106,422 (n=2) | +16.89 % σ↑ |
5Mkeys-string-mget-512B-5keys-pipeline-10 |
io-threads-8 | 815,187 (n=1) | 935,741 (n=2) | +14.79 % |
5Mkeys-string-mget-100B-5keys-pipeline-10 |
standalone | 464,503 (n=2) | 503,364 (n=2) | +8.37 % σ↑ |
5Mkeys-string-mget-100B-5keys-pipeline-10 |
io-threads-8 | 902,737 (n=2) | 977,641 (n=2) | +8.30 % σ↑ |
5Mkeys-string-mget-512B-5keys-pipeline-10 |
standalone | 314,000 (n=1) | 333,018 (n=2) | +6.06 % |
5Mkeys-string-mget-1KiB-5keys |
standalone | 118,086 (n=1) | 124,493 (n=2) | +5.43 % |
5Mkeys-string-mget-100B-100keys-pipeline-10 |
standalone | 56,483 (n=2) | 59,229 (n=3) | +4.86 % σ↑ |
5Mkeys-string-mget-10B-100keys |
io-threads-8 | 30,950 (n=1) | 32,049 (n=2) | +3.55 % |
5Mkeys-string-mget-100B-5keys |
standalone | 145,255 (n=2) | 150,094 (n=2) | +3.33 % σ↑ |
5Mkeys-string-mget-512B-5keys |
standalone | 132,357 (n=1) | 136,500 (n=2) | +3.13 % |
10Mkeys-load-string-mset-10-keys-with-100B-values |
standalone | 129,437 (n=2) | 131,074 (n=3) | +1.26 % σ↑ |
5Mkeys-string-mget-100B-2keys-pipeline-10 |
standalone | 886,782 (n=2) | 891,897 (n=2) | +0.58 % |
5Mkeys-string-mget-1KiB-5keys-pipeline-10 |
standalone | 123,797 (n=1) | 123,513 (n=2) | -0.23 % |
5Mkeys-string-mget-100B-2keys-pipeline-10 |
io-threads-8 | 1,610,943 (n=2) | 1,578,659 (n=2) | -2.00 % |
5Mkeys-string-mget-100B-2keys |
standalone | 168,012 (n=2) | 164,302 (n=2) | -2.21 % σ↓ |
5Mkeys-string-mget-1KiB-5keys-pipeline-10 |
io-threads-8 | 477,506 (n=1) | 415,693 (n=2) | -12.94 % (*) |
σ↑ / σ↓ = σ-disjoint (PR min > BL max or vice versa) when both sides have ≥ 2 dp.
(*) The −12.94 % on mget-1KiB-5keys-pipeline-10 / io-threads-8 is the same outlier flagged on #14907 (also σ-confirmed there at −12.05 %). The test's YAML spec is missing the --pipeline 10 client argument that its 100B and 512B siblings have, so it actually runs unpipelined despite the name. Both PR and baseline run the same broken spec, so the delta is real, but the magnitude doesn't match the prefetch mechanism (its 5-key siblings at 100B and 512B are +1 to +15 %). Likely a spec hygiene issue separate from this PR — happy to fix the spec under a separate change.
Throughput — m8g-2 (Graviton4 ARM, multi-dp, σ-disjoint where ≥ 2 dp/cell)
| Test | Topology | unstable mean (n) | PR mean (n) | Δ |
|---|---|---|---|---|
5Mkeys-string-mget-10B-100keys-pipeline-10 |
io-threads-8 | 3,059 (n=1) | 6,986 (n=2) | +128.34 % |
5Mkeys-string-mget-100B-100keys-pipeline-10 |
io-threads-8 | 6,261 (n=2) | 9,188 (n=3) | +46.76 % σ↑ |
5Mkeys-string-mget-10B-100keys |
standalone | 21,106 (n=1) | 29,716 (n=2) | +40.79 % |
5Mkeys-string-mget-10B-100keys-pipeline-10 |
standalone | 22,608 (n=1) | 31,728 (n=2) | +40.34 % |
5Mkeys-string-mget-100B-30keys-pipeline-10 |
io-threads-8 | 22,804 (n=2) | 29,772 (n=2) | +30.56 % σ↑ |
5Mkeys-string-mget-100B-5keys-pipeline-10 |
io-threads-8 | 115,835 (n=2) | 126,741 (n=2) | +9.42 % σ↑ |
5Mkeys-string-mget-512B-5keys-pipeline-10 |
io-threads-8 | 95,888 (n=1) | 102,753 (n=2) | +7.16 % |
5Mkeys-string-mget-100B-2keys-pipeline-10 |
io-threads-8 | 602,821 (n=2) | 635,721 (n=2) | +5.46 % σ↑ |
10Mkeys-string-mget-80keys-100B-multitenant-tagged |
standalone | 29,761 (n=2) | 30,526 (n=3) | +2.57 % σ↑ |
5Mkeys-string-mget-100B-30keys-pipeline-10 |
standalone | 67,169 (n=2) | 67,965 (n=2) | +1.18 % |
5Mkeys-string-mget-100B-100keys |
standalone | 25,818 (n=2) | 26,089 (n=3) | +1.05 % σ↑ |
10Mkeys-load-string-mset-10-keys-with-100B-values |
standalone | 109,132 (n=2) | 109,808 (n=3) | +0.62 % σ↑ |
5Mkeys-string-mget-512B-5keys-pipeline-10 |
standalone | 170,550 (n=1) | 171,282 (n=2) | +0.43 % |
5Mkeys-string-mget-100B-100keys-pipeline-10 |
standalone | 26,244 (n=2) | 26,266 (n=3) | +0.08 % |
5Mkeys-string-mget-100B-2keys-pipeline-10 |
standalone | 701,637 (n=2) | 701,509 (n=2) | -0.02 % |
5Mkeys-string-mget-100B-5keys-pipeline-10 |
standalone | 351,784 (n=2) | 351,625 (n=2) | -0.05 % |
5Mkeys-string-mget-512B-5keys |
standalone | 129,069 (n=1) | 128,976 (n=2) | -0.07 % |
5Mkeys-string-mget-1KiB-5keys-pipeline-10 |
standalone | 113,092 (n=1) | 112,987 (n=2) | -0.09 % |
5Mkeys-string-mget-100B-5keys |
standalone | 139,258 (n=1) | 139,089 (n=2) | -0.12 % |
5Mkeys-string-mget-100B-2keys |
standalone | 160,049 (n=2) | 159,567 (n=2) | -0.30 % σ↓ |
5Mkeys-string-mget-1KiB-5keys-pipeline-10 |
io-threads-8 | 111,884 (n=1) | 111,834 (n=2) | -0.04 % |
5Mkeys-string-mget-10B-100keys |
io-threads-8 | 18,356 (n=1) | 18,251 (n=2) | -0.57 % |
5Mkeys-string-mget-100B-20keys |
standalone | 84,033 (n=2) | 83,318 (n=2) | -0.85 % σ↓ |
5Mkeys-string-mget-1KiB-5keys |
standalone | 114,255 (n=1) | 112,518 (n=2) | -1.52 % |
5Mkeys-string-mget-100B-30keys |
standalone | 69,012 (n=2) | 67,441 (n=2) | -2.28 % σ↓ |
ARM picture splits two ways: all the big wins land on ARM io-threads-8 (where IO-thread interleaving disrupts the HW prefetcher and the explicit dictPrefetchKeys warm-up does real work — 4 σ-confirmed io-threads wins +5 to +47 %), while ARM standalone is essentially flat to slightly negative on small-N MGET (Graviton4's wide HW prefetcher already absorbs the multi-key MGET case, so the explicit prefetch is overhead). The two big standalone wins on mget-10B-100keys{,-pipeline-10} are the exception — small values + many keys overwhelm even Graviton4's HW prefetcher.
Cross-check on m7i-profiler (single-dp, x86 SPR, with TMA)
Same 25-test set: 24 wins, 1 flat (−1.5 % MSET-load). Headlines reproduce: mget-10B-100keys-pipeline-10 io-threads +95.19 %, mget-100B-100keys-pipeline-10 io-threads +57.58 %, mget-100B-30keys-pipeline-10 io-threads +36.68 %.
TMA on the headline test
mget-100B-100keys-pipeline-10 / io-threads-8: L1_Bound −3.2 pp, Frontend_Bound −1.5 pp, Memory_Bound +3.5 pp (L3 +2.1 pp). Optimization signature: prefetch warming kvobj into L1 before keyCompare, fewer instructions per command, memory becomes the new ceiling at sharply higher throughput.
Notes
- Sundb slot-reuse change in
3d724a05cis gated byserver.cluster_enabled→ on every benchmark above (alloss-standalone*) the new code path is functionally identical to the prior PR head; the result is a clean reproduction of the prior round's wins. - 2-key MGET shows a small consistent flat-to-slight-negative (−0.4 to −2.2 %), consistent with the optimization being an overhead-amortization play that needs ≥ 5 keys to win.
- Across the 50 paired m7i-2 + m8g-2 cells: 17 σ-disjoint wins, 5 σ-disjoint flat-negatives (all ≤ −2.3 %, all ARM-standalone-small-N where Graviton4's HW prefetch dominates), 1 σ-disjoint outlier (
mget-1KiB-5keys-pipeline-10io-threads-8, spec issue noted above).
Co-authored-by: Yuan Wang <wangyuancode@163.com>
committed the suggestion. thank you @ShooterIT ! |
Co-authored-by: Yuan Wang <wangyuancode@163.com>

Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses
across the keys of a single multi-key command. Builds on the cross-command
batched prefetch framework introduced in #14017 and the dict-prefetch state
machine in
memory_prefetch.c, and lifts the kvobject-aware bits out of thestate machine into two new
dictTypecallbacks so the same machinery canbe reused for other dict-encoded types later (hash hashtable, sets, sorted
sets) without paying for
kvobj-specific code paths in the core loop.Bundles the work originally proposed in #14899 (MGET prefetch framework, by
@mpozniak95) and #15043 (MSET batch prefetch), with several follow-up
commits on top:
expire-dict prefetch callback, drop the single-key MGET special case,
route MSET through
dictPrefetchKeysso the existing entry's value iswarmed too;
memory_prefetch.cby @ShooterIT (clearertypes + explicit
Init/Reset/Run/Freelifecycle);drop
MGET_LIGHT_PREFETCH_MAXand the inner bucket-only branch; smallrequests get a single
dictPrefetchKeyscall covering all keys, largerrequests batch in chunks via the same call;
prefetchKeysBatch(c, slot, start, stride)helper by@ShooterIT that deduplicates MGET (
stride=1) and MSET (stride=2)batching paths, with adaptive batch sizing (≥ 2 batches remain → take
one; else take all remaining in one call) and a stack-allocating
fixed-size lookup array.
Design
Two new optional callbacks on
dictType:dbDictTyperegisters both. The kv-aware logic — thedictEntryIsKey()shortcut for embedded kvobjs, and
kv->ptrforOBJ_STRING/OBJ_ENCODING_RAWvalues — now lives in two small helpers inserver.c:The
PrefetchGetValueDataFunctypedef and the per-callget_val_dataparameter on
dictPrefetchKeys()/dictPrefetch()are removed — thedict's own type drives both ends. This also removes the foot-gun where
callers (like
mgetCommand) had to remember whether to passprefetchGetObjectValuePtrorNULL.memory_prefetch.cno longerreferences
kvobj,kvobjGetKey, or any specific value layout.State machine
Two file-local types in
memory_prefetch.c:dictPrefetchLookupdictFind(mirrors the locals a synchronousdictFindwould carry across one bucket walk).dictPrefetcherdictPrefetchLookups through the FSM, yielding to the next in-flight lookup each time a prefetch is issued.Five-stage lifecycle for each lookup, driven by the prefetcher:
PREFETCH_BUCKETfirst picksht_table[0], then flips toht_table[1]if the dict is mid-rehash, then transitions to
PREFETCH_DONEif nomore tables remain.
memory_prefetch.cexposes a small lifecycle that any caller can drive:Each FSM stage is a named static function (
dictPrefetchBucket,dictPrefetchEntry,dictPrefetchEntryKey,dictPrefetchEntryValue),so the
dictPrefetcherRundriver is a four-lineswitchover the state.The state machine is dict-pure: no
kvobjfield ondictPrefetchLookup,no
kvobjGetKeyreach-through. Round-robin advance semantics — a stateonly advances the cursor if a prefetch was actually issued — are
preserved, so the embedded-kvobj fast path
(
dictEntryIsKey(de) == 1→ callback returns NULL) still skips theextra prefetch and falls straight into the compare on the next loop
iteration.
The cross-command path (
prefetchCommands/PrefetchCommandsBatch)embeds a
dictPrefetcherinitialized once at startup and reset perbatch, so cross-command prefetching no longer allocates per call.
Intra-command API
A single multi-key command (e.g. MGET) can prefetch dict data for a
batch of its own keys, reusing the same state machine that the
cross-command path uses. Single-key calls (
nkeys <= 1) early-return —nothing to interleave with. The implementation stack-allocates a
fixed-size lookup array bounded by
DICT_PREFETCH_MAX_SIZE = 64(noVLA, predictable stack usage), so the intra-command path doesn't touch
the heap.
Notes on the call sites
A shared helper picks the next prefetch batch and warms it via
dictPrefetchKeys:Adaptive batch sizing inside the helper: if at least two full batches
(
PREFETCH_BATCH_SIZE * 2 = 32items) remain, take one batch(
PREFETCH_BATCH_SIZE = 16); otherwise take all remaining items in onecall. This generalizes the small-request fast path so the trailing
batch of a large request also gets the single-call benefit.
MGET (
mgetCommand) — gated bydo_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1,with
already_prefetched = c->current_pending_cmd && (c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED).When
do_prefetchis set, each iteration callsprefetchKeysBatch(c, slot, j, 1)and then sequentiallylookupKeyReads + replies the chosen batch. Whendo_prefetchisclear (cross-command path already warmed the keys, or batch
prefetching is off), the loop takes all remaining items in one go
and skips the prefetch.
MSET / MSETNX (
msetGenericCommand) — samedo_prefetchgate asMGET with
stride = 2. For the NX flag the NX-check loop runslookupKeyWrite(which already warmed everything viaprefetchKeysBatch); the SET loop then disables further prefetch(
do_prefetch &&= !nx) so we don't re-prefetch on the second pass.Going through the full state machine (rather than bucket-only) means
dbDictType'sprefetchEntryValuecallback runs on a key match —warming the old kvobj's payload, which
setKey -> dbReplaceValue -> updateKeysizesHist(oldlen, newlen)then reads to compute thehistogram delta. The slot dict is re-fetched per batch — in cluster
mode the slot dict can be freed mid-MSET (
KVSTORE_FREE_EMPTY_DICTSexpireIfNeeded), so a cached pointer would otherwise dangle.Cross-command batch path (
addCommandToBatch) — setsPENDING_CMD_KEYS_PREFETCHEDon every command added to the batch,even on partial-batch overflow (was: only when ALL keys fit). The
intra-command path then uniformly skips supplemental prefetching for
any command the batch touched. Rationale: running both paths
(cross-command warm + intra-command supplement) caused a measured
−9.6 % regression on x86 with pipeline-10, and the partial cross-
command warmup is sufficient for the head of the keyset; the cold
tail goes through normal lookup, which is still cheaper than running
the FSM a second time on already-warm keys.
Future types: each dict's
dictTypecan register its ownprefetchEntryKey/prefetchEntryValue(e.g. for the hashtable hashencoding, the field-sds and value-sds payloads), without touching
memory_prefetch.c.Benchmark validation
Validated on
x86-aws-m7i.metal-24xl-2(Intel Xeon 8488C, Sapphire Rapids,bare-metal). PR head
792c9ca1cvsunstable2432f5527.2432f5527(Ops/sec)792c9ca1c(Ops/sec)5Mkeys-string-mget-100B-2keys5Mkeys-string-mget-100B-5keys5Mkeys-string-mget-100B-10keys5Mkeys-string-mget-100B-20keys5Mkeys-string-mget-100B-30keys5Mkeys-string-mget-100B-100keys10Mkeys-load-string-mset-10-keys-with-10B-values10Mkeys-load-string-mset-10-keys-with-100B-values10Mkeys-load-string-mset-50-keys-with-10B-values10Mkeys-load-string-mset-50-keys-with-100B-valuesMGET wins scale with batch size — the prefetch hides one DRAM hop per
key, so the longer the batch, the more memory parallelism we recover.
MSET is positive across the board, with the largest test
(
50-keys-100B) showing the value-prefetch fix paying off first(highest cache pressure). ARM (Graviton4) was uniformly flat in earlier
runs against the same
unstablehead, which matches what the underlyinghardware prefetcher already gives.
Note
Medium Risk
Touches core dict lookup prefetching and rewrites MGET/MSET hot paths; while performance-focused, subtle mistakes could cause correctness regressions (e.g., stale dict pointers, cluster slot handling) or reduced performance under some workloads.
Overview
Adds dictType-configurable prefetch hooks (
prefetchEntryKey/prefetchEntryValue) and rewrites the dict prefetch logic into a reusable, kvobj-agnostic state machine that can warm both key and value payloads during lookups.Introduces a new intra-command API
dictPrefetchKeys()and updatesMGET/MSET/MSETNXto prefetch their own key batches (with adaptive batching), while coordinating with the existing cross-command prefetcher via a newPENDING_CMD_KEYS_PREFETCHEDflag to avoid double-prefetching.Registers db keyspace-specific prefetch callbacks in
dbDictTypeand adds unit tests covering multi-batchMSETexecution and an expiry/batch-boundary regression case.Reviewed by Cursor Bugbot for commit bc081cb. Bugbot is set up for automated code reviews on this repo. Configure here.