Skip to content

Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899

Closed
mpozniak95 wants to merge 11 commits into
redis:unstablefrom
mpozniak95:mget_prefetch_framework
Closed

Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899
mpozniak95 wants to merge 11 commits into
redis:unstablefrom
mpozniak95:mget_prefetch_framework

Conversation

@mpozniak95

@mpozniak95 mpozniak95 commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

Motivation

The existing memory_prefetch.c framework amortizes dictionary lookup cache misses across commands in I/O-thread batches, but provides no mechanism for a single multi-key command to prefetch its own keys. When a command like MGET key1 ... key100 executes, each lookupKeyRead() triggers a chain of cache misses (hash bucket → entry → kvobj → value data). With 5M+ keys, nearly every access is an L3/DRAM miss. For MGET 100 keys this means ~400 sequential cache misses with the CPU stalling on each one.

This PR refactors the prefetch state machine to be reusable and applies it inside mgetCommand as the first consumer.

Changes

1. New DictPrefetchCtx abstraction (memory_prefetch.c)

Extracted a lightweight context struct from the global PrefetchCommandsBatch. All state machine functions (prefetchBucket, prefetchEntry, prefetchKVObject, prefetchValueData) were refactored to operate on a DictPrefetchCtx * parameter instead of accessing the global batch directly. The existing cross-command I/O-thread path creates a DictPrefetchCtx from the global batch fields — behavior is fully backward-compatible.

2. New public API (memory_prefetch.h)

void dictPrefetchKeys(struct dict **dicts, void **keys, size_t nkeys,
                      PrefetchGetValueDataFunc get_val_data);
void *prefetchGetObjectValuePtr(const void *val);

dictPrefetchKeys() stack-allocates KeyPrefetchInfo[nkeys] and runs the same 5-state round-robin machine (BUCKET → ENTRY → KVOBJ → VALDATA → DONE). Zero heap allocations in the hot path.

3. Cross-command awareness (memory_prefetch.c + server.h)

Added PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. When addCommandToBatch() successfully adds all keys of a command to the cross-command batch, it sets this flag. mgetCommand checks the flag and skips intra-command prefetching when keys are already warm — avoiding redundant work when I/O threads have already prefetched everything.

4. Optimized mgetCommand (t_string.c)

Processes keys in batches of 16 with three fast paths:

  • Single key: direct lookup, zero overhead
  • Empty dict: NULL-reply loop with lookupKeyRead() for side effects
  • Cross-command prefetched: simple loop when I/O threads already warmed the cache

Benchmark results

Tested on r8i.4xlarge ubuntu24.04 with redis-benchmarks-specification memtier suites. best of 3 iterations.

Single-threaded (no I/O threads)

Test DB size Value size Main (ops/s) Optimized (ops/s) Δ
MGET 100 keys 5M 100B 26,325 39,573 +50%
MGET 30 keys, P=10 5M 100B 84,181 122,095 +45%
MGET 10 keys 5M 100B 119,446 147,563 +24%
MGET 30 keys 5M 100B 74,634 91,602 +23%
MGET 20 keys 5M 100B 95,328 112,204 +18%
MGET 5 keys 5M 512B 134,749 155,004 +15%
MGET 5 keys 5M 1KiB 127,420 145,808 +14%
MGET 5 keys, P=10 5M 512B 280,572 314,293 +12%
MGET 5 keys 5M 100B 155,014 170,170 +10%
MGET 2 keys 5M 100B 174,118 189,762 +9%
MGET 5 keys, P=10 5M 1KiB 133,109 145,478 +9%
MGET 5 keys 1M 1KiB 119,035 126,642 +6%
MGET 2 keys, P=10 5M 100B 885,126 927,125 +5%
MGET 100 keys, P=10 5M 100B 49,941 51,870 +4%
MGET 5 keys, P=10 5M 100B 478,288 494,179 +3%

Single-threaded: +3% to +50% across all tests, zero regressions.

4 I/O threads

Test DB size Value size Main (ops/s) Optimized (ops/s) Δ
MGET 30 keys, P=10 5M 100B 102,871 173,263 +68%
MGET 100 keys, P=10 5M 100B 28,133 44,083 +57%
MGET 5 keys, P=10 5M 100B 720,426 843,673 +17%
MGET 10 keys 5M 100B 376,988 413,916 +10%
MGET 2 keys 5M 100B 474,214 506,097 +7%
MGET 2 keys, P=10 5M 100B 1,989,495 2,113,546 +6%
MGET 5 keys 5M 512B 453,154 467,093 +3%
MGET 30 keys 5M 100B 207,203 211,638 +2%
MGET 5 keys, P=10 5M 512B 522,417 530,606 +2%
MGET 5 keys 5M 100B 459,781 466,000 +1%
MGET 5 keys, P=10 5M 1KiB 298,161 298,208 ~0%
MGET 5 keys 5M 1KiB 298,122 298,262 ~0%
HMGET 5 fields, P=10 1M 100B 2,042,931 2,017,445 -1%
MGET 5 keys 1M 1KiB 135,522 131,678 -3%
MGET 20 keys 5M 100B 261,383 252,574 -3%
MGET 100 keys 5M 100B 77,969 69,382 -11%

4 I/O threads: up to +68% on pipelined workloads. Regressions on non-pipelined high-key-count tests (20, 100 keys) possibly due to partial overlap with cross-command prefetching.

16 I/O threads

Test DB size Value size Main (ops/s) Optimized (ops/s) Δ
MGET 30 keys, P=10 5M 100B 114,994 179,161 +56%
MGET 100 keys, P=10 5M 100B 31,648 48,894 +54%
MGET 5 keys, P=10 5M 512B 484,134 564,920 +17%
MGET 5 keys, P=10 5M 100B 533,706 612,046 +15%
MGET 5 keys 5M 100B 418,656 432,231 +3%
MGET 5 keys 5M 512B 406,153 416,785 +3%
MGET 10 keys 5M 100B 323,832 331,224 +2%
MGET 2 keys, P=10 5M 100B 1,491,821 1,525,486 +2%
MGET 2 keys 5M 100B 522,589 524,531 ~0%
HMGET 5 fields, P=10 1M 100B 1,715,306 1,722,662 ~0%
MGET 5 keys 5M 1KiB 298,251 298,136 ~0%
MGET 5 keys, P=10 5M 1KiB 298,381 298,229 ~0%
MGET 30 keys 5M 100B 163,822 157,660 -4%
MGET 5 keys 1M 1KiB 136,597 130,641 -4%
MGET 20 keys 5M 100B 209,638 198,278 -5%
MGET 100 keys 5M 100B 65,570 59,888 -9%

16 I/O threads: up to +56% on pipelined workloads. Same regression pattern as 4 I/O threads on non-pipelined high-key-count tests.


Note

Medium Risk
Touches the hot mgetCommand path and refactors the dict prefetch state machine; correctness should be unchanged but performance/behavioral interactions with pipelining and I/O-thread batch prefetching could regress or be hard to reason about.

Overview
Refactors the dict prefetch state machine in memory_prefetch.c to run against a new DictPrefetchCtx, allowing the same logic to be reused outside the global cross-command batch.

Adds a public intra-command API dictPrefetchKeys() (plus PrefetchGetValueDataFunc and prefetchGetObjectValuePtr in memory_prefetch.h) so multi-key commands can prefetch their own key lookups without heap allocation.

Updates cross-command batching to set a new PENDING_CMD_KEYS_PREFETCHED flag only when all keys of a pending command were added to the batch, and rewrites mgetCommand to batch keys (size 16), prefetch dict buckets/state-machine per batch, and skip intra-command prefetch when cross-command prefetching already warmed the cache.

Reviewed by Cursor Bugbot for commit 362fbdd. Bugbot is set up for automated code reviews on this repo. Configure here.

@CLAassistant

CLAassistant commented Mar 17, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@fcostaoliveira fcostaoliveira requested a review from sundb March 17, 2026 17:22
Comment thread src/memory_prefetch.c Outdated
@sundb sundb requested a review from ShooterIT March 18, 2026 00:33
Comment thread src/memory_prefetch.c
Comment thread src/t_string.c Outdated
@fcostaoliveira

Copy link
Copy Markdown
Collaborator

Preliminary performance analysis for PR #14899 — MGET Prefetch Framework

Note: This is preliminary data from targeted benchmarks. More comprehensive results across additional tests and platforms to follow.

Benchmarks triggered for mpozniak95/redis branch mget_prefetch_framework commit 70d4444ce5f4885945560bfe688fa3e2f3aa1fca.

ARM (arm-aws-m8g.metal-24xl) — oss-standalone

Test Case Baseline redis/redis unstable (median obs. +- std.dev) Comparison (median obs. +- std.dev) % change (higher-better) Note
memtier_benchmark-5Mkeys-string-mget-100B-100keys 43890 +- 1.4% (2 datapoints) 48747 +- 0.1% (2 datapoints) +11.1% IMPROVEMENT
memtier_benchmark-5Mkeys-string-mget-100B-100keys-pipeline-10 56946 +- 1.3% (2 datapoints) 55794 +- 0.9% (2 datapoints) -2.0% No Change
memtier_benchmark-5Mkeys-string-mget-100B-2keys 197110 +- 0.3% (3 datapoints) 197841 +0.4% No Change
memtier_benchmark-5Mkeys-string-mget-100B-2keys-pipeline-10 1010214 +- 0.2% (3 datapoints) 1019085 +0.9% No Change
memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-10 706908 +- 0.6% (3 datapoints) 714618 +1.1% No Change

x86 (x86-aws-m7i.metal-24xl) — oss-standalone

Test Case Baseline redis/redis unstable (median obs. +- std.dev) Comparison (median obs. +- std.dev) % change (higher-better) Note
memtier_benchmark-5Mkeys-string-mget-100B-100keys 32691 +- 1.1% (2 datapoints) 40872 +- 0.7% (2 datapoints) +25.0% IMPROVEMENT
memtier_benchmark-5Mkeys-string-mget-100B-100keys-pipeline-10 60869 +- 4.8% (2 datapoints) 55051 +- 0.6% (2 datapoints) -9.6% REGRESSION
memtier_benchmark-5Mkeys-string-mget-100B-2keys 169522 +- 1.3% (3 datapoints) 172012 +1.5% No Change
memtier_benchmark-5Mkeys-string-mget-100B-2keys-pipeline-10 893595 +- 0.3% (3 datapoints) 892429 -0.1% No Change
memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-10 803727 +- 1.4% (3 datapoints) 799166 -0.6% No Change

IO-threads (oss-standalone-08-io-threads) — setget2000c regression guard

Platform Baseline unstable (median obs. +- std.dev) Comparison (median obs.) % change Note
arm-aws-m8g.metal-24xl 3182190 +- 0.4% (3 datapoints) 3173256 -0.3% No Change
x86-aws-m7i.metal-24xl 2865474 +- 1.0% (3 datapoints) 2705671 -5.6% REGRESSION

Summary

  • MGET 100 keys (no pipeline): +11.1% ARM, +25.0% x86 — confirmed with 2 datapoints, low variance
  • MGET 2 keys: No change on both platforms (expected — minimal prefetch benefit)
  • MGET 100 keys pipeline-10: No change on ARM, -9.6% regression on x86 — confirmed with 2 datapoints
  • IO-threads (setget2000c): No change on ARM, -5.6% regression on x86 with 8 IO threads
  • The x86 pipeline/IO-thread regressions suggest the intra-command prefetch may conflict with the cross-command batch prefetch path under high concurrency on x86

@fcostaoliveira

Copy link
Copy Markdown
Collaborator

Updated performance analysis — after commit 4ef667a (pipeline double-prefetch fix)

The previous analysis (commit 70d4444ce) showed -9.6% regression on mget-100B-100keys-pipeline-10 and -5.6% on setget2000c IO-threads on x86. Commit 4ef667ae6 adds a guard to skip intra-command prefetch when pipeline is active (ready_len > 1), since the cross-command batch path already partially warms the hash table.

ARM (arm-aws-m8g.metal-24xl) — oss-standalone

Test Case Baseline unstable (median +- std.dev) Comparison (median) % change Note
mget-100B-100keys 43890 +- 1.4% (2 dp) 48562 +10.6% IMPROVEMENT
mget-100B-100keys-pipeline-10 56946 +- 1.3% (2 dp) 57581 +1.1% No Change
mget-100B-2keys 197110 +- 0.3% (3 dp) 198193 +0.5% No Change
mget-100B-2keys-pipeline-10 1010214 +- 0.2% (3 dp) 1016831 +0.7% No Change
setget2000c-1KiB-pipeline-10 706908 +- 0.6% (3 dp) 715233 +1.2% No Change

x86 (x86-aws-m7i.metal-24xl) — oss-standalone

Test Case Baseline unstable (median +- std.dev) Comparison (median) % change Note
mget-100B-100keys-pipeline-10 60869 +- 4.8% (2 dp) 61464 +1.0% No Change
mget-100B-2keys-pipeline-10 893595 +- 0.3% (3 dp) 893119 -0.1% No Change
mget-100B-2keys 169522 +- 1.3% (3 dp) 164955 -2.7% No Change
mget-100B-100keys 32691 +- 1.1% (2 dp) 32137 -1.7% No Change
setget2000c-1KiB-pipeline-10 803727 +- 1.4% (3 dp) 801778 -0.2% No Change

Summary — before vs after fix

Metric Before (70d4444) After (4ef667a)
x86 mget-100keys-pipeline-10 -9.6% REGRESSION +1.0% No Change
x86 setget2000c standalone -0.6% -0.2%
ARM mget-100keys (no pipeline) +11.1% IMPROVEMENT +10.6% IMPROVEMENT
ARM mget-100keys-pipeline-10 -2.0% +1.1%

The pipeline regression on x86 is resolved. No regressions detected above the 3% threshold on either platform. IO-threads topology (oss-standalone-08-io-threads) was not triggered for this commit — pending re-run to confirm the -5.6% x86 IO-threads regression is also fixed.

@sundb @mpozniak95 — We're triggering a full benchmark run now covering both oss-standalone and oss-standalone-08-io-threads topologies, on both x86 and ARM. Will follow up with complete results.

Comment thread src/t_string.c Outdated
@fcostaoliveira fcostaoliveira added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Mar 29, 2026
mpozniak95 and others added 6 commits March 30, 2026 21:08
…keys

Add PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. Set it in
addCommandToBatch() when a command is added to the cross-command
prefetch batch. In mgetCommand, check the flag and fall back to
plain sequential lookups when keys are already warm from the batch.

This eliminates redundant prefetch work that caused regressions
of up to -16% with many I/O threads, while preserving the large
gains (+52-54%) for pipelined workloads where the batch overflows.
The previous guard only skipped intra-command dictPrefetchKeys() when
PENDING_CMD_KEYS_PREFETCHED was set, which requires ALL keys to fit in
the batch. For MGET 100 keys with a batch size of 32, the flag was
never set — both prefetch paths always ran.

On x86 with pipeline-10, this double-prefetching caused -9.6% regression
on mget-100B-100keys-pipeline-10 and -5.6% on setget2000c with 8 IO
threads due to cache-bandwidth contention in ctxPrefetchAndAdvance.

Fix: skip intra-command prefetch when either:
1. PENDING_CMD_KEYS_PREFETCHED is set (all keys fit in batch), OR
2. Pipeline is active (ready_len > 1) — partial batch warmup is
   sufficient, double-prefetching hurts more than it helps.

Single non-pipelined MGET (ready_len <= 1) still uses intra-command
prefetch, preserving the +11% ARM / +25% x86 improvement.
The ready_len > 1 heuristic that skips intra-command dictPrefetchKeys()
assumed cross-command batch prefetching always runs during pipelining.
When prefetch-batch-max-size is 0 (batch disabled), addCommandToBatch()
returns C_ERR immediately and no keys are warmed, but ready_len > 1
still holds for any pipeline — causing MGET to skip its only prefetch
opportunity.

Add server.prefetch_batch_max_size > 0 to the ready_len guard so
intra-command prefetch runs when the batch system is disabled.
@fcostaoliveira fcostaoliveira force-pushed the mget_prefetch_framework branch from 2b30246 to a470915 Compare March 30, 2026 20:10
@fcostaoliveira fcostaoliveira removed the action:run-benchmark Triggers the benchmark suite for this Pull Request label Mar 30, 2026
@fcostaoliveira

Copy link
Copy Markdown
Collaborator

Updated performance analysis — rebased on latest unstable (commit a470915)

Previous analysis was against a stale baseline (44 commits behind unstable). Rebased the branch and re-ran benchmarks against current unstable HEAD.

Note: Still collecting more MGET data across additional pipeline variants and ARM. This is preliminary data from the first IO-threads MGET benchmarks ever run on this infrastructure.

x86 (x86-aws-m7i.metal-24xl) — oss-standalone

Test Case Baseline unstable (median +- std.dev) Comparison (median) % change Note
mget-100B-100keys 32691 +- 1.1% (2 dp) 40491 +23.9% IMPROVEMENT
mget-100B-30keys 76546 +- 0.4% (2 dp) 87551 +14.4% IMPROVEMENT
mget-100B-20keys 96788 +- 2.0% (3 dp) 105752 +9.3% IMPROVEMENT
mget-100B-5keys 151999 +- 0.1% (3 dp) 157728 +3.8% IMPROVEMENT
mget-100B-10keys 131399 +- 0.6% (3 dp) 135433 +3.1% IMPROVEMENT
mget-100B-5keys-pipeline-10 465659 +- 0.3% (3 dp) 480291 +3.1% IMPROVEMENT
mget-1KiB-5keys 123074 125884 +2.3% No Change
mget-1KiB-5keys-pipeline-10 122349 124211 +1.5% No Change
mget-512B-5keys-pipeline-10 315258 318174 +0.9% No Change
mget-100B-30keys-pipeline-10 110900 111470 +0.5% No Change
setget2000c-pipeline-10 803727 805587 +0.2% No Change
setget2000c-pipeline-16 990524 989529 -0.1% No Change
mget-100B-2keys-pipeline-10 893595 891518 -0.2% No Change
mget-100B-2keys 169522 168710 -0.5% No Change
mget-100B-100keys-pipeline-10 61394 60846 -0.9% No Change
hmget-pipeline-10 958532 941637 -1.8% No Change
mget-1KiB (10 keys × 1KiB) 134096 125252 -6.6% REGRESSION

x86 (x86-aws-m7i.metal-24xl) — mget-100B-100keys-pipeline-10 across IO-threads topologies (NEW)

Topology Baseline unstable Comparison % change Note
oss-standalone 61394 60846 -0.9% No Change
oss-standalone-02-io-threads 38205 39458 +3.3% IMPROVEMENT
oss-standalone-04-io-threads 35474 37697 +6.3% IMPROVEMENT
oss-standalone-08-io-threads 48640 51585 +6.1% IMPROVEMENT

Summary

  • Non-pipelined MGET with 5-100 keys (100B values): +3% to +24% improvement on x86 — scales with key count
  • Pipelined MGET: flat to +3% — no regression (the original -9.6% regression is resolved)
  • IO-threads MGET (pipeline-10, 100 keys): +3% to +6% improvement across 2/4/8 IO-thread topologies
  • setget2000c (SET/GET regression guard): flat (0%) on both pipeline-10 and pipeline-16
  • One regression: mget-1KiB (10 keys × 1KiB values, no pipeline) shows -6.6% — the dictPrefetchKeys overhead outweighs benefit for large values with small key counts. This test uses 1KiB values where prefetch state machine overhead per key is proportionally higher

Still collecting ARM standalone and additional MGET pipeline IO-threads data. Will follow up with complete results.

@sundb @mpozniak95

@fcostaoliveira

Copy link
Copy Markdown
Collaborator

EMBSTR (10B values) benchmark results — new test coverage

Added new MGET benchmarks with 10-byte values (OBJ_ENCODING_EMBSTR, where value data is embedded in the kv object) to cover all string encoding types. These use --distinct-client-seed for realistic cache behavior across clients.

x86 (x86-aws-m7i.metal-24xl) — mget-10B-100keys

Topology Baseline unstable Comparison (PR) % change Note
oss-standalone 21974 34152 +55.4% IMPROVEMENT
oss-standalone-02-io-threads 31047 42977 +38.4% IMPROVEMENT
oss-standalone-04-io-threads 31234 43042 +37.8% IMPROVEMENT
oss-standalone-08-io-threads 30804 42630 +38.4% IMPROVEMENT

ARM (arm-aws-m8g.metal-24xl) — mget-10B-100keys

Topology Baseline unstable Comparison (PR) % change Note
oss-standalone 21490 45194 +110.3% IMPROVEMENT
oss-standalone-02-io-threads 32471 65174 +100.7% IMPROVEMENT
oss-standalone-04-io-threads 32522 65371 +101.0% IMPROVEMENT
oss-standalone-08-io-threads 32356 65369 +102.0% IMPROVEMENT

x86 — mget-10B-100keys-pipeline-10

Topology Baseline unstable Comparison (PR) % change Note
oss-standalone 22717 24182 +6.4% IMPROVEMENT

ARM — mget-10B-100keys-pipeline-10

Topology Baseline unstable Comparison (PR) % change Note
oss-standalone 22891 24347 +6.4% IMPROVEMENT

Known regression: memtier_benchmark-1Mkeys-string-mget-1KiB

This test (10 keys × 1KiB RAW values, no pipeline) shows -8% on x86 and -6% on ARM. The profiling data shows the dictPrefetchKeys value-data prefetch callback (prefetchGetObjectValuePtr) adds overhead for large RAW values — it prefetches only the first 64 bytes of a 1KiB separate allocation, while adding a full state-machine iteration per key. We're working on a fix (skip value-data prefetch in the intra-command path) and will follow up with updated results.

@sundb @mpozniak95

The prefetchGetObjectValuePtr callback adds one state-machine iteration
per key to prefetch the first cache line of the value allocation.
For EMBSTR values (<=44B) it returns NULL (value is embedded in the kv
object), making the iteration pure overhead.  For large RAW values
(>=1KiB) it warms <10% of the data while adding measurable cost —
profiling shows 3B samples in ctxPrefetchValdata vs 17.8B total in the
prefetch path.

Pass NULL instead: skip value-data prefetch entirely in the MGET
intra-command path.  The dict bucket/entry/kv-object prefetch provides
the bulk of the improvement; addReplyBulk handles value data access
sequentially.
Three optimizations to the dict prefetch state machine:

1. Replace modulo with branch in ctxPrefetchAndAdvance — avoids
   expensive division on every state transition.

2. Add remaining counter to DictPrefetchCtx — ctxNextInfo returns
   NULL immediately when all keys are done instead of scanning the
   full array. Eliminates O(n) scan per call as keys complete.

3. Fast-path in ctxPrefetchKvobj when no value-data callback and
   entry is a direct key — do key comparison inline and skip the
   VALDATA state entirely, saving one round-trip through the state
   machine per key.
For batches of <=10 keys, replace the full dictPrefetchKeys state
machine with a simple loop that prefetches hash table buckets only.

The state machine has ~4 states per key with function dispatch and
round-robin scanning overhead.  Profiling shows this costs 4.3% of
total CPU for 10-key MGET with 1KiB values — matching the observed
regression exactly.  A single-pass bucket prefetch eliminates the
iteration overhead while still warming the hash chain heads.

Larger batches (>10 keys) continue using the full state machine where
the multi-round interleaving provides net benefit.
@fcostaoliveira fcostaoliveira force-pushed the mget_prefetch_framework branch from 186f995 to dcb8aa8 Compare March 31, 2026 22:21
Comment thread src/t_string.c
Comment thread src/t_string.c
keys[k] = c->argv[j + k]->ptr;
dicts[k] = d;
}
dictPrefetchKeys(dicts, keys, n, NULL);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value-data callback NULL contradicts documented example usage

Medium Severity

dictPrefetchKeys is called with NULL for the get_val_data callback, so the VALDATA state never prefetches value data for RAW-encoded strings. The cross-command batch path passes prefetchGetObjectValuePtr, and the documented example in the same file (line 433) also shows passing prefetchGetObjectValuePtr. This means for batches of 11–16 keys using the full state machine, kv->ptr for RAW strings won't be in cache when addReplyBulk accesses it.

Additional Locations (1)
Fix in Cursor Fix in Web

@fcostaoliveira

Copy link
Copy Markdown
Collaborator

Final performance analysis — commit dcb8aa8 (rebased on latest unstable)

Full MGET benchmark suite across all string encodings (EMBSTR 10B, RAW 100B/512B/1KiB), key counts (2–100), pipeline depths (1, 10), and topologies (standalone + 02/04/08 IO-threads), on both x86 and ARM.

Summary

  • x86: up to +67.9% improvement (MGET 10B EMBSTR, 100 keys, standalone)
  • ARM: up to +112.7% improvement (MGET 10B EMBSTR, 100 keys, standalone)
  • Zero regressions across all topologies, encodings, and pipeline configurations on both platforms

x86 (x86-aws-m7i.metal-24xl) — oss-standalone

Test Case Baseline unstable Comparison (PR) % change Note
mget-10B-100keys 21974 36894 +67.9% IMPROVEMENT
mget-100B-100keys 32022 39493 +23.3% IMPROVEMENT
mget-100B-30keys 77470 85734 +10.7% IMPROVEMENT
mget-100B-20keys 96470 103945 +7.7% IMPROVEMENT
mget-10B-100keys-pipeline-10 22717 24245 +6.7% IMPROVEMENT
mget-100B-2keys-pipeline-10 887290 935785 +5.5% IMPROVEMENT
mget-100B-5keys-pipeline-10 470951 489931 +4.0% IMPROVEMENT
mget-512B-5keys-pipeline-10 316216 326153 +3.1% IMPROVEMENT
mget-100B-100keys-pipeline-10 61319 63101 +2.9% No Change
mget-100B-30keys-pipeline-10 109449 112017 +2.3% No Change
mget-100B-5keys 151458 154636 +2.1% No Change
mget-100B-10keys 133057 135861 +2.1% No Change
mget-512B-5keys 137861 138955 +0.8% No Change
mget-1KiB-5keys-pipeline-10 123179 124145 +0.8% No Change
mget-1KiB-5keys 124863 122805 -1.6% No Change
mget-1KiB 137088 134841 -1.6% No Change
mget-100B-2keys 171538 168515 -1.8% No Change

x86 — oss-standalone-02-io-threads

Test Case Baseline unstable Comparison (PR) % change Note
mget-10B-100keys 31047 48641 +56.7% IMPROVEMENT
mget-10B-100keys-pipeline-10 25398 27733 +9.2% IMPROVEMENT
mget-100B-30keys-pipeline-10 131002 138359 +5.6% IMPROVEMENT
mget-100B-100keys-pipeline-10 38582 39892 +3.4% IMPROVEMENT

x86 — oss-standalone-04-io-threads

Test Case Baseline unstable Comparison (PR) % change Note
mget-10B-100keys 31234 48441 +55.1% IMPROVEMENT
mget-100B-5keys-pipeline-10 1132402 1310331 +15.7% IMPROVEMENT
mget-10B-100keys-pipeline-10 25876 28283 +9.3% IMPROVEMENT
mget-100B-2keys-pipeline-10 2283308 2467139 +8.1% IMPROVEMENT
mget-100B-30keys-pipeline-10 115508 123581 +7.0% IMPROVEMENT
mget-100B-100keys-pipeline-10 35151 37591 +6.9% IMPROVEMENT

x86 — oss-standalone-08-io-threads

Test Case Baseline unstable Comparison (PR) % change Note
mget-100B-2keys-pipeline-10 1564255 1812083 +15.8% IMPROVEMENT
mget-10B-100keys-pipeline-10 25826 28033 +8.5% IMPROVEMENT
mget-512B-5keys-pipeline-10 894374 955192 +6.8% IMPROVEMENT
mget-1KiB-5keys-pipeline-10 438268 466975 +6.6% IMPROVEMENT
mget-10B-100keys 30804 47677 +54.8% IMPROVEMENT
mget-100B-5keys-pipeline-10 987120 1038147 +5.2% IMPROVEMENT
mget-100B-100keys-pipeline-10 50261 52086 +3.6% IMPROVEMENT
mget-100B-30keys-pipeline-10 188336 194892 +3.5% IMPROVEMENT

ARM (arm-aws-m8g.metal-24xl) — oss-standalone

Test Case Baseline unstable Comparison (PR) % change Note
mget-10B-100keys 21490 45710 +112.7% IMPROVEMENT
mget-100B-30keys 85239 93681 +9.9% IMPROVEMENT
mget-100B-20keys 107623 115082 +6.9% IMPROVEMENT
mget-100B-100keys 43605 46612 +6.9% IMPROVEMENT
mget-10B-100keys-pipeline-10 22891 24398 +6.6% IMPROVEMENT
mget-100B-10keys 145909 150486 +3.1% IMPROVEMENT
mget-100B-30keys-pipeline-10 104580 106673 +2.0% No Change
mget-100B-5keys-pipeline-10 545431 551314 +1.1% No Change
mget-512B-5keys 160471 162215 +1.1% No Change
mget-100B-5keys 175937 177940 +1.1% No Change
mget-1KiB-5keys-pipeline-10 146393 146798 +0.3% No Change
mget-100B-2keys 196102 196724 +0.3% No Change
mget-1KiB-5keys 147608 147504 -0.1% No Change
mget-1KiB 151912 151250 -0.4% No Change
mget-100B-100keys-pipeline-10 56851 56637 -0.4% No Change
mget-512B-5keys-pipeline-10 349163 343147 -1.7% No Change
mget-100B-2keys-pipeline-10 1003653 977958 -2.6% No Change

ARM — oss-standalone-02-io-threads

Test Case Baseline unstable Comparison (PR) % change Note
mget-10B-100keys 32471 67537 +108.0% IMPROVEMENT
mget-10B-100keys-pipeline-10 26331 28862 +9.6% IMPROVEMENT
mget-100B-30keys-pipeline-10 127850 128544 +0.5% No Change
mget-100B-100keys-pipeline-10 36779 35526 -3.4% No Change

ARM — oss-standalone-04-io-threads

Test Case Baseline unstable Comparison (PR) % change Note
mget-10B-100keys 32522 67666 +108.1% IMPROVEMENT
mget-10B-100keys-pipeline-10 26653 29277 +9.8% IMPROVEMENT
mget-100B-2keys-pipeline-10 3066805 3274012 +6.8% IMPROVEMENT
mget-100B-5keys-pipeline-10 1142510 1205380 +5.5% IMPROVEMENT
mget-100B-30keys-pipeline-10 92741 96967 +4.6% IMPROVEMENT
mget-100B-100keys-pipeline-10 29689 30550 +2.9% No Change

ARM — oss-standalone-08-io-threads

Test Case Baseline unstable Comparison (PR) % change Note
mget-10B-100keys 32356 67338 +108.1% IMPROVEMENT
mget-100B-100keys-pipeline-10 39596 51576 +30.3% IMPROVEMENT
mget-10B-100keys-pipeline-10 26425 28987 +9.7% IMPROVEMENT
mget-100B-30keys-pipeline-10 187125 197733 +5.7% IMPROVEMENT
mget-100B-2keys-pipeline-10 2702842 2815783 +4.2% IMPROVEMENT
mget-100B-5keys-pipeline-10 1253545 1299721 +3.7% IMPROVEMENT

Benchmark coverage

Tested across all string encodings that MGET handles:

  • EMBSTR (≤44B): 10B values — value embedded in kv object
  • RAW (>44B): 100B, 512B, 1KiB values — separate allocation

Key counts: 2, 5, 10, 20, 30, 100 keys per MGET.
Pipeline: 1 (no pipeline) and 10.
Topologies: standalone, 02/04/08-io-threads.
Platforms: x86 (Intel m7i.metal-24xl), ARM (Graviton m8g.metal-24xl).
All benchmarks use --distinct-client-seed for realistic cache behavior.

@sundb @mpozniak95

… callback

Two review comments from Cursor Bugbot on dcb8aa8:

1. Missing `#undef MGET_LIGHT_PREFETCH_MAX` at the end of mgetCommand.
   MGET_BATCH was already undef'd but its sibling was not.

2. The `dictPrefetchKeys()` docstring example passed
   `prefetchGetObjectValuePtr` as the value-data callback, while the
   mgetCommand caller passes NULL on purpose (the callback's extra state
   machine work costs more than warming the value payload saves, since the
   following lookupKeyRead + addReplyBulk touch the payload immediately and
   the hardware prefetcher is already on it). Update the example to show
   the typical NULL form and document when passing a non-NULL callback
   would actually be worthwhile.
@fcostaoliveira

Copy link
Copy Markdown
Collaborator

CE Performance Automation : step 1 of 2 (build) STARTING...

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-04-21 23:37:47.892481
You can check each build/benchmark progress in grafana:

  • git hash: 362fbdd
  • git branch: mget_prefetch_framework
  • commit date and time: 2026-04-22 00:32:11+01:00
  • commit summary: Address review comments: undef MGET_LIGHT_PREFETCH_MAX + clarify NULL callback
  • test filters:
    • command priority lower limit: 0
    • command priority upper limit: 100000
    • test name regex: .mget.
    • command group regex: .*

@fcostaoliveira

fcostaoliveira commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2026-04-21 23:38:48.967283 and took 104.768925 seconds up until now.
Status: [####----------------------------------------------------------------------------] 5.56% completed.

In total will run 18 benchmarks.
- 17 pending.
- 1 completed:
- 1 successful.
- 0 failed.
You can check a the status in detail via the grafana link

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 362fbdd. Configure here.

Comment thread src/memory_prefetch.c
.get_val_data = get_val_data,
};

server.stat_total_prefetch_batches++;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefetch stats now conflate IO-threaded and main-thread operations

Low Severity

dictPrefetchKeys increments server.stat_total_prefetch_batches and its state machine increments server.stat_total_prefetch_entries via ctxMarkDone, but these counters are reported to users as io_threaded_total_prefetch_batches and io_threaded_total_prefetch_entries. Intra-command prefetching from mgetCommand runs on the main thread without any IO threads involved, inflating these IO-thread-specific metrics and making them unreliable for monitoring IO thread performance.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 362fbdd. Configure here.

ShooterIT added a commit that referenced this pull request May 11, 2026
…15133)

Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses
across the keys of a single multi-key command. Builds on the cross-command
batched prefetch framework introduced in #14017 and the dict-prefetch state
machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the
state machine into two new `dictType` callbacks so the same machinery can
be reused for other dict-encoded types later (hash hashtable, sets, sorted
sets) without paying for `kvobj`-specific code paths in the core loop.

Bundles the work originally proposed in #14899 (MGET prefetch framework,
by @mpozniak95) and #15043 (MSET batch prefetch).

## Design

Two new optional callbacks on `dictType`:

```c
typedef struct dictType {
    ...
    /* Bring the entry's key payload into cache before keyCompare runs.
     * Returns the address to prefetch, or NULL if the entry alone is enough. */
    void *(*prefetchEntryKey)(const dictEntry *de);

    /* Called only after a key match. Returns the value-side payload to
     * prefetch (or NULL). */
    void *(*prefetchEntryValue)(const dictEntry *de);
} dictType;
```

`dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()`
shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` /
`OBJ_ENCODING_RAW` values — now lives in two small helpers in
`server.c`:

```c
static void *dbDictPrefetchEntryKey(const dictEntry *de) {
    return dictEntryIsKey(de) ? NULL : dictGetKey(de);
}

static void *dbDictPrefetchEntryValue(const dictEntry *de) {
    kvobj *kv = dictGetKey(de);
    return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW)
            ? kv->ptr : NULL;
}
```

The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data`
parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the
dict's own type drives both ends. This also removes the foot-gun where
callers (like `mgetCommand`) had to remember whether to pass
`prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer
references `kvobj`, `kvobjGetKey`, or any specific value layout.

## State machine

Two file-local types in `memory_prefetch.c`:

| Type | Role |
|---|---|
| `dictPrefetchLookup` | Per-key snapshot of an in-flight,
software-pipelined `dictFind` (mirrors the locals a synchronous
`dictFind` would carry across one bucket walk). |
| `dictPrefetcher` | Driver that advances a batch of
`dictPrefetchLookup`s through the FSM, yielding to the next in-flight
lookup each time a prefetch is issued. |

Five-stage lifecycle for each lookup, driven by the prefetcher:

```text
                                                           │
                                                         start
                                                           │
                                                  ┌────────▼─────────┐
                                       ┌─────────►│  PREFETCH_BUCKET ├────►────────┐
                                       │          └────────┬─────────┘            no more tables
                                       │             bucket│found                  │
                                       │                   │                       │
        entry not found - goto next table         ┌────────▼────────┐              │
                                       └────◄─────┤ PREFETCH_ENTRY  │              ▼
                                    ┌────────────►└────────┬────────┘              │
                                    │                 entry│found                  │
                                    │                      │                       │
                                    │          ┌───────────▼─────────────┐         │
                                    │          │   PREFETCH_ENTRY_KEY    │ ◄── dictType->prefetchEntryKey(de)
                                    │          └───────────┬─────────────┘         │
                                    │                      │                       │
        key mismatch - goto next entry                     │                       │
                                    │          ┌───────────▼─────────────┐         │
                                    └──────◄───│   PREFETCH_ENTRY_VALUE  │ ◄── keyCompare; on match,
                                               └───────────┬─────────────┘     dictType->prefetchEntryValue(de)
                                                           │                       │
                                                 ┌─────────▼─────────────┐         │
                                                 │     PREFETCH_DONE     │◄────────┘
                                                 └───────────────────────┘
```

`PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]`
if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no
more tables remain.

`memory_prefetch.c` exposes a small lifecycle that any caller can drive:

```c
dictPrefetcherInit(p, max_keys);                  /* one-shot heap alloc of lookups[] */
dictPrefetcherReset(p, dicts, keys, nkeys);       /* configure for one batch */
dictPrefetcherRun(p);                             /* drive FSM until all PREFETCH_DONE */
dictPrefetcherFree(p);                            /* release */
```

Each FSM stage is a named static function (`dictPrefetchBucket`,
`dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`),
so the `dictPrefetcherRun` driver is a four-line `switch` over the
state.

The state machine is dict-pure: no `kvobj` field on
`dictPrefetchLookup`,
no `kvobjGetKey` reach-through. Round-robin advance semantics — a state
only advances the cursor if a prefetch was actually issued — are
preserved, so the embedded-kvobj fast path
(`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the
extra prefetch and falls straight into the compare on the next loop
iteration.

The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`)
embeds a `dictPrefetcher` initialized once at startup and reset per
batch, so cross-command prefetching no longer allocates per call.

## Intra-command API

```c
void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys);
```

A single multi-key command (e.g. MGET) can prefetch dict data for a
batch of its own keys, reusing the same state machine that the
cross-command path uses. Single-key calls (`nkeys <= 1`) early-return —
nothing to interleave with. The implementation stack-allocates a
fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no
VLA, predictable stack usage), so the intra-command path doesn't touch
the heap.

## Notes on the call sites

A shared helper picks the next prefetch batch and warms it via
`dictPrefetchKeys`:

```c
/* Pick the next prefetch batch starting at argv[start] and warm it via
 * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for
 * key/value pairs (MSET). Returns the chosen batch size in items. */
static int prefetchKeysBatch(client *c, int slot, int start, int stride);
```

Adaptive batch sizing inside the helper: if at least two full batches
(`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch
(`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one
call. This generalizes the small-request fast path so the trailing
batch of a large request also gets the single-call benefit.

- **MGET (`mgetCommand`)** — gated by
`do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`,
with `already_prefetched = c->current_pending_cmd &&
(c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`.
  When `do_prefetch` is set, each iteration calls
  `prefetchKeysBatch(c, slot, j, 1)` and then sequentially
  `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is
  clear (cross-command path already warmed the keys, or batch
  prefetching is off), the loop takes all remaining items in one go
  and skips the prefetch.

- **MSET / MSETNX (`msetGenericCommand`)** — same `do_prefetch` gate as
  MGET with `stride = 2`. For the NX flag the NX-check loop runs
  `lookupKeyWrite` (which already warmed everything via
  `prefetchKeysBatch`); the SET loop then disables further prefetch
  (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass.
  Going through the full state machine (rather than bucket-only) means
  `dbDictType`'s `prefetchEntryValue` callback runs on a key match —
  warming the old kvobj's payload, which `setKey -> dbReplaceValue ->
  updateKeysizesHist(oldlen, newlen)` then reads to compute the
  histogram delta. The slot dict is re-fetched per batch — in cluster
  mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS`
  + `expireIfNeeded`), so a cached pointer would otherwise dangle.

- **Cross-command batch path (`addCommandToBatch`)** — sets
  `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch,
  even on partial-batch overflow (was: only when ALL keys fit). The
  intra-command path then uniformly skips supplemental prefetching for
  any command the batch touched. Rationale: running both paths
  (cross-command warm + intra-command supplement) caused a measured
  −9.6 % regression on x86 with pipeline-10, and the partial cross-
  command warmup is sufficient for the head of the keyset; the cold
  tail goes through normal lookup, which is still cheaper than running
  the FSM a second time on already-warm keys.

- **Future types**: each dict's `dictType` can register its own
  `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash
  encoding, the field-sds and value-sds payloads), without touching
  `memory_prefetch.c`.

## Benchmark validation

On x86, performance improvements are significant for larger batch sizes:
  - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44%
  - 5Mkeys-string-mget-100B-100keys: +37.33%
  - 5Mkeys-string-mget-100B-30keys: +22.40%
On ARM (Graviton4), the gains are even more pronounced:
  - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34%
  - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76%
Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions.

---------

Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
@ShooterIT

Copy link
Copy Markdown
Member

closing this since we merged #15133

@ShooterIT ShooterIT closed this May 11, 2026
@github-project-automation github-project-automation Bot moved this from Todo to Done in Redis 8.8 May 11, 2026
@sundb sundb removed this from Redis 8.8 May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants