Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899
Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899mpozniak95 wants to merge 11 commits into
Conversation
Preliminary performance analysis for PR #14899 — MGET Prefetch Framework
Benchmarks triggered for ARM (
|
| Test Case | Baseline redis/redis unstable (median obs. +- std.dev) | Comparison (median obs. +- std.dev) | % change (higher-better) | Note |
|---|---|---|---|---|
| memtier_benchmark-5Mkeys-string-mget-100B-100keys | 43890 +- 1.4% (2 datapoints) | 48747 +- 0.1% (2 datapoints) | +11.1% | IMPROVEMENT |
| memtier_benchmark-5Mkeys-string-mget-100B-100keys-pipeline-10 | 56946 +- 1.3% (2 datapoints) | 55794 +- 0.9% (2 datapoints) | -2.0% | No Change |
| memtier_benchmark-5Mkeys-string-mget-100B-2keys | 197110 +- 0.3% (3 datapoints) | 197841 | +0.4% | No Change |
| memtier_benchmark-5Mkeys-string-mget-100B-2keys-pipeline-10 | 1010214 +- 0.2% (3 datapoints) | 1019085 | +0.9% | No Change |
| memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-10 | 706908 +- 0.6% (3 datapoints) | 714618 | +1.1% | No Change |
x86 (x86-aws-m7i.metal-24xl) — oss-standalone
| Test Case | Baseline redis/redis unstable (median obs. +- std.dev) | Comparison (median obs. +- std.dev) | % change (higher-better) | Note |
|---|---|---|---|---|
| memtier_benchmark-5Mkeys-string-mget-100B-100keys | 32691 +- 1.1% (2 datapoints) | 40872 +- 0.7% (2 datapoints) | +25.0% | IMPROVEMENT |
| memtier_benchmark-5Mkeys-string-mget-100B-100keys-pipeline-10 | 60869 +- 4.8% (2 datapoints) | 55051 +- 0.6% (2 datapoints) | -9.6% | REGRESSION |
| memtier_benchmark-5Mkeys-string-mget-100B-2keys | 169522 +- 1.3% (3 datapoints) | 172012 | +1.5% | No Change |
| memtier_benchmark-5Mkeys-string-mget-100B-2keys-pipeline-10 | 893595 +- 0.3% (3 datapoints) | 892429 | -0.1% | No Change |
| memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-10 | 803727 +- 1.4% (3 datapoints) | 799166 | -0.6% | No Change |
IO-threads (oss-standalone-08-io-threads) — setget2000c regression guard
| Platform | Baseline unstable (median obs. +- std.dev) | Comparison (median obs.) | % change | Note |
|---|---|---|---|---|
| arm-aws-m8g.metal-24xl | 3182190 +- 0.4% (3 datapoints) | 3173256 | -0.3% | No Change |
| x86-aws-m7i.metal-24xl | 2865474 +- 1.0% (3 datapoints) | 2705671 | -5.6% | REGRESSION |
Summary
- MGET 100 keys (no pipeline): +11.1% ARM, +25.0% x86 — confirmed with 2 datapoints, low variance
- MGET 2 keys: No change on both platforms (expected — minimal prefetch benefit)
- MGET 100 keys pipeline-10: No change on ARM, -9.6% regression on x86 — confirmed with 2 datapoints
- IO-threads (setget2000c): No change on ARM, -5.6% regression on x86 with 8 IO threads
- The x86 pipeline/IO-thread regressions suggest the intra-command prefetch may conflict with the cross-command batch prefetch path under high concurrency on x86
Updated performance analysis — after commit 4ef667a (pipeline double-prefetch fix)
ARM (
|
| Test Case | Baseline unstable (median +- std.dev) | Comparison (median) | % change | Note |
|---|---|---|---|---|
| mget-100B-100keys | 43890 +- 1.4% (2 dp) | 48562 | +10.6% | IMPROVEMENT |
| mget-100B-100keys-pipeline-10 | 56946 +- 1.3% (2 dp) | 57581 | +1.1% | No Change |
| mget-100B-2keys | 197110 +- 0.3% (3 dp) | 198193 | +0.5% | No Change |
| mget-100B-2keys-pipeline-10 | 1010214 +- 0.2% (3 dp) | 1016831 | +0.7% | No Change |
| setget2000c-1KiB-pipeline-10 | 706908 +- 0.6% (3 dp) | 715233 | +1.2% | No Change |
x86 (x86-aws-m7i.metal-24xl) — oss-standalone
| Test Case | Baseline unstable (median +- std.dev) | Comparison (median) | % change | Note |
|---|---|---|---|---|
| mget-100B-100keys-pipeline-10 | 60869 +- 4.8% (2 dp) | 61464 | +1.0% | No Change |
| mget-100B-2keys-pipeline-10 | 893595 +- 0.3% (3 dp) | 893119 | -0.1% | No Change |
| mget-100B-2keys | 169522 +- 1.3% (3 dp) | 164955 | -2.7% | No Change |
| mget-100B-100keys | 32691 +- 1.1% (2 dp) | 32137 | -1.7% | No Change |
| setget2000c-1KiB-pipeline-10 | 803727 +- 1.4% (3 dp) | 801778 | -0.2% | No Change |
Summary — before vs after fix
| Metric | Before (70d4444) | After (4ef667a) |
|---|---|---|
| x86 mget-100keys-pipeline-10 | -9.6% REGRESSION | +1.0% No Change |
| x86 setget2000c standalone | -0.6% | -0.2% |
| ARM mget-100keys (no pipeline) | +11.1% IMPROVEMENT | +10.6% IMPROVEMENT |
| ARM mget-100keys-pipeline-10 | -2.0% | +1.1% |
The pipeline regression on x86 is resolved. No regressions detected above the 3% threshold on either platform. IO-threads topology (oss-standalone-08-io-threads) was not triggered for this commit — pending re-run to confirm the -5.6% x86 IO-threads regression is also fixed.
@sundb @mpozniak95 — We're triggering a full benchmark run now covering both oss-standalone and oss-standalone-08-io-threads topologies, on both x86 and ARM. Will follow up with complete results.
…keys Add PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. Set it in addCommandToBatch() when a command is added to the cross-command prefetch batch. In mgetCommand, check the flag and fall back to plain sequential lookups when keys are already warm from the batch. This eliminates redundant prefetch work that caused regressions of up to -16% with many I/O threads, while preserving the large gains (+52-54%) for pipelined workloads where the batch overflows.
The previous guard only skipped intra-command dictPrefetchKeys() when PENDING_CMD_KEYS_PREFETCHED was set, which requires ALL keys to fit in the batch. For MGET 100 keys with a batch size of 32, the flag was never set — both prefetch paths always ran. On x86 with pipeline-10, this double-prefetching caused -9.6% regression on mget-100B-100keys-pipeline-10 and -5.6% on setget2000c with 8 IO threads due to cache-bandwidth contention in ctxPrefetchAndAdvance. Fix: skip intra-command prefetch when either: 1. PENDING_CMD_KEYS_PREFETCHED is set (all keys fit in batch), OR 2. Pipeline is active (ready_len > 1) — partial batch warmup is sufficient, double-prefetching hurts more than it helps. Single non-pipelined MGET (ready_len <= 1) still uses intra-command prefetch, preserving the +11% ARM / +25% x86 improvement.
The ready_len > 1 heuristic that skips intra-command dictPrefetchKeys() assumed cross-command batch prefetching always runs during pipelining. When prefetch-batch-max-size is 0 (batch disabled), addCommandToBatch() returns C_ERR immediately and no keys are warmed, but ready_len > 1 still holds for any pipeline — causing MGET to skip its only prefetch opportunity. Add server.prefetch_batch_max_size > 0 to the ready_len guard so intra-command prefetch runs when the batch system is disabled.
2b30246 to
a470915
Compare
Updated performance analysis — rebased on latest unstable (commit a470915)Previous analysis was against a stale baseline (44 commits behind unstable). Rebased the branch and re-ran benchmarks against current unstable HEAD.
x86 (
|
| Test Case | Baseline unstable (median +- std.dev) | Comparison (median) | % change | Note |
|---|---|---|---|---|
| mget-100B-100keys | 32691 +- 1.1% (2 dp) | 40491 | +23.9% | IMPROVEMENT |
| mget-100B-30keys | 76546 +- 0.4% (2 dp) | 87551 | +14.4% | IMPROVEMENT |
| mget-100B-20keys | 96788 +- 2.0% (3 dp) | 105752 | +9.3% | IMPROVEMENT |
| mget-100B-5keys | 151999 +- 0.1% (3 dp) | 157728 | +3.8% | IMPROVEMENT |
| mget-100B-10keys | 131399 +- 0.6% (3 dp) | 135433 | +3.1% | IMPROVEMENT |
| mget-100B-5keys-pipeline-10 | 465659 +- 0.3% (3 dp) | 480291 | +3.1% | IMPROVEMENT |
| mget-1KiB-5keys | 123074 | 125884 | +2.3% | No Change |
| mget-1KiB-5keys-pipeline-10 | 122349 | 124211 | +1.5% | No Change |
| mget-512B-5keys-pipeline-10 | 315258 | 318174 | +0.9% | No Change |
| mget-100B-30keys-pipeline-10 | 110900 | 111470 | +0.5% | No Change |
| setget2000c-pipeline-10 | 803727 | 805587 | +0.2% | No Change |
| setget2000c-pipeline-16 | 990524 | 989529 | -0.1% | No Change |
| mget-100B-2keys-pipeline-10 | 893595 | 891518 | -0.2% | No Change |
| mget-100B-2keys | 169522 | 168710 | -0.5% | No Change |
| mget-100B-100keys-pipeline-10 | 61394 | 60846 | -0.9% | No Change |
| hmget-pipeline-10 | 958532 | 941637 | -1.8% | No Change |
| mget-1KiB (10 keys × 1KiB) | 134096 | 125252 | -6.6% | REGRESSION |
x86 (x86-aws-m7i.metal-24xl) — mget-100B-100keys-pipeline-10 across IO-threads topologies (NEW)
| Topology | Baseline unstable | Comparison | % change | Note |
|---|---|---|---|---|
| oss-standalone | 61394 | 60846 | -0.9% | No Change |
| oss-standalone-02-io-threads | 38205 | 39458 | +3.3% | IMPROVEMENT |
| oss-standalone-04-io-threads | 35474 | 37697 | +6.3% | IMPROVEMENT |
| oss-standalone-08-io-threads | 48640 | 51585 | +6.1% | IMPROVEMENT |
Summary
- Non-pipelined MGET with 5-100 keys (100B values): +3% to +24% improvement on x86 — scales with key count
- Pipelined MGET: flat to +3% — no regression (the original -9.6% regression is resolved)
- IO-threads MGET (pipeline-10, 100 keys): +3% to +6% improvement across 2/4/8 IO-thread topologies
- setget2000c (SET/GET regression guard): flat (0%) on both pipeline-10 and pipeline-16
- One regression:
mget-1KiB(10 keys × 1KiB values, no pipeline) shows -6.6% — the dictPrefetchKeys overhead outweighs benefit for large values with small key counts. This test uses 1KiB values where prefetch state machine overhead per key is proportionally higher
Still collecting ARM standalone and additional MGET pipeline IO-threads data. Will follow up with complete results.
EMBSTR (10B values) benchmark results — new test coverageAdded new MGET benchmarks with 10-byte values (OBJ_ENCODING_EMBSTR, where value data is embedded in the kv object) to cover all string encoding types. These use x86 (
|
| Topology | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| oss-standalone | 21974 | 34152 | +55.4% | IMPROVEMENT |
| oss-standalone-02-io-threads | 31047 | 42977 | +38.4% | IMPROVEMENT |
| oss-standalone-04-io-threads | 31234 | 43042 | +37.8% | IMPROVEMENT |
| oss-standalone-08-io-threads | 30804 | 42630 | +38.4% | IMPROVEMENT |
ARM (arm-aws-m8g.metal-24xl) — mget-10B-100keys
| Topology | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| oss-standalone | 21490 | 45194 | +110.3% | IMPROVEMENT |
| oss-standalone-02-io-threads | 32471 | 65174 | +100.7% | IMPROVEMENT |
| oss-standalone-04-io-threads | 32522 | 65371 | +101.0% | IMPROVEMENT |
| oss-standalone-08-io-threads | 32356 | 65369 | +102.0% | IMPROVEMENT |
x86 — mget-10B-100keys-pipeline-10
| Topology | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| oss-standalone | 22717 | 24182 | +6.4% | IMPROVEMENT |
ARM — mget-10B-100keys-pipeline-10
| Topology | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| oss-standalone | 22891 | 24347 | +6.4% | IMPROVEMENT |
Known regression: memtier_benchmark-1Mkeys-string-mget-1KiB
This test (10 keys × 1KiB RAW values, no pipeline) shows -8% on x86 and -6% on ARM. The profiling data shows the dictPrefetchKeys value-data prefetch callback (prefetchGetObjectValuePtr) adds overhead for large RAW values — it prefetches only the first 64 bytes of a 1KiB separate allocation, while adding a full state-machine iteration per key. We're working on a fix (skip value-data prefetch in the intra-command path) and will follow up with updated results.
The prefetchGetObjectValuePtr callback adds one state-machine iteration per key to prefetch the first cache line of the value allocation. For EMBSTR values (<=44B) it returns NULL (value is embedded in the kv object), making the iteration pure overhead. For large RAW values (>=1KiB) it warms <10% of the data while adding measurable cost — profiling shows 3B samples in ctxPrefetchValdata vs 17.8B total in the prefetch path. Pass NULL instead: skip value-data prefetch entirely in the MGET intra-command path. The dict bucket/entry/kv-object prefetch provides the bulk of the improvement; addReplyBulk handles value data access sequentially.
Three optimizations to the dict prefetch state machine: 1. Replace modulo with branch in ctxPrefetchAndAdvance — avoids expensive division on every state transition. 2. Add remaining counter to DictPrefetchCtx — ctxNextInfo returns NULL immediately when all keys are done instead of scanning the full array. Eliminates O(n) scan per call as keys complete. 3. Fast-path in ctxPrefetchKvobj when no value-data callback and entry is a direct key — do key comparison inline and skip the VALDATA state entirely, saving one round-trip through the state machine per key.
For batches of <=10 keys, replace the full dictPrefetchKeys state machine with a simple loop that prefetches hash table buckets only. The state machine has ~4 states per key with function dispatch and round-robin scanning overhead. Profiling shows this costs 4.3% of total CPU for 10-key MGET with 1KiB values — matching the observed regression exactly. A single-pass bucket prefetch eliminates the iteration overhead while still warming the hash chain heads. Larger batches (>10 keys) continue using the full state machine where the multi-round interleaving provides net benefit.
186f995 to
dcb8aa8
Compare
| keys[k] = c->argv[j + k]->ptr; | ||
| dicts[k] = d; | ||
| } | ||
| dictPrefetchKeys(dicts, keys, n, NULL); |
There was a problem hiding this comment.
Value-data callback NULL contradicts documented example usage
Medium Severity
dictPrefetchKeys is called with NULL for the get_val_data callback, so the VALDATA state never prefetches value data for RAW-encoded strings. The cross-command batch path passes prefetchGetObjectValuePtr, and the documented example in the same file (line 433) also shows passing prefetchGetObjectValuePtr. This means for batches of 11–16 keys using the full state machine, kv->ptr for RAW strings won't be in cache when addReplyBulk accesses it.
Additional Locations (1)
Final performance analysis — commit dcb8aa8 (rebased on latest unstable)Full MGET benchmark suite across all string encodings (EMBSTR 10B, RAW 100B/512B/1KiB), key counts (2–100), pipeline depths (1, 10), and topologies (standalone + 02/04/08 IO-threads), on both x86 and ARM. Summary
x86 (
|
| Test Case | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| mget-10B-100keys | 21974 | 36894 | +67.9% | IMPROVEMENT |
| mget-100B-100keys | 32022 | 39493 | +23.3% | IMPROVEMENT |
| mget-100B-30keys | 77470 | 85734 | +10.7% | IMPROVEMENT |
| mget-100B-20keys | 96470 | 103945 | +7.7% | IMPROVEMENT |
| mget-10B-100keys-pipeline-10 | 22717 | 24245 | +6.7% | IMPROVEMENT |
| mget-100B-2keys-pipeline-10 | 887290 | 935785 | +5.5% | IMPROVEMENT |
| mget-100B-5keys-pipeline-10 | 470951 | 489931 | +4.0% | IMPROVEMENT |
| mget-512B-5keys-pipeline-10 | 316216 | 326153 | +3.1% | IMPROVEMENT |
| mget-100B-100keys-pipeline-10 | 61319 | 63101 | +2.9% | No Change |
| mget-100B-30keys-pipeline-10 | 109449 | 112017 | +2.3% | No Change |
| mget-100B-5keys | 151458 | 154636 | +2.1% | No Change |
| mget-100B-10keys | 133057 | 135861 | +2.1% | No Change |
| mget-512B-5keys | 137861 | 138955 | +0.8% | No Change |
| mget-1KiB-5keys-pipeline-10 | 123179 | 124145 | +0.8% | No Change |
| mget-1KiB-5keys | 124863 | 122805 | -1.6% | No Change |
| mget-1KiB | 137088 | 134841 | -1.6% | No Change |
| mget-100B-2keys | 171538 | 168515 | -1.8% | No Change |
x86 — oss-standalone-02-io-threads
| Test Case | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| mget-10B-100keys | 31047 | 48641 | +56.7% | IMPROVEMENT |
| mget-10B-100keys-pipeline-10 | 25398 | 27733 | +9.2% | IMPROVEMENT |
| mget-100B-30keys-pipeline-10 | 131002 | 138359 | +5.6% | IMPROVEMENT |
| mget-100B-100keys-pipeline-10 | 38582 | 39892 | +3.4% | IMPROVEMENT |
x86 — oss-standalone-04-io-threads
| Test Case | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| mget-10B-100keys | 31234 | 48441 | +55.1% | IMPROVEMENT |
| mget-100B-5keys-pipeline-10 | 1132402 | 1310331 | +15.7% | IMPROVEMENT |
| mget-10B-100keys-pipeline-10 | 25876 | 28283 | +9.3% | IMPROVEMENT |
| mget-100B-2keys-pipeline-10 | 2283308 | 2467139 | +8.1% | IMPROVEMENT |
| mget-100B-30keys-pipeline-10 | 115508 | 123581 | +7.0% | IMPROVEMENT |
| mget-100B-100keys-pipeline-10 | 35151 | 37591 | +6.9% | IMPROVEMENT |
x86 — oss-standalone-08-io-threads
| Test Case | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| mget-100B-2keys-pipeline-10 | 1564255 | 1812083 | +15.8% | IMPROVEMENT |
| mget-10B-100keys-pipeline-10 | 25826 | 28033 | +8.5% | IMPROVEMENT |
| mget-512B-5keys-pipeline-10 | 894374 | 955192 | +6.8% | IMPROVEMENT |
| mget-1KiB-5keys-pipeline-10 | 438268 | 466975 | +6.6% | IMPROVEMENT |
| mget-10B-100keys | 30804 | 47677 | +54.8% | IMPROVEMENT |
| mget-100B-5keys-pipeline-10 | 987120 | 1038147 | +5.2% | IMPROVEMENT |
| mget-100B-100keys-pipeline-10 | 50261 | 52086 | +3.6% | IMPROVEMENT |
| mget-100B-30keys-pipeline-10 | 188336 | 194892 | +3.5% | IMPROVEMENT |
ARM (arm-aws-m8g.metal-24xl) — oss-standalone
| Test Case | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| mget-10B-100keys | 21490 | 45710 | +112.7% | IMPROVEMENT |
| mget-100B-30keys | 85239 | 93681 | +9.9% | IMPROVEMENT |
| mget-100B-20keys | 107623 | 115082 | +6.9% | IMPROVEMENT |
| mget-100B-100keys | 43605 | 46612 | +6.9% | IMPROVEMENT |
| mget-10B-100keys-pipeline-10 | 22891 | 24398 | +6.6% | IMPROVEMENT |
| mget-100B-10keys | 145909 | 150486 | +3.1% | IMPROVEMENT |
| mget-100B-30keys-pipeline-10 | 104580 | 106673 | +2.0% | No Change |
| mget-100B-5keys-pipeline-10 | 545431 | 551314 | +1.1% | No Change |
| mget-512B-5keys | 160471 | 162215 | +1.1% | No Change |
| mget-100B-5keys | 175937 | 177940 | +1.1% | No Change |
| mget-1KiB-5keys-pipeline-10 | 146393 | 146798 | +0.3% | No Change |
| mget-100B-2keys | 196102 | 196724 | +0.3% | No Change |
| mget-1KiB-5keys | 147608 | 147504 | -0.1% | No Change |
| mget-1KiB | 151912 | 151250 | -0.4% | No Change |
| mget-100B-100keys-pipeline-10 | 56851 | 56637 | -0.4% | No Change |
| mget-512B-5keys-pipeline-10 | 349163 | 343147 | -1.7% | No Change |
| mget-100B-2keys-pipeline-10 | 1003653 | 977958 | -2.6% | No Change |
ARM — oss-standalone-02-io-threads
| Test Case | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| mget-10B-100keys | 32471 | 67537 | +108.0% | IMPROVEMENT |
| mget-10B-100keys-pipeline-10 | 26331 | 28862 | +9.6% | IMPROVEMENT |
| mget-100B-30keys-pipeline-10 | 127850 | 128544 | +0.5% | No Change |
| mget-100B-100keys-pipeline-10 | 36779 | 35526 | -3.4% | No Change |
ARM — oss-standalone-04-io-threads
| Test Case | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| mget-10B-100keys | 32522 | 67666 | +108.1% | IMPROVEMENT |
| mget-10B-100keys-pipeline-10 | 26653 | 29277 | +9.8% | IMPROVEMENT |
| mget-100B-2keys-pipeline-10 | 3066805 | 3274012 | +6.8% | IMPROVEMENT |
| mget-100B-5keys-pipeline-10 | 1142510 | 1205380 | +5.5% | IMPROVEMENT |
| mget-100B-30keys-pipeline-10 | 92741 | 96967 | +4.6% | IMPROVEMENT |
| mget-100B-100keys-pipeline-10 | 29689 | 30550 | +2.9% | No Change |
ARM — oss-standalone-08-io-threads
| Test Case | Baseline unstable | Comparison (PR) | % change | Note |
|---|---|---|---|---|
| mget-10B-100keys | 32356 | 67338 | +108.1% | IMPROVEMENT |
| mget-100B-100keys-pipeline-10 | 39596 | 51576 | +30.3% | IMPROVEMENT |
| mget-10B-100keys-pipeline-10 | 26425 | 28987 | +9.7% | IMPROVEMENT |
| mget-100B-30keys-pipeline-10 | 187125 | 197733 | +5.7% | IMPROVEMENT |
| mget-100B-2keys-pipeline-10 | 2702842 | 2815783 | +4.2% | IMPROVEMENT |
| mget-100B-5keys-pipeline-10 | 1253545 | 1299721 | +3.7% | IMPROVEMENT |
Benchmark coverage
Tested across all string encodings that MGET handles:
- EMBSTR (≤44B): 10B values — value embedded in kv object
- RAW (>44B): 100B, 512B, 1KiB values — separate allocation
Key counts: 2, 5, 10, 20, 30, 100 keys per MGET.
Pipeline: 1 (no pipeline) and 10.
Topologies: standalone, 02/04/08-io-threads.
Platforms: x86 (Intel m7i.metal-24xl), ARM (Graviton m8g.metal-24xl).
All benchmarks use --distinct-client-seed for realistic cache behavior.
… callback Two review comments from Cursor Bugbot on dcb8aa8: 1. Missing `#undef MGET_LIGHT_PREFETCH_MAX` at the end of mgetCommand. MGET_BATCH was already undef'd but its sibling was not. 2. The `dictPrefetchKeys()` docstring example passed `prefetchGetObjectValuePtr` as the value-data callback, while the mgetCommand caller passes NULL on purpose (the callback's extra state machine work costs more than warming the value payload saves, since the following lookupKeyRead + addReplyBulk touch the payload immediately and the hardware prefetcher is already on it). Update the example to show the typical NULL form and document when passing a non-NULL callback would actually be worthwhile.
CE Performance Automation : step 1 of 2 (build) STARTING...This comment was automatically generated given a benchmark was triggered.
|
CE Performance Automation : step 2 of 2 (benchmark) RUNNING...This comment was automatically generated given a benchmark was triggered. Started benchmark suite at 2026-04-21 23:38:48.967283 and took 104.768925 seconds up until now. In total will run 18 benchmarks. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit 362fbdd. Configure here.
| .get_val_data = get_val_data, | ||
| }; | ||
|
|
||
| server.stat_total_prefetch_batches++; |
There was a problem hiding this comment.
Prefetch stats now conflate IO-threaded and main-thread operations
Low Severity
dictPrefetchKeys increments server.stat_total_prefetch_batches and its state machine increments server.stat_total_prefetch_entries via ctxMarkDone, but these counters are reported to users as io_threaded_total_prefetch_batches and io_threaded_total_prefetch_entries. Intra-command prefetching from mgetCommand runs on the main thread without any IO threads involved, inflating these IO-thread-specific metrics and making them unreliable for monitoring IO thread performance.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 362fbdd. Configure here.
…15133) Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses across the keys of a single multi-key command. Builds on the cross-command batched prefetch framework introduced in #14017 and the dict-prefetch state machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the state machine into two new `dictType` callbacks so the same machinery can be reused for other dict-encoded types later (hash hashtable, sets, sorted sets) without paying for `kvobj`-specific code paths in the core loop. Bundles the work originally proposed in #14899 (MGET prefetch framework, by @mpozniak95) and #15043 (MSET batch prefetch). ## Design Two new optional callbacks on `dictType`: ```c typedef struct dictType { ... /* Bring the entry's key payload into cache before keyCompare runs. * Returns the address to prefetch, or NULL if the entry alone is enough. */ void *(*prefetchEntryKey)(const dictEntry *de); /* Called only after a key match. Returns the value-side payload to * prefetch (or NULL). */ void *(*prefetchEntryValue)(const dictEntry *de); } dictType; ``` `dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()` shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` / `OBJ_ENCODING_RAW` values — now lives in two small helpers in `server.c`: ```c static void *dbDictPrefetchEntryKey(const dictEntry *de) { return dictEntryIsKey(de) ? NULL : dictGetKey(de); } static void *dbDictPrefetchEntryValue(const dictEntry *de) { kvobj *kv = dictGetKey(de); return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW) ? kv->ptr : NULL; } ``` The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data` parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the dict's own type drives both ends. This also removes the foot-gun where callers (like `mgetCommand`) had to remember whether to pass `prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer references `kvobj`, `kvobjGetKey`, or any specific value layout. ## State machine Two file-local types in `memory_prefetch.c`: | Type | Role | |---|---| | `dictPrefetchLookup` | Per-key snapshot of an in-flight, software-pipelined `dictFind` (mirrors the locals a synchronous `dictFind` would carry across one bucket walk). | | `dictPrefetcher` | Driver that advances a batch of `dictPrefetchLookup`s through the FSM, yielding to the next in-flight lookup each time a prefetch is issued. | Five-stage lifecycle for each lookup, driven by the prefetcher: ```text │ start │ ┌────────▼─────────┐ ┌─────────►│ PREFETCH_BUCKET ├────►────────┐ │ └────────┬─────────┘ no more tables │ bucket│found │ │ │ │ entry not found - goto next table ┌────────▼────────┐ │ └────◄─────┤ PREFETCH_ENTRY │ ▼ ┌────────────►└────────┬────────┘ │ │ entry│found │ │ │ │ │ ┌───────────▼─────────────┐ │ │ │ PREFETCH_ENTRY_KEY │ ◄── dictType->prefetchEntryKey(de) │ └───────────┬─────────────┘ │ │ │ │ key mismatch - goto next entry │ │ │ ┌───────────▼─────────────┐ │ └──────◄───│ PREFETCH_ENTRY_VALUE │ ◄── keyCompare; on match, └───────────┬─────────────┘ dictType->prefetchEntryValue(de) │ │ ┌─────────▼─────────────┐ │ │ PREFETCH_DONE │◄────────┘ └───────────────────────┘ ``` `PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]` if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no more tables remain. `memory_prefetch.c` exposes a small lifecycle that any caller can drive: ```c dictPrefetcherInit(p, max_keys); /* one-shot heap alloc of lookups[] */ dictPrefetcherReset(p, dicts, keys, nkeys); /* configure for one batch */ dictPrefetcherRun(p); /* drive FSM until all PREFETCH_DONE */ dictPrefetcherFree(p); /* release */ ``` Each FSM stage is a named static function (`dictPrefetchBucket`, `dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`), so the `dictPrefetcherRun` driver is a four-line `switch` over the state. The state machine is dict-pure: no `kvobj` field on `dictPrefetchLookup`, no `kvobjGetKey` reach-through. Round-robin advance semantics — a state only advances the cursor if a prefetch was actually issued — are preserved, so the embedded-kvobj fast path (`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the extra prefetch and falls straight into the compare on the next loop iteration. The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`) embeds a `dictPrefetcher` initialized once at startup and reset per batch, so cross-command prefetching no longer allocates per call. ## Intra-command API ```c void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys); ``` A single multi-key command (e.g. MGET) can prefetch dict data for a batch of its own keys, reusing the same state machine that the cross-command path uses. Single-key calls (`nkeys <= 1`) early-return — nothing to interleave with. The implementation stack-allocates a fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no VLA, predictable stack usage), so the intra-command path doesn't touch the heap. ## Notes on the call sites A shared helper picks the next prefetch batch and warms it via `dictPrefetchKeys`: ```c /* Pick the next prefetch batch starting at argv[start] and warm it via * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for * key/value pairs (MSET). Returns the chosen batch size in items. */ static int prefetchKeysBatch(client *c, int slot, int start, int stride); ``` Adaptive batch sizing inside the helper: if at least two full batches (`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch (`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one call. This generalizes the small-request fast path so the trailing batch of a large request also gets the single-call benefit. - **MGET (`mgetCommand`)** — gated by `do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`, with `already_prefetched = c->current_pending_cmd && (c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`. When `do_prefetch` is set, each iteration calls `prefetchKeysBatch(c, slot, j, 1)` and then sequentially `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is clear (cross-command path already warmed the keys, or batch prefetching is off), the loop takes all remaining items in one go and skips the prefetch. - **MSET / MSETNX (`msetGenericCommand`)** — same `do_prefetch` gate as MGET with `stride = 2`. For the NX flag the NX-check loop runs `lookupKeyWrite` (which already warmed everything via `prefetchKeysBatch`); the SET loop then disables further prefetch (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass. Going through the full state machine (rather than bucket-only) means `dbDictType`'s `prefetchEntryValue` callback runs on a key match — warming the old kvobj's payload, which `setKey -> dbReplaceValue -> updateKeysizesHist(oldlen, newlen)` then reads to compute the histogram delta. The slot dict is re-fetched per batch — in cluster mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS` + `expireIfNeeded`), so a cached pointer would otherwise dangle. - **Cross-command batch path (`addCommandToBatch`)** — sets `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch, even on partial-batch overflow (was: only when ALL keys fit). The intra-command path then uniformly skips supplemental prefetching for any command the batch touched. Rationale: running both paths (cross-command warm + intra-command supplement) caused a measured −9.6 % regression on x86 with pipeline-10, and the partial cross- command warmup is sufficient for the head of the keyset; the cold tail goes through normal lookup, which is still cheaper than running the FSM a second time on already-warm keys. - **Future types**: each dict's `dictType` can register its own `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash encoding, the field-sds and value-sds payloads), without touching `memory_prefetch.c`. ## Benchmark validation On x86, performance improvements are significant for larger batch sizes: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44% - 5Mkeys-string-mget-100B-100keys: +37.33% - 5Mkeys-string-mget-100B-30keys: +22.40% On ARM (Graviton4), the gains are even more pronounced: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34% - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76% Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions. --------- Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>
|
closing this since we merged #15133 |


Motivation
The existing
memory_prefetch.cframework amortizes dictionary lookup cache misses across commands in I/O-thread batches, but provides no mechanism for a single multi-key command to prefetch its own keys. When a command likeMGET key1 ... key100executes, eachlookupKeyRead()triggers a chain of cache misses (hash bucket → entry → kvobj → value data). With 5M+ keys, nearly every access is an L3/DRAM miss. For MGET 100 keys this means ~400 sequential cache misses with the CPU stalling on each one.This PR refactors the prefetch state machine to be reusable and applies it inside
mgetCommandas the first consumer.Changes
1. New
DictPrefetchCtxabstraction (memory_prefetch.c)Extracted a lightweight context struct from the global
PrefetchCommandsBatch. All state machine functions (prefetchBucket,prefetchEntry,prefetchKVObject,prefetchValueData) were refactored to operate on aDictPrefetchCtx *parameter instead of accessing the globalbatchdirectly. The existing cross-command I/O-thread path creates aDictPrefetchCtxfrom the globalbatchfields — behavior is fully backward-compatible.2. New public API (
memory_prefetch.h)dictPrefetchKeys()stack-allocatesKeyPrefetchInfo[nkeys]and runs the same 5-state round-robin machine (BUCKET → ENTRY → KVOBJ → VALDATA → DONE). Zero heap allocations in the hot path.3. Cross-command awareness (
memory_prefetch.c+server.h)Added
PENDING_CMD_KEYS_PREFETCHEDflag topendingCommand. WhenaddCommandToBatch()successfully adds all keys of a command to the cross-command batch, it sets this flag.mgetCommandchecks the flag and skips intra-command prefetching when keys are already warm — avoiding redundant work when I/O threads have already prefetched everything.4. Optimized
mgetCommand(t_string.c)Processes keys in batches of 16 with three fast paths:
lookupKeyRead()for side effectsBenchmark results
Tested on r8i.4xlarge ubuntu24.04 with
redis-benchmarks-specificationmemtier suites. best of 3 iterations.Single-threaded (no I/O threads)
Single-threaded: +3% to +50% across all tests, zero regressions.
4 I/O threads
4 I/O threads: up to +68% on pipelined workloads. Regressions on non-pipelined high-key-count tests (20, 100 keys) possibly due to partial overlap with cross-command prefetching.
16 I/O threads
16 I/O threads: up to +56% on pipelined workloads. Same regression pattern as 4 I/O threads on non-pipelined high-key-count tests.
Note
Medium Risk
Touches the hot
mgetCommandpath and refactors the dict prefetch state machine; correctness should be unchanged but performance/behavioral interactions with pipelining and I/O-thread batch prefetching could regress or be hard to reason about.Overview
Refactors the dict prefetch state machine in
memory_prefetch.cto run against a newDictPrefetchCtx, allowing the same logic to be reused outside the global cross-command batch.Adds a public intra-command API
dictPrefetchKeys()(plusPrefetchGetValueDataFuncandprefetchGetObjectValuePtrinmemory_prefetch.h) so multi-key commands can prefetch their own key lookups without heap allocation.Updates cross-command batching to set a new
PENDING_CMD_KEYS_PREFETCHEDflag only when all keys of a pending command were added to the batch, and rewritesmgetCommandto batch keys (size 16), prefetch dict buckets/state-machine per batch, and skip intra-command prefetch when cross-command prefetching already warmed the cache.Reviewed by Cursor Bugbot for commit 362fbdd. Bugbot is set up for automated code reviews on this repo. Configure here.