Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET by mpozniak95 · Pull Request #14899 · redis/redis

mpozniak95 · 2026-03-17T17:16:27Z

Motivation

The existing memory_prefetch.c framework amortizes dictionary lookup cache misses across commands in I/O-thread batches, but provides no mechanism for a single multi-key command to prefetch its own keys. When a command like MGET key1 ... key100 executes, each lookupKeyRead() triggers a chain of cache misses (hash bucket → entry → kvobj → value data). With 5M+ keys, nearly every access is an L3/DRAM miss. For MGET 100 keys this means ~400 sequential cache misses with the CPU stalling on each one.

This PR refactors the prefetch state machine to be reusable and applies it inside mgetCommand as the first consumer.

Changes

1. New `DictPrefetchCtx` abstraction (`memory_prefetch.c`)

Extracted a lightweight context struct from the global PrefetchCommandsBatch. All state machine functions (prefetchBucket, prefetchEntry, prefetchKVObject, prefetchValueData) were refactored to operate on a DictPrefetchCtx * parameter instead of accessing the global batch directly. The existing cross-command I/O-thread path creates a DictPrefetchCtx from the global batch fields — behavior is fully backward-compatible.

2. New public API (`memory_prefetch.h`)

void dictPrefetchKeys(struct dict **dicts, void **keys, size_t nkeys,
                      PrefetchGetValueDataFunc get_val_data);
void *prefetchGetObjectValuePtr(const void *val);

dictPrefetchKeys() stack-allocates KeyPrefetchInfo[nkeys] and runs the same 5-state round-robin machine (BUCKET → ENTRY → KVOBJ → VALDATA → DONE). Zero heap allocations in the hot path.

3. Cross-command awareness (`memory_prefetch.c` + `server.h`)

Added PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. When addCommandToBatch() successfully adds all keys of a command to the cross-command batch, it sets this flag. mgetCommand checks the flag and skips intra-command prefetching when keys are already warm — avoiding redundant work when I/O threads have already prefetched everything.

4. Optimized `mgetCommand` (`t_string.c`)

Processes keys in batches of 16 with three fast paths:

Single key: direct lookup, zero overhead
Empty dict: NULL-reply loop with lookupKeyRead() for side effects
Cross-command prefetched: simple loop when I/O threads already warmed the cache

Benchmark results

Tested on r8i.4xlarge ubuntu24.04 with redis-benchmarks-specification memtier suites. best of 3 iterations.

Single-threaded (no I/O threads)

Test	DB size	Value size	Main (ops/s)	Optimized (ops/s)	Δ
MGET 100 keys	5M	100B	26,325	39,573	+50%
MGET 30 keys, P=10	5M	100B	84,181	122,095	+45%
MGET 10 keys	5M	100B	119,446	147,563	+24%
MGET 30 keys	5M	100B	74,634	91,602	+23%
MGET 20 keys	5M	100B	95,328	112,204	+18%
MGET 5 keys	5M	512B	134,749	155,004	+15%
MGET 5 keys	5M	1KiB	127,420	145,808	+14%
MGET 5 keys, P=10	5M	512B	280,572	314,293	+12%
MGET 5 keys	5M	100B	155,014	170,170	+10%
MGET 2 keys	5M	100B	174,118	189,762	+9%
MGET 5 keys, P=10	5M	1KiB	133,109	145,478	+9%
MGET 5 keys	1M	1KiB	119,035	126,642	+6%
MGET 2 keys, P=10	5M	100B	885,126	927,125	+5%
MGET 100 keys, P=10	5M	100B	49,941	51,870	+4%
MGET 5 keys, P=10	5M	100B	478,288	494,179	+3%

Single-threaded: +3% to +50% across all tests, zero regressions.

4 I/O threads

Test	DB size	Value size	Main (ops/s)	Optimized (ops/s)	Δ
MGET 30 keys, P=10	5M	100B	102,871	173,263	+68%
MGET 100 keys, P=10	5M	100B	28,133	44,083	+57%
MGET 5 keys, P=10	5M	100B	720,426	843,673	+17%
MGET 10 keys	5M	100B	376,988	413,916	+10%
MGET 2 keys	5M	100B	474,214	506,097	+7%
MGET 2 keys, P=10	5M	100B	1,989,495	2,113,546	+6%
MGET 5 keys	5M	512B	453,154	467,093	+3%
MGET 30 keys	5M	100B	207,203	211,638	+2%
MGET 5 keys, P=10	5M	512B	522,417	530,606	+2%
MGET 5 keys	5M	100B	459,781	466,000	+1%
MGET 5 keys, P=10	5M	1KiB	298,161	298,208	~0%
MGET 5 keys	5M	1KiB	298,122	298,262	~0%
HMGET 5 fields, P=10	1M	100B	2,042,931	2,017,445	-1%
MGET 5 keys	1M	1KiB	135,522	131,678	-3%
MGET 20 keys	5M	100B	261,383	252,574	-3%
MGET 100 keys	5M	100B	77,969	69,382	-11%

4 I/O threads: up to +68% on pipelined workloads. Regressions on non-pipelined high-key-count tests (20, 100 keys) possibly due to partial overlap with cross-command prefetching.

16 I/O threads

Test	DB size	Value size	Main (ops/s)	Optimized (ops/s)	Δ
MGET 30 keys, P=10	5M	100B	114,994	179,161	+56%
MGET 100 keys, P=10	5M	100B	31,648	48,894	+54%
MGET 5 keys, P=10	5M	512B	484,134	564,920	+17%
MGET 5 keys, P=10	5M	100B	533,706	612,046	+15%
MGET 5 keys	5M	100B	418,656	432,231	+3%
MGET 5 keys	5M	512B	406,153	416,785	+3%
MGET 10 keys	5M	100B	323,832	331,224	+2%
MGET 2 keys, P=10	5M	100B	1,491,821	1,525,486	+2%
MGET 2 keys	5M	100B	522,589	524,531	~0%
HMGET 5 fields, P=10	1M	100B	1,715,306	1,722,662	~0%
MGET 5 keys	5M	1KiB	298,251	298,136	~0%
MGET 5 keys, P=10	5M	1KiB	298,381	298,229	~0%
MGET 30 keys	5M	100B	163,822	157,660	-4%
MGET 5 keys	1M	1KiB	136,597	130,641	-4%
MGET 20 keys	5M	100B	209,638	198,278	-5%
MGET 100 keys	5M	100B	65,570	59,888	-9%

16 I/O threads: up to +56% on pipelined workloads. Same regression pattern as 4 I/O threads on non-pipelined high-key-count tests.

Note

Medium Risk
Touches the hot mgetCommand path and refactors the dict prefetch state machine; correctness should be unchanged but performance/behavioral interactions with pipelining and I/O-thread batch prefetching could regress or be hard to reason about.

Overview
Refactors the dict prefetch state machine in memory_prefetch.c to run against a new DictPrefetchCtx, allowing the same logic to be reused outside the global cross-command batch.

Adds a public intra-command API dictPrefetchKeys() (plus PrefetchGetValueDataFunc and prefetchGetObjectValuePtr in memory_prefetch.h) so multi-key commands can prefetch their own key lookups without heap allocation.

Updates cross-command batching to set a new PENDING_CMD_KEYS_PREFETCHED flag only when all keys of a pending command were added to the batch, and rewrites mgetCommand to batch keys (size 16), prefetch dict buckets/state-machine per batch, and skip intra-command prefetch when cross-command prefetching already warmed the cache.

^{Reviewed by Cursor Bugbot for commit 362fbdd. Bugbot is set up for automated code reviews on this repo. Configure here.}

CLAassistant · 2026-03-17T17:16:43Z

All committers have signed the CLA.

fcostaoliveira · 2026-03-29T11:17:53Z

Preliminary performance analysis for PR #14899 — MGET Prefetch Framework

Note: This is preliminary data from targeted benchmarks. More comprehensive results across additional tests and platforms to follow.

Benchmarks triggered for mpozniak95/redis branch mget_prefetch_framework commit 70d4444ce5f4885945560bfe688fa3e2f3aa1fca.

ARM (`arm-aws-m8g.metal-24xl`) — oss-standalone

Test Case	Baseline redis/redis unstable (median obs. +- std.dev)	Comparison (median obs. +- std.dev)	% change (higher-better)	Note
memtier_benchmark-5Mkeys-string-mget-100B-100keys	43890 +- 1.4% (2 datapoints)	48747 +- 0.1% (2 datapoints)	+11.1%	IMPROVEMENT
memtier_benchmark-5Mkeys-string-mget-100B-100keys-pipeline-10	56946 +- 1.3% (2 datapoints)	55794 +- 0.9% (2 datapoints)	-2.0%	No Change
memtier_benchmark-5Mkeys-string-mget-100B-2keys	197110 +- 0.3% (3 datapoints)	197841	+0.4%	No Change
memtier_benchmark-5Mkeys-string-mget-100B-2keys-pipeline-10	1010214 +- 0.2% (3 datapoints)	1019085	+0.9%	No Change
memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-10	706908 +- 0.6% (3 datapoints)	714618	+1.1%	No Change

x86 (`x86-aws-m7i.metal-24xl`) — oss-standalone

Test Case	Baseline redis/redis unstable (median obs. +- std.dev)	Comparison (median obs. +- std.dev)	% change (higher-better)	Note
memtier_benchmark-5Mkeys-string-mget-100B-100keys	32691 +- 1.1% (2 datapoints)	40872 +- 0.7% (2 datapoints)	+25.0%	IMPROVEMENT
memtier_benchmark-5Mkeys-string-mget-100B-100keys-pipeline-10	60869 +- 4.8% (2 datapoints)	55051 +- 0.6% (2 datapoints)	-9.6%	REGRESSION
memtier_benchmark-5Mkeys-string-mget-100B-2keys	169522 +- 1.3% (3 datapoints)	172012	+1.5%	No Change
memtier_benchmark-5Mkeys-string-mget-100B-2keys-pipeline-10	893595 +- 0.3% (3 datapoints)	892429	-0.1%	No Change
memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-10	803727 +- 1.4% (3 datapoints)	799166	-0.6%	No Change

IO-threads (`oss-standalone-08-io-threads`) — setget2000c regression guard

Platform	Baseline unstable (median obs. +- std.dev)	Comparison (median obs.)	% change	Note
arm-aws-m8g.metal-24xl	3182190 +- 0.4% (3 datapoints)	3173256	-0.3%	No Change
x86-aws-m7i.metal-24xl	2865474 +- 1.0% (3 datapoints)	2705671	-5.6%	REGRESSION

Summary

MGET 100 keys (no pipeline): +11.1% ARM, +25.0% x86 — confirmed with 2 datapoints, low variance
MGET 2 keys: No change on both platforms (expected — minimal prefetch benefit)
MGET 100 keys pipeline-10: No change on ARM, -9.6% regression on x86 — confirmed with 2 datapoints
IO-threads (setget2000c): No change on ARM, -5.6% regression on x86 with 8 IO threads
The x86 pipeline/IO-thread regressions suggest the intra-command prefetch may conflict with the cross-command batch prefetch path under high concurrency on x86

fcostaoliveira · 2026-03-29T22:47:21Z

Updated performance analysis — after commit `4ef667a` (pipeline double-prefetch fix)

The previous analysis (commit 70d4444ce) showed -9.6% regression on mget-100B-100keys-pipeline-10 and -5.6% on setget2000c IO-threads on x86. Commit 4ef667ae6 adds a guard to skip intra-command prefetch when pipeline is active (ready_len > 1), since the cross-command batch path already partially warms the hash table.

ARM (`arm-aws-m8g.metal-24xl`) — oss-standalone

Test Case	Baseline unstable (median +- std.dev)	Comparison (median)	% change	Note
mget-100B-100keys	43890 +- 1.4% (2 dp)	48562	+10.6%	IMPROVEMENT
mget-100B-100keys-pipeline-10	56946 +- 1.3% (2 dp)	57581	+1.1%	No Change
mget-100B-2keys	197110 +- 0.3% (3 dp)	198193	+0.5%	No Change
mget-100B-2keys-pipeline-10	1010214 +- 0.2% (3 dp)	1016831	+0.7%	No Change
setget2000c-1KiB-pipeline-10	706908 +- 0.6% (3 dp)	715233	+1.2%	No Change

x86 (`x86-aws-m7i.metal-24xl`) — oss-standalone

Test Case	Baseline unstable (median +- std.dev)	Comparison (median)	% change	Note
mget-100B-100keys-pipeline-10	60869 +- 4.8% (2 dp)	61464	+1.0%	No Change
mget-100B-2keys-pipeline-10	893595 +- 0.3% (3 dp)	893119	-0.1%	No Change
mget-100B-2keys	169522 +- 1.3% (3 dp)	164955	-2.7%	No Change
mget-100B-100keys	32691 +- 1.1% (2 dp)	32137	-1.7%	No Change
setget2000c-1KiB-pipeline-10	803727 +- 1.4% (3 dp)	801778	-0.2%	No Change

Summary — before vs after fix

Metric	Before (`70d4444`)	After (`4ef667a`)
x86 mget-100keys-pipeline-10	-9.6% REGRESSION	+1.0% No Change
x86 setget2000c standalone	-0.6%	-0.2%
ARM mget-100keys (no pipeline)	+11.1% IMPROVEMENT	+10.6% IMPROVEMENT
ARM mget-100keys-pipeline-10	-2.0%	+1.1%

The pipeline regression on x86 is resolved. No regressions detected above the 3% threshold on either platform. IO-threads topology (oss-standalone-08-io-threads) was not triggered for this commit — pending re-run to confirm the -5.6% x86 IO-threads regression is also fixed.

@sundb @mpozniak95 — We're triggering a full benchmark run now covering both oss-standalone and oss-standalone-08-io-threads topologies, on both x86 and ARM. Will follow up with complete results.

…) API

…keys Add PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. Set it in addCommandToBatch() when a command is added to the cross-command prefetch batch. In mgetCommand, check the flag and fall back to plain sequential lookups when keys are already warm from the batch. This eliminates redundant prefetch work that caused regressions of up to -16% with many I/O threads, while preserving the large gains (+52-54%) for pipelined workloads where the batch overflows.

The previous guard only skipped intra-command dictPrefetchKeys() when PENDING_CMD_KEYS_PREFETCHED was set, which requires ALL keys to fit in the batch. For MGET 100 keys with a batch size of 32, the flag was never set — both prefetch paths always ran. On x86 with pipeline-10, this double-prefetching caused -9.6% regression on mget-100B-100keys-pipeline-10 and -5.6% on setget2000c with 8 IO threads due to cache-bandwidth contention in ctxPrefetchAndAdvance. Fix: skip intra-command prefetch when either: 1. PENDING_CMD_KEYS_PREFETCHED is set (all keys fit in batch), OR 2. Pipeline is active (ready_len > 1) — partial batch warmup is sufficient, double-prefetching hurts more than it helps. Single non-pipelined MGET (ready_len <= 1) still uses intra-command prefetch, preserving the +11% ARM / +25% x86 improvement.

The ready_len > 1 heuristic that skips intra-command dictPrefetchKeys() assumed cross-command batch prefetching always runs during pipelining. When prefetch-batch-max-size is 0 (batch disabled), addCommandToBatch() returns C_ERR immediately and no keys are warmed, but ready_len > 1 still holds for any pipeline — causing MGET to skip its only prefetch opportunity. Add server.prefetch_batch_max_size > 0 to the ready_len guard so intra-command prefetch runs when the batch system is disabled.

fcostaoliveira · 2026-03-31T10:56:09Z

Updated performance analysis — rebased on latest unstable (commit `a470915`)

Previous analysis was against a stale baseline (44 commits behind unstable). Rebased the branch and re-ran benchmarks against current unstable HEAD.

Note: Still collecting more MGET data across additional pipeline variants and ARM. This is preliminary data from the first IO-threads MGET benchmarks ever run on this infrastructure.

x86 (`x86-aws-m7i.metal-24xl`) — oss-standalone

Test Case	Baseline unstable (median +- std.dev)	Comparison (median)	% change	Note
mget-100B-100keys	32691 +- 1.1% (2 dp)	40491	+23.9%	IMPROVEMENT
mget-100B-30keys	76546 +- 0.4% (2 dp)	87551	+14.4%	IMPROVEMENT
mget-100B-20keys	96788 +- 2.0% (3 dp)	105752	+9.3%	IMPROVEMENT
mget-100B-5keys	151999 +- 0.1% (3 dp)	157728	+3.8%	IMPROVEMENT
mget-100B-10keys	131399 +- 0.6% (3 dp)	135433	+3.1%	IMPROVEMENT
mget-100B-5keys-pipeline-10	465659 +- 0.3% (3 dp)	480291	+3.1%	IMPROVEMENT
mget-1KiB-5keys	123074	125884	+2.3%	No Change
mget-1KiB-5keys-pipeline-10	122349	124211	+1.5%	No Change
mget-512B-5keys-pipeline-10	315258	318174	+0.9%	No Change
mget-100B-30keys-pipeline-10	110900	111470	+0.5%	No Change
setget2000c-pipeline-10	803727	805587	+0.2%	No Change
setget2000c-pipeline-16	990524	989529	-0.1%	No Change
mget-100B-2keys-pipeline-10	893595	891518	-0.2%	No Change
mget-100B-2keys	169522	168710	-0.5%	No Change
mget-100B-100keys-pipeline-10	61394	60846	-0.9%	No Change
hmget-pipeline-10	958532	941637	-1.8%	No Change
mget-1KiB (10 keys × 1KiB)	134096	125252	-6.6%	REGRESSION

x86 (`x86-aws-m7i.metal-24xl`) — mget-100B-100keys-pipeline-10 across IO-threads topologies (NEW)

Topology	Baseline unstable	Comparison	% change	Note
oss-standalone	61394	60846	-0.9%	No Change
oss-standalone-02-io-threads	38205	39458	+3.3%	IMPROVEMENT
oss-standalone-04-io-threads	35474	37697	+6.3%	IMPROVEMENT
oss-standalone-08-io-threads	48640	51585	+6.1%	IMPROVEMENT

Summary

Non-pipelined MGET with 5-100 keys (100B values): +3% to +24% improvement on x86 — scales with key count
Pipelined MGET: flat to +3% — no regression (the original -9.6% regression is resolved)
IO-threads MGET (pipeline-10, 100 keys): +3% to +6% improvement across 2/4/8 IO-thread topologies
setget2000c (SET/GET regression guard): flat (0%) on both pipeline-10 and pipeline-16
One regression: mget-1KiB (10 keys × 1KiB values, no pipeline) shows -6.6% — the dictPrefetchKeys overhead outweighs benefit for large values with small key counts. This test uses 1KiB values where prefetch state machine overhead per key is proportionally higher

Still collecting ARM standalone and additional MGET pipeline IO-threads data. Will follow up with complete results.

@sundb @mpozniak95

fcostaoliveira · 2026-03-31T16:34:16Z

EMBSTR (10B values) benchmark results — new test coverage

Added new MGET benchmarks with 10-byte values (OBJ_ENCODING_EMBSTR, where value data is embedded in the kv object) to cover all string encoding types. These use --distinct-client-seed for realistic cache behavior across clients.

x86 (`x86-aws-m7i.metal-24xl`) — mget-10B-100keys

Topology	Baseline unstable	Comparison (PR)	% change	Note
oss-standalone	21974	34152	+55.4%	IMPROVEMENT
oss-standalone-02-io-threads	31047	42977	+38.4%	IMPROVEMENT
oss-standalone-04-io-threads	31234	43042	+37.8%	IMPROVEMENT
oss-standalone-08-io-threads	30804	42630	+38.4%	IMPROVEMENT

ARM (`arm-aws-m8g.metal-24xl`) — mget-10B-100keys

Topology	Baseline unstable	Comparison (PR)	% change	Note
oss-standalone	21490	45194	+110.3%	IMPROVEMENT
oss-standalone-02-io-threads	32471	65174	+100.7%	IMPROVEMENT
oss-standalone-04-io-threads	32522	65371	+101.0%	IMPROVEMENT
oss-standalone-08-io-threads	32356	65369	+102.0%	IMPROVEMENT

x86 — mget-10B-100keys-pipeline-10

Topology	Baseline unstable	Comparison (PR)	% change	Note
oss-standalone	22717	24182	+6.4%	IMPROVEMENT

ARM — mget-10B-100keys-pipeline-10

Topology	Baseline unstable	Comparison (PR)	% change	Note
oss-standalone	22891	24347	+6.4%	IMPROVEMENT

Known regression: `memtier_benchmark-1Mkeys-string-mget-1KiB`

This test (10 keys × 1KiB RAW values, no pipeline) shows -8% on x86 and -6% on ARM. The profiling data shows the dictPrefetchKeys value-data prefetch callback (prefetchGetObjectValuePtr) adds overhead for large RAW values — it prefetches only the first 64 bytes of a 1KiB separate allocation, while adding a full state-machine iteration per key. We're working on a fix (skip value-data prefetch in the intra-command path) and will follow up with updated results.

@sundb @mpozniak95

The prefetchGetObjectValuePtr callback adds one state-machine iteration per key to prefetch the first cache line of the value allocation. For EMBSTR values (<=44B) it returns NULL (value is embedded in the kv object), making the iteration pure overhead. For large RAW values (>=1KiB) it warms <10% of the data while adding measurable cost — profiling shows 3B samples in ctxPrefetchValdata vs 17.8B total in the prefetch path. Pass NULL instead: skip value-data prefetch entirely in the MGET intra-command path. The dict bucket/entry/kv-object prefetch provides the bulk of the improvement; addReplyBulk handles value data access sequentially.

Three optimizations to the dict prefetch state machine: 1. Replace modulo with branch in ctxPrefetchAndAdvance — avoids expensive division on every state transition. 2. Add remaining counter to DictPrefetchCtx — ctxNextInfo returns NULL immediately when all keys are done instead of scanning the full array. Eliminates O(n) scan per call as keys complete. 3. Fast-path in ctxPrefetchKvobj when no value-data callback and entry is a direct key — do key comparison inline and skip the VALDATA state entirely, saving one round-trip through the state machine per key.

For batches of <=10 keys, replace the full dictPrefetchKeys state machine with a simple loop that prefetches hash table buckets only. The state machine has ~4 states per key with function dispatch and round-robin scanning overhead. Profiling shows this costs 4.3% of total CPU for 10-key MGET with 1KiB values — matching the observed regression exactly. A single-pass bucket prefetch eliminates the iteration overhead while still warming the hash chain heads. Larger batches (>10 keys) continue using the full state machine where the multi-round interleaving provides net benefit.

cursor · 2026-03-31T22:28:57Z

+                keys[k]  = c->argv[j + k]->ptr;
+                dicts[k] = d;
+            }
+            dictPrefetchKeys(dicts, keys, n, NULL);


Value-data callback NULL contradicts documented example usage

Medium Severity

dictPrefetchKeys is called with NULL for the get_val_data callback, so the VALDATA state never prefetches value data for RAW-encoded strings. The cross-command batch path passes prefetchGetObjectValuePtr, and the documented example in the same file (line 433) also shows passing prefetchGetObjectValuePtr. This means for batches of 11–16 keys using the full state machine, kv->ptr for RAW strings won't be in cache when addReplyBulk accesses it.

Additional Locations (1)

src/memory_prefetch.c#L432-L433

fcostaoliveira · 2026-04-01T10:36:44Z

Final performance analysis — commit `dcb8aa8` (rebased on latest unstable)

Full MGET benchmark suite across all string encodings (EMBSTR 10B, RAW 100B/512B/1KiB), key counts (2–100), pipeline depths (1, 10), and topologies (standalone + 02/04/08 IO-threads), on both x86 and ARM.

Summary

x86: up to +67.9% improvement (MGET 10B EMBSTR, 100 keys, standalone)
ARM: up to +112.7% improvement (MGET 10B EMBSTR, 100 keys, standalone)
Zero regressions across all topologies, encodings, and pipeline configurations on both platforms

x86 (`x86-aws-m7i.metal-24xl`) — oss-standalone

Test Case	Baseline unstable	Comparison (PR)	% change	Note
mget-10B-100keys	21974	36894	+67.9%	IMPROVEMENT
mget-100B-100keys	32022	39493	+23.3%	IMPROVEMENT
mget-100B-30keys	77470	85734	+10.7%	IMPROVEMENT
mget-100B-20keys	96470	103945	+7.7%	IMPROVEMENT
mget-10B-100keys-pipeline-10	22717	24245	+6.7%	IMPROVEMENT
mget-100B-2keys-pipeline-10	887290	935785	+5.5%	IMPROVEMENT
mget-100B-5keys-pipeline-10	470951	489931	+4.0%	IMPROVEMENT
mget-512B-5keys-pipeline-10	316216	326153	+3.1%	IMPROVEMENT
mget-100B-100keys-pipeline-10	61319	63101	+2.9%	No Change
mget-100B-30keys-pipeline-10	109449	112017	+2.3%	No Change
mget-100B-5keys	151458	154636	+2.1%	No Change
mget-100B-10keys	133057	135861	+2.1%	No Change
mget-512B-5keys	137861	138955	+0.8%	No Change
mget-1KiB-5keys-pipeline-10	123179	124145	+0.8%	No Change
mget-1KiB-5keys	124863	122805	-1.6%	No Change
mget-1KiB	137088	134841	-1.6%	No Change
mget-100B-2keys	171538	168515	-1.8%	No Change

x86 — oss-standalone-02-io-threads

Test Case	Baseline unstable	Comparison (PR)	% change	Note
mget-10B-100keys	31047	48641	+56.7%	IMPROVEMENT
mget-10B-100keys-pipeline-10	25398	27733	+9.2%	IMPROVEMENT
mget-100B-30keys-pipeline-10	131002	138359	+5.6%	IMPROVEMENT
mget-100B-100keys-pipeline-10	38582	39892	+3.4%	IMPROVEMENT

x86 — oss-standalone-04-io-threads

Test Case	Baseline unstable	Comparison (PR)	% change	Note
mget-10B-100keys	31234	48441	+55.1%	IMPROVEMENT
mget-100B-5keys-pipeline-10	1132402	1310331	+15.7%	IMPROVEMENT
mget-10B-100keys-pipeline-10	25876	28283	+9.3%	IMPROVEMENT
mget-100B-2keys-pipeline-10	2283308	2467139	+8.1%	IMPROVEMENT
mget-100B-30keys-pipeline-10	115508	123581	+7.0%	IMPROVEMENT
mget-100B-100keys-pipeline-10	35151	37591	+6.9%	IMPROVEMENT

x86 — oss-standalone-08-io-threads

Test Case	Baseline unstable	Comparison (PR)	% change	Note
mget-100B-2keys-pipeline-10	1564255	1812083	+15.8%	IMPROVEMENT
mget-10B-100keys-pipeline-10	25826	28033	+8.5%	IMPROVEMENT
mget-512B-5keys-pipeline-10	894374	955192	+6.8%	IMPROVEMENT
mget-1KiB-5keys-pipeline-10	438268	466975	+6.6%	IMPROVEMENT
mget-10B-100keys	30804	47677	+54.8%	IMPROVEMENT
mget-100B-5keys-pipeline-10	987120	1038147	+5.2%	IMPROVEMENT
mget-100B-100keys-pipeline-10	50261	52086	+3.6%	IMPROVEMENT
mget-100B-30keys-pipeline-10	188336	194892	+3.5%	IMPROVEMENT

ARM (`arm-aws-m8g.metal-24xl`) — oss-standalone

Test Case	Baseline unstable	Comparison (PR)	% change	Note
mget-10B-100keys	21490	45710	+112.7%	IMPROVEMENT
mget-100B-30keys	85239	93681	+9.9%	IMPROVEMENT
mget-100B-20keys	107623	115082	+6.9%	IMPROVEMENT
mget-100B-100keys	43605	46612	+6.9%	IMPROVEMENT
mget-10B-100keys-pipeline-10	22891	24398	+6.6%	IMPROVEMENT
mget-100B-10keys	145909	150486	+3.1%	IMPROVEMENT
mget-100B-30keys-pipeline-10	104580	106673	+2.0%	No Change
mget-100B-5keys-pipeline-10	545431	551314	+1.1%	No Change
mget-512B-5keys	160471	162215	+1.1%	No Change
mget-100B-5keys	175937	177940	+1.1%	No Change
mget-1KiB-5keys-pipeline-10	146393	146798	+0.3%	No Change
mget-100B-2keys	196102	196724	+0.3%	No Change
mget-1KiB-5keys	147608	147504	-0.1%	No Change
mget-1KiB	151912	151250	-0.4%	No Change
mget-100B-100keys-pipeline-10	56851	56637	-0.4%	No Change
mget-512B-5keys-pipeline-10	349163	343147	-1.7%	No Change
mget-100B-2keys-pipeline-10	1003653	977958	-2.6%	No Change

ARM — oss-standalone-02-io-threads

Test Case	Baseline unstable	Comparison (PR)	% change	Note
mget-10B-100keys	32471	67537	+108.0%	IMPROVEMENT
mget-10B-100keys-pipeline-10	26331	28862	+9.6%	IMPROVEMENT
mget-100B-30keys-pipeline-10	127850	128544	+0.5%	No Change
mget-100B-100keys-pipeline-10	36779	35526	-3.4%	No Change

ARM — oss-standalone-04-io-threads

Test Case	Baseline unstable	Comparison (PR)	% change	Note
mget-10B-100keys	32522	67666	+108.1%	IMPROVEMENT
mget-10B-100keys-pipeline-10	26653	29277	+9.8%	IMPROVEMENT
mget-100B-2keys-pipeline-10	3066805	3274012	+6.8%	IMPROVEMENT
mget-100B-5keys-pipeline-10	1142510	1205380	+5.5%	IMPROVEMENT
mget-100B-30keys-pipeline-10	92741	96967	+4.6%	IMPROVEMENT
mget-100B-100keys-pipeline-10	29689	30550	+2.9%	No Change

ARM — oss-standalone-08-io-threads

Test Case	Baseline unstable	Comparison (PR)	% change	Note
mget-10B-100keys	32356	67338	+108.1%	IMPROVEMENT
mget-100B-100keys-pipeline-10	39596	51576	+30.3%	IMPROVEMENT
mget-10B-100keys-pipeline-10	26425	28987	+9.7%	IMPROVEMENT
mget-100B-30keys-pipeline-10	187125	197733	+5.7%	IMPROVEMENT
mget-100B-2keys-pipeline-10	2702842	2815783	+4.2%	IMPROVEMENT
mget-100B-5keys-pipeline-10	1253545	1299721	+3.7%	IMPROVEMENT

Benchmark coverage

Tested across all string encodings that MGET handles:

EMBSTR (≤44B): 10B values — value embedded in kv object
RAW (>44B): 100B, 512B, 1KiB values — separate allocation

Key counts: 2, 5, 10, 20, 30, 100 keys per MGET.
Pipeline: 1 (no pipeline) and 10.
Topologies: standalone, 02/04/08-io-threads.
Platforms: x86 (Intel m7i.metal-24xl), ARM (Graviton m8g.metal-24xl).
All benchmarks use --distinct-client-seed for realistic cache behavior.

@sundb @mpozniak95

…mework

… callback Two review comments from Cursor Bugbot on dcb8aa8: 1. Missing `#undef MGET_LIGHT_PREFETCH_MAX` at the end of mgetCommand. MGET_BATCH was already undef'd but its sibling was not. 2. The `dictPrefetchKeys()` docstring example passed `prefetchGetObjectValuePtr` as the value-data callback, while the mgetCommand caller passes NULL on purpose (the callback's extra state machine work costs more than warming the value payload saves, since the following lookupKeyRead + addReplyBulk touch the payload immediately and the hardware prefetcher is already on it). Update the example to show the typical NULL form and document when passing a non-NULL callback would actually be worthwhile.

fcostaoliveira · 2026-04-21T23:37:48Z

CE Performance Automation : step 1 of 2 (build) STARTING...

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-04-21 23:37:47.892481
You can check each build/benchmark progress in grafana:

git hash: 362fbdd
git branch: mget_prefetch_framework
commit date and time: 2026-04-22 00:32:11+01:00
commit summary: Address review comments: undef MGET_LIGHT_PREFETCH_MAX + clarify NULL callback
test filters:
- command priority lower limit: 0
- command priority upper limit: 100000
- test name regex: .mget.
- command group regex: .*

fcostaoliveira · 2026-04-21T23:38:48Z

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2026-04-21 23:38:48.967283 and took 104.768925 seconds up until now.
Status: [####----------------------------------------------------------------------------] 5.56% completed.

In total will run 18 benchmarks.
- 17 pending.
- 1 completed:
- 1 successful.
- 0 failed.
You can check a the status in detail via the grafana link

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{Reviewed by Cursor Bugbot for commit 362fbdd. Configure here.}

cursor · 2026-04-21T23:40:36Z

+        .get_val_data = get_val_data,
+    };
+
+    server.stat_total_prefetch_batches++;


Prefetch stats now conflate IO-threaded and main-thread operations

Low Severity

dictPrefetchKeys increments server.stat_total_prefetch_batches and its state machine increments server.stat_total_prefetch_entries via ctxMarkDone, but these counters are reported to users as io_threaded_total_prefetch_batches and io_threaded_total_prefetch_entries. Intra-command prefetching from mgetCommand runs on the main thread without any IO threads involved, inflating these IO-thread-specific metrics and making them unreliable for monitoring IO thread performance.

Additional Locations (1)

src/memory_prefetch.c#L153-L154

^{Reviewed by Cursor Bugbot for commit 362fbdd. Configure here.}

@mpozniak95

…15133) Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses across the keys of a single multi-key command. Builds on the cross-command batched prefetch framework introduced in #14017 and the dict-prefetch state machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the state machine into two new `dictType` callbacks so the same machinery can be reused for other dict-encoded types later (hash hashtable, sets, sorted sets) without paying for `kvobj`-specific code paths in the core loop. Bundles the work originally proposed in #14899 (MGET prefetch framework, by @mpozniak95) and #15043 (MSET batch prefetch). ## Design Two new optional callbacks on `dictType`: ```c typedef struct dictType { ... /* Bring the entry's key payload into cache before keyCompare runs. * Returns the address to prefetch, or NULL if the entry alone is enough. */ void *(*prefetchEntryKey)(const dictEntry *de); /* Called only after a key match. Returns the value-side payload to * prefetch (or NULL). */ void *(*prefetchEntryValue)(const dictEntry *de); } dictType; ``` `dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()` shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` / `OBJ_ENCODING_RAW` values — now lives in two small helpers in `server.c`: ```c static void *dbDictPrefetchEntryKey(const dictEntry *de) { return dictEntryIsKey(de) ? NULL : dictGetKey(de); } static void *dbDictPrefetchEntryValue(const dictEntry *de) { kvobj *kv = dictGetKey(de); return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW) ? kv->ptr : NULL; } ``` The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data` parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the dict's own type drives both ends. This also removes the foot-gun where callers (like `mgetCommand`) had to remember whether to pass `prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer references `kvobj`, `kvobjGetKey`, or any specific value layout. ## State machine Two file-local types in `memory_prefetch.c`: | Type | Role | |---|---| | `dictPrefetchLookup` | Per-key snapshot of an in-flight, software-pipelined `dictFind` (mirrors the locals a synchronous `dictFind` would carry across one bucket walk). | | `dictPrefetcher` | Driver that advances a batch of `dictPrefetchLookup`s through the FSM, yielding to the next in-flight lookup each time a prefetch is issued. | Five-stage lifecycle for each lookup, driven by the prefetcher: ```text │ start │ ┌────────▼─────────┐ ┌─────────►│ PREFETCH_BUCKET ├────►────────┐ │ └────────┬─────────┘ no more tables │ bucket│found │ │ │ │ entry not found - goto next table ┌────────▼────────┐ │ └────◄─────┤ PREFETCH_ENTRY │ ▼ ┌────────────►└────────┬────────┘ │ │ entry│found │ │ │ │ │ ┌───────────▼─────────────┐ │ │ │ PREFETCH_ENTRY_KEY │ ◄── dictType->prefetchEntryKey(de) │ └───────────┬─────────────┘ │ │ │ │ key mismatch - goto next entry │ │ │ ┌───────────▼─────────────┐ │ └──────◄───│ PREFETCH_ENTRY_VALUE │ ◄── keyCompare; on match, └───────────┬─────────────┘ dictType->prefetchEntryValue(de) │ │ ┌─────────▼─────────────┐ │ │ PREFETCH_DONE │◄────────┘ └───────────────────────┘ ``` `PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]` if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no more tables remain. `memory_prefetch.c` exposes a small lifecycle that any caller can drive: ```c dictPrefetcherInit(p, max_keys); /* one-shot heap alloc of lookups[] */ dictPrefetcherReset(p, dicts, keys, nkeys); /* configure for one batch */ dictPrefetcherRun(p); /* drive FSM until all PREFETCH_DONE */ dictPrefetcherFree(p); /* release */ ``` Each FSM stage is a named static function (`dictPrefetchBucket`, `dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`), so the `dictPrefetcherRun` driver is a four-line `switch` over the state. The state machine is dict-pure: no `kvobj` field on `dictPrefetchLookup`, no `kvobjGetKey` reach-through. Round-robin advance semantics — a state only advances the cursor if a prefetch was actually issued — are preserved, so the embedded-kvobj fast path (`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the extra prefetch and falls straight into the compare on the next loop iteration. The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`) embeds a `dictPrefetcher` initialized once at startup and reset per batch, so cross-command prefetching no longer allocates per call. ## Intra-command API ```c void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys); ``` A single multi-key command (e.g. MGET) can prefetch dict data for a batch of its own keys, reusing the same state machine that the cross-command path uses. Single-key calls (`nkeys <= 1`) early-return — nothing to interleave with. The implementation stack-allocates a fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no VLA, predictable stack usage), so the intra-command path doesn't touch the heap. ## Notes on the call sites A shared helper picks the next prefetch batch and warms it via `dictPrefetchKeys`: ```c /* Pick the next prefetch batch starting at argv[start] and warm it via * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for * key/value pairs (MSET). Returns the chosen batch size in items. */ static int prefetchKeysBatch(client *c, int slot, int start, int stride); ``` Adaptive batch sizing inside the helper: if at least two full batches (`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch (`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one call. This generalizes the small-request fast path so the trailing batch of a large request also gets the single-call benefit. - **MGET (`mgetCommand`)** — gated by `do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`, with `already_prefetched = c->current_pending_cmd && (c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`. When `do_prefetch` is set, each iteration calls `prefetchKeysBatch(c, slot, j, 1)` and then sequentially `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is clear (cross-command path already warmed the keys, or batch prefetching is off), the loop takes all remaining items in one go and skips the prefetch. - **MSET / MSETNX (`msetGenericCommand`)** — same `do_prefetch` gate as MGET with `stride = 2`. For the NX flag the NX-check loop runs `lookupKeyWrite` (which already warmed everything via `prefetchKeysBatch`); the SET loop then disables further prefetch (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass. Going through the full state machine (rather than bucket-only) means `dbDictType`'s `prefetchEntryValue` callback runs on a key match — warming the old kvobj's payload, which `setKey -> dbReplaceValue -> updateKeysizesHist(oldlen, newlen)` then reads to compute the histogram delta. The slot dict is re-fetched per batch — in cluster mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS` + `expireIfNeeded`), so a cached pointer would otherwise dangle. - **Cross-command batch path (`addCommandToBatch`)** — sets `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch, even on partial-batch overflow (was: only when ALL keys fit). The intra-command path then uniformly skips supplemental prefetching for any command the batch touched. Rationale: running both paths (cross-command warm + intra-command supplement) caused a measured −9.6 % regression on x86 with pipeline-10, and the partial cross- command warmup is sufficient for the head of the keyset; the cold tail goes through normal lookup, which is still cheaper than running the FSM a second time on already-warm keys. - **Future types**: each dict's `dictType` can register its own `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash encoding, the field-sds and value-sds payloads), without touching `memory_prefetch.c`. ## Benchmark validation On x86, performance improvements are significant for larger batch sizes: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44% - 5Mkeys-string-mget-100B-100keys: +37.33% - 5Mkeys-string-mget-100B-30keys: +22.40% On ARM (Graviton4), the gains are even more pronounced: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34% - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76% Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions. --------- Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>

ShooterIT · 2026-05-11T09:36:42Z

closing this since we merged #15133

fcostaoliveira requested a review from sundb March 17, 2026 17:22

cursor Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread src/memory_prefetch.c Outdated

sundb requested a review from ShooterIT March 18, 2026 00:33

cursor Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread src/memory_prefetch.c

Comment thread src/t_string.c Outdated

sundb added this to Redis 8.8 Mar 25, 2026

github-project-automation Bot moved this to Todo in Redis 8.8 Mar 25, 2026

cursor Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread src/t_string.c Outdated

fcostaoliveira added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Mar 29, 2026

mpozniak95 and others added 6 commits March 30, 2026 21:08

Extend memory_prefetch framework with intra-command dictPrefetchKeys(…

79bb48c

…) API

Remove unused fields current_dicts and get_value_data_func

6911b26

Add zero-initializing

f0bcfa5

fcostaoliveira force-pushed the mget_prefetch_framework branch from 2b30246 to a470915 Compare March 30, 2026 20:10

fcostaoliveira removed the action:run-benchmark Triggers the benchmark suite for this Pull Request label Mar 30, 2026

fcostaoliveira added 3 commits March 31, 2026 17:34

fcostaoliveira force-pushed the mget_prefetch_framework branch from 186f995 to dcb8aa8 Compare March 31, 2026 22:21

cursor Bot reviewed Mar 31, 2026

View reviewed changes

fcostaoliveira mentioned this pull request Apr 14, 2026

Add batched dict bucket prefetch for MSET/MSETNX #15043

Closed

fcostaoliveira added 2 commits April 22, 2026 00:29

Merge remote-tracking branch 'origin/unstable' into mget_prefetch_fra…

5f3e670

…mework

cursor Bot reviewed Apr 21, 2026

View reviewed changes

fcostaoliveira mentioned this pull request Apr 28, 2026

Batched MGET/MSET dict prefetch with dictType-driven payload hints #15133

Merged

ShooterIT closed this May 11, 2026

github-project-automation Bot moved this from Todo to Done in Redis 8.8 May 11, 2026

sundb removed this from Redis 8.8 May 22, 2026

Uh oh!

Conversation

mpozniak95 commented Mar 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

1. New DictPrefetchCtx abstraction (memory_prefetch.c)

2. New public API (memory_prefetch.h)

3. Cross-command awareness (memory_prefetch.c + server.h)

4. Optimized mgetCommand (t_string.c)

Benchmark results

Single-threaded (no I/O threads)

4 I/O threads

16 I/O threads

Uh oh!

CLAassistant commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fcostaoliveira commented Mar 29, 2026

Preliminary performance analysis for PR #14899 — MGET Prefetch Framework

ARM (arm-aws-m8g.metal-24xl) — oss-standalone

x86 (x86-aws-m7i.metal-24xl) — oss-standalone

IO-threads (oss-standalone-08-io-threads) — setget2000c regression guard

Summary

Uh oh!

fcostaoliveira commented Mar 29, 2026

Updated performance analysis — after commit 4ef667a (pipeline double-prefetch fix)

ARM (arm-aws-m8g.metal-24xl) — oss-standalone

x86 (x86-aws-m7i.metal-24xl) — oss-standalone

Summary — before vs after fix

Uh oh!

Uh oh!

fcostaoliveira commented Mar 31, 2026

Updated performance analysis — rebased on latest unstable (commit a470915)

x86 (x86-aws-m7i.metal-24xl) — oss-standalone

x86 (x86-aws-m7i.metal-24xl) — mget-100B-100keys-pipeline-10 across IO-threads topologies (NEW)

Summary

Uh oh!

fcostaoliveira commented Mar 31, 2026

EMBSTR (10B values) benchmark results — new test coverage

x86 (x86-aws-m7i.metal-24xl) — mget-10B-100keys

ARM (arm-aws-m8g.metal-24xl) — mget-10B-100keys

x86 — mget-10B-100keys-pipeline-10

ARM — mget-10B-100keys-pipeline-10

Known regression: memtier_benchmark-1Mkeys-string-mget-1KiB

Uh oh!

Uh oh!

cursor Bot Mar 31, 2026

Choose a reason for hiding this comment

Value-data callback NULL contradicts documented example usage

Uh oh!

fcostaoliveira commented Apr 1, 2026

Final performance analysis — commit dcb8aa8 (rebased on latest unstable)

Summary

x86 (x86-aws-m7i.metal-24xl) — oss-standalone

x86 — oss-standalone-02-io-threads

x86 — oss-standalone-04-io-threads

x86 — oss-standalone-08-io-threads

ARM (arm-aws-m8g.metal-24xl) — oss-standalone

ARM — oss-standalone-02-io-threads

ARM — oss-standalone-04-io-threads

ARM — oss-standalone-08-io-threads

Benchmark coverage

Uh oh!

fcostaoliveira commented Apr 21, 2026

CE Performance Automation : step 1 of 2 (build) STARTING...

Uh oh!

fcostaoliveira commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 21, 2026

Choose a reason for hiding this comment

Prefetch stats now conflate IO-threaded and main-thread operations

mpozniak95 commented Mar 17, 2026 •

edited by cursor Bot

Loading

1. New `DictPrefetchCtx` abstraction (`memory_prefetch.c`)

2. New public API (`memory_prefetch.h`)

3. Cross-command awareness (`memory_prefetch.c` + `server.h`)

4. Optimized `mgetCommand` (`t_string.c`)

CLAassistant commented Mar 17, 2026 •

edited

Loading

ARM (`arm-aws-m8g.metal-24xl`) — oss-standalone

x86 (`x86-aws-m7i.metal-24xl`) — oss-standalone

IO-threads (`oss-standalone-08-io-threads`) — setget2000c regression guard

Updated performance analysis — after commit `4ef667a` (pipeline double-prefetch fix)

ARM (`arm-aws-m8g.metal-24xl`) — oss-standalone

x86 (`x86-aws-m7i.metal-24xl`) — oss-standalone

Updated performance analysis — rebased on latest unstable (commit `a470915`)

x86 (`x86-aws-m7i.metal-24xl`) — oss-standalone

x86 (`x86-aws-m7i.metal-24xl`) — mget-100B-100keys-pipeline-10 across IO-threads topologies (NEW)

x86 (`x86-aws-m7i.metal-24xl`) — mget-10B-100keys

ARM (`arm-aws-m8g.metal-24xl`) — mget-10B-100keys

Known regression: `memtier_benchmark-1Mkeys-string-mget-1KiB`

Value-data callback `NULL` contradicts documented example usage

Final performance analysis — commit `dcb8aa8` (rebased on latest unstable)

x86 (`x86-aws-m7i.metal-24xl`) — oss-standalone

ARM (`arm-aws-m8g.metal-24xl`) — oss-standalone

fcostaoliveira commented Apr 21, 2026 •

edited

Loading