Improve multithreaded performance with memory prefetching#14017
Conversation
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
CE Performance Automation : step 1 of 2 (build) DONE.This comment was automatically generated given a benchmark was triggered.
You can check a comparison in detail via the grafana link |
CE Performance Automation : step 2 of 2 (benchmark) FINISHED.This comment was automatically generated given a benchmark was triggered. Started benchmark suite at 2025-06-20 22:24:03.546479 and took 11299.83093 seconds to finish. In total will run 176 benchmarks. |
Automated performance analysis summaryThis comment was automatically generated given there is performance data available. Using platform named: intel64-ubuntu22.04-redis-icx1 to do the comparison. In summary:
You can check a comparison in detail via the grafana link Comparison between unstable and ShooterIT:memory-prefetch.Time Period from 5 months ago. (environment used: oss-standalone) By GROUP change csv:command_group,min_change,max_change By COMMAND change csv:command,min_change,max_change
Unstable test regexp names: memtier_benchmark-1Mkeys-load-zset-listpack-with-100-elements-double-score Improvements Table
Improvements test regexp names: memtier_benchmark-1key-zset-1M-elements-zremrangebyscore-pipeline-10 Full Results table:WARNING: There were 150 benchmarks with NO datapoints for both baseline and comparison. NO datapoints for both baseline and comparison:NO DATAPOINTS test regexp names: latency-rate-limited-10000_qps-memtier_benchmark-100Kkeys-hash-hgetall-50-fields-100B-values|latency-rate-limited-10000_qps-memtier_benchmark-100Kkeys-load-hash-50-fields-with-1000B-values|latency-rate-limited-10000_qps-memtier_benchmark-100Kkeys-load-hash-50-fields-with-100B-values|latency-rate-limited-10000_qps-memtier_benchmark-100Kkeys-load-hash-50-fields-with-10B-values|latency-rate-limited-10000_qps-memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values|latency-rate-limited-10000_qps-memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values|latency-rate-limited-10000_qps-memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-100B-expire-use-case|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-10B-expire-use-case|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-1KiB-expire-use-case|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-4KiB-expire-use-case|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-bitmap-getbit-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-generic-exists-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-generic-expire-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-generic-expireat-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-generic-pexpire-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-generic-scan-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-generic-touch-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-generic-ttl-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-hash-hexists|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-hash-hget-hgetall-hkeys-hvals-with-100B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-hash-hgetall-50-fields-10B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-hash-hincrby|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-hash-hmget-5-fields-with-100B-values-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-hash-transactions-multi-exec-pipeline-20|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-list-lpop-rpop-with-100B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-list-lpop-rpop-with-10B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-list-lpop-rpop-with-1KiB-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-list-with-100B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-list-with-10B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-list-with-1KiB-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-set-intset-with-100-elements|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-set-intset-with-100-elements-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-stream-1-fields-with-100B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-stream-1-fields-with-100B-values-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-string-with-100B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-string-with-100B-values-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-string-with-10B-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-string-with-10B-values-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-string-with-1KiB-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-string-with-20KiB-values|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-zset-with-10-elements-double-score|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-load-zset-with-10-elements-int-score|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-append-1-100B|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-append-1-100B-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-decr|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-get-100B|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-get-100B-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-get-10B|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-get-10B-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-get-1KiB|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-get-1KiB-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-get-20KiB|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-incrby|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-incrby-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-incrbyfloat|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-incrbyfloat-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-mget-1KiB|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-setex-100B-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-setrange-100B|latency-rate-limited-10000_qps-memtier_benchmark-1Mkeys-string-setrange-100B-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-2-elements-geopos|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-2-elements-geosearch-fromlonlat-withcoord|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-60M-elements-geodist|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-60M-elements-geodist-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-60M-elements-geohash|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-60M-elements-geohash-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-60M-elements-geopos|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-60M-elements-geopos-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat-bybox|latency-rate-limited-10000_qps-memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-hash-hscan-50-fields-10B-values|latency-rate-limited-10000_qps-memtier_benchmark-1key-list-10-elements-lrange-all-elements|latency-rate-limited-10000_qps-memtier_benchmark-1key-list-10-elements-lrange-all-elements-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-list-100-elements-lrange-all-elements|latency-rate-limited-10000_qps-memtier_benchmark-1key-list-100-elements-lrange-all-elements-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-list-10K-elements-lindex-integer|latency-rate-limited-10000_qps-memtier_benchmark-1key-list-10K-elements-lindex-string|latency-rate-limited-10000_qps-memtier_benchmark-1key-list-1K-elements-lrange-all-elements|latency-rate-limited-10000_qps-memtier_benchmark-1key-list-1K-elements-lrange-all-elements-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-pfadd-4KB-values-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-10-elements-smembers|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-10-elements-smembers-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-10-elements-smismember|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-100-elements-sismember-is-a-member|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-100-elements-sismember-not-a-member|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-100-elements-smembers|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-100-elements-smismember|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-100-elements-sscan|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-10M-elements-sismember-50pct-chance|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-1K-elements-smembers|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-1M-elements-sismember-50pct-chance|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-200K-elements-sadd-constant|latency-rate-limited-10000_qps-memtier_benchmark-1key-set-2M-elements-sadd-increasing|latency-rate-limited-10000_qps-memtier_benchmark-1key-zincrby-1M-elements-pipeline-1|latency-rate-limited-10000_qps-memtier_benchmark-1key-zrank-1M-elements-pipeline-1|latency-rate-limited-10000_qps-memtier_benchmark-1key-zrem-5M-elements-pipeline-1|latency-rate-limited-10000_qps-memtier_benchmark-1key-zrevrangebyscore-256K-elements-pipeline-1|latency-rate-limited-10000_qps-memtier_benchmark-1key-zrevrank-1M-elements-pipeline-1|latency-rate-limited-10000_qps-memtier_benchmark-1key-zset-10-elements-zrange-all-elements|latency-rate-limited-10000_qps-memtier_benchmark-1key-zset-10-elements-zrange-all-elements-long-scores|latency-rate-limited-10000_qps-memtier_benchmark-1key-zset-100-elements-zrange-all-elements|latency-rate-limited-10000_qps-memtier_benchmark-1key-zset-100-elements-zrangebyscore-all-elements|latency-rate-limited-10000_qps-memtier_benchmark-1key-zset-100-elements-zrangebyscore-all-elements-long-scores|latency-rate-limited-10000_qps-memtier_benchmark-1key-zset-100-elements-zscan|latency-rate-limited-10000_qps-memtier_benchmark-1key-zset-1M-elements-zcard-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-1key-zset-1M-elements-zrevrange-5-elements|latency-rate-limited-10000_qps-memtier_benchmark-1key-zset-1M-elements-zscore-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-2keys-lua-eval-hset-expire|latency-rate-limited-10000_qps-memtier_benchmark-2keys-lua-evalsha-hset-expire|latency-rate-limited-10000_qps-memtier_benchmark-2keys-set-10-100-elements-sdiff|latency-rate-limited-10000_qps-memtier_benchmark-2keys-set-10-100-elements-sinter|latency-rate-limited-10000_qps-memtier_benchmark-2keys-set-10-100-elements-sunion|latency-rate-limited-10000_qps-memtier_benchmark-2keys-stream-5-entries-xread-all-entries|latency-rate-limited-10000_qps-memtier_benchmark-2keys-stream-5-entries-xread-all-entries-pipeline-10|latency-rate-limited-10000_qps-memtier_benchmark-3Mkeys-load-string-with-512B-values|latency-rate-limited-10000_qps-memtier_benchmark-connection-hello|latency-rate-limited-1000_qps-memtier_benchmark-10Kkeys-load-hash-50-fields-with-10000B-values|latency-rate-limited-1000_qps-memtier_benchmark-1Mkeys-load-zset-listpack-with-100-elements-double-score|latency-rate-limited-1000_qps-memtier_benchmark-1key-100M-bits-bitmap-bitcount|latency-rate-limited-1000_qps-memtier_benchmark-1key-list-10K-elements-linsert-lrem-integer|latency-rate-limited-1000_qps-memtier_benchmark-1key-list-10K-elements-linsert-lrem-string|latency-rate-limited-1000_qps-memtier_benchmark-1key-list-10K-elements-lpos-integer|latency-rate-limited-1000_qps-memtier_benchmark-1key-list-10K-elements-lpos-string|latency-rate-limited-1000_qps-memtier_benchmark-1key-list-2K-elements-quicklist-lrange-all-elements-longs|latency-rate-limited-1000_qps-memtier_benchmark-1key-zset-1K-elements-zrange-all-elements|latency-rate-limited-1000_qps-memtier_benchmark-2keys-zset-300-elements-skiplist-encoded-zunion|latency-rate-limited-1000_qps-memtier_benchmark-2keys-zset-300-elements-skiplist-encoded-zunionstore|latency-rate-limited-100_qps-memtier_benchmark-1key-1Billion-bits-bitmap-bitcount|memtier_benchmark-10Mkeys-load-hash-5-fields-with-1000B-values|memtier_benchmark-10Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10|memtier_benchmark-1Mkeys-lhash-hexists|memtier_benchmark-1Mkeys-lhash-hincbry|memtier_benchmark-1Mkeys-load-string-with-200KiB-values|memtier_benchmark-1Mkeys-load-string-with-2MB-values|memtier_benchmark-1Mkeys-string-get-200KiB|memtier_benchmark-1Mkeys-string-get-20KiB|memtier_benchmark-1Mkeys-string-get-2MB|redis-benchmark-full-suite-1Mkeys-100B|redis-benchmark-full-suite-1Mkeys-100B-pipeline-10|redis-benchmark-full-suite-1Mkeys-1KiB|redis-benchmark-full-suite-1Mkeys-1KiB-pipeline-10|vector_db_benchmark_test |
|
@ShooterIT , do we understand the degradation in "Automated performance analysis summary". Is it real issue? |
|
Hi @moticless it is not real issue, the benchmark doesn't test based on multi-threaded. For single thread, we don't apply this feature now. |
|
To make the optimizations effective, we must merge both #13968 and #14017 at the same time. Here is a record of test results. The machine is
|
oranagra
left a comment
There was a problem hiding this comment.
i didn't really review the code, just took a quick glance
| getKeysResult result = GETKEYS_RESULT_INIT; | ||
| int numkeys = getKeysFromCommand(cmd, argv, argc, &result); |
There was a problem hiding this comment.
if we do that, maybe we can cache the result so it can serve others usages (ACL, Cluster, ROF).
besides, maybe instead of just computing the slot of the first key, it can already check for cross slot, and then the main thread won't have to.
we did that in lookahead.
There was a problem hiding this comment.
yes. we can let IO threads do this and cache the result. This PR is based on valkey-io/valkey#861, valkey hasn’t optimized for this issue, so I kept the original behavior, and i don't want to make it bigger, maybe a new PR is better, as you said, we can cache the key result, even calculate if all keys are in a single slot in the IO thread.
besides, I have tried to cache key result but there is no significant performance improvement, so i think it is not urgent.
There was a problem hiding this comment.
no significant performance improvement.. did you check with cluster mode or ACL?
seems odd that we invest if offloading argument parsing, and command lookup, but claim that offloading key name extraction and slot number / cross slot isn't significant.
in any case, i don't mind taking it in a separate PR, arguing this one is about memory prefetching only. the reason i commented was because this change does add a call to getKeysFromCommand, so essentially adding some work that's now done twice.
anyway, if / when we'll combine the lookahead project this this, we'll get it cached anyway.
There was a problem hiding this comment.
I tested without cluster mode or ACL, so i only offload get keys result, here is a flame graph i have save in the previous test with 8 IO threads 8write. It shows getting keys result costs 0.1% CPU of the main thread.

If in cluster mode, calculating slot number may cost more CPU.
There was a problem hiding this comment.
anyway, if / when we'll combine the lookahead project this this, we'll get it cached anyway.
since the lookahead project did this, we can have this feature (cache keys result) after merging it?
There was a problem hiding this comment.
not sure i understand, 0.1% of the main thread CPU in which scenario? if there's no ACL or cluster, it's not used by the main thread.
in any case, we agree it should be cached, and we agree it can be done later. the only question is if we have some small regression because we can now call it twice (e.g. in cluster mode). i guess we don't care as long as the impact of the PR is still positive and we have a plan to improve later.
There was a problem hiding this comment.
the scenario is under standalone mode with 8 IO threads, 100% SET command stress test.
yes, do agree
There was a problem hiding this comment.
I see that main thread calls getKeysFromCommand() inside addCommandToBatch(). Maybe it can be avoided as well once we cache the results.
There was a problem hiding this comment.
Yes. Keys result can be used
- memory prefetch
- slot calculating for cluster/cluster-compatibility
- ACL check
And as oran said, the IO thread also should check if all keys are in the same slot, instead of only the first key, so the main thread can just report cross-slot error (MULTI-EXEC requires special treatment), so i want to put these works in a separate PR (Valkey also intends to do this, but hasn’t implemented it yet). lookahead project already did this, maybe we just reuse this logic when merging. Besides, as i showed above, the regression of calling getKeysFromCommand is not notable.
We need to check the command arity in IO threads, if it is not correct, we should reset it, as we may do memory prefetching according to the `iolookedcmd`. Accessing `argv` using the key positions returned by `getKeysFromCommand` is unsafe and must be avoided for invalid commands. This bug starts to have an impact after #14017
…15133) Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses across the keys of a single multi-key command. Builds on the cross-command batched prefetch framework introduced in #14017 and the dict-prefetch state machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the state machine into two new `dictType` callbacks so the same machinery can be reused for other dict-encoded types later (hash hashtable, sets, sorted sets) without paying for `kvobj`-specific code paths in the core loop. Bundles the work originally proposed in #14899 (MGET prefetch framework, by @mpozniak95) and #15043 (MSET batch prefetch). ## Design Two new optional callbacks on `dictType`: ```c typedef struct dictType { ... /* Bring the entry's key payload into cache before keyCompare runs. * Returns the address to prefetch, or NULL if the entry alone is enough. */ void *(*prefetchEntryKey)(const dictEntry *de); /* Called only after a key match. Returns the value-side payload to * prefetch (or NULL). */ void *(*prefetchEntryValue)(const dictEntry *de); } dictType; ``` `dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()` shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` / `OBJ_ENCODING_RAW` values — now lives in two small helpers in `server.c`: ```c static void *dbDictPrefetchEntryKey(const dictEntry *de) { return dictEntryIsKey(de) ? NULL : dictGetKey(de); } static void *dbDictPrefetchEntryValue(const dictEntry *de) { kvobj *kv = dictGetKey(de); return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW) ? kv->ptr : NULL; } ``` The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data` parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the dict's own type drives both ends. This also removes the foot-gun where callers (like `mgetCommand`) had to remember whether to pass `prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer references `kvobj`, `kvobjGetKey`, or any specific value layout. ## State machine Two file-local types in `memory_prefetch.c`: | Type | Role | |---|---| | `dictPrefetchLookup` | Per-key snapshot of an in-flight, software-pipelined `dictFind` (mirrors the locals a synchronous `dictFind` would carry across one bucket walk). | | `dictPrefetcher` | Driver that advances a batch of `dictPrefetchLookup`s through the FSM, yielding to the next in-flight lookup each time a prefetch is issued. | Five-stage lifecycle for each lookup, driven by the prefetcher: ```text │ start │ ┌────────▼─────────┐ ┌─────────►│ PREFETCH_BUCKET ├────►────────┐ │ └────────┬─────────┘ no more tables │ bucket│found │ │ │ │ entry not found - goto next table ┌────────▼────────┐ │ └────◄─────┤ PREFETCH_ENTRY │ ▼ ┌────────────►└────────┬────────┘ │ │ entry│found │ │ │ │ │ ┌───────────▼─────────────┐ │ │ │ PREFETCH_ENTRY_KEY │ ◄── dictType->prefetchEntryKey(de) │ └───────────┬─────────────┘ │ │ │ │ key mismatch - goto next entry │ │ │ ┌───────────▼─────────────┐ │ └──────◄───│ PREFETCH_ENTRY_VALUE │ ◄── keyCompare; on match, └───────────┬─────────────┘ dictType->prefetchEntryValue(de) │ │ ┌─────────▼─────────────┐ │ │ PREFETCH_DONE │◄────────┘ └───────────────────────┘ ``` `PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]` if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no more tables remain. `memory_prefetch.c` exposes a small lifecycle that any caller can drive: ```c dictPrefetcherInit(p, max_keys); /* one-shot heap alloc of lookups[] */ dictPrefetcherReset(p, dicts, keys, nkeys); /* configure for one batch */ dictPrefetcherRun(p); /* drive FSM until all PREFETCH_DONE */ dictPrefetcherFree(p); /* release */ ``` Each FSM stage is a named static function (`dictPrefetchBucket`, `dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`), so the `dictPrefetcherRun` driver is a four-line `switch` over the state. The state machine is dict-pure: no `kvobj` field on `dictPrefetchLookup`, no `kvobjGetKey` reach-through. Round-robin advance semantics — a state only advances the cursor if a prefetch was actually issued — are preserved, so the embedded-kvobj fast path (`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the extra prefetch and falls straight into the compare on the next loop iteration. The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`) embeds a `dictPrefetcher` initialized once at startup and reset per batch, so cross-command prefetching no longer allocates per call. ## Intra-command API ```c void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys); ``` A single multi-key command (e.g. MGET) can prefetch dict data for a batch of its own keys, reusing the same state machine that the cross-command path uses. Single-key calls (`nkeys <= 1`) early-return — nothing to interleave with. The implementation stack-allocates a fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no VLA, predictable stack usage), so the intra-command path doesn't touch the heap. ## Notes on the call sites A shared helper picks the next prefetch batch and warms it via `dictPrefetchKeys`: ```c /* Pick the next prefetch batch starting at argv[start] and warm it via * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for * key/value pairs (MSET). Returns the chosen batch size in items. */ static int prefetchKeysBatch(client *c, int slot, int start, int stride); ``` Adaptive batch sizing inside the helper: if at least two full batches (`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch (`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one call. This generalizes the small-request fast path so the trailing batch of a large request also gets the single-call benefit. - **MGET (`mgetCommand`)** — gated by `do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`, with `already_prefetched = c->current_pending_cmd && (c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`. When `do_prefetch` is set, each iteration calls `prefetchKeysBatch(c, slot, j, 1)` and then sequentially `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is clear (cross-command path already warmed the keys, or batch prefetching is off), the loop takes all remaining items in one go and skips the prefetch. - **MSET / MSETNX (`msetGenericCommand`)** — same `do_prefetch` gate as MGET with `stride = 2`. For the NX flag the NX-check loop runs `lookupKeyWrite` (which already warmed everything via `prefetchKeysBatch`); the SET loop then disables further prefetch (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass. Going through the full state machine (rather than bucket-only) means `dbDictType`'s `prefetchEntryValue` callback runs on a key match — warming the old kvobj's payload, which `setKey -> dbReplaceValue -> updateKeysizesHist(oldlen, newlen)` then reads to compute the histogram delta. The slot dict is re-fetched per batch — in cluster mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS` + `expireIfNeeded`), so a cached pointer would otherwise dangle. - **Cross-command batch path (`addCommandToBatch`)** — sets `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch, even on partial-batch overflow (was: only when ALL keys fit). The intra-command path then uniformly skips supplemental prefetching for any command the batch touched. Rationale: running both paths (cross-command warm + intra-command supplement) caused a measured −9.6 % regression on x86 with pipeline-10, and the partial cross- command warmup is sufficient for the head of the keyset; the cold tail goes through normal lookup, which is still cheaper than running the FSM a second time on already-warm keys. - **Future types**: each dict's `dictType` can register its own `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash encoding, the field-sds and value-sds payloads), without touching `memory_prefetch.c`. ## Benchmark validation On x86, performance improvements are significant for larger batch sizes: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44% - 5Mkeys-string-mget-100B-100keys: +37.33% - 5Mkeys-string-mget-100B-30keys: +22.40% On ARM (Graviton4), the gains are even more pronounced: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34% - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76% Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions. --------- Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>
This PR is based on: valkey-io/valkey#861
Implementation of Redis
When the main thread processes clients with ready-to-execute commands (i.e., clients for which the IO thread has parsed the commands), a batch of up to 16 commands is created. Initially, the command's argv, which were allocated by the IO thread, is prefetched to the main thread's L1 cache. Subsequently, all the dict entries and values required for the commands are prefetched from the dictionary before the command execution.
Memory prefetching for main hash table
As shown in the picture, after #13806 , we unify key value and the dict uses no_value optimization, so the memory prefetching has 4 steps:
we also need to handle the case that the dict entry is the pointer of kv object, just skip step 3.
MAA can improves single-threaded memory access efficiency by interleaving the execution of multiple independent operations, allowing memory-level parallelism and better CPU utilization. Its key point is batch-wise interleaved execution. Split a batch of independent operations (such as multiple key lookups) into multiple state machines, and interleave their progress within a single thread to hide the memory access latency of individual requests.
The difference between serial execution and interleaved execution:
naive serial execution
interleaved execution
New configuration
This PR involves a new configuration
prefetch-batch-max-size, but we think it is a low level optimization, so we hide this config:When multiple commands are parsed by the I/O threads and ready for execution, we take advantage of knowing the next set of commands and prefetch their required dictionary entries in a batch. This reduces memory access costs. The optimal batch size depends on the specific workflow of the user. The default batch size is 16, which can be modified using the 'prefetch-batch-max-size' config.
When the config is set to 0, prefetching is disabled.
Co-authored-by: Uri Yagelnik uriy@amazon.com