Add batched prefetch for HGETALL on hashtable encoding#14988
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 9d820b7. Configure here.
🤖 Augment PR SummarySummary: Adds a batched software-prefetch fast path for Changes:
Technical Notes: The new path manually skips expired fields and uses 🤖 Was this summary useful? React with 👍 or 👎 |
ShooterIT
left a comment
There was a problem hiding this comment.
i was thinking we should make a more generic dict prefetching API or structure.
|
Agreed @ShooterIT — a generic dict batched-prefetch iterator would be useful for SMEMBERS, KEYS, and other dict-scanning commands. Happy to follow up with that once this pattern is validated. For now keeping it scoped to HGETALL where the profiling data shows the biggest win (dictNext at 10.1% flat CPU). |
Automated performance analysis summaryThis comment was automatically generated given there is performance data available. Using platform named: arm-aws-m8g.metal-24xl for both baseline and comparison. Using triggering environment: ci for both baseline and comparison. In summary:
You can check a comparison in detail via the grafana link Comparison between unstable and 718ae4c.Time Period from 2 years ago. (environment used: oss-standalone) By GROUP change csv:command_group,min_change,q1_change,median_change,q3_change,max_change By COMMAND change csv:command,min_change,q1_change,median_change,q3_change,max_change
Regressions test regexp names: memtier_benchmark-1Mkeys-generic-scan-pipeline-10|memtier_benchmark-1Mkeys-hash-hkeys-5-fields-with-100B-values-with-expiration-pipeline-10 Improvements Table
Improvements test regexp names: memtier_benchmark-10Kkeys-hash-hgetall-200-fields-100B-values|memtier_benchmark-10Kkeys-hash-hgetall-500-fields-100B-values|memtier_benchmark-1key-hash-1K-fields-hgetall|memtier_benchmark-10Kkeys-hash-hgetall-100-fields-100B-values|memtier_benchmark-1Mkeys-generic-scan-type-pipeline-10|memtier_benchmark-1key-hash-1K-fields-hgetall-pipeline-10 Full Results table: |
skaslev
left a comment
There was a problem hiding this comment.
Looks good to me overall.
Added one question inline.
fcostaoliveira
left a comment
There was a problem hiding this comment.
Good point. Yes, we could take the fast path unconditionally — HKEYS and HVALS would also benefit from the prefetched dict entry lookup, even though they only emit keys or values. The Phase 2 value prefetch can be gated on flags & OBJ_HASH_VALUE as you suggest. I'll update the code to remove the (flags & OBJ_HASH_KEY) && (flags & OBJ_HASH_VALUE) gate and skip the value prefetch when only keys are requested.
For HGETALL on OBJ_ENCODING_HT when prefetch-batch-max-size > 0, use a three-phase batched iteration instead of the generic per-entry iterator: Phase 1: Collect up to 16 dict entries via dictNext, prefetch each Entry struct (field SDS is embedded in Entry, so one prefetch warms both the Entry metadata and the field string data). Phase 2: For each collected entry, call entryGetValue to resolve the value SDS pointer and prefetch the value data. For entries with pointer-based values (Type 3, large values), this hides the latency of the separate heap allocation dereference. Phase 3: Emit replies — all field and value data is cache-warm. addReplyBulkCBuffer calls operate on warm cache lines. Profiling shows dictNext at 10.1% flat CPU for 1K-field HGETALL (scattered heap allocations cause cache misses on every entry). The batched approach overlaps prefetch latency with reply generation, which consumes ~25% CPU and provides a natural prefetch window. Respects prefetch-batch-max-size: when 0, falls back to the generic hashTypeIterator path. Skips expired fields like the generic path.
- Check server.allow_access_expired before skipping expired fields, matching hashTypeNext() behavior. Without this, count diverges from hashTypeLength() when expired access is allowed, triggering the serverAssert(count == length). - Bound batch size by server.prefetch_batch_max_size (capped at 16) instead of ignoring the configured limit. - Remove unused batch_de array (dead code).
- Replace server.prefetch_batch_max_size with a constant HGETALL_BATCH (16). The config is meant for cross-command key prefetch, not intra-command iteration. (sundb, ShooterIT) - Remove the config gate — fast path activates on encoding == HT with both KEY and VALUE flags. - Restructure with goto done to share cleanup between fast path and fallback, reducing duplication. (sundb)
… keys requested Per review: the batched prefetch fast path was gated on HGETALL (keys+values) only. HKEYS and HVALS also benefit from the Entry struct prefetch. Remove the flags gate and conditionally: - Skip Phase 2 value prefetch when only keys requested - Emit key/value in Phase 3 based on the flags
053973c to
50ea65c
Compare
|
Addressed in 50ea65c — the batched prefetch fast path now runs for HKEYS and HVALS too (not just HGETALL). Phase 2 value prefetch is skipped when Also rebased onto latest unstable (includes #15003 propagation fix, #14661 C fast_float, #14979 CoW dict dismiss, #14905 GCRA type). |
i think this pattern is proven, a generic function may be better |
- drop indent on `#define HGETALL_BATCH 16` - collapse phase-1 batching into a single `while (batch_count < HGETALL_BATCH)` loop; tighter and easier to read. No behavior change — same skip-expired semantics, same prefetch ordering.
- Move the per-entry expire check from Phase 1 into Phase 2. Phase 1 is now a pure pointer-fetch + `redis_prefetch_read`, so the prefetch is no longer redundant against an inline `entryGetExpiry` read on the same cache line. - Cache the value SDS pointers extracted in Phase 2 into `batch_val[]` so Phase 3 doesn't call `entryGetValue` a second time. - Compact the batch in place so Phase 3 only walks non-expired entries.
|
Pushed b475e1c with both follow-ups, @sundb:
Verified Will re-trigger the x86 + ARM benchmark suite to confirm the prefetch improvement holds with the new ordering — posting numbers when they land. |
|
Re-benchmarked x86 (
|
| Test | unstable | PR | Δ |
|---|---|---|---|
memtier_benchmark-10Kkeys-hash-hgetall-500-fields-100B-values |
14,750 | 21,033 | +42.6% |
memtier_benchmark-10Kkeys-hash-hgetall-200-fields-100B-values |
35,081 | 45,255 | +29.0% |
memtier_benchmark-10Kkeys-hash-hgetall-100-fields-100B-values |
63,053 | 73,992 | +17.3% |
memtier_benchmark-1key-hash-1K-fields-hgetall-pipeline-10 |
16,003 | 18,398 | +15.0% |
memtier_benchmark-1key-hash-1K-fields-hgetall |
16,357 | 18,762 | +14.7% |
memtier_benchmark-1Mkeys-hash-hget-hgetall-hkeys-hvals-with-100B-values |
171,114 | 189,264 | +10.6% |
memtier_benchmark-1Mkeys-hash-hkeys-5-fields-with-100B-values-with-expiration-pipeline-10 |
806,716 | 858,329 | +6.4% |
memtier_benchmark-playbook-session-storage-1k-sessions |
156,094 | 163,824 | +5.0% |
memtier_benchmark-session-caching-hash-100k-sessions |
152,834 | 160,415 | +5.0% |
memtier_benchmark-1Mkeys-hash-hgetall-50-fields-10B-values |
167,362 | 174,691 | +4.4% |
memtier_benchmark-100Kkeys-hash-hgetall-50-fields-100B-values |
158,793 | 165,259 | +4.1% |
TMA funnel (Intel m7i, 10Kkeys-hgetall-200f-100B)
The Phase-1/Phase-2 reorder is producing the expected microarchitectural shift — Phase 1 is now a pure pointer-fetch + redis_prefetch_read(e), so the prefetches are no longer redundant against an inline expiry-check.
| Pipeline slots | unstable | PR | Δ |
|---|---|---|---|
| Retiring (useful work) | 29.8% | 34.7% | +4.9% |
| Backend_Bound | 52.3% | 42.4% | −9.9% |
| Memory_Bound | 45.7% | 36.4% | −9.3% |
| DRAM_Bound | 22.2% | 9.8% | −12.4% |
| └ MEM_Latency | 38.0% | 31.3% | −6.7% |
| Bad_Speculation | 4.4% | 2.1% | −2.3% |
DRAM_Bound dropped by more than half — prefetches land the Entry structs and value SDS into cache before Phase 3 reads them.
ARM
Triggered on a secondary ARM runner where memtier_benchmark itself appears CPU-bound (HGETALL absolute throughput is roughly half of the primary runner's), so the server-side prefetch win is masked. Re-triggering on the primary ARM runner; will follow up with those numbers.
| int skip_expired = !server.allow_access_expired; | ||
| dict *d = o->ptr; | ||
| dictIterator di; | ||
| dictInitSafeIterator(&di, d); |
There was a problem hiding this comment.
Is dictInitSafeIterator used to prevent cache invalidation caused by rehashing?
|
ARM follow-up — re-ran on the primary ARM (
|
| Test | unstable | PR | Δ |
|---|---|---|---|
memtier_benchmark-10Kkeys-hash-hgetall-500-fields-100B-values |
15,799 | 23,553 | +49.1% |
memtier_benchmark-10Kkeys-hash-hgetall-200-fields-100B-values |
43,965 | 54,192 | +23.3% |
memtier_benchmark-1key-hash-1K-fields-hgetall-pipeline-10 |
15,039 | 17,864 | +18.8% |
memtier_benchmark-10Kkeys-hash-hgetall-100-fields-100B-values |
76,976 | 88,790 | +15.3% |
memtier_benchmark-1key-hash-1K-fields-hgetall |
16,262 | 18,733 | +15.2% |
memtier_benchmark-1Mkeys-hash-hkeys-5-fields-with-100B-values-with-expiration-pipeline-10 |
1,037,583 | 1,097,135 | +5.7% |
Median improvement +17.1%, max +49.1%. Cross-arch story holds — same magnitude as x86, same direction (the prefetch is hiding the same DRAM-latency stall on Graviton). Confidence in b475e1cca is high.

Summary
Add a batched prefetch fast path for
HGETALLon hashtable-encoded hashes. When iterating large hash tables, pointer chasing through scattered heap allocations (dictEntry→Entry→ value SDS) causes cache misses that dominate CPU time (~10% flat indictNext).The new path collects dict entries in batches of configured batch size, issues software prefetches for the
Entrystructs and their value SDS data, then emits replies while the data is cache-warm. This hides memory latency by overlapping prefetch with reply generation.Conditions for the fast path:
prefetch-batch-max-size> 0Falls back to the existing generic iterator path otherwise.
Benchmark Results
Tested on
x86-aws-m7i.metal-24xlandarm-aws-m8g.metal-24xl,oss-standalonetopology.x86 — 5 improvements, 0 regressions
ARM — 5 improvements, 0 regressions
Improvement scales with field count — more fields means more cache misses hidden by prefetch. Zero regressions on listpack-encoded or small hashtable hashes.
Note
Medium Risk
Touches the hot reply/iteration path for
HGETALLand adds new control flow (goto done) plus expiry filtering, so subtle count/order or iterator correctness issues could surface under edge cases despite being scoped to HT encoding.Overview
Adds a batched prefetch fast path to
genericHgetallCommandfor hashtable-encoded hashes (OBJ_ENCODING_HT), issuingredis_prefetch_readonEntrystructs and their value SDS in small batches before writing replies.The new path compacts out expired fields (when
allow_access_expiredis off), preserves existing reply shape/count assertions, and falls back to the existinghashTypeIteratorloop for non-HT encodings.Reviewed by Cursor Bugbot for commit b475e1c. Bugbot is set up for automated code reviews on this repo. Configure here.