Improve multithreaded performance with memory prefetching by uriyage · Pull Request #861 · valkey-io/valkey

uriyage · 2024-08-01T13:24:43Z

This PR is the last of three pull requests intended to achieve the goal of processing one million requests per second.

1st PR: #758
2nd PR:#763

With these 3 PRs, we can now handle up to 1.2 million requests per second, up from 200K without IO-threads and ~380K with the current IO threads implementation.

This PR utilizes the IO threads to execute commands in batches, allowing us to prefetch the dictionary data in advance.

After making the IO threads asynchronous and offloading more work to them in the first 2 PRs, the lookupKey function becomes a main bottle-neck and it takes about 50% of the main-thread time (Tested with SET command). This is because the Valkey dictionary is a straightforward but inefficient chained hash implementation. While traversing the hash linked lists, every access to either a dictEntry structure, pointer to key, or a value object requires, with high probability, an expensive external memory access.

Memory Access Amortization

(Designed and implemented by dan touitou)

Memory Access Amortization (MAA) is a technique designed to optimize the performance of dynamic data structures by reducing the impact of memory access latency. It is applicable when multiple operations need to be executed concurrently. The principle behind it is that for certain dynamic data structures, executing operations in a batch is more efficient than executing each one separately.

Rather than executing operations sequentially, this approach interleaves the execution of all operations. This is done in such a way that whenever a memory access is required during an operation, the program prefetches the necessary memory and transitions to another operation. This ensures that when one operation is blocked awaiting memory access, other memory accesses are executed in parallel, thereby reducing the average access latency.

We applied this method in the development of dictPrefetch, which takes as parameters a vector of keys and dictionaries. It ensures that all memory addresses required to execute dictionary operations for these keys are loaded into the L1-L3 caches when executing commands. Essentially, dictPrefetch is an interleaved execution of dictFind for all the keys.

Implementation details

When the main thread iterates over the clients-pending-io-read, for clients with ready-to-execute commands (i.e., clients for which the IO thread has parsed the commands), a batch of up to 16 commands is created. Initially, the command's argv, which were allocated by the IO thread, is prefetched to the main thread's L1 cache. Subsequently, all the dict entries and values required for the commands are prefetched from the dictionary before the command execution. Only then will the commands be executed.

codecov · 2024-08-01T13:38:52Z

Codecov Report

Attention: Patch coverage is 4.46927% with 171 lines in your changes missing coverage. Please review.

Project coverage is 70.27%. Comparing base (1d18842) to head (ebbbed1).
Report is 52 commits behind head on unstable.

Files	Patch %	Lines
src/memory_prefetch.c	2.45%	159 Missing ⚠️
src/networking.c	15.38%	11 Missing ⚠️
src/io_threads.c	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #861      +/-   ##
============================================
- Coverage     70.35%   70.27%   -0.09%     
============================================
  Files           112      113       +1     
  Lines         61467    61711     +244     
============================================
+ Hits          43245    43365     +120     
- Misses        18222    18346     +124

Files	Coverage Δ
src/config.c	`78.69% <ø> (ø)`
src/dict.c	`97.54% <100.00%> (+0.25%)`	⬆️
src/kvstore.c	`96.17% <100.00%> (ø)`
src/server.c	`88.57% <ø> (+<0.01%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/io_threads.c	`7.87% <0.00%> (-0.04%)`	⬇️
src/networking.c	`88.42% <15.38%> (-0.30%)`	⬇️
src/memory_prefetch.c	`2.45% <2.45%> (ø)`

... and 25 files with indirect coverage changes

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

madolson · 2024-08-06T03:10:52Z

@lipzhu Would be great if you could take a look as well.

madolson

Generally LGTM. The dict prefetching code is a bit convoluted, mostly because of the global, so I think we can make that a bit cleaner. The other main point is it would be good to understand how we tuned the one magic number.

zhulipeng · 2024-08-06T14:23:39Z

Profiling the cache hit/miss W/o prefetch, the performance boost is impressive. @uriyage Great work.

BTW, the performance improvement is based on the situation that memory latency have been the bottle-neck of main thread, do you have evaluation if memory latency is not the bottle-neck, will the additional prefetch instruction caused regression?

 # w/ prefetch
 Performance counter stats for process id '1841714':

   128,348,636,929      mem_load_retired.l1_hit                                       (66.64%)
     1,917,290,678      mem_load_retired.l1_miss                                      (66.65%)
       736,226,936      mem_load_retired.l2_hit                                       (66.67%)
     1,180,484,127      mem_load_retired.l2_miss                                      (66.71%)
       419,957,504      mem_load_retired.l3_hit                                       (66.69%)
         8,607,452      mem_load_retired.l3_miss                                      (66.65%)

      10.001272139 seconds time elapsed

 # w/o prefetch
 Performance counter stats for process id '1842658':

   118,738,848,689      mem_load_retired.l1_hit                                       (66.64%)
     1,641,809,909      mem_load_retired.l1_miss                                      (66.64%)
       592,192,759      mem_load_retired.l2_hit                                       (66.67%)
     1,049,283,088      mem_load_retired.l2_miss                                      (66.71%)
       373,710,273      mem_load_retired.l3_hit                                       (66.69%)
        31,032,840      mem_load_retired.l3_miss                                      (66.65%)

      10.001325784 seconds time elapsed

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

uriyage · 2024-08-07T07:18:40Z

BTW, the performance improvement is based on the situation that memory latency have been the bottle-neck of main thread, do you have evaluation if memory latency is not the bottle-neck, will the additional prefetch instruction caused regression?

@lipzhu I tested the same code with one key in the database, meaning we accessed the same key repeatedly. In this case, memory latency is not the bottleneck, and as expected prefetching didn't help. However, no regression was observed.

madolson

LGTM outside of some existing comments and the clang format thing. Did we decide to make the batch size configurable and/or self-tuning? Making it configurable would also allow a value of zero to be "no prefetching" which also gives users an operational knob if it's not working for any reason.

Also, please open the doc PR as we discussed, then we can add it to the Valkey 8 board so it doesn't get dropped and this PR is unblocked to get merged.

zhulipeng · 2024-08-14T05:51:18Z

Making it configurable would also allow a value of zero to be "no prefetching" which also gives users an operational knob if it's not working for any reason.

+1. To make code structure cleaner maybe we can extract the prefetch related code to a new file and apply the prefetch optimization like a rule.

uriyage · 2024-08-14T17:20:23Z

LGTM outside of some existing comments and the clang format thing. Did we decide to make the batch size configurable and/or self-tuning? Making it configurable would also allow a value of zero to be "no prefetching" which also gives users an operational knob if it's not working for any reason.

Also, please open the doc PR as we discussed, then we can add it to the Valkey 8 board so it doesn't get dropped and this PR is unblocked to get merged.

@madolson @lipzhu I agree, I will add a configuration for it. However, it will require some changes since the batch is currently static. I also agree that we can refactor the code into a new file.
I believe I'll be able to publish the code by tomorrow. @lipzhu I'm not sure I understand what you mean by 'apply like a rule'.

zhulipeng · 2024-08-15T01:52:54Z

@lipzhu I'm not sure I understand what you mean by 'apply like a rule'.

Sorry for the miss understanding, I mean the Rule-Based Optimizer in DBMS, inspired by SparkSQL:spark.sql.adaptive.optimizer.excludedRules. Which give user a option to apply/exclude the prefetch rule based on their real environment.

Ref:
https://spark.apache.org/docs/latest/sql-performance-tuning.html

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

madolson

Just a bunch of minor wording things, everything looked good now.

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

madolson

LGTM, kicked off the CI again with all tests.

This PR utilizes the IO threads to execute commands in batches, allowing us to prefetch the dictionary data in advance. After making the IO threads asynchronous and offloading more work to them in the first 2 PRs, the `lookupKey` function becomes a main bottle-neck and it takes about 50% of the main-thread time (Tested with SET command). This is because the Valkey dictionary is a straightforward but inefficient chained hash implementation. While traversing the hash linked lists, every access to either a dictEntry structure, pointer to key, or a value object requires, with high probability, an expensive external memory access. ### Memory Access Amortization Memory Access Amortization (MAA) is a technique designed to optimize the performance of dynamic data structures by reducing the impact of memory access latency. It is applicable when multiple operations need to be executed concurrently. The principle behind it is that for certain dynamic data structures, executing operations in a batch is more efficient than executing each one separately. Rather than executing operations sequentially, this approach interleaves the execution of all operations. This is done in such a way that whenever a memory access is required during an operation, the program prefetches the necessary memory and transitions to another operation. This ensures that when one operation is blocked awaiting memory access, other memory accesses are executed in parallel, thereby reducing the average access latency. We applied this method in the development of `dictPrefetch`, which takes as parameters a vector of keys and dictionaries. It ensures that all memory addresses required to execute dictionary operations for these keys are loaded into the L1-L3 caches when executing commands. Essentially, `dictPrefetch` is an interleaved execution of dictFind for all the keys. **Implementation details** When the main thread iterates over the `clients-pending-io-read`, for clients with ready-to-execute commands (i.e., clients for which the IO thread has parsed the commands), a batch of up to 16 commands is created. Initially, the command's argv, which were allocated by the IO thread, is prefetched to the main thread's L1 cache. Subsequently, all the dict entries and values required for the commands are prefetched from the dictionary before the command execution. Only then will the commands be executed. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com>

This PR is based on: valkey-io/valkey#861 > ### Memory Access Amortization > (Designed and implemented by [dan touitou](https://github.com/touitou-dan)) > > Memory Access Amortization (MAA) is a technique designed to optimize the performance of dynamic data structures by reducing the impact of memory access latency. It is applicable when multiple operations need to be executed concurrently. The principle behind it is that for certain dynamic data structures, executing operations in a batch is more efficient than executing each one separately. > > Rather than executing operations sequentially, this approach interleaves the execution of all operations. This is done in such a way that whenever a memory access is required during an operation, the program prefetches the necessary memory and transitions to another operation. This ensures that when one operation is blocked awaiting memory access, other memory accesses are executed in parallel, thereby reducing the average access latency. > > We applied this method in the development of dictPrefetch, which takes as parameters a vector of keys and dictionaries. It ensures that all memory addresses required to execute dictionary operations for these keys are loaded into the L1-L3 caches when executing commands. Essentially, dictPrefetch is an interleaved execution of dictFind for all the keys. ### Implementation of Redis When the main thread processes clients with ready-to-execute commands (i.e., clients for which the IO thread has parsed the commands), a batch of up to 16 commands is created. Initially, the command's argv, which were allocated by the IO thread, is prefetched to the main thread's L1 cache. Subsequently, all the dict entries and values required for the commands are prefetched from the dictionary before the command execution. #### Memory prefetching for main hash table As shown in the picture, after #13806 , we unify key value and the dict uses no_value optimization, so the memory prefetching has 4 steps: 1. prefetch the bucket of the hash table 2. prefetch the entry associated with the given key's hash 3. prefetch the kv object of the entry 4. prefetch the value data of the kv object we also need to handle the case that the dict entry is the pointer of kv object, just skip step 3. MAA can improves single-threaded memory access efficiency by interleaving the execution of multiple independent operations, allowing memory-level parallelism and better CPU utilization. Its key point is batch-wise interleaved execution. Split a batch of independent operations (such as multiple key lookups) into multiple state machines, and interleave their progress within a single thread to hide the memory access latency of individual requests. The difference between serial execution and interleaved execution: **naive serial execution** ``` key1: step1 → wait → step2 → wait → done key2: step1 → wait → step2 → wait → done ``` **interleaved execution** ``` key1: step1 → step2 → done key2: step1 → step2 → done key3: step1 → step2 → done ↑ While waiting for key1’s memory, progress key2/key3 ``` #### New configuration This PR involves a new configuration `prefetch-batch-max-size`, but we think it is a low level optimization, so we hide this config: When multiple commands are parsed by the I/O threads and ready for execution, we take advantage of knowing the next set of commands and prefetch their required dictionary entries in a batch. This reduces memory access costs. The optimal batch size depends on the specific workflow of the user. The default batch size is 16, which can be modified using the 'prefetch-batch-max-size' config. When the config is set to 0, prefetching is disabled. --------- Co-authored-by: Uri Yagelnik <uriy@amazon.com> Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>

Memory Access Amortization

f821c86

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

uriyage force-pushed the commands-prefetching branch from ddb9ab2 to f821c86 Compare August 1, 2024 13:50

hpatro reviewed Aug 1, 2024

View reviewed changes

Comment thread src/dict.h Outdated

Comment thread src/dict.c Outdated

Comment thread src/dict.c Outdated

Comment thread src/dict.c Outdated

Comment thread src/dict.c Outdated

Comment thread src/dict.c Outdated

Comment thread src/networking.c Outdated

Comment thread src/networking.c

secwall reviewed Aug 3, 2024

View reviewed changes

Comment thread src/server.h

Adress PR comments

da8106b

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

madolson reviewed Aug 6, 2024

View reviewed changes

Comment thread utils/generate-fmtargs.py Outdated

madolson reviewed Aug 6, 2024

View reviewed changes

Comment thread src/server.c

madolson reviewed Aug 6, 2024

View reviewed changes

Comment thread src/server.c Outdated

madolson reviewed Aug 6, 2024

View reviewed changes

madolson added needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes labels Aug 6, 2024

madolson requested a review from zuiderkwast August 6, 2024 04:45

zhulipeng reviewed Aug 6, 2024

View reviewed changes

Comment thread src/dict.c Outdated

Address more PR comments

9a41120

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

uriyage force-pushed the commands-prefetching branch from 2fbf659 to 9a41120 Compare August 6, 2024 23:00

madolson reviewed Aug 14, 2024

View reviewed changes

Dynamic config for batch size.

d04a755

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

uriyage force-pushed the commands-prefetching branch from 24ae4ad to d04a755 Compare August 18, 2024 04:49

zuiderkwast removed their request for review August 19, 2024 14:16

madolson reviewed Aug 19, 2024

View reviewed changes

Comment thread src/maa.c

madolson reviewed Aug 19, 2024

View reviewed changes

Comment thread src/maa.c Outdated

madolson reviewed Aug 19, 2024

View reviewed changes

Comment thread valkey.conf Outdated

madolson reviewed Aug 19, 2024

View reviewed changes

Comment thread valkey.conf Outdated

Adress PR comments

d300b1b

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

madolson removed the needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. label Aug 23, 2024

madolson reviewed Aug 23, 2024

View reviewed changes

PR comments

c319572

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

uriyage force-pushed the commands-prefetching branch from 3dcf709 to c319572 Compare August 25, 2024 07:18

madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 26, 2024

madolson reviewed Aug 26, 2024

View reviewed changes

Comment thread src/networking.c Outdated

Update src/networking.c

ebbbed1

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

madolson approved these changes Aug 26, 2024

View reviewed changes

madolson changed the title ~~Memory Access Amortization~~ Improve multithreaded performance with memory prefetching Aug 27, 2024

madolson merged commit 04d76d8 into valkey-io:unstable Aug 27, 2024

arukiidou mentioned this pull request Sep 11, 2024

[NEW] Reaching 1 million requests per second on a single Valkey instance #22

Closed

lilpoozie2005 approved these changes Sep 11, 2024

View reviewed changes

alexander-shabanov mentioned this pull request Nov 25, 2024

[NEW] Reply Offload #1353

Closed

ShooterIT mentioned this pull request May 7, 2025

Improve multithreaded performance with memory prefetching redis/redis#14017

Merged

chzhoo mentioned this pull request Jul 9, 2025

Accelerate RDB loading by prefetching hashtable buckets #2311

Closed

Uh oh!

Conversation

uriyage commented Aug 1, 2024

Memory Access Amortization

Uh oh!

codecov Bot commented Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madolson commented Aug 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhulipeng commented Aug 6, 2024

Uh oh!

uriyage commented Aug 7, 2024

Uh oh!

madolson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhulipeng commented Aug 14, 2024

Uh oh!

uriyage commented Aug 14, 2024

Uh oh!

zhulipeng commented Aug 15, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov Bot commented Aug 1, 2024 •

edited

Loading

madolson left a comment •

edited

Loading