Improve performance during rehashing#3073
Conversation
Signed-off-by: chzhoo <czawyx@163.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3073 +/- ##
============================================
+ Coverage 74.31% 74.35% +0.03%
============================================
Files 129 129
Lines 71010 71037 +27
============================================
+ Hits 52773 52821 +48
+ Misses 18237 18216 -21
🚀 New features to boost your workflow:
|
Signed-off-by: chzhoo <czawyx@163.com>
zuiderkwast
left a comment
There was a problem hiding this comment.
This is very clever!
With this approach, the CPU (or the compiler?) can be smart enough to do concurrent memory access and we don't need to add any explicit __builtin_prefetch() (valkey_prefetch) at all?
Signed-off-by: chzhoo <czawyx@163.com>
Yes, I have benchmarked the performance of manually calling |
zuiderkwast
left a comment
There was a problem hiding this comment.
Looks good to me. Thank you!
I will merge it but first I will wait some days to let others have a chance to review too.
|
Benchmark ran on this commit: Benchmark Comparison: unstable vs 8a8a090 (averaged) - rps metricsRun Summary:
Statistical Notes:
Note: Values with (n=X, σ=Y, CV=Z%, CI99%=±W%, PI99%=±V%) indicate averages from X runs with standard deviation Y, coefficient of variation Z%, 99% confidence interval margin of error ±W% of the mean, and 99% prediction interval margin of error ±V% of the mean. CI bounds [A, B] and PI bounds [C, D] show the actual interval ranges. Configuration:
Configuration:
|
|
I don't think this benchmark is representative as rehashing would mainly occur in warmup duration of write load and would go unnoticed in result. |
Do we want a label for benchmarks without warmups? |
@roshkhatri It would help for this PR... but I feel that every optimization has some special scenario that we want to target so every optimization requires some specific benchmark to visualize the improvement. I guess whatever we add to the automated benchmarking, it might not be enough for the next optimization PR. If we could chat with the github actions bot and give it some very specific instructions, like give it a specific valkey-benchmark command line to run, then it might covering more special scenarios. Maybe. I think it's fine to run special benchmarks manually though. @chzhoo When you run the benchmark, do you see a difference in the latencies in the various percentiles? I would guess we have fewer requests with a high latency. |
I re-ran the benchmarks. The new version shows latency improvements at the p95 and p99 percentiles, while p50 latency remains unchanged. The commands and detailed results are below. |
Yes I think I can create wf that you can interact with, mention the PR and give the configs that we want |
|
For some bots, you can post a github comment with some commands it will act on. For example dependabot accepts commands in this way: https://docs.github.com/en/code-security/reference/supply-chain-security/dependabot-pull-request-comment-commands |
As the hashtable grows, rehash is triggered. Rehash consumes a
significant amount of CPU time, resulting in noticeable performance
degradation. Much of this CPU time is spent on random memory access
(when accessing hashtable entries). This CPU usage can be optimized
through concurrent memory access.
```
// The original rehash process:
For each bucket and its child buckets, all entries are processed as follows:
1) Access the entry, which involves random memory access.
2) Compute the hash value of the entry.
3) Based on the hash value, insert the entries into the new hashtable.
// The optimized rehash process:
The original steps are transformed into batch processing and phased execution, leveraging
the CPU's multi-issue capability to achieve concurrent memory access. The steps are as follows:
1) Access multiple entries serially within a for loop. Since each entry access is independent,
modern CPU architectures can perform concurrent memory access.
2) Compute the hash values for all entries.
3) Based on the hash values, insert the entries into the new hashtable.
```
---------
Signed-off-by: chzhoo <czawyx@163.com>
As the hashtable grows, rehash is triggered. Rehash consumes a
significant amount of CPU time, resulting in noticeable performance
degradation. Much of this CPU time is spent on random memory access
(when accessing hashtable entries). This CPU usage can be optimized
through concurrent memory access.
```
// The original rehash process:
For each bucket and its child buckets, all entries are processed as follows:
1) Access the entry, which involves random memory access.
2) Compute the hash value of the entry.
3) Based on the hash value, insert the entries into the new hashtable.
// The optimized rehash process:
The original steps are transformed into batch processing and phased execution, leveraging
the CPU's multi-issue capability to achieve concurrent memory access. The steps are as follows:
1) Access multiple entries serially within a for loop. Since each entry access is independent,
modern CPU architectures can perform concurrent memory access.
2) Compute the hash values for all entries.
3) Based on the hash values, insert the entries into the new hashtable.
```
---------
Signed-off-by: chzhoo <czawyx@163.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Description
As the hashtable grows, rehash is triggered. Rehash consumes a significant amount of CPU time, resulting in noticeable performance degradation. Much of this CPU time is spent on random memory access (when accessing hashtable entries). This CPU usage can be optimized through concurrent memory access.
Benchmark
Benchmark step
Write a batch of data into the server and record the QPS with the following commands.
Benchmark result
The benchmark results are shown below, with the Y-axis representing the recorded QPS and the X-axis representing time (in seconds). The green line indicates the optimized performance, while the blue line represents the original performance.
Benchmark env
CPU: Intel(R) Xeon(R) 6983P-C * 8
OS: Ubuntu Server 24.04 LTS 64bit
Memory: 32GB
VM: Tencent Cloud | S9e.2XLARGE32
Server start command: valkey-server --save '' --appendonly no