Optimize dict expand check, move allow check after the thresholds check #12789

enjoy-binbin · 2023-11-21T13:44:43Z

dictExpandAllowed (for the main db dict and the expire dict) seems to
involve a few function calls and memory accesses, and we can do it only
after the thresholds checks and can get some performance improvements.

A simple benchmark test: there are 11032768 fixed keys in the database,
start a redis-server with --maxmemory big_number --save "",
start a redis-benchmark with -P 100 -r 1000000000 -t set -n 5000000,
collect throughput summary: n requests per second result.

After five rounds of testing:

unstable     this PR
848032.56    897988.56
854408.69    913408.88
858663.94    914076.81
871839.56    916758.31
882612.56    920640.75

We can see a 5% performance improvement in general condition.
But note we'll still hit the degradation when we over the thresholds.

regression introduced in #11692

dictExpandAllowed (for the main db dict and the expire dict) seems to involve a few function calls and memory accesses, and we can do it only after the thresholds checks and can get some performance improvements. A simple benchmark test: there are 11032768 fixed keys in the database, start a redis-server with `--maxmemory big_number --save ""`, start a redis-benchmark with `-P 100 -r 1000000000 -t set -n 5000000`, collect `throughput summary: n requests per second` result. After five rounds of testing: ``` unstable this PR 848032.56 897988.56 854408.69 913408.88 858663.94 914076.81 871839.56 916758.31 882612.56 920640.75 ``` We can see a 5% performance improvement in general condition. But note we'll still hit the degradation when we over the thresholds.

enjoy-binbin · 2023-11-21T14:03:11Z

It doesn't seem related. I retriggered CI and will look at the failed use case later.

https://github.com/redis/redis/actions/runs/6944684261/job/18892360058?pr=12789#step:4:4594

*** [err]: replica buffer don't induce eviction in tests/unit/maxmemory.tcl
Expected -50272 < 50000 && -50272 > -50000 (context: type eval line 80 cmd {assert {$delta < $delta_max && $delta > -$delta_max}} proc ::test)
Cleanup: may take some time... OK

oranagra · 2023-11-21T15:46:31Z

can you try to temporarily revert your change, and instead cache some metrics inside overMaxmemoryAfterAlloc.
i wanna check what's expensive here, and see if we can also optimize the case where the rehash thresholds are crossed but it is prevented because of overMaxmemoryAfterAlloc

enjoy-binbin · 2023-11-22T04:49:22Z

ok, i ran a few tests, here is the latest result.

unstable:    this PR:     revert PR and do cache:    keep and do cache:    unstable branch, change e to default 23 in _dictNextExp:  unstable with #11007 change:
858958.94    872295.88    864453.69                  881523.31             889521.44                                                 872905.00
863259.69    902364.19    870019.12                  887941.75             889996.44                                                 881106.75
873057.44    909421.56    870170.56                  894774.56             896861.00                                                 898311.19
874890.62    905633.06    872905.00                  909587.00             901225.69                                                 897666.06
887311.44    911743.25    881057.25                  909421.56             905305.12                                                 900576.38

average:
871495.62    900291.59    871721.12                  896649.64             896581.94                                                 890113.08

the "do cache" part diff:

 int overMaxmemoryAfterAlloc(size_t moremem) {
     if (!server.maxmemory) return  0; /* No limit. */
 
+    static time_t last_check = 0;
+    static size_t last_mem_used = 0;
+    static size_t last_overhead = 0;
+    int can_refresh = 0;
+
+    if ((server.unixtime - last_check) >= 1) {
+        last_check = server.unixtime;
+        can_refresh = 1;
+    }
+
     /* Check quickly. */
-    size_t mem_used = zmalloc_used_memory();
+    if (can_refresh) last_mem_used = zmalloc_used_memory();
+    size_t mem_used = last_mem_used;
     if (mem_used + moremem <= server.maxmemory) return 0;
 
-    size_t overhead = freeMemoryGetNotCountedMemory();
+    if (can_refresh) last_overhead = freeMemoryGetNotCountedMemory();
+    size_t overhead = last_overhead;
     mem_used = (mem_used > overhead) ? mem_used - overhead : 0;
     return mem_used + moremem > server.maxmemory;
 }

the "unstable branch, change e to default 23 in _dictNextExp" part diff:

 static signed char _dictNextExp(unsigned long size)
 {
     unsigned char e = DICT_HT_INITIAL_EXP;
+    e = 23;
 
     if (size >= LONG_MAX) return (8*sizeof(long)-1);
     while(1) {

So it seems that every time a key is added, _dictNextExp needs to repeatedly calculate that e for a large dictionary, which is expensive in this case

oranagra · 2023-11-22T07:12:11Z

thanks.
can you add a row with average or it's ratio vs unstable to the table?
do you mean that the main part of the slowness is the inefficiency of _dictNextExp (which can be optimized with CLZ, and was dismissed so many times, see #2382)?

enjoy-binbin · 2023-11-22T07:32:21Z

i added a average row in the table. yean, the data indicates that it is _dictNextExp, are you interested in providing a new implementation of CLZ? so that i can test it

in the past we didn't have scenarios where we needed to calculate it frequently. Now that we have it, it seems it's time to pick it up.

oranagra · 2023-11-22T08:22:30Z

have a look at the linked PRs of the one i referred to, one of them was recent enough.
used the CLZ builtin, instead of the more portable bit-trick (which is slightly slower).
let's first see if it's that inefficient _dictNextExp loop, or maybe some other side effect of started with a bigger HT.

enjoy-binbin · 2023-11-22T08:33:53Z

ok, unstable with #11007 change:

872905.00
881106.75
898311.19
897666.06
900576.38

average:
890113.08

oranagra · 2023-11-22T09:27:15Z

so it looks like we should / could do both, i.e. re-order the conditions, and use CLZ.
IIUC, the caching didn't do much (871xxx vs 871xxx).
is that right?

enjoy-binbin · 2023-11-22T09:28:49Z

so it looks like we should / could do both, i.e. re-order the conditions, and use CLZ.
IIUC, the caching didn't do much (871xxx vs 871xxx).
is that right?

yes, that is right

oranagra · 2023-11-22T09:34:35Z

ok, i merged this one.
if you have time you can do the CLZ in another PR.
IMHO we should use CLZ builtin when available, and the bit trick otherwise.

enjoy-binbin · 2023-11-22T09:39:22Z

if you have time you can do the CLZ in another PR.
IMHO we should use CLZ builtin when available, and the bit trick otherwise.

ok, i can find time to sort out the past PRs

enjoy-binbin · 2023-11-27T14:11:03Z

enjoy-binbin@300d98a

Regarding the CLZ part, I have been running this code on the machine for a few days, and so far everything is ok (but it is not finished yet, the LONG_MAX is too big and becomes slower and slower). WDYT about this patch? @oranagra do you like it?

oranagra · 2023-11-27T15:09:05Z

there's no need for that test.
CLZ (count leading zeros) can easily give you the next power of two.
for testing off by one bugs and such, it's enough to test it on 126,127,128,129 (and same for 2^61), no need for that test loop.

additionally, i think we should detect systems that don't support the buildin, and use the 5 shifts bit trick from #2382 when missing.

enjoy-binbin · 2023-11-27T15:20:05Z

yean, the loop test is just run locally by myself, and I want to make sure that the output of each number is the same as before. (I have tested other CLZ branches before, and some parts are different)

In fact we are already using __builtin_clzl, so i drop that part.

static inline clientMemUsageBucket *getMemUsageBucket(size_t mem) {
    int size_in_bits = 8*(int)sizeof(mem);
    int clz = mem > 0 ? __builtin_clzl(mem) : size_in_bits;
    int bucket_idx = size_in_bits - clz;
    if (bucket_idx > CLIENT_MEM_USAGE_BUCKET_MAX_LOG)
        bucket_idx = CLIENT_MEM_USAGE_BUCKET_MAX_LOG;
    else if (bucket_idx < CLIENT_MEM_USAGE_BUCKET_MIN_LOG)
        bucket_idx = CLIENT_MEM_USAGE_BUCKET_MIN_LOG;
    bucket_idx -= CLIENT_MEM_USAGE_BUCKET_MIN_LOG;
    return &server.client_mem_usage_buckets[bucket_idx];
}

oranagra · 2023-11-27T15:34:59Z

ohh, i didn't recall we're already using it.
so i suppose all the platforms we use have it (even if the CPU doesn't have an instruction).
so i suppose it's rather straight forward.

enjoy-binbin · 2023-11-27T15:57:24Z

ok, tomorrow i will measure the performance comparison after reaching the threshold based on that commit, and then prepare a PR for that.

#12815

In the past, we did not call _dictNextExp frequently. It was only called when the dictionary was expanded. Later, dictTypeExpandAllowed was introduced in redis#7954, which is 6.2. For the data dict and the expire dict, we can check maxmemory before actually expanding the dict. This is a good optimization to avoid maxmemory being exceeded due to the dict expansion. And in redis#11692, we moved the dictTypeExpandAllowed check before the threshold check, this caused a bit of performance degradation, every time a key is added to the dict, dictTypeExpandAllowed is called to check. The main reason for degradation is that in a large dict, we need to call _dictNextExp frequently, that is, every time we add a key, we need to call _dictNextExp once. Then the threshold is checked to see if the dict needs to be expanded. We can see that the order of checks here can be optimized. So we moved the dictTypeExpandAllowed check back to after the threshold check in redis#12789. In this way, before the dict is actually expanded (that is, before the threshold is reached), we will not do anything extra compared to before, that is, we will not call _dictNextExp frequently. But note we'll still hit the degradation when we over the thresholds. When the threshold is reached, because redis#7954, we may delay the dict expansion due to maxmemory limitations. In this case, we will call _dictNextExp every time we add a key during this period. This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count leading zeros) can easily give you the next power of two. It should be noted that we have actually introduced the use of __builtin_clzl in redis#8687, which is 7.0. So i suppose all the platforms we use have it (even if the CPU doesn't have an instruction). We build 67108864 (2**26) keys through DEBUG POPULTE, which will use approximately 5.49G memory (used_memory:5898522936). If expansion is triggered, the additional hash table will consume approximately 1G memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G), which will be less than 5.49G + 1G, so we will delay the dict rehash while addint the keys. After that, each time an element is added to the dict, an allow check will be performed, that is, we can frequently call _dictNextExp to test the comparison before and after the optimization. Using DEBUG HTSTATS 0 to check and make sure that our dict expansion is dealyed. Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`, After ten rounds of testing: ``` unstable: this PR: 769585.94 816860.00 771724.00 818196.69 775674.81 822368.44 781983.12 822503.69 783576.25 828088.75 784190.75 828637.75 791389.69 829875.50 794659.94 835660.69 798212.00 830013.25 801153.62 833934.56 ``` We can see there is about 4-5% performance improvement in this case.

In the past, we did not call _dictNextExp frequently. It was only called when the dictionary was expanded. Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2. For the data dict and the expire dict, we can check maxmemory before actually expanding the dict. This is a good optimization to avoid maxmemory being exceeded due to the dict expansion. And in #11692, we moved the dictTypeExpandAllowed check before the threshold check, this caused a bit of performance degradation, every time a key is added to the dict, dictTypeExpandAllowed is called to check. The main reason for degradation is that in a large dict, we need to call _dictNextExp frequently, that is, every time we add a key, we need to call _dictNextExp once. Then the threshold is checked to see if the dict needs to be expanded. We can see that the order of checks here can be optimized. So we moved the dictTypeExpandAllowed check back to after the threshold check in #12789. In this way, before the dict is actually expanded (that is, before the threshold is reached), we will not do anything extra compared to before, that is, we will not call _dictNextExp frequently. But note we'll still hit the degradation when we over the thresholds. When the threshold is reached, because #7954, we may delay the dict expansion due to maxmemory limitations. In this case, we will call _dictNextExp every time we add a key during this period. This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count leading zeros) can easily give you the next power of two. It should be noted that we have actually introduced the use of __builtin_clzl in #8687, which is 7.0. So i suppose all the platforms we use have it (even if the CPU doesn't have an instruction). We build 67108864 (2**26) keys through DEBUG POPULTE, which will use approximately 5.49G memory (used_memory:5898522936). If expansion is triggered, the additional hash table will consume approximately 1G memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G), which will be less than 5.49G + 1G, so we will delay the dict rehash while addint the keys. After that, each time an element is added to the dict, an allow check will be performed, that is, we can frequently call _dictNextExp to test the comparison before and after the optimization. Using DEBUG HTSTATS 0 to check and make sure that our dict expansion is dealyed. Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`. Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`. After ten rounds of testing: ``` unstable: this PR: 769585.94 816860.00 771724.00 818196.69 775674.81 822368.44 781983.12 822503.69 783576.25 828088.75 784190.75 828637.75 791389.69 829875.50 794659.94 835660.69 798212.00 830013.25 801153.62 833934.56 ``` We can see there is about 4-5% performance improvement in this case.

…ck (#12789) dictExpandAllowed (for the main db dict and the expire dict) seems to involve a few function calls and memory accesses, and we can do it only after the thresholds checks and can get some performance improvements. A simple benchmark test: there are 11032768 fixed keys in the database, start a redis-server with `--maxmemory big_number --save ""`, start a redis-benchmark with `-P 100 -r 1000000000 -t set -n 5000000`, collect `throughput summary: n requests per second` result. After five rounds of testing: ``` unstable this PR 848032.56 897988.56 854408.69 913408.88 858663.94 914076.81 871839.56 916758.31 882612.56 920640.75 ``` We can see a 5% performance improvement in general condition. But note we'll still hit the degradation when we over the thresholds. (cherry picked from commit 4634769)

In the past, we did not call _dictNextExp frequently. It was only called when the dictionary was expanded. Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2. For the data dict and the expire dict, we can check maxmemory before actually expanding the dict. This is a good optimization to avoid maxmemory being exceeded due to the dict expansion. And in #11692, we moved the dictTypeExpandAllowed check before the threshold check, this caused a bit of performance degradation, every time a key is added to the dict, dictTypeExpandAllowed is called to check. The main reason for degradation is that in a large dict, we need to call _dictNextExp frequently, that is, every time we add a key, we need to call _dictNextExp once. Then the threshold is checked to see if the dict needs to be expanded. We can see that the order of checks here can be optimized. So we moved the dictTypeExpandAllowed check back to after the threshold check in #12789. In this way, before the dict is actually expanded (that is, before the threshold is reached), we will not do anything extra compared to before, that is, we will not call _dictNextExp frequently. But note we'll still hit the degradation when we over the thresholds. When the threshold is reached, because #7954, we may delay the dict expansion due to maxmemory limitations. In this case, we will call _dictNextExp every time we add a key during this period. This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count leading zeros) can easily give you the next power of two. It should be noted that we have actually introduced the use of __builtin_clzl in #8687, which is 7.0. So i suppose all the platforms we use have it (even if the CPU doesn't have an instruction). We build 67108864 (2**26) keys through DEBUG POPULTE, which will use approximately 5.49G memory (used_memory:5898522936). If expansion is triggered, the additional hash table will consume approximately 1G memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G), which will be less than 5.49G + 1G, so we will delay the dict rehash while addint the keys. After that, each time an element is added to the dict, an allow check will be performed, that is, we can frequently call _dictNextExp to test the comparison before and after the optimization. Using DEBUG HTSTATS 0 to check and make sure that our dict expansion is dealyed. Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`. Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`. After ten rounds of testing: ``` unstable: this PR: 769585.94 816860.00 771724.00 818196.69 775674.81 822368.44 781983.12 822503.69 783576.25 828088.75 784190.75 828637.75 791389.69 829875.50 794659.94 835660.69 798212.00 830013.25 801153.62 833934.56 ``` We can see there is about 4-5% performance improvement in this case. (cherry picked from commit 22cc9b5)

…ck (#12789) dictExpandAllowed (for the main db dict and the expire dict) seems to involve a few function calls and memory accesses, and we can do it only after the thresholds checks and can get some performance improvements. A simple benchmark test: there are 11032768 fixed keys in the database, start a redis-server with `--maxmemory big_number --save ""`, start a redis-benchmark with `-P 100 -r 1000000000 -t set -n 5000000`, collect `throughput summary: n requests per second` result. After five rounds of testing: ``` unstable this PR 848032.56 897988.56 854408.69 913408.88 858663.94 914076.81 871839.56 916758.31 882612.56 920640.75 ``` We can see a 5% performance improvement in general condition. But note we'll still hit the degradation when we over the thresholds. (cherry picked from commit 4634769)

In the past, we did not call _dictNextExp frequently. It was only called when the dictionary was expanded. Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2. For the data dict and the expire dict, we can check maxmemory before actually expanding the dict. This is a good optimization to avoid maxmemory being exceeded due to the dict expansion. And in #11692, we moved the dictTypeExpandAllowed check before the threshold check, this caused a bit of performance degradation, every time a key is added to the dict, dictTypeExpandAllowed is called to check. The main reason for degradation is that in a large dict, we need to call _dictNextExp frequently, that is, every time we add a key, we need to call _dictNextExp once. Then the threshold is checked to see if the dict needs to be expanded. We can see that the order of checks here can be optimized. So we moved the dictTypeExpandAllowed check back to after the threshold check in #12789. In this way, before the dict is actually expanded (that is, before the threshold is reached), we will not do anything extra compared to before, that is, we will not call _dictNextExp frequently. But note we'll still hit the degradation when we over the thresholds. When the threshold is reached, because #7954, we may delay the dict expansion due to maxmemory limitations. In this case, we will call _dictNextExp every time we add a key during this period. This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count leading zeros) can easily give you the next power of two. It should be noted that we have actually introduced the use of __builtin_clzl in #8687, which is 7.0. So i suppose all the platforms we use have it (even if the CPU doesn't have an instruction). We build 67108864 (2**26) keys through DEBUG POPULTE, which will use approximately 5.49G memory (used_memory:5898522936). If expansion is triggered, the additional hash table will consume approximately 1G memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G), which will be less than 5.49G + 1G, so we will delay the dict rehash while addint the keys. After that, each time an element is added to the dict, an allow check will be performed, that is, we can frequently call _dictNextExp to test the comparison before and after the optimization. Using DEBUG HTSTATS 0 to check and make sure that our dict expansion is dealyed. Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`. Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`. After ten rounds of testing: ``` unstable: this PR: 769585.94 816860.00 771724.00 818196.69 775674.81 822368.44 781983.12 822503.69 783576.25 828088.75 784190.75 828637.75 791389.69 829875.50 794659.94 835660.69 798212.00 830013.25 801153.62 833934.56 ``` We can see there is about 4-5% performance improvement in this case. (cherry picked from commit 22cc9b5)

enjoy-binbin mentioned this pull request Nov 21, 2023

Make sure that fork child doesn't do incremental rehashing #11692

Merged

zuiderkwast approved these changes Nov 21, 2023

View reviewed changes

oranagra merged commit 4634769 into redis:unstable Nov 22, 2023

enjoy-binbin deleted the optimize_dict_expand_check branch November 22, 2023 09:39

enjoy-binbin mentioned this pull request Nov 28, 2023

Use CLZ in _dictNextExp to get the next power of two #12815

Merged

YaacovHazan moved this from To Do to In progress in Redis 6.2 Backport May 26, 2025

YaacovHazan moved this from In progress to To Do in Redis 6.2 Backport May 26, 2025

YaacovHazan moved this from To Do to Pending in Redis 6.2 Backport May 26, 2025

Optimize dict expand check, move allow check after the thresholds check #12789

Optimize dict expand check, move allow check after the thresholds check #12789

Uh oh!

Conversation

enjoy-binbin commented Nov 21, 2023 • edited by oranagra Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Nov 21, 2023

Uh oh!

oranagra commented Nov 21, 2023

Uh oh!

enjoy-binbin commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Nov 22, 2023

Uh oh!

enjoy-binbin commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Nov 22, 2023

Uh oh!

enjoy-binbin commented Nov 22, 2023

Uh oh!

oranagra commented Nov 22, 2023

Uh oh!

enjoy-binbin commented Nov 22, 2023

Uh oh!

oranagra commented Nov 22, 2023

Uh oh!

enjoy-binbin commented Nov 22, 2023

Uh oh!

enjoy-binbin commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Nov 27, 2023

Uh oh!

enjoy-binbin commented Nov 27, 2023

Uh oh!

oranagra commented Nov 27, 2023

Uh oh!

enjoy-binbin commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

enjoy-binbin commented Nov 21, 2023 •

edited by oranagra

Loading

enjoy-binbin commented Nov 22, 2023 •

edited

Loading

enjoy-binbin commented Nov 22, 2023 •

edited

Loading

enjoy-binbin commented Nov 27, 2023 •

edited

Loading

enjoy-binbin commented Nov 27, 2023 •

edited

Loading