Skip to content

Conversation

@enjoy-binbin
Copy link
Contributor

@enjoy-binbin enjoy-binbin commented Nov 21, 2023

dictExpandAllowed (for the main db dict and the expire dict) seems to
involve a few function calls and memory accesses, and we can do it only
after the thresholds checks and can get some performance improvements.

A simple benchmark test: there are 11032768 fixed keys in the database,
start a redis-server with --maxmemory big_number --save "",
start a redis-benchmark with -P 100 -r 1000000000 -t set -n 5000000,
collect throughput summary: n requests per second result.

After five rounds of testing:

unstable     this PR
848032.56    897988.56
854408.69    913408.88
858663.94    914076.81
871839.56    916758.31
882612.56    920640.75

We can see a 5% performance improvement in general condition.
But note we'll still hit the degradation when we over the thresholds.

regression introduced in #11692

dictExpandAllowed (for the main db dict and the expire dict) seems to
involve a few function calls and memory accesses, and we can do it only
after the thresholds checks and can get some performance improvements.

A simple benchmark test: there are 11032768 fixed keys in the database,
start a redis-server with `--maxmemory big_number --save ""`,
start a redis-benchmark with `-P 100 -r 1000000000 -t set -n 5000000`,
collect `throughput summary: n requests per second` result.

After five rounds of testing:
```
unstable     this PR
848032.56    897988.56
854408.69    913408.88
858663.94    914076.81
871839.56    916758.31
882612.56    920640.75
```

We can see a 5% performance improvement in general condition.
But note we'll still hit the degradation when we over the thresholds.
@enjoy-binbin
Copy link
Contributor Author

It doesn't seem related. I retriggered CI and will look at the failed use case later.

https://github.com/redis/redis/actions/runs/6944684261/job/18892360058?pr=12789#step:4:4594

*** [err]: replica buffer don't induce eviction in tests/unit/maxmemory.tcl
Expected -50272 < 50000 && -50272 > -50000 (context: type eval line 80 cmd {assert {$delta < $delta_max && $delta > -$delta_max}} proc ::test)
Cleanup: may take some time... OK

@oranagra
Copy link
Member

can you try to temporarily revert your change, and instead cache some metrics inside overMaxmemoryAfterAlloc.
i wanna check what's expensive here, and see if we can also optimize the case where the rehash thresholds are crossed but it is prevented because of overMaxmemoryAfterAlloc

@enjoy-binbin
Copy link
Contributor Author

enjoy-binbin commented Nov 22, 2023

ok, i ran a few tests, here is the latest result.

unstable:    this PR:     revert PR and do cache:    keep and do cache:    unstable branch, change e to default 23 in _dictNextExp:  unstable with #11007 change:
858958.94    872295.88    864453.69                  881523.31             889521.44                                                 872905.00
863259.69    902364.19    870019.12                  887941.75             889996.44                                                 881106.75
873057.44    909421.56    870170.56                  894774.56             896861.00                                                 898311.19
874890.62    905633.06    872905.00                  909587.00             901225.69                                                 897666.06
887311.44    911743.25    881057.25                  909421.56             905305.12                                                 900576.38

average:
871495.62    900291.59    871721.12                  896649.64             896581.94                                                 890113.08

the "do cache" part diff:

 int overMaxmemoryAfterAlloc(size_t moremem) {
     if (!server.maxmemory) return  0; /* No limit. */
 
+    static time_t last_check = 0;
+    static size_t last_mem_used = 0;
+    static size_t last_overhead = 0;
+    int can_refresh = 0;
+
+    if ((server.unixtime - last_check) >= 1) {
+        last_check = server.unixtime;
+        can_refresh = 1;
+    }
+
     /* Check quickly. */
-    size_t mem_used = zmalloc_used_memory();
+    if (can_refresh) last_mem_used = zmalloc_used_memory();
+    size_t mem_used = last_mem_used;
     if (mem_used + moremem <= server.maxmemory) return 0;
 
-    size_t overhead = freeMemoryGetNotCountedMemory();
+    if (can_refresh) last_overhead = freeMemoryGetNotCountedMemory();
+    size_t overhead = last_overhead;
     mem_used = (mem_used > overhead) ? mem_used - overhead : 0;
     return mem_used + moremem > server.maxmemory;
 }

the "unstable branch, change e to default 23 in _dictNextExp" part diff:

 static signed char _dictNextExp(unsigned long size)
 {
     unsigned char e = DICT_HT_INITIAL_EXP;
+    e = 23;
 
     if (size >= LONG_MAX) return (8*sizeof(long)-1);
     while(1) {

So it seems that every time a key is added, _dictNextExp needs to repeatedly calculate that e for a large dictionary, which is expensive in this case

@oranagra
Copy link
Member

thanks.
can you add a row with average or it's ratio vs unstable to the table?
do you mean that the main part of the slowness is the inefficiency of _dictNextExp (which can be optimized with CLZ, and was dismissed so many times, see #2382)?

@enjoy-binbin
Copy link
Contributor Author

enjoy-binbin commented Nov 22, 2023

i added a average row in the table. yean, the data indicates that it is _dictNextExp, are you interested in providing a new implementation of CLZ? so that i can test it

in the past we didn't have scenarios where we needed to calculate it frequently. Now that we have it, it seems it's time to pick it up.

@oranagra
Copy link
Member

have a look at the linked PRs of the one i referred to, one of them was recent enough.
used the CLZ builtin, instead of the more portable bit-trick (which is slightly slower).
let's first see if it's that inefficient _dictNextExp loop, or maybe some other side effect of started with a bigger HT.

@enjoy-binbin
Copy link
Contributor Author

ok, unstable with #11007 change:

872905.00
881106.75
898311.19
897666.06
900576.38

average:
890113.08

@oranagra
Copy link
Member

so it looks like we should / could do both, i.e. re-order the conditions, and use CLZ.
IIUC, the caching didn't do much (871xxx vs 871xxx).
is that right?

@enjoy-binbin
Copy link
Contributor Author

so it looks like we should / could do both, i.e. re-order the conditions, and use CLZ.
IIUC, the caching didn't do much (871xxx vs 871xxx).
is that right?

yes, that is right

@oranagra oranagra merged commit 4634769 into redis:unstable Nov 22, 2023
@oranagra
Copy link
Member

ok, i merged this one.
if you have time you can do the CLZ in another PR.
IMHO we should use CLZ builtin when available, and the bit trick otherwise.

@enjoy-binbin
Copy link
Contributor Author

if you have time you can do the CLZ in another PR.
IMHO we should use CLZ builtin when available, and the bit trick otherwise.

ok, i can find time to sort out the past PRs

@enjoy-binbin enjoy-binbin deleted the optimize_dict_expand_check branch November 22, 2023 09:39
@enjoy-binbin
Copy link
Contributor Author

enjoy-binbin commented Nov 27, 2023

enjoy-binbin@300d98a

Regarding the CLZ part, I have been running this code on the machine for a few days, and so far everything is ok (but it is not finished yet, the LONG_MAX is too big and becomes slower and slower). WDYT about this patch? @oranagra do you like it?

@oranagra
Copy link
Member

there's no need for that test.
CLZ (count leading zeros) can easily give you the next power of two.
for testing off by one bugs and such, it's enough to test it on 126,127,128,129 (and same for 2^61), no need for that test loop.

additionally, i think we should detect systems that don't support the buildin, and use the 5 shifts bit trick from #2382 when missing.

@enjoy-binbin
Copy link
Contributor Author

yean, the loop test is just run locally by myself, and I want to make sure that the output of each number is the same as before. (I have tested other CLZ branches before, and some parts are different)

In fact we are already using __builtin_clzl, so i drop that part.

static inline clientMemUsageBucket *getMemUsageBucket(size_t mem) {
    int size_in_bits = 8*(int)sizeof(mem);
    int clz = mem > 0 ? __builtin_clzl(mem) : size_in_bits;
    int bucket_idx = size_in_bits - clz;
    if (bucket_idx > CLIENT_MEM_USAGE_BUCKET_MAX_LOG)
        bucket_idx = CLIENT_MEM_USAGE_BUCKET_MAX_LOG;
    else if (bucket_idx < CLIENT_MEM_USAGE_BUCKET_MIN_LOG)
        bucket_idx = CLIENT_MEM_USAGE_BUCKET_MIN_LOG;
    bucket_idx -= CLIENT_MEM_USAGE_BUCKET_MIN_LOG;
    return &server.client_mem_usage_buckets[bucket_idx];
}

@oranagra
Copy link
Member

ohh, i didn't recall we're already using it.
so i suppose all the platforms we use have it (even if the CPU doesn't have an instruction).
so i suppose it's rather straight forward.

@enjoy-binbin
Copy link
Contributor Author

enjoy-binbin commented Nov 27, 2023

ok, tomorrow i will measure the performance comparison after reaching the threshold based on that commit, and then prepare a PR for that.

#12815

enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Nov 28, 2023
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in redis#7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in redis#11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in redis#12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because redis#7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in redis#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0 to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`,
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.
oranagra pushed a commit that referenced this pull request Nov 29, 2023
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.
oranagra pushed a commit that referenced this pull request Jan 9, 2024
…ck (#12789)

dictExpandAllowed (for the main db dict and the expire dict) seems to
involve a few function calls and memory accesses, and we can do it only
after the thresholds checks and can get some performance improvements.

A simple benchmark test: there are 11032768 fixed keys in the database,
start a redis-server with `--maxmemory big_number --save ""`,
start a redis-benchmark with `-P 100 -r 1000000000 -t set -n 5000000`,
collect `throughput summary: n requests per second` result.

After five rounds of testing:
```
unstable     this PR
848032.56    897988.56
854408.69    913408.88
858663.94    914076.81
871839.56    916758.31
882612.56    920640.75
```

We can see a 5% performance improvement in general condition.
But note we'll still hit the degradation when we over the thresholds.

(cherry picked from commit 4634769)
oranagra pushed a commit that referenced this pull request Jan 9, 2024
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.

(cherry picked from commit 22cc9b5)
oranagra pushed a commit that referenced this pull request Jan 9, 2024
…ck (#12789)

dictExpandAllowed (for the main db dict and the expire dict) seems to
involve a few function calls and memory accesses, and we can do it only
after the thresholds checks and can get some performance improvements.

A simple benchmark test: there are 11032768 fixed keys in the database,
start a redis-server with `--maxmemory big_number --save ""`,
start a redis-benchmark with `-P 100 -r 1000000000 -t set -n 5000000`,
collect `throughput summary: n requests per second` result.

After five rounds of testing:
```
unstable     this PR
848032.56    897988.56
854408.69    913408.88
858663.94    914076.81
871839.56    916758.31
882612.56    920640.75
```

We can see a 5% performance improvement in general condition.
But note we'll still hit the degradation when we over the thresholds.

(cherry picked from commit 4634769)
oranagra pushed a commit that referenced this pull request Jan 9, 2024
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.

(cherry picked from commit 22cc9b5)
@YaacovHazan YaacovHazan moved this from To Do to In progress in Redis 6.2 Backport May 26, 2025
@YaacovHazan YaacovHazan moved this from In progress to To Do in Redis 6.2 Backport May 26, 2025
@YaacovHazan YaacovHazan moved this from To Do to Pending in Redis 6.2 Backport May 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Pending
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants