Skip to content

Conversation

@ShooterIT
Copy link
Member

@ShooterIT ShooterIT commented Oct 24, 2020

As we know, redis may reject user's requests or evict some keys if used memory is over maxmemory. Dictionaries expanding may make things worse, some big dictionaries, such as main db and expires dict, may eat huge memory at once for allocating a new big hash table and be far more than maxmemory after expanding. There are related issues: #4213 #4583

More details, when expand dict in redis, we will allocate a new big ht[1] that generally is double of ht[0], The size of ht[1] will be very big if ht[0] already is big. For db dict, if we have more than 64 million keys, we need to cost 1GB for ht[1] when dict expands.

If the sum of used memory and new hash table of dict needed exceeds maxmemory, we shouldn't allow the dict to expand. Because, if we enable keys eviction, we still couldn't add much more keys after eviction and rehashing, what's worse, redis will keep less keys when redis only remains a little memory for storing new hash table instead of users' data. Moreover users can't write data in redis if disable keys eviction.

What this commit changed ?

  • Add a new member function expandAllowed for dict type, it provide a way for caller to allow expand or not. We expose two parameters for this function: more memory needed for expanding and dict current load factor, users can implement a function to make a decision by them.
  • For main db dict and expires dict type, these dictionaries may be very big and cost huge memory for expanding, so we implement a judgement function: we can stop dict to expand provisionally if used memory will be over maxmemory after dict expands, but to guarantee the performance of redis, we still allow dict to expand if dict load factor exceeds the safe load factor.
  • Add test cases to verify we don't allow main db to expand when left memory is not enough, so that avoid keys eviction.

Other changes:

  • For new hash table size when expand. Before this commit, the size is that double used of dict and later _dictNextPower. Actually we aim to control a dict load factor between 0.5 and 1.0. Now we replace *2 with +1, since the first check is that used >= size, the outcome of before will usually be the same as _dictNextPower(used+1). The only case where it'll differ is when dict_can_resize is false during fork, so that later the _dictNextPower(used*2) will cause the dict to jump to *4 (i.e. _dictNextPower(1025*2) will return 4096).
  • Fix rehash test cases due to changing algorithm of new hash table size when expand.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShooterIT thanks you for this PR and the idea it brings.
I think we need to work on it a bit more to make sure it doesn't have any undesired side effects.
We can easily think of cases, the rehash is causing damage, I think I would rather try to limit the implications of this change to such cases, to avoid surprises.

@oranagra oranagra added this to the Next minor backlog milestone Oct 28, 2020
Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to repeat one of my specific points, but I think we need a way to make sure we don't end up with a poorly balanced DB. We can prevent rehashing but keep adding in more items, at some point we need to force a rehash or give visibility into it. A real usecase I can see is someone has a redis database that they keep bumping up maxmemory as they need it, using a policy without eviction, but the dictionary never rehashes.

@oranagra
Copy link
Member

oranagra commented Nov 11, 2020

Had a discuss this with @yossigo @guybe7 @MeirShpilraien @inbaryuval @yaacovhazan-Redislabs, and here's the conclusions:

  • It's hard to predict the impact of such a mechanism when enabled on all dicts, we feel we should focus on just solving the major known problem of the main dict, in which is easier to realize all the implications of this change. for that, a new field can be added in the dictType struct, so it doesn't have a cost per instance of the dictionary.
  • this is essentially a balance between memory and execution performance, currently redis aims to hit a dict saturation ratio between 0.5 and 1.0 (as soon at it reaches 1.0 it bring it back to 0.5), but these two constants can in theory be set to different ones (0.8 and 1.2 or anything else).
  • the code that does both used*2 and later _dictNextPower seems like a bug. since the check is that used >= size, the outcome of this will usually be the same as _dictNextPower(used+1) (no need for the *2). the only case where it'll differ is when dict_can_resize is false during fork, so that later the _dictNextPower(used*2) will cause the dict to jump to *4 (i.e. _dictNextPower(1025*2) will give you 4096).

proposals:

  1. replace the *2 with +1 (will only affect wakeup after fork)
  2. when there is enough memory, keep the logic like it is (saturation between 0.5 and 1.0), but when the memory is in shortage (rehash is going to cause eviction), we can relax the saturation threshold, but only up to a limit. let's say up to saturation of 1.61803 (after that we will prefer to rehash, despite the eviction it'll cause).
  3. we can probably revise the rehashing code to be a little bit less memory hungry during rehashing (i'll open another issue to describe the design).

@ShooterIT
Copy link
Member Author

Thanks @oranagra @madolson
After more thought, I agree with you, and we can only implement this strategy on main DB dict, we can't even on expire dict. Because we don't know users' eviction policy, but we definitely add key and evict key on main DB dict. just as Madelyn said, we may delete key from main db but add members on hash object.

We already know new hash table size maybe *4 or *8 of old size when add many keys during having forked child. I feel good to replace the *2 with +1, and we needn't to try to check again if memory is not enough.

For hard saturation 1.61803 (golden ratio?), saturation will be 0.8 after rehash when reach 1.61803, we can add some new keys without performance loss after eviction, right?

@ShooterIT
Copy link
Member Author

But if we set a hard hash saturation limit, can we apply this strategy on all dict when has less memory?

If we add a new field into the dictType struct, we may need to modify all dictType variables. Currently, my implementation is like setting dict_can_resize, does it break encapsulation? @madolson

@oranagra
Copy link
Member

i rather modify 20 structs in many places in redis code base, than add a new member per dict instance (considering it's memory consumption).

but maybe considering that the current plan (if we take what i posted) is just a matter of changing the threshold (and not a yes/no boolean decision), we can apply that on all dictionaries.

p.s. if we do that, i suppose the selection between the threshold should not be if used_memory + new_malloc > maxmemory, but rather check that we're close to the limit (over used+new_malloc > 95% or alike), otherwise, considering that the last eviction could have evicted a large object, we have a big chunk of unused memory, and small dicts will not use the right threshold.

@madolson
Copy link
Contributor

I would also put it on the 20 structs still, I don't think it's that bad and gives us more flexibility.

"but maybe considering that the current plan (if we take what i posted) is just a matter of changing the threshold (and not a yes/no boolean decision), we can apply that on all dictionaries." I agree we could.

@ShooterIT
Copy link
Member Author

Yes, adding a member function into dictType can give us more flexibility, I feel ok, it just need to modify many structs.

I hope we can apply this mechanism on more dicts. For correctness, whatever eviction policy, it is ok to apply this mechanism on main db. But we need to pay attention on expire dict. I experienced one case, redis immediately eat extra 1GB memory when user keys number were more than 32M. For main db rehash, it only eat 512MB(64M * 8Byte). And I also found all keys have expired time, so I thought another 512MB was eaten by expire dict rehash.

@ShooterIT
Copy link
Member Author

hi @oranagra @madolson

I did several things in current commit.

  • Add member function in dict type to check whether dict can expand, and expose two parameters: more memory rehash need and dict current load factor, users can implement function to make a decision by them.
  • For main db dict and expires dict type, I implement a expand function.
  • For new hash table size after expanding, i use next power of two over ht[0].used.

There are some things i am not sure, need your help

  • the name of new member function (so currently i don't add comments for every dictType, i will add if we confirm).
  • I don't add expanding check implementations for other dicts.

Please review

@oranagra
Copy link
Member

thanks @ShooterIT.
this approach looks good to me (good to be merged). posted a few minor comments.

@ShooterIT ShooterIT changed the title Can't rehash if used memory exceeds maxmemory after rehashing Limit the main db and expires dictionaries to expand Nov 25, 2020
oranagra
oranagra previously approved these changes Nov 25, 2020
@oranagra
Copy link
Member

@redis/core-team i think this one is ready to be merged.
@ShooterIT please update the top comment of the PR for an easy review, and also to be used as a commit comment when squash-merging.

@oranagra oranagra added state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten release-notes indication that this issue needs to be mentioned in the release notes labels Nov 25, 2020
Copy link
Member

@itamarhaber itamarhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but didn't review

@ShooterIT
Copy link
Member Author

Hi @oranagra done, please have a look

yossigo
yossigo previously approved these changes Dec 1, 2020
@itamarhaber
Copy link
Member

FYI @filipecosta90 because performance, duh.

itamarhaber
itamarhaber previously approved these changes Dec 3, 2020
@JimB123
Copy link
Contributor

JimB123 commented Dec 3, 2020

Overall, I think this is a good change, and is complementary with the recent change to support incremental eviction.

I like the rename of the dictType for the expires table. The current name is very general and the new usage is specific.

I concur with @ShooterIT regarding his analysis of performance. freeMemoryGetNotCountedMemory is already called once on all writes when memory is high - so it can't be a significant performance problem or it would be impacting current performance. In this case, it would be called a 2nd time for the subset of writes which are "adds". I've looked at the code path and it doesn't seem to be significant (now) and can't become significant (future) without impacting general write performance when memory is high. I don't think it would be worth the complexity to cache this value.

@oranagra
Copy link
Member

oranagra commented Dec 3, 2020

@JimB123 which dictType for eviction table?

@JimB123
Copy link
Contributor

JimB123 commented Dec 3, 2020

@JimB123 which dictType for eviction table?

Sorry - expires, not evict. keyptrDictType -> dbExpiresDictType

madolson
madolson previously approved these changes Dec 4, 2020
Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the change

@oranagra
Copy link
Member

oranagra commented Dec 4, 2020

@JimB123 we did rename it (IIRC i asked for it for the exact same reason as you)

@oranagra oranagra dismissed stale reviews from madolson, itamarhaber, yossigo, and themself via 504932e December 6, 2020 09:28
@oranagra oranagra merged commit 75f9dec into redis:unstable Dec 6, 2020
@oranagra oranagra mentioned this pull request Jan 13, 2021
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Mar 2, 2021
As we know, redis may reject user's requests or evict some keys if
used memory is over maxmemory. Dictionaries expanding may make
things worse, some big dictionaries, such as main db and expires dict,
may eat huge memory at once for allocating a new big hash table and be
far more than maxmemory after expanding.
There are related issues: redis#4213 redis#4583

More details, when expand dict in redis, we will allocate a new big
ht[1] that generally is double of ht[0], The size of ht[1] will be
very big if ht[0] already is big. For db dict, if we have more than
64 million keys, we need to cost 1GB for ht[1] when dict expands.

If the sum of used memory and new hash table of dict needed exceeds
maxmemory, we shouldn't allow the dict to expand. Because, if we
enable keys eviction, we still couldn't add much more keys after
eviction and rehashing, what's worse, redis will keep less keys when
redis only remains a little memory for storing new hash table instead
of users' data. Moreover users can't write data in redis if disable
keys eviction.

What this commit changed ?

Add a new member function expandAllowed for dict type, it provide a way
for caller to allow expand or not. We expose two parameters for this
function: more memory needed for expanding and dict current load factor,
users can implement a function to make a decision by them.
For main db dict and expires dict type, these dictionaries may be very
big and cost huge memory for expanding, so we implement a judgement
function: we can stop dict to expand provisionally if used memory will
be over maxmemory after dict expands, but to guarantee the performance
of redis, we still allow dict to expand if dict load factor exceeds the
safe load factor.
Add test cases to verify we don't allow main db to expand when left
memory is not enough, so that avoid keys eviction.

Other changes:

For new hash table size when expand. Before this commit, the size is
that double used of dict and later _dictNextPower. Actually we aim to
control a dict load factor between 0.5 and 1.0. Now we replace *2 with
+1, since the first check is that used >= size, the outcome of before
will usually be the same as _dictNextPower(used+1). The only case where
it'll differ is when dict_can_resize is false during fork, so that later
the _dictNextPower(used*2) will cause the dict to jump to *4 (i.e.
_dictNextPower(1025*2) will return 4096).
Fix rehash test cases due to changing algorithm of new hash table size
when expand.
@ShooterIT ShooterIT deleted the rehash branch March 11, 2021 07:33
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Nov 28, 2023
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in redis#7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in redis#11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in redis#12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because redis#7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in redis#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0 to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`,
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.
oranagra pushed a commit that referenced this pull request Nov 29, 2023
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.
oranagra pushed a commit that referenced this pull request Jan 9, 2024
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.

(cherry picked from commit 22cc9b5)
oranagra pushed a commit that referenced this pull request Jan 9, 2024
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.

(cherry picked from commit 22cc9b5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants