Skip to content

Conversation

@oranagra
Copy link
Member

@oranagra oranagra commented Jan 5, 2023

Turns out that a fork child calling getExpire while persisting keys (and possibly also a result of some module fork tasks) could cause dictFind to do incremental rehashing in the child process, which is both a waste of time, and also causes COW harm.

Discovered in #11679

Turns out that a fork child calling getExpire while persisting keys (and
possibly also a result of some module fork tasks) could cause dictFind
to do incremental rehashing in the child process, which is both a waste
of time, and also causes COW harm.
@oranagra oranagra requested a review from madolson January 5, 2023 10:00
@oranagra oranagra merged commit 2bec254 into redis:unstable Jan 10, 2023
@oranagra oranagra deleted the no_rehash_in_child branch January 10, 2023 06:40
@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Jan 11, 2023
@oranagra oranagra mentioned this pull request Jan 16, 2023
oranagra added a commit that referenced this pull request Jan 16, 2023
Turns out that a fork child calling getExpire while persisting keys (and
possibly also a result of some module fork tasks) could cause dictFind
to do incremental rehashing in the child process, which is both a waste
of time, and also causes COW harm.

(cherry picked from commit 2bec254)
This was referenced Feb 27, 2023
oranagra added a commit that referenced this pull request Feb 28, 2023
Turns out that a fork child calling getExpire while persisting keys (and
possibly also a result of some module fork tasks) could cause dictFind
to do incremental rehashing in the child process, which is both a waste
of time, and also causes COW harm.

(cherry picked from commit 2bec254)
(cherry picked from commit 3e82bdf)
oranagra added a commit that referenced this pull request Feb 28, 2023
Turns out that a fork child calling getExpire while persisting keys (and
possibly also a result of some module fork tasks) could cause dictFind
to do incremental rehashing in the child process, which is both a waste
of time, and also causes COW harm.

(cherry picked from commit 2bec254)
(cherry picked from commit 3e82bdf)
(cherry picked from commit 4803334cf6cb1eccdd33674a72a215ed6cd10069)
oranagra added a commit that referenced this pull request Jun 16, 2023
…ng fork (#12276)

## Issue:
When a dict has a long chain or the length of the chain is longer than
the number of samples, we will never be able to sample the elements
at the end of the chain using dictGetSomeKeys().
This could mean that SRANDMEMBER can be hang in and endless loop.
The most severe case, is the pathological case of when someone uses SCAN+DEL
or SSCAN+SREM creating an unevenly distributed dict.
This was amplified by the recent change in #11692 which prevented a
down-sizing rehashing while there is a fork.

## Solution
1. Before, we will stop sampling when we reach the maximum number
  of samples, even if there is more data after the current chain.
  Now when we reach the maximum we use the Reservoir Sampling
  algorithm to fairly sample the end of the chain that cannot be sampled
2. Fix the rehashing code, so that the same as it allows rehashing for up-sizing
  during fork when the ratio is extreme, it will allow it for down-sizing as well.

Issue was introduced (or became more severe) by #11692

Co-authored-by: Oran Agra <oran@redislabs.com>
oranagra pushed a commit that referenced this pull request Jul 10, 2023
This was introduced by the recent change in #11692 which prevented a
down-sizing rehashing while there is a fork.

## Solution
1. Fix the rehashing code, so that the same as it allows rehashing for up-sizing
  during fork when the ratio is extreme, it will allow it for down-sizing as well.

Co-authored-by: Oran Agra <oran@redislabs.com>

This is a partial cherry pick of:
(cherry picked from commit b00a235)
(cherry picked from commit d4c37320382edb342292a3e30250d46896a12016)
oranagra pushed a commit that referenced this pull request Jul 10, 2023
This was introduced by the recent change in #11692 which prevented a
down-sizing rehashing while there is a fork.

## Solution
1. Fix the rehashing code, so that the same as it allows rehashing for up-sizing
  during fork when the ratio is extreme, it will allow it for down-sizing as well.

Co-authored-by: Oran Agra <oran@redislabs.com>

This is a partial cherry pick of:
(cherry picked from commit b00a235)
(cherry picked from commit d4c37320382edb342292a3e30250d46896a12016)
oranagra added a commit that referenced this pull request Jul 10, 2023
…ng fork (#12276)

## Issue:
When a dict has a long chain or the length of the chain is longer than
the number of samples, we will never be able to sample the elements
at the end of the chain using dictGetSomeKeys().
This could mean that SRANDMEMBER can be hang in and endless loop.
The most severe case, is the pathological case of when someone uses SCAN+DEL
or SSCAN+SREM creating an unevenly distributed dict.
This was amplified by the recent change in #11692 which prevented a
down-sizing rehashing while there is a fork.

## Solution
1. Before, we will stop sampling when we reach the maximum number
  of samples, even if there is more data after the current chain.
  Now when we reach the maximum we use the Reservoir Sampling
  algorithm to fairly sample the end of the chain that cannot be sampled
2. Fix the rehashing code, so that the same as it allows rehashing for up-sizing
  during fork when the ratio is extreme, it will allow it for down-sizing as well.

Issue was introduced (or became more severe) by #11692

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit b00a235)
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
Turns out that a fork child calling getExpire while persisting keys (and
possibly also a result of some module fork tasks) could cause dictFind
to do incremental rehashing in the child process, which is both a waste
of time, and also causes COW harm.
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Nov 28, 2023
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in redis#7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in redis#11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in redis#12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because redis#7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in redis#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0 to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`,
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.
oranagra pushed a commit that referenced this pull request Nov 29, 2023
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.
oranagra pushed a commit that referenced this pull request Jan 9, 2024
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.

(cherry picked from commit 22cc9b5)
oranagra pushed a commit that referenced this pull request Jan 9, 2024
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.

(cherry picked from commit 22cc9b5)
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jan 15, 2024
The new shrink was added in redis#12850.
Also updated outdated comments, see redis#11692.
oranagra pushed a commit that referenced this pull request Jan 15, 2024
The new shrink was added in #12850.
Also updated outdated comments, see #11692.
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
The new shrink was added in redis#12850.
Also updated outdated comments, see redis#11692.
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
The new shrink was added in redis#12850.
Also updated outdated comments, see redis#11692.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants