-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Fix resize hash tables stuck on the last non-empty slot #12802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it's odd that this resize_cursor is -1 when the iteration is over. It's better to align the db iterator with SCAN cursors and return 0 when the iteration is done. When starting again, it starts from the beginning.
I commented on it in #11695 (comment) but it was only a bit vague in the end of another comment.
If we do this change, more weird things like #12754 (comment) can be fixed. It's more intuitive to call dbIteratorInitFromSlot with the slot you want instead of slot - 1.
Even dbGetNextNonEmptySlot can return 0 instead of -1 when there are not more slots. We can allow the current slot to be 0 but the next slot can never be 0. WDYT?
oranagra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly LGTM, i agree the use of a dbGetDictFromIterator to exploit it's internals without actually using it for iteration was odd.
i'd like @hpatro to comment as well.
p.s. can / should we add a test for that case?
I think Not sure if I fully understand your point. Would you like to open another PR to attempt implementing it and clarify your intentions? |
@soloestoy Maybe I can do that later. I will not insist in changing this iterator right now. Please go ahead and merge this PR. I can just try to clarify what I mean:
We could allow the current slot to be 0 but the next slot cannot be 0. If next slot is 0 it means we reached the end. (That's somewhat how we treat SCAN's return value 0 vs SCAN input cursor 0.) It should be clear how the iterator should be used; if We have functions for current slot: dbGetDictFromIterator and dbIteratorGetCurrentSlot. We have dbIteratorNextDict to skip to the next non-empty slot. To initialize, we have dbIteratorInit which finds the first non-empty slot and dbIteratorInitFromSlot which doesn't check that the given slot is non-empty. It's a bit asymmetric. |
|
@soloestoy Curious what helped with the discovery of the issue and if you could add some test cases for the edge case it would be helpful. |
@hpatro I found this issue while reviewing the code. Adding test cases will indeed help us identify problems earlier. I have added the test case. Please re-approve @oranagra . |
The test have a race:
```
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
table size: 12
number of elements: 2
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
```
When `r del "{alice}$j"` is executed in the loop, when the key is
deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL,
if serverCron happens to trigger slot dict resize, then the test
will fail. Because there is not way to meet HASHTABLE_MIN_FILL in
the subsequent dels.
The solution is to avoid triggering the resize in advance. We can
use multi to delete them at once, or we can disable the resize.
Since we disabled resize in the previous test, the fix also uses
the method of disabling resize.
The test is introduced in redis#12802.
The test have a race:
```
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
table size: 12
number of elements: 2
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
```
When `r del "{alice}$j"` is executed in the loop, when the key is
deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL,
if serverCron happens to trigger slot dict resize, then the test
will fail. Because there is not way to meet HASHTABLE_MIN_FILL in
the subsequent dels.
The solution is to avoid triggering the resize in advance. We can
use multi to delete them at once, or we can disable the resize.
Since we disabled resize in the previous test, the fix also uses
the method of disabling resize.
The test is introduced in #12802.
Before redis#12850, we will only try to shrink the dict in serverCron, which we can control by using a child process, but now every time we delete a key, the shrink check will be called. In these test (added in redis#12802), we meant to disable the resizing, but druing the delete, the dict will meet the force shrink, like 2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and will cause the test to fail. In this commit, we try to keep the load factor at 3 / 128 = 0.023, that is, do not meet the force shrink.
Before #12850, we will only try to shrink the dict in serverCron, which we can control by using a child process, but now every time we delete a key, the shrink check will be called. In these test (added in #12802), we meant to disable the resizing, but druing the delete, the dict will meet the force shrink, like 2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and will cause the test to fail. In this commit, we try to keep the load factor at 3 / 128 = 0.023, that is, do not meet the force shrink.
tests consistently fail on timeout (sleep that's too short). it now takes more time because in #12819 we iterate on all dicts, not just non-empty ones. it passed the PR's CI because it skips the `slow` tag, which might have been misplaced, but now it is probably required. with the fix, the tests take quite a lot of time: ``` [ok]: Redis can trigger resizing (1860 ms) [ok]: Redis can rewind and trigger smaller slot resizing (744 ms) ``` before #12819: ``` [ok]: Redis can trigger resizing (309 ms) [ok]: Redis can rewind and trigger smaller slot resizing (295 ms) ``` failure: https://github.com/redis/redis/actions/runs/7704158180/job/20995931735 ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. *** [err]: Redis can trigger resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 128 number of elements: 5 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 256 number of elements: 10 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test) ```
The test have a race:
```
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
table size: 12
number of elements: 2
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
```
When `r del "{alice}$j"` is executed in the loop, when the key is
deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL,
if serverCron happens to trigger slot dict resize, then the test
will fail. Because there is not way to meet HASHTABLE_MIN_FILL in
the subsequent dels.
The solution is to avoid triggering the resize in advance. We can
use multi to delete them at once, or we can disable the resize.
Since we disabled resize in the previous test, the fix also uses
the method of disabling resize.
The test is introduced in redis#12802.
Before redis#12850, we will only try to shrink the dict in serverCron, which we can control by using a child process, but now every time we delete a key, the shrink check will be called. In these test (added in redis#12802), we meant to disable the resizing, but druing the delete, the dict will meet the force shrink, like 2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and will cause the test to fail. In this commit, we try to keep the load factor at 3 / 128 = 0.023, that is, do not meet the force shrink.
…redis#13009) tests consistently fail on timeout (sleep that's too short). it now takes more time because in redis#12819 we iterate on all dicts, not just non-empty ones. it passed the PR's CI because it skips the `slow` tag, which might have been misplaced, but now it is probably required. with the fix, the tests take quite a lot of time: ``` [ok]: Redis can trigger resizing (1860 ms) [ok]: Redis can rewind and trigger smaller slot resizing (744 ms) ``` before redis#12819: ``` [ok]: Redis can trigger resizing (309 ms) [ok]: Redis can rewind and trigger smaller slot resizing (295 ms) ``` failure: https://github.com/redis/redis/actions/runs/7704158180/job/20995931735 ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. *** [err]: Redis can trigger resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 128 number of elements: 5 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 256 number of elements: 10 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test) ```
The test have a race:
```
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
table size: 12
number of elements: 2
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
```
When `r del "{alice}$j"` is executed in the loop, when the key is
deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL,
if serverCron happens to trigger slot dict resize, then the test
will fail. Because there is not way to meet HASHTABLE_MIN_FILL in
the subsequent dels.
The solution is to avoid triggering the resize in advance. We can
use multi to delete them at once, or we can disable the resize.
Since we disabled resize in the previous test, the fix also uses
the method of disabling resize.
The test is introduced in redis#12802.
Before redis#12850, we will only try to shrink the dict in serverCron, which we can control by using a child process, but now every time we delete a key, the shrink check will be called. In these test (added in redis#12802), we meant to disable the resizing, but druing the delete, the dict will meet the force shrink, like 2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and will cause the test to fail. In this commit, we try to keep the load factor at 3 / 128 = 0.023, that is, do not meet the force shrink.
…redis#13009) tests consistently fail on timeout (sleep that's too short). it now takes more time because in redis#12819 we iterate on all dicts, not just non-empty ones. it passed the PR's CI because it skips the `slow` tag, which might have been misplaced, but now it is probably required. with the fix, the tests take quite a lot of time: ``` [ok]: Redis can trigger resizing (1860 ms) [ok]: Redis can rewind and trigger smaller slot resizing (744 ms) ``` before redis#12819: ``` [ok]: Redis can trigger resizing (309 ms) [ok]: Redis can rewind and trigger smaller slot resizing (295 ms) ``` failure: https://github.com/redis/redis/actions/runs/7704158180/job/20995931735 ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. *** [err]: Redis can trigger resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 128 number of elements: 5 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 256 number of elements: 10 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test) ```
Introduced in #11695 .
The
tryResizeHashTablesfunction gets stuck on the last non-empty slot while iterating through dictionaries. It does not restart from the beginning. The reason for this issue is a problem with the usage ofdbIteratorNextDict:When iterating to the last non-empty slot,
next_slotis set to -1, causing it to loop indefinitely on that slot. We need to modify the code to ensure that after iterating to the last non-empty slot, it returns to the first non-empty slot.BTW, function
tryResizeHashTablesis actually iterating over slots that have keys. However, in its implementation, it leverages thedbIterator(which is a key iterator) to obtain slot and dictionary information. While this approach works fine, but it is not very intuitive. This PR also improves readability by changing the iteration to directly iterate over slots, thereby enhancing clarity.