Fix resize hash tables stuck on the last non-empty slot #12802

soloestoy · 2023-11-23T07:49:14Z

Introduced in #11695 .

The tryResizeHashTables function gets stuck on the last non-empty slot while iterating through dictionaries. It does not restart from the beginning. The reason for this issue is a problem with the usage of dbIteratorNextDict:

/* Returns next dictionary from the iterator, or NULL if iteration is complete. */
dict *dbIteratorNextDict(dbIterator *dbit) {
    if (dbit->next_slot == -1) return NULL;
    dbit->slot = dbit->next_slot;
    dbit->next_slot = dbGetNextNonEmptySlot(dbit->db, dbit->slot, dbit->keyType);
    return dbGetDictFromIterator(dbit);
}

When iterating to the last non-empty slot, next_slot is set to -1, causing it to loop indefinitely on that slot. We need to modify the code to ensure that after iterating to the last non-empty slot, it returns to the first non-empty slot.

BTW, function tryResizeHashTables is actually iterating over slots that have keys. However, in its implementation, it leverages the dbIterator (which is a key iterator) to obtain slot and dictionary information. While this approach works fine, but it is not very intuitive. This PR also improves readability by changing the iteration to directly iterate over slots, thereby enhancing clarity.

zuiderkwast

I thought it's odd that this resize_cursor is -1 when the iteration is over. It's better to align the db iterator with SCAN cursors and return 0 when the iteration is done. When starting again, it starts from the beginning.

I commented on it in #11695 (comment) but it was only a bit vague in the end of another comment.

If we do this change, more weird things like #12754 (comment) can be fixed. It's more intuitive to call dbIteratorInitFromSlot with the slot you want instead of slot - 1.

Even dbGetNextNonEmptySlot can return 0 instead of -1 when there are not more slots. We can allow the current slot to be 0 but the next slot can never be 0. WDYT?

oranagra

mostly LGTM, i agree the use of a dbGetDictFromIterator to exploit it's internals without actually using it for iteration was odd.
i'd like @hpatro to comment as well.

p.s. can / should we add a test for that case?

src/server.c

soloestoy · 2023-11-27T11:16:29Z

@zuiderkwast

I thought it's odd that this resize_cursor is -1 when the iteration is over. It's better to align the db iterator with SCAN cursors and return 0 when the iteration is done. When starting again, it starts from the beginning.

I commented on it in #11695 (comment) but it was only a bit vague in the end of another comment.

If we do this change, more weird things like #12754 (comment) can be fixed. It's more intuitive to call dbIteratorInitFromSlot with the slot you want instead of slot - 1.

Even dbGetNextNonEmptySlot can return 0 instead of -1 when there are not more slots. We can allow the current slot to be 0 but the next slot can never be 0. WDYT?

I think dbGetNextNonEmptySlot and scan are different. For scan, we proactively start the iteration with 0 as the beginning cursor, which is guaranteed to be a valid cursor. However, for dbGetNextNonEmptySlot, we initially don't have a specific slot and need to obtain a valid slot through dbIteratorNextDict. If returning 0 indicates the end, how do we differentiate if the first non-empty slot is also 0?

Not sure if I fully understand your point. Would you like to open another PR to attempt implementing it and clarify your intentions?

zuiderkwast · 2023-11-27T21:19:03Z

Not sure if I fully understand your point. Would you like to open another PR to attempt implementing it and clarify your intentions?

@soloestoy Maybe I can do that later. I will not insist in changing this iterator right now. Please go ahead and merge this PR.

I can just try to clarify what I mean:

However, for dbGetNextNonEmptySlot, we initially don't have a specific slot and need to obtain a valid slot through dbIteratorNextDict. If returning 0 indicates the end, how do we differentiate if the first non-empty slot is also 0?

We could allow the current slot to be 0 but the next slot cannot be 0. If next slot is 0 it means we reached the end. (That's somewhat how we treat SCAN's return value 0 vs SCAN input cursor 0.)

It should be clear how the iterator should be used; if next should be called before the first iteration, like in while (next()) or only in the end of each iteration like in a for loop (for (init(); isValid(); next())). It doesn't seem very clear right now.

We have functions for current slot: dbGetDictFromIterator and dbIteratorGetCurrentSlot. We have dbIteratorNextDict to skip to the next non-empty slot. To initialize, we have dbIteratorInit which finds the first non-empty slot and dbIteratorInitFromSlot which doesn't check that the given slot is non-empty. It's a bit asymmetric.

src/server.c

hpatro · 2023-11-28T00:34:22Z

@soloestoy Curious what helped with the discovery of the issue and if you could add some test cases for the edge case it would be helpful.

soloestoy · 2023-11-28T09:43:03Z

Curious what helped with the discovery of the issue and if you could add some test cases for the edge case it would be helpful.

@hpatro I found this issue while reviewing the code. Adding test cases will indeed help us identify problems earlier. I have added the test case. Please re-approve @oranagra .

The test have a race: ``` *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 12 number of elements: 2 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) ``` When `r del "{alice}$j"` is executed in the loop, when the key is deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL, if serverCron happens to trigger slot dict resize, then the test will fail. Because there is not way to meet HASHTABLE_MIN_FILL in the subsequent dels. The solution is to avoid triggering the resize in advance. We can use multi to delete them at once, or we can disable the resize. Since we disabled resize in the previous test, the fix also uses the method of disabling resize. The test is introduced in redis#12802.

The test have a race: ``` *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 12 number of elements: 2 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) ``` When `r del "{alice}$j"` is executed in the loop, when the key is deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL, if serverCron happens to trigger slot dict resize, then the test will fail. Because there is not way to meet HASHTABLE_MIN_FILL in the subsequent dels. The solution is to avoid triggering the resize in advance. We can use multi to delete them at once, or we can disable the resize. Since we disabled resize in the previous test, the fix also uses the method of disabling resize. The test is introduced in #12802.

Before redis#12850, we will only try to shrink the dict in serverCron, which we can control by using a child process, but now every time we delete a key, the shrink check will be called. In these test (added in redis#12802), we meant to disable the resizing, but druing the delete, the dict will meet the force shrink, like 2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and will cause the test to fail. In this commit, we try to keep the load factor at 3 / 128 = 0.023, that is, do not meet the force shrink.

Before #12850, we will only try to shrink the dict in serverCron, which we can control by using a child process, but now every time we delete a key, the shrink check will be called. In these test (added in #12802), we meant to disable the resizing, but druing the delete, the dict will meet the force shrink, like 2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and will cause the test to fail. In this commit, we try to keep the load factor at 3 / 128 = 0.023, that is, do not meet the force shrink.

tests consistently fail on timeout (sleep that's too short). it now takes more time because in #12819 we iterate on all dicts, not just non-empty ones. it passed the PR's CI because it skips the `slow` tag, which might have been misplaced, but now it is probably required. with the fix, the tests take quite a lot of time: ``` [ok]: Redis can trigger resizing (1860 ms) [ok]: Redis can rewind and trigger smaller slot resizing (744 ms) ``` before #12819: ``` [ok]: Redis can trigger resizing (309 ms) [ok]: Redis can rewind and trigger smaller slot resizing (295 ms) ``` failure: https://github.com/redis/redis/actions/runs/7704158180/job/20995931735 ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. *** [err]: Redis can trigger resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 128 number of elements: 5 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 256 number of elements: 10 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test) ```

The test have a race: ``` *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 12 number of elements: 2 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) ``` When `r del "{alice}$j"` is executed in the loop, when the key is deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL, if serverCron happens to trigger slot dict resize, then the test will fail. Because there is not way to meet HASHTABLE_MIN_FILL in the subsequent dels. The solution is to avoid triggering the resize in advance. We can use multi to delete them at once, or we can disable the resize. Since we disabled resize in the previous test, the fix also uses the method of disabling resize. The test is introduced in redis#12802.

Before redis#12850, we will only try to shrink the dict in serverCron, which we can control by using a child process, but now every time we delete a key, the shrink check will be called. In these test (added in redis#12802), we meant to disable the resizing, but druing the delete, the dict will meet the force shrink, like 2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and will cause the test to fail. In this commit, we try to keep the load factor at 3 / 128 = 0.023, that is, do not meet the force shrink.

…redis#13009) tests consistently fail on timeout (sleep that's too short). it now takes more time because in redis#12819 we iterate on all dicts, not just non-empty ones. it passed the PR's CI because it skips the `slow` tag, which might have been misplaced, but now it is probably required. with the fix, the tests take quite a lot of time: ``` [ok]: Redis can trigger resizing (1860 ms) [ok]: Redis can rewind and trigger smaller slot resizing (744 ms) ``` before redis#12819: ``` [ok]: Redis can trigger resizing (309 ms) [ok]: Redis can rewind and trigger smaller slot resizing (295 ms) ``` failure: https://github.com/redis/redis/actions/runs/7704158180/job/20995931735 ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. *** [err]: Redis can trigger resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 128 number of elements: 5 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 256 number of elements: 10 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test) ```

The test have a race: ``` *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 12 number of elements: 2 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) ``` When `r del "{alice}$j"` is executed in the loop, when the key is deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL, if serverCron happens to trigger slot dict resize, then the test will fail. Because there is not way to meet HASHTABLE_MIN_FILL in the subsequent dels. The solution is to avoid triggering the resize in advance. We can use multi to delete them at once, or we can disable the resize. Since we disabled resize in the previous test, the fix also uses the method of disabling resize. The test is introduced in redis#12802.

Before redis#12850, we will only try to shrink the dict in serverCron, which we can control by using a child process, but now every time we delete a key, the shrink check will be called. In these test (added in redis#12802), we meant to disable the resizing, but druing the delete, the dict will meet the force shrink, like 2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and will cause the test to fail. In this commit, we try to keep the load factor at 3 / 128 = 0.023, that is, do not meet the force shrink.

…redis#13009) tests consistently fail on timeout (sleep that's too short). it now takes more time because in redis#12819 we iterate on all dicts, not just non-empty ones. it passed the PR's CI because it skips the `slow` tag, which might have been misplaced, but now it is probably required. with the fix, the tests take quite a lot of time: ``` [ok]: Redis can trigger resizing (1860 ms) [ok]: Redis can rewind and trigger smaller slot resizing (744 ms) ``` before redis#12819: ``` [ok]: Redis can trigger resizing (309 ms) [ok]: Redis can rewind and trigger smaller slot resizing (295 ms) ``` failure: https://github.com/redis/redis/actions/runs/7704158180/job/20995931735 ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. *** [err]: Redis can trigger resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 128 number of elements: 5 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) *** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl Expected '[Dictionary HT] Hash table 0 stats (main hash table): table size: 256 number of elements: 10 [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test) ```

Fix resize hash tables stuck on the last non-empty slot

58e2bf2

soloestoy requested a review from oranagra November 23, 2023 07:49

zuiderkwast reviewed Nov 23, 2023

View reviewed changes

oranagra reviewed Nov 23, 2023

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

src/server.c Show resolved Hide resolved

use continue

fff9a92

oranagra approved these changes Nov 27, 2023

View reviewed changes

hpatro reviewed Nov 28, 2023

View reviewed changes

src/server.c Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

soloestoy added 2 commits November 28, 2023 10:44

remove unused dbIteratorInitFromSlot

f44f118

add test

b2843e4

oranagra approved these changes Nov 28, 2023

View reviewed changes

soloestoy merged commit a1c5171 into redis:unstable Nov 28, 2023

enjoy-binbin mentioned this pull request Jan 12, 2024

Fix race in slot dict resize test #12942

Merged

enjoy-binbin mentioned this pull request Jan 18, 2024

Fix unexpected resize causing test failure #12960

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix resize hash tables stuck on the last non-empty slot #12802

Fix resize hash tables stuck on the last non-empty slot #12802

Uh oh!

soloestoy commented Nov 23, 2023

Uh oh!

zuiderkwast left a comment

Uh oh!

oranagra left a comment

Uh oh!

Uh oh!

Uh oh!

soloestoy commented Nov 27, 2023

Uh oh!

zuiderkwast commented Nov 27, 2023

Uh oh!

Uh oh!

Uh oh!

hpatro commented Nov 28, 2023

Uh oh!

soloestoy commented Nov 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix resize hash tables stuck on the last non-empty slot #12802

Fix resize hash tables stuck on the last non-empty slot #12802

Uh oh!

Conversation

soloestoy commented Nov 23, 2023

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

soloestoy commented Nov 27, 2023

Uh oh!

zuiderkwast commented Nov 27, 2023

Uh oh!

Uh oh!

Uh oh!

hpatro commented Nov 28, 2023

Uh oh!

soloestoy commented Nov 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants