Skip to content

Conversation

@soloestoy
Copy link
Contributor

Introduced in #11695 .

The tryResizeHashTables function gets stuck on the last non-empty slot while iterating through dictionaries. It does not restart from the beginning. The reason for this issue is a problem with the usage of dbIteratorNextDict:

/* Returns next dictionary from the iterator, or NULL if iteration is complete. */
dict *dbIteratorNextDict(dbIterator *dbit) {
    if (dbit->next_slot == -1) return NULL;
    dbit->slot = dbit->next_slot;
    dbit->next_slot = dbGetNextNonEmptySlot(dbit->db, dbit->slot, dbit->keyType);
    return dbGetDictFromIterator(dbit);
}

When iterating to the last non-empty slot, next_slot is set to -1, causing it to loop indefinitely on that slot. We need to modify the code to ensure that after iterating to the last non-empty slot, it returns to the first non-empty slot.

BTW, function tryResizeHashTables is actually iterating over slots that have keys. However, in its implementation, it leverages the dbIterator (which is a key iterator) to obtain slot and dictionary information. While this approach works fine, but it is not very intuitive. This PR also improves readability by changing the iteration to directly iterate over slots, thereby enhancing clarity.

@soloestoy soloestoy requested a review from oranagra November 23, 2023 07:49
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's odd that this resize_cursor is -1 when the iteration is over. It's better to align the db iterator with SCAN cursors and return 0 when the iteration is done. When starting again, it starts from the beginning.

I commented on it in #11695 (comment) but it was only a bit vague in the end of another comment.

If we do this change, more weird things like #12754 (comment) can be fixed. It's more intuitive to call dbIteratorInitFromSlot with the slot you want instead of slot - 1.

Even dbGetNextNonEmptySlot can return 0 instead of -1 when there are not more slots. We can allow the current slot to be 0 but the next slot can never be 0. WDYT?

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM, i agree the use of a dbGetDictFromIterator to exploit it's internals without actually using it for iteration was odd.
i'd like @hpatro to comment as well.

p.s. can / should we add a test for that case?

@soloestoy
Copy link
Contributor Author

@zuiderkwast

I thought it's odd that this resize_cursor is -1 when the iteration is over. It's better to align the db iterator with SCAN cursors and return 0 when the iteration is done. When starting again, it starts from the beginning.

I commented on it in #11695 (comment) but it was only a bit vague in the end of another comment.

If we do this change, more weird things like #12754 (comment) can be fixed. It's more intuitive to call dbIteratorInitFromSlot with the slot you want instead of slot - 1.

Even dbGetNextNonEmptySlot can return 0 instead of -1 when there are not more slots. We can allow the current slot to be 0 but the next slot can never be 0. WDYT?

I think dbGetNextNonEmptySlot and scan are different. For scan, we proactively start the iteration with 0 as the beginning cursor, which is guaranteed to be a valid cursor. However, for dbGetNextNonEmptySlot, we initially don't have a specific slot and need to obtain a valid slot through dbIteratorNextDict. If returning 0 indicates the end, how do we differentiate if the first non-empty slot is also 0?

Not sure if I fully understand your point. Would you like to open another PR to attempt implementing it and clarify your intentions?

@zuiderkwast
Copy link
Contributor

Not sure if I fully understand your point. Would you like to open another PR to attempt implementing it and clarify your intentions?

@soloestoy Maybe I can do that later. I will not insist in changing this iterator right now. Please go ahead and merge this PR.

I can just try to clarify what I mean:

However, for dbGetNextNonEmptySlot, we initially don't have a specific slot and need to obtain a valid slot through dbIteratorNextDict. If returning 0 indicates the end, how do we differentiate if the first non-empty slot is also 0?

We could allow the current slot to be 0 but the next slot cannot be 0. If next slot is 0 it means we reached the end. (That's somewhat how we treat SCAN's return value 0 vs SCAN input cursor 0.)

It should be clear how the iterator should be used; if next should be called before the first iteration, like in while (next()) or only in the end of each iteration like in a for loop (for (init(); isValid(); next())). It doesn't seem very clear right now.

We have functions for current slot: dbGetDictFromIterator and dbIteratorGetCurrentSlot. We have dbIteratorNextDict to skip to the next non-empty slot. To initialize, we have dbIteratorInit which finds the first non-empty slot and dbIteratorInitFromSlot which doesn't check that the given slot is non-empty. It's a bit asymmetric.

@hpatro
Copy link
Contributor

hpatro commented Nov 28, 2023

@soloestoy Curious what helped with the discovery of the issue and if you could add some test cases for the edge case it would be helpful.

@soloestoy
Copy link
Contributor Author

Curious what helped with the discovery of the issue and if you could add some test cases for the edge case it would be helpful.

@hpatro I found this issue while reviewing the code. Adding test cases will indeed help us identify problems earlier. I have added the test case. Please re-approve @oranagra .

@soloestoy soloestoy merged commit a1c5171 into redis:unstable Nov 28, 2023
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jan 12, 2024
The test have a race:
```
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 12
 number of elements: 2
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
```

When `r del "{alice}$j"` is executed in the loop, when the key is
deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL,
if serverCron happens to trigger slot dict resize, then the test
will fail. Because there is not way to meet HASHTABLE_MIN_FILL in
the subsequent dels.

The solution is to avoid triggering the resize in advance. We can
use multi to delete them at once, or we can disable the resize.
Since we disabled resize in the previous test, the fix also uses
the method of disabling resize.

The test is introduced in redis#12802.
oranagra pushed a commit that referenced this pull request Jan 17, 2024
The test have a race:
```
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 12
 number of elements: 2
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
```

When `r del "{alice}$j"` is executed in the loop, when the key is
deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL,
if serverCron happens to trigger slot dict resize, then the test
will fail. Because there is not way to meet HASHTABLE_MIN_FILL in
the subsequent dels.

The solution is to avoid triggering the resize in advance. We can
use multi to delete them at once, or we can disable the resize.
Since we disabled resize in the previous test, the fix also uses
the method of disabling resize.

The test is introduced in #12802.
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jan 18, 2024
Before redis#12850, we will only try to shrink the dict in serverCron,
which we can control by using a child process, but now every time
we delete a key, the shrink check will be called.

In these test (added in redis#12802), we meant to disable the resizing,
but druing the delete, the dict will meet the force shrink, like
2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and
will cause the test to fail.

In this commit, we try to keep the load factor at 3 / 128 = 0.023,
that is, do not meet the force shrink.
oranagra pushed a commit that referenced this pull request Jan 18, 2024
Before #12850, we will only try to shrink the dict in serverCron,
which we can control by using a child process, but now every time
we delete a key, the shrink check will be called.

In these test (added in #12802), we meant to disable the resizing,
but druing the delete, the dict will meet the force shrink, like
2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and
will cause the test to fail.

In this commit, we try to keep the load factor at 3 / 128 = 0.023,
that is, do not meet the force shrink.
oranagra added a commit that referenced this pull request Jan 30, 2024
tests consistently fail on timeout (sleep that's too short).
it now takes more time because in #12819 we iterate on all dicts, not
just non-empty ones.
it passed the PR's CI because it skips the `slow` tag, which might have
been misplaced, but now it is probably required.
with the fix, the tests take quite a lot of time:
```
[ok]: Redis can trigger resizing (1860 ms)
[ok]: Redis can rewind and trigger smaller slot resizing (744 ms)
```
before #12819:
```
[ok]: Redis can trigger resizing (309 ms)
[ok]: Redis can rewind and trigger smaller slot resizing (295 ms)
```

failure:
https://github.com/redis/redis/actions/runs/7704158180/job/20995931735
```
*** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl
scan didn't handle slot skipping logic.
*** [err]: Redis can trigger resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 128
 number of elements: 5
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) 
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 256
 number of elements: 10
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test) 
```
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
The test have a race:
```
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 12
 number of elements: 2
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
```

When `r del "{alice}$j"` is executed in the loop, when the key is
deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL,
if serverCron happens to trigger slot dict resize, then the test
will fail. Because there is not way to meet HASHTABLE_MIN_FILL in
the subsequent dels.

The solution is to avoid triggering the resize in advance. We can
use multi to delete them at once, or we can disable the resize.
Since we disabled resize in the previous test, the fix also uses
the method of disabling resize.

The test is introduced in redis#12802.
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
Before redis#12850, we will only try to shrink the dict in serverCron,
which we can control by using a child process, but now every time
we delete a key, the shrink check will be called.

In these test (added in redis#12802), we meant to disable the resizing,
but druing the delete, the dict will meet the force shrink, like
2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and
will cause the test to fail.

In this commit, we try to keep the load factor at 3 / 128 = 0.023,
that is, do not meet the force shrink.
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
…redis#13009)

tests consistently fail on timeout (sleep that's too short).
it now takes more time because in redis#12819 we iterate on all dicts, not
just non-empty ones.
it passed the PR's CI because it skips the `slow` tag, which might have
been misplaced, but now it is probably required.
with the fix, the tests take quite a lot of time:
```
[ok]: Redis can trigger resizing (1860 ms)
[ok]: Redis can rewind and trigger smaller slot resizing (744 ms)
```
before redis#12819:
```
[ok]: Redis can trigger resizing (309 ms)
[ok]: Redis can rewind and trigger smaller slot resizing (295 ms)
```

failure:
https://github.com/redis/redis/actions/runs/7704158180/job/20995931735
```
*** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl
scan didn't handle slot skipping logic.
*** [err]: Redis can trigger resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 128
 number of elements: 5
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) 
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 256
 number of elements: 10
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test) 
```
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
The test have a race:
```
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 12
 number of elements: 2
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
```

When `r del "{alice}$j"` is executed in the loop, when the key is
deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL,
if serverCron happens to trigger slot dict resize, then the test
will fail. Because there is not way to meet HASHTABLE_MIN_FILL in
the subsequent dels.

The solution is to avoid triggering the resize in advance. We can
use multi to delete them at once, or we can disable the resize.
Since we disabled resize in the previous test, the fix also uses
the method of disabling resize.

The test is introduced in redis#12802.
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
Before redis#12850, we will only try to shrink the dict in serverCron,
which we can control by using a child process, but now every time
we delete a key, the shrink check will be called.

In these test (added in redis#12802), we meant to disable the resizing,
but druing the delete, the dict will meet the force shrink, like
2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and
will cause the test to fail.

In this commit, we try to keep the load factor at 3 / 128 = 0.023,
that is, do not meet the force shrink.
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…redis#13009)

tests consistently fail on timeout (sleep that's too short).
it now takes more time because in redis#12819 we iterate on all dicts, not
just non-empty ones.
it passed the PR's CI because it skips the `slow` tag, which might have
been misplaced, but now it is probably required.
with the fix, the tests take quite a lot of time:
```
[ok]: Redis can trigger resizing (1860 ms)
[ok]: Redis can rewind and trigger smaller slot resizing (744 ms)
```
before redis#12819:
```
[ok]: Redis can trigger resizing (309 ms)
[ok]: Redis can rewind and trigger smaller slot resizing (295 ms)
```

failure:
https://github.com/redis/redis/actions/runs/7704158180/job/20995931735
```
*** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl
scan didn't handle slot skipping logic.
*** [err]: Redis can trigger resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 128
 number of elements: 5
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test) 
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 256
 number of elements: 10
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test) 
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants