Use Reservoir Sampling for random sampling of dict, and fix hang during fork #12276

sundb · 2023-06-07T06:04:40Z

Issue

When a dict has a long chain or the length of the chain is longer than the number of samples, we will never be able to sample the elements at the end of the chain using dictGetSomeKeys().
This means that there's an unfair random sampling affecting: RANDOMKEY, SRANDMEMBER, SPOP, ZRANDMEMBER, HRANDFIELD, and the eviction mechanism.

Severe issue

This could mean that SRANDMEMBER, ZRANDMEMBER and HRANDFIELD (when used with <count>) can be hang in and endless loop.
The most severe case, is the pathological case of when someone uses SCAN+DEL or SSCAN+SREM creating an unevenly distributed dict.

Recent regression

The above was amplified by the recent change in #11692 which prevented a down-sizing rehashing while there is a fork.
So in a pathological case of SSCAN+SREM causing an uneven hash table, followed by down-sizing of the hash table, if the rehashing would be suspended during a fork, and then calling SRANDMEMBER with <count> can lead to a hang.

Solution

Before, we will stop sampling when we reach the maximum number of samples, even if there is more data after the current chain.
Now when we reach the maximum we use the Reservoir Sampling algorithm to fairly sample the end of the chain that cannot be sampled
Fix the rehashing code, so that the same as it allows rehashing for up-sizing during fork when the ratio is extreme, it will allow it for down-sizing as well.

Fix #12200

oranagra

neat! LGTM.
calling @inbaryuval to comment about the algorithm fairness.

@sundb don't we also wanna fix the rehash condition that we discussed here #12200 (comment)
i'd imagine that needs to be backported to anywhere were the issue was taken to, but considering the fix here is rather safe, maybe we can backport that as well.

oranagra · 2023-06-12T07:41:22Z

btw, maybe it would be nice to create a test (one that takes the dict to the extreme using SCAN+DEL) and checks the random fairness.

sundb · 2023-06-12T09:36:57Z

btw, maybe it would be nice to create a test (one that takes the dict to the extreme using SCAN+DEL) and checks the random fairness.

Because adding this test requires a fix for #12200 (comment) first, but I think this fix for #issuecomment-1563003032 should require a separate PR for backport, anyway I'll add the test in this PR first.

sundb · 2023-06-14T11:18:58Z

@oranagra Added the test, please wait for me to run it for several hours.
This algorithm seems fair, and this test will easily fail when I change unsigned long r = randomULong() % (stored + 1); to unsigned long r = randomULong() % (stored + 10);.

oranagra

didn't review the test yet

oranagra

i've added more details to the top comment, feel free to fix / extend.

tests/unit/type/set.tcl

Co-authored-by: Oran Agra <oran@redislabs.com>

oranagra · 2023-06-16T12:47:46Z

@sundb this is ready right? i can remove the WIP and merge it...
please have a look at the edited top comment and see if anything requires a change.
thanks a lot!!!

sundb · 2023-06-16T14:29:34Z

the top comment is ok and it's ready.
I can't be in front of the computer now, please help to remove WIP, thx.

LiorKogan · 2023-06-19T15:42:02Z

@sundb the following may produce a biased non-uniformaly distributed value :
unsigned long r = randomULong() % (stored + 1);

(See here; the problem is less severe for 64-bits, but still exists, especially when stored is large).

One possible solution based on randomULong: here.

LiorKogan · 2023-06-19T15:45:14Z

For the reference, there are more efficient sampling algorithms than reservoir sampling - see here. I'm not sure if it worth the trouble here.

sundb · 2023-06-20T02:55:05Z

@LiorKogan Thank you for figue out.
Admittedly, randomULong() is not as evenly distributed as std::uniform_int_distribution, but it has been used extensively in other sampling and has proven to be adequate.
I chose the current algorithm because in most scenarios, the length of our dict bucket chain is not too large (the number of samples is always larger than the chain length), and I benchmarked that the sampling performance will be the same as the code before this PR, I do not want to use overly complex algorithms to reduce the sampling performance, which may be more random and evenly distributed.

This was introduced by the recent change in #11692 which prevented a down-sizing rehashing while there is a fork. ## Solution 1. Fix the rehashing code, so that the same as it allows rehashing for up-sizing during fork when the ratio is extreme, it will allow it for down-sizing as well. Co-authored-by: Oran Agra <oran@redislabs.com> This is a partial cherry pick of: (cherry picked from commit b00a235) (cherry picked from commit d4c37320382edb342292a3e30250d46896a12016)

…ng fork (#12276) ## Issue: When a dict has a long chain or the length of the chain is longer than the number of samples, we will never be able to sample the elements at the end of the chain using dictGetSomeKeys(). This could mean that SRANDMEMBER can be hang in and endless loop. The most severe case, is the pathological case of when someone uses SCAN+DEL or SSCAN+SREM creating an unevenly distributed dict. This was amplified by the recent change in #11692 which prevented a down-sizing rehashing while there is a fork. ## Solution 1. Before, we will stop sampling when we reach the maximum number of samples, even if there is more data after the current chain. Now when we reach the maximum we use the Reservoir Sampling algorithm to fairly sample the end of the chain that cannot be sampled 2. Fix the rehashing code, so that the same as it allows rehashing for up-sizing during fork when the ratio is extreme, it will allow it for down-sizing as well. Issue was introduced (or became more severe) by #11692 Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit b00a235)

Use Reservoir Sampling algorithm for optimized random sampling of dict

2db33ef

oranagra approved these changes Jun 12, 2023

View reviewed changes

sundb added 2 commits June 14, 2023 19:11

Fix shrinkage rehashing and add test

978dac2

Avoid dead loops due to too few members

539247c

oranagra approved these changes Jun 14, 2023

View reviewed changes

oranagra reviewed Jun 14, 2023

View reviewed changes

sundb and others added 7 commits June 16, 2023 10:15

Update tests/unit/type/set.tcl

1a58b0f

Co-authored-by: Oran Agra <oran@redislabs.com>

Update tests/unit/type/set.tcl

e0862e9

Co-authored-by: Oran Agra <oran@redislabs.com>

Update tests/unit/type/set.tcl

7c26536

Co-authored-by: Oran Agra <oran@redislabs.com>

Update tests/unit/type/set.tcl

fd75632

Co-authored-by: Oran Agra <oran@redislabs.com>

Fix test failure due to reuse of previous myset

097b3d1

Update tests/unit/type/set.tcl

38e5558

Co-authored-by: Oran Agra <oran@redislabs.com>

Add assert for expected numbers of slots and length of bucket chain

5ab49e5

oranagra changed the title ~~WIP - Use Reservoir Sampling algorithm for optimized random sampling of dict~~ Use Reservoir Sampling for random sampling of dict, and fix hang during fork Jun 16, 2023

oranagra merged commit b00a235 into redis:unstable Jun 16, 2023

oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Jun 16, 2023

sundb deleted the sample.long.chain branch June 19, 2023 02:24

This was referenced Jul 10, 2023

Release 6.0.20 #12399

Merged

Release 6.2.13 #12400

Merged

Release 7.0.12 #12401

Merged

Relase 7.2-RC3 #12402

Merged

stondino00 mentioned this pull request Jul 10, 2023

Upgrade Redis to 6.2.13 nextcloud-snap/nextcloud-snap#2448

Closed

oranagra mentioned this pull request Jan 17, 2024

Change the threshold of dict expand, shrink and rehash #12948

Merged

Use Reservoir Sampling for random sampling of dict, and fix hang during fork #12276

Use Reservoir Sampling for random sampling of dict, and fix hang during fork #12276

Uh oh!

Conversation

sundb commented Jun 7, 2023 • edited by oranagra Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Severe issue

Recent regression

Solution

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

oranagra commented Jun 12, 2023

Uh oh!

sundb commented Jun 12, 2023

Uh oh!

sundb commented Jun 14, 2023

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oranagra commented Jun 16, 2023

Uh oh!

sundb commented Jun 16, 2023

Uh oh!

LiorKogan commented Jun 19, 2023

Uh oh!

LiorKogan commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sundb commented Jun 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sundb commented Jun 7, 2023 •

edited by oranagra

Loading

LiorKogan commented Jun 19, 2023 •

edited

Loading

sundb commented Jun 20, 2023 •

edited

Loading