Skip to content

Conversation

@sundb
Copy link
Collaborator

@sundb sundb commented Jun 7, 2023

Issue

When a dict has a long chain or the length of the chain is longer than the number of samples, we will never be able to sample the elements at the end of the chain using dictGetSomeKeys().
This means that there's an unfair random sampling affecting: RANDOMKEY, SRANDMEMBER, SPOP, ZRANDMEMBER, HRANDFIELD, and the eviction mechanism.

Severe issue

This could mean that SRANDMEMBER, ZRANDMEMBER and HRANDFIELD (when used with <count>) can be hang in and endless loop.
The most severe case, is the pathological case of when someone uses SCAN+DEL or SSCAN+SREM creating an unevenly distributed dict.

Recent regression

The above was amplified by the recent change in #11692 which prevented a down-sizing rehashing while there is a fork.
So in a pathological case of SSCAN+SREM causing an uneven hash table, followed by down-sizing of the hash table, if the rehashing would be suspended during a fork, and then calling SRANDMEMBER with <count> can lead to a hang.

Solution

  1. Before, we will stop sampling when we reach the maximum number of samples, even if there is more data after the current chain.
    Now when we reach the maximum we use the Reservoir Sampling algorithm to fairly sample the end of the chain that cannot be sampled
  2. Fix the rehashing code, so that the same as it allows rehashing for up-sizing during fork when the ratio is extreme, it will allow it for down-sizing as well.

Fix #12200

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat! LGTM.
calling @inbaryuval to comment about the algorithm fairness.

@sundb don't we also wanna fix the rehash condition that we discussed here #12200 (comment)
i'd imagine that needs to be backported to anywhere were the issue was taken to, but considering the fix here is rather safe, maybe we can backport that as well.

@oranagra
Copy link
Member

btw, maybe it would be nice to create a test (one that takes the dict to the extreme using SCAN+DEL) and checks the random fairness.

@sundb
Copy link
Collaborator Author

sundb commented Jun 12, 2023

btw, maybe it would be nice to create a test (one that takes the dict to the extreme using SCAN+DEL) and checks the random fairness.

Because adding this test requires a fix for #12200 (comment) first, but I think this fix for #issuecomment-1563003032 should require a separate PR for backport, anyway I'll add the test in this PR first.

@sundb
Copy link
Collaborator Author

sundb commented Jun 14, 2023

@oranagra Added the test, please wait for me to run it for several hours.
This algorithm seems fair, and this test will easily fail when I change unsigned long r = randomULong() % (stored + 1); to unsigned long r = randomULong() % (stored + 10);.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't review the test yet

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've added more details to the top comment, feel free to fix / extend.

sundb and others added 7 commits June 16, 2023 10:15
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
@oranagra
Copy link
Member

@sundb this is ready right? i can remove the WIP and merge it...
please have a look at the edited top comment and see if anything requires a change.
thanks a lot!!!

@sundb
Copy link
Collaborator Author

sundb commented Jun 16, 2023

the top comment is ok and it's ready.
I can't be in front of the computer now, please help to remove WIP, thx.

@oranagra oranagra changed the title WIP - Use Reservoir Sampling algorithm for optimized random sampling of dict Use Reservoir Sampling for random sampling of dict, and fix hang during fork Jun 16, 2023
@oranagra oranagra merged commit b00a235 into redis:unstable Jun 16, 2023
@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Jun 16, 2023
@sundb sundb deleted the sample.long.chain branch June 19, 2023 02:24
@LiorKogan
Copy link
Member

@sundb the following may produce a biased non-uniformaly distributed value :
unsigned long r = randomULong() % (stored + 1);

(See here; the problem is less severe for 64-bits, but still exists, especially when stored is large).

One possible solution based on randomULong: here.

@LiorKogan
Copy link
Member

LiorKogan commented Jun 19, 2023

For the reference, there are more efficient sampling algorithms than reservoir sampling - see here. I'm not sure if it worth the trouble here.

@sundb
Copy link
Collaborator Author

sundb commented Jun 20, 2023

@LiorKogan Thank you for figue out.
Admittedly, randomULong() is not as evenly distributed as std::uniform_int_distribution, but it has been used extensively in other sampling and has proven to be adequate.
I chose the current algorithm because in most scenarios, the length of our dict bucket chain is not too large (the number of samples is always larger than the chain length), and I benchmarked that the sampling performance will be the same as the code before this PR, I do not want to use overly complex algorithms to reduce the sampling performance, which may be more random and evenly distributed.

oranagra pushed a commit that referenced this pull request Jul 10, 2023
This was introduced by the recent change in #11692 which prevented a
down-sizing rehashing while there is a fork.

## Solution
1. Fix the rehashing code, so that the same as it allows rehashing for up-sizing
  during fork when the ratio is extreme, it will allow it for down-sizing as well.

Co-authored-by: Oran Agra <oran@redislabs.com>

This is a partial cherry pick of:
(cherry picked from commit b00a235)
(cherry picked from commit d4c37320382edb342292a3e30250d46896a12016)
oranagra pushed a commit that referenced this pull request Jul 10, 2023
This was introduced by the recent change in #11692 which prevented a
down-sizing rehashing while there is a fork.

## Solution
1. Fix the rehashing code, so that the same as it allows rehashing for up-sizing
  during fork when the ratio is extreme, it will allow it for down-sizing as well.

Co-authored-by: Oran Agra <oran@redislabs.com>

This is a partial cherry pick of:
(cherry picked from commit b00a235)
(cherry picked from commit d4c37320382edb342292a3e30250d46896a12016)
oranagra added a commit that referenced this pull request Jul 10, 2023
…ng fork (#12276)

## Issue:
When a dict has a long chain or the length of the chain is longer than
the number of samples, we will never be able to sample the elements
at the end of the chain using dictGetSomeKeys().
This could mean that SRANDMEMBER can be hang in and endless loop.
The most severe case, is the pathological case of when someone uses SCAN+DEL
or SSCAN+SREM creating an unevenly distributed dict.
This was amplified by the recent change in #11692 which prevented a
down-sizing rehashing while there is a fork.

## Solution
1. Before, we will stop sampling when we reach the maximum number
  of samples, even if there is more data after the current chain.
  Now when we reach the maximum we use the Reservoir Sampling
  algorithm to fairly sample the end of the chain that cannot be sampled
2. Fix the rehashing code, so that the same as it allows rehashing for up-sizing
  during fork when the ratio is extreme, it will allow it for down-sizing as well.

Issue was introduced (or became more severe) by #11692

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit b00a235)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[BUG] Hang in SRANDMEMBER call during CoW fork saving

3 participants