Skip to content

add weighted reservoir sampling (AE-S)#44

Merged
tbg merged 1 commit intocockroachdb:masterfrom
tbg:split
Sep 10, 2014
Merged

add weighted reservoir sampling (AE-S)#44
tbg merged 1 commit intocockroachdb:masterfrom
tbg:split

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Sep 10, 2014

Implemented algorithm A-Res from http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf, offering a reservoir sampling algorithm with the chance of each element being in the final sample being proportional to the (positive) weight assigned to each element.
If need arises we will fairly easily be able to implement A-ExpJ using this (but I don't think we'll ever need it) to save some random numbers.
The careful reviewer will observe that really I would have wanted a priority queue instead of an abstract heap (to get "peek") but in this version the exposition is clearer and the performance hit isn't more than a slightly worse constant depending on the reservoir size.
A-Res will be used to generate the split key when it is needed, taking the size of key-value pairs as their weight. Other metrics are possible, allowing for a range of interesting split key criteria.

@tbg tbg assigned andybons and spencerkimball and unassigned andybons Sep 10, 2014
@andybons
Copy link
Copy Markdown
Contributor

lgtm 👍

@andybons
Copy link
Copy Markdown
Contributor

I’m convinced Toby just likes implementing stuff from papers he finds on the internet ;)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a little confused by this. If a type which is actually a slice, such as WeightedValueHeap, is used, does it even makes sense to refer to it using pointer? Or better to just do func (h WeightedValueHeap)...? Is there some standardized way of treating these types?

@toddlipcon
Copy link
Copy Markdown

Curious why this is useful for split key determination. Doesn't this imply that, if all of the key-values are the same length, then a split becomes uniformly random? It seems to me that, in that case, you'd want the split to be preferentially near the middle of the data (if not exactly near the middle).

util/sampling.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this method, why not make just the plain old NewWeightedReservoirSample allow minHeap to be specified as nil in which case it creates a WeightedValueHeap as the default...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, doing that

@spencerkimball
Copy link
Copy Markdown
Member

LGTM

@spencerkimball
Copy link
Copy Markdown
Member

@toddlipcon: the idea is to end up with a weighted sampling of all the keys in a range by the total size in bytes of the key + value. Let's say you kept a sample of 100 key/values. When finished, you'd pull all 100 keys out of the sample, sort them, and take the 50th as the split key.

@toddlipcon
Copy link
Copy Markdown

that still depends on a linear scan over all of the keys in order to construct the reservoir sample. If you're going to scan the whole range, couldn't you do this exactly with the simple algorithm (pseudocode):

total_size = sum the byte size of all files in the range
scanner = new InternalScanner();
scanned_size = 0
while scanner.stats.bytes_scanned() < total_size / 2:
scanner.next()

or something of that sort?

Or, assuming you have a tree structure key index in your SSTables, you could probably do this with a logarithmic number of seeks instead of a linear scan.

@spencerkimball
Copy link
Copy Markdown
Member

Yep, that would work and certainly has the benefit of simplicity and would
take half the time on scan. We don't have a particularly accurate signal
for total bytes in the range and don't have access (or at least don't want
to try to ghetto-rig access) to the low level rocksdb MST files. Still, the
split key does not have to be anywhere near an exact choice, so we could
use RocksDB's estimate for size of range.

Tobias, up to you how you want to do it.

On Wed, Sep 10, 2014 at 11:48 AM, Todd Lipcon notifications@github.com
wrote:

that still depends on a linear scan over all of the keys in order to
construct the reservoir sample. If you're going to scan the whole range,
couldn't you do this exactly with the simple algorithm (pseudocode):

total_size = sum the byte size of all files in the range
scanner = new InternalScanner();
scanned_size = 0
while scanner.stats.bytes_scanned() < total_size / 2:
scanner.next()

or something of that sort?

Or, assuming you have a tree structure key index in your SSTables, you
could probably do this with a logarithmic number of seeks instead of a
linear scan.


Reply to this email directly or view it on GitHub
#44 (comment).

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Sep 10, 2014

Leaving it as is for now, to be reconsidered as stats evolve.

address feedback from @spencerkimball

final commit before merge
tbg added a commit that referenced this pull request Sep 10, 2014
add weighted reservoir sampling (AE-S)
@tbg tbg merged commit 87044be into cockroachdb:master Sep 10, 2014
@tbg tbg deleted the split branch September 16, 2014 08:56
soniabhishek pushed a commit to soniabhishek/cockroach that referenced this pull request Feb 15, 2017
ebembi-crdb added a commit to ebembi-crdb/generated-diagrams that referenced this pull request Feb 19, 2026
Source commit: 4c61bb667fdd42452a5040622cc3512ec4ba7d21
Source PR: cockroachdb/cockroach#44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants