add weighted reservoir sampling (AE-S)#44
Conversation
|
lgtm 👍 |
|
I’m convinced Toby just likes implementing stuff from papers he finds on the internet ;) |
There was a problem hiding this comment.
I'm always a little confused by this. If a type which is actually a slice, such as WeightedValueHeap, is used, does it even makes sense to refer to it using pointer? Or better to just do func (h WeightedValueHeap)...? Is there some standardized way of treating these types?
|
Curious why this is useful for split key determination. Doesn't this imply that, if all of the key-values are the same length, then a split becomes uniformly random? It seems to me that, in that case, you'd want the split to be preferentially near the middle of the data (if not exactly near the middle). |
util/sampling.go
Outdated
There was a problem hiding this comment.
Instead of this method, why not make just the plain old NewWeightedReservoirSample allow minHeap to be specified as nil in which case it creates a WeightedValueHeap as the default...
|
LGTM |
|
@toddlipcon: the idea is to end up with a weighted sampling of all the keys in a range by the total size in bytes of the key + value. Let's say you kept a sample of 100 key/values. When finished, you'd pull all 100 keys out of the sample, sort them, and take the 50th as the split key. |
|
that still depends on a linear scan over all of the keys in order to construct the reservoir sample. If you're going to scan the whole range, couldn't you do this exactly with the simple algorithm (pseudocode): total_size = sum the byte size of all files in the range or something of that sort? Or, assuming you have a tree structure key index in your SSTables, you could probably do this with a logarithmic number of seeks instead of a linear scan. |
|
Yep, that would work and certainly has the benefit of simplicity and would Tobias, up to you how you want to do it. On Wed, Sep 10, 2014 at 11:48 AM, Todd Lipcon notifications@github.com
|
|
Leaving it as is for now, to be reconsidered as stats evolve. |
address feedback from @spencerkimball final commit before merge
add weighted reservoir sampling (AE-S)
minor refactoring
Source commit: 4c61bb667fdd42452a5040622cc3512ec4ba7d21 Source PR: cockroachdb/cockroach#44
Implemented algorithm A-Res from http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf, offering a reservoir sampling algorithm with the chance of each element being in the final sample being proportional to the (positive) weight assigned to each element.
If need arises we will fairly easily be able to implement A-ExpJ using this (but I don't think we'll ever need it) to save some random numbers.
The careful reviewer will observe that really I would have wanted a priority queue instead of an abstract heap (to get "peek") but in this version the exposition is clearer and the performance hit isn't more than a slightly worse constant depending on the reservoir size.
A-Res will be used to generate the split key when it is needed, taking the size of key-value pairs as their weight. Other metrics are possible, allowing for a range of interesting split key criteria.