Conversation
There was a problem hiding this comment.
https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ to avoid division on runtime value.
There was a problem hiding this comment.
Ok, time to do some micro benchmarks :)
There was a problem hiding this comment.
What about just using pcg64?
There was a problem hiding this comment.
It's borrowed from https://github.com/yandex/ClickHouse/blob/master/dbms/src/AggregateFunctions/ReservoirSampler.h#L205 and I noticed a fast suffix... I thought this was an optimization you did in be79166
There was a problem hiding this comment.
be79166 was just a modernization of code base. Without much focus on performance on this particular function.
There was a problem hiding this comment.
Separate rng for every state looks too much. We can use thread_local_rng.
There was a problem hiding this comment.
But do we need deterministic result?
There was a problem hiding this comment.
Perhaps we can have a very unrandom sampler and a truely uniform sampler.
But do we need deterministic result?
I don't know and I'm always curious about the MergeTree's sample clause and it's deterministic characteristic.
There was a problem hiding this comment.
thread_local_rng will be as random as separate rngs.
I'm always curious about the MergeTree's sample clause and it's deterministic characteristic.
Yes, but for groupArraySample it's not enough to just use deterministic rng, you should also run in single thread to get deterministic result. But it's not practical.
There was a problem hiding this comment.
I also have a thought about "groupArrayLast" :)
There was a problem hiding this comment.
Could you elaborate? what elements should be victimized in such case
There was a problem hiding this comment.
groupArrayLast is intended to select "last" elements. It can be implemented as a cycle buffer.
There was a problem hiding this comment.
I found it useful for one of my use cases, so here it is - #44521
There was a problem hiding this comment.
Could you please also add performance test?
It will be very interesting to test various sampling methods.
For example, I still prefer something more simple and non-uniform like:
fill array, then replace every element with 1/2 probability, then replace every element with 1/3 probability, etc...
There was a problem hiding this comment.
Yeah, I'll try adding more samplers and make perf comp tests
There was a problem hiding this comment.
We can do it on create.
There was a problem hiding this comment.
ah, missed that interface :)
There was a problem hiding this comment.
Looks very painful. Maybe rng is POD and we can just writePODBinary?
There was a problem hiding this comment.
unfortunately it's not POD
There was a problem hiding this comment.
Technically it's not POD but it should behave this way.
There was a problem hiding this comment.
oops, testing code
120c261 to
a61a62e
Compare
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):
support reservior sampling for groupArray -- groupArraySample.
Currently the implementation is not uniform, and all groups have the RNG seeded in the same way. More variants can be added in the future.