Skip to content

Conversation

@cyb70289
Copy link
Contributor

@cyb70289 cyb70289 commented Feb 7, 2021

Implement approximate quantile kernel using t-digest util.

@github-actions
Copy link

github-actions bot commented Feb 7, 2021

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious whether this is best exposed as an interpolation kind for the "quantile" function, or a separate function altogether. Are there precedents in other libraries or database engines?

It seems R uses a separate function:
https://www.rdocumentation.org/packages/tdigest/versions/0.3.0/topics/tdigest

cc @nealrichardson @michalursa for opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matlab quantile has a method parameter, which can be exact (linear interpolation) or approximate (t-digest).
https://www.mathworks.com/help/stats/quantile.html#mw_06fb6551-ec1a-48f6-a690-f790dfaa9849

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW that tdigest R function is not part of base R, it is a contributed package.

R's quantile function also supports multiple methods, 9 in fact, via the type parameter: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

None of these are approximate in the same way as this, but it's a further argument that it could be a function parameter rather than a separate function. TBH I don't know how much it matters, as a compute API consumer I can make either work. It's marginally easier if they're separate functions rather than having to mess with FunctionOptions to select them--are maybe there are downsides to that way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the t-digest method also has its own parameters (compression and delta). If we ever want to expose them, it would probably be less confusing as a separate function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like separate function makes the most sense unless there's a cost to having an additional function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the t-digest method also has its own parameters (compression and delta). If we ever want to expose them, it would probably be less confusing as a separate function.

Good point. I will separate it into another function.

Copy link
Contributor Author

@cyb70289 cyb70289 Feb 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added separated TDigest kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I've separated TDigest kernel, please review, thanks.

@cyb70289
Copy link
Contributor Author

appveyor complains protobuf namespace collision, should be fixed by pr #9542
mingw32 arrow-python-test timeout, still checking if related to this patch

@cyb70289 cyb70289 changed the title ARROW-11541: [C++][Compute] Implement approximate quantile kernel ARROW-11541: [C++][Compute] Implement tdigest kernel Feb 22, 2021
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Some comments below.

Also, it seems there's a performance problem here where TDigest is much slower than Quantile:

QuantileKernelDouble/1048576/10000       1104 us         1104 us          628 bytes_per_second=905.728M/s null_percent=0.01 size=1048.58k
QuantileKernelDouble/1048576/100         1242 us         1242 us          562 bytes_per_second=805.287M/s null_percent=1 size=1048.58k
QuantileKernelDouble/1048576/10           898 us          897 us          780 bytes_per_second=1114.26M/s null_percent=10 size=1048.58k
QuantileKernelDouble/1048576/2            864 us          863 us          810 bytes_per_second=1.13112G/s null_percent=50 size=1048.58k
QuantileKernelDouble/1048576/1          0.920 us        0.920 us       767094 bytes_per_second=1062.03G/s null_percent=100 size=1048.58k
QuantileKernelDouble/1048576/0           1222 us         1221 us          570 bytes_per_second=818.707M/s null_percent=0 size=1048.58k

TDigestKernel/1048576/10000              5933 us         5932 us          118 bytes_per_second=168.565M/s null_percent=0.01 size=1048.58k
TDigestKernel/1048576/100                5899 us         5898 us          119 bytes_per_second=169.563M/s null_percent=1 size=1048.58k
TDigestKernel/1048576/10                 5518 us         5517 us          127 bytes_per_second=181.242M/s null_percent=10 size=1048.58k
TDigestKernel/1048576/2                  3341 us         3341 us          210 bytes_per_second=299.355M/s null_percent=50 size=1048.58k
TDigestKernel/1048576/1                  1.58 us         1.58 us       440133 bytes_per_second=617.774G/s null_percent=100 size=1048.58k
TDigestKernel/1048576/0                  5942 us         5941 us          117 bytes_per_second=168.328M/s null_percent=0 size=1048.58k

(Ubuntu 20.04, clang 10)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it be better to partition at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No solid benchmark.
The idea is to drop the isnan check in Add function. When input buffer is full, partition NaNs to the buffer end and ignore them in sorting and merging. Guess the performance depends heavily on inputs (maybe worse for common cases where NaN is rare and isnan branch is always predicted correctly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this comment. Will do benchmark first.

@pitrou
Copy link
Member

pitrou commented Feb 22, 2021

For the record, 93% of the consumed CPU time seems to be in TDigest::MergeInput, of which:

  • 23% in TDigestMerger::Add
  • 68% in std::sort

@pitrou
Copy link
Member

pitrou commented Feb 22, 2021

Here are the new benchmark results. It seems TDigest is only really useful for centiles (!):

QuantileKernelMedianNarrow<Int32Type>/1048576/0          611 us          610 us         1170 bytes_per_second=1.59974G/s items_per_second=429.428M/s null_percent=0 size=1048.58k
QuantileKernelMedianWide<Int32Type>/1048576/0           1602 us         1595 us          438 bytes_per_second=627.15M/s items_per_second=164.404M/s null_percent=0 size=1048.58k
QuantileKernelMedianNarrow<Int64Type>/1048576/0          332 us          331 us         2111 bytes_per_second=2.95009G/s items_per_second=395.955M/s null_percent=0 size=1048.58k
QuantileKernelMedianWide<Int64Type>/1048576/0            553 us          550 us         1272 bytes_per_second=1.77461G/s items_per_second=238.185M/s null_percent=0 size=1048.58k
QuantileKernelMedianWide<DoubleType>/1048576/0          1148 us         1145 us          609 bytes_per_second=873.302M/s items_per_second=114.465M/s null_percent=0 size=1048.58k

QuantileKernelDecilesNarrow<Int32Type>/1048576/0         620 us          617 us         1158 bytes_per_second=1.5826G/s items_per_second=424.825M/s null_percent=0 size=1048.58k
QuantileKernelDecilesWide<Int32Type>/1048576/0          4685 us         4666 us          151 bytes_per_second=214.324M/s items_per_second=56.1838M/s null_percent=0 size=1048.58k
QuantileKernelDecilesWide<DoubleType>/1048576/0         2841 us         2829 us          247 bytes_per_second=353.477M/s items_per_second=46.3309M/s null_percent=0 size=1048.58k

QuantileKernelCentilesNarrow<Int32Type>/1048576/0        621 us          619 us         1137 bytes_per_second=1.57778G/s items_per_second=423.531M/s null_percent=0 size=1048.58k
QuantileKernelCentilesWide<Int32Type>/1048576/0        14151 us        14120 us           50 bytes_per_second=70.8198M/s items_per_second=18.565M/s null_percent=0 size=1048.58k
QuantileKernelCentilesWide<DoubleType>/1048576/0        7354 us         7328 us           97 bytes_per_second=136.46M/s items_per_second=17.8861M/s null_percent=0 size=1048.58k

TDigestKernelDoubleMedian/1048576/0                     5655 us         5628 us          125 bytes_per_second=177.681M/s items_per_second=23.289M/s null_percent=0 size=1048.58k
TDigestKernelDoubleDeciles/1048576/0                    5617 us         5598 us          125 bytes_per_second=178.645M/s items_per_second=23.4154M/s null_percent=0 size=1048.58k
TDigestKernelDoubleCentiles/1048576/0                   5802 us         5784 us          125 bytes_per_second=172.883M/s items_per_second=22.6601M/s null_percent=0 size=1048.58k

@cyb70289
Copy link
Contributor Author

Also, it seems there's a performance problem here where TDigest is much slower than Quantile:

Thanks for careful review. As you found, the bottleneck of TDigest is sorting. I once talked with the author about possibility of improving TDigest performance by removing sorting: tdunning/t-digest#153. There are some ideas to mitigate the penalty of sorting. I also plan to evaluate faster sorting algorithm like radix sort for floating numbers (folly has such an implementation).

Quantile kernel is expensive in calculating quantiles (call std::nth_element). So centils() performances the worst, simply because it calculates more quantiles.
TDigest kernel is trivial in calculating quantiles, so its performance is almost constant regardless number of quantiles to calculate. But TDigest is expensive in adding data points (TDigestMerger::Add), when buffer is full, it sorts (call std::sort) the input buffer which is much slower than std::nth_element.

I think Quantile and TDigest are for different purpose. You will use TDigest if you have large volume of inputs (or streaming data), and copying all of them leads to huge memory footprint. Another benefit of TDigest is that chunks can be processed in parallel, then merge into one tdigest, though current aggregate kernel framework doesn't support it.

@cyb70289
Copy link
Contributor Author

Other thoughts about O(1) space quantile kernel performance:
TDigest is good at estimate skewed quantiles, like 0.01, 0.99. For normal quantiles between 0.1 ~ 0.9, the old school online quantile algorithm P-Square [1] is good enough, it should be faster than TDigest. Maybe we can add a p-square kernel for additional cases, or auto select tdigest/p-square per input options.
[1] https://www.cse.wustl.edu/~jain/papers/ftp/psqr.pdf

@cyb70289
Copy link
Contributor Author

CI error not related

@pitrou
Copy link
Member

pitrou commented Feb 23, 2021

Rebasing from master should allow fixing the CI failure.

@pitrou
Copy link
Member

pitrou commented Feb 23, 2021

@cyb70289 Should we merge this or do you want to keep experimenting?

@cyb70289
Copy link
Contributor Author

@cyb70289 Should we merge this or do you want to keep experimenting?

I would like to merge this patch. Take tdigest optimization as followup task. Guess only code in util lib needs change, not kernel related code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants