Skip to content

[Data] Add approximate quantile to aggregator#57598

Merged
alexeykudinkin merged 12 commits intoray-project:masterfrom
owenowenisme:data/add-approximate-quantile-to-aggregrator
Oct 16, 2025
Merged

[Data] Add approximate quantile to aggregator#57598
alexeykudinkin merged 12 commits intoray-project:masterfrom
owenowenisme:data/add-approximate-quantile-to-aggregrator

Conversation

@owenowenisme
Copy link
Copy Markdown
Member

@owenowenisme owenowenisme commented Oct 9, 2025

Why are these changes needed?

Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
• Enables efficient support for the summary API.
• More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are prompted to install it.


Here's a simple test to show the efficiency difference between ApproximateQuantile and Quantile

import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")

In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is (49,999,999.5-49,979,428.0)/49,999,999.5= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median 13.11x faster.

{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@owenowenisme owenowenisme force-pushed the data/add-approximate-quantile-to-aggregrator branch from e0584b6 to 45381b1 Compare October 9, 2025 13:20
@owenowenisme owenowenisme added the go add ONLY when ready to merge, run all tests label Oct 9, 2025
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
@owenowenisme owenowenisme force-pushed the data/add-approximate-quantile-to-aggregrator branch from 45381b1 to 024f199 Compare October 9, 2025 23:55
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
@owenowenisme owenowenisme marked this pull request as ready for review October 10, 2025 08:27
@owenowenisme owenowenisme requested a review from a team as a code owner October 10, 2025 08:27
cursor[bot]

This comment was marked as outdated.

Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Oct 10, 2025
"""
self._require_datasketches()
self._quantiles = quantiles
self._k = k
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of k, let's use capacity_per_level

Copy link
Copy Markdown
Member Author

@owenowenisme owenowenisme Oct 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capacity_per_level does not feel accurate to me, I think maybe we don't need to hide the detail of k, since user will need to see the doc from datasketches anyway.

I added link to k params description to guide users to the doc for more info.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem there is it's not obvious to a user what k represents.

They have to look up the algorithm to build intuition. Curious why do you say capacity_per_level is inaccurate?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that I think the concept of "accuracy" should be in param name.
And from user's view capacity might make them confused.
How about accuracy_factor?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantile_precision?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG!

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
cursor[bot]

This comment was marked as outdated.

@bveeramani bveeramani enabled auto-merge (squash) October 15, 2025 17:33
@github-actions github-actions bot disabled auto-merge October 15, 2025 17:33
)

def zero(self, quantile_precision: int):
sketch_cls = self._require_datasketches()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be needed in the ctor

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto everywhere

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
@alexeykudinkin alexeykudinkin merged commit 81cf351 into ray-project:master Oct 16, 2025
6 checks passed
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 • Enables efficient support for the summary API.
 • More scalable than exact Quantile on large datasets.

Note:
• DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants