[data] Support CountDistinct aggregate#59030
[data] Support CountDistinct aggregate#59030alexeykudinkin merged 7 commits intoray-project:masterfrom
Conversation
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a CountDistinct aggregation, which is a useful addition. The implementation correctly leverages the existing Unique aggregation. However, I've identified a high-severity issue where the ignore_nulls parameter is not being respected due to the parent class's implementation. I've suggested a fix by overriding aggregate_block in the new CountDistinct class. Additionally, I've pointed out a minor error in the docstring and proposed corrections to the tests to ensure the ignore_nulls functionality is properly validated.
| .groupby("A") | ||
| .aggregate( | ||
| Count("B", alias_name="count_b", ignore_nulls=ignore_nulls), | ||
| CountDistinct("B", alias_name="count_distinct_b"), |
There was a problem hiding this comment.
The CountDistinct aggregation in this test should be parameterized with the ignore_nulls variable to properly test both cases (True and False), just like the other aggregations in this test.
| CountDistinct("B", alias_name="count_distinct_b"), | |
| CountDistinct("B", alias_name="count_distinct_b", ignore_nulls=ignore_nulls), |
|
|
||
| aggs = [ | ||
| Count("B", alias_name="count_b", ignore_nulls=ignore_nulls), | ||
| CountDistinct("B", alias_name="count_distinct_b"), |
There was a problem hiding this comment.
| { | ||
| "B": [ | ||
| ("count_b", lambda s: s.count() if ignore_nulls else len(s)), | ||
| ("count_distinct_b", lambda s: s.nunique(dropna=False)), |
There was a problem hiding this comment.
Bug: Test uses wrong pandas comparison for CountDistinct null handling
The test compares CountDistinct (using default ignore_nulls=True) against s.nunique(dropna=False), which have opposite semantics. With ignore_nulls=True, nulls are excluded from the count, but dropna=False tells pandas to include NaN as a unique value. Since the test data contains NaN values, this mismatch will cause test failures. The pandas comparison needs to use dropna=True to match, or CountDistinct needs to pass ignore_nulls=ignore_nulls to be consistent with other aggregations in the test.
Additional Locations (1)
| { | ||
| "B": [ | ||
| ("count", lambda s: s.count() if ignore_nulls else len(s)), | ||
| ("count_distinct", lambda s: s.nunique(skipna=False)), |
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
alexeykudinkin
left a comment
There was a problem hiding this comment.
LGTM, please address feedback from Gemini
|
|
||
| aggs = [ | ||
| Count("B", alias_name="count_b", ignore_nulls=ignore_nulls), | ||
| CountDistinct("B", alias_name="count_distinct_b"), |
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
|
Hi @alexeykudinkin, thanks for review. I have resolved problems, but ci fails. I don't find actual fail info in microcheck. Could you please re-triage this test? |
Some of the errors were transient (5xx errors). I just merge master into your branch. Those should resolve |
Thanks @goutamvenkat-anyscale !!! |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
[2026-01-01T21:27:19Z] Public APIs that are NOT documented:[2026-01-01T21:27:19Z] ray.data.aggregate.CountDistinct Looks like you're missing some documentation on |
Adds CountDistinct to the list of aggregation functions in the API reference to improve documentation completeness. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Done |
|
Thank you all! |
## Description `CountDistinct` allow users to compute the number of distinct values in a column, similar to SQL's `COUNT(DISTINCT ...)`. ## Related issues close ray-project#58252 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
## Description `CountDistinct` allow users to compute the number of distinct values in a column, similar to SQL's `COUNT(DISTINCT ...)`. ## Related issues close ray-project#58252 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>
## Description `CountDistinct` allow users to compute the number of distinct values in a column, similar to SQL's `COUNT(DISTINCT ...)`. ## Related issues close ray-project#58252 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>
## Description `CountDistinct` allow users to compute the number of distinct values in a column, similar to SQL's `COUNT(DISTINCT ...)`. ## Related issues close ray-project#58252 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
## Description `CountDistinct` allow users to compute the number of distinct values in a column, similar to SQL's `COUNT(DISTINCT ...)`. ## Related issues close ray-project#58252 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
CountDistinctallow users to compute the number of distinct values in a column, similar to SQL'sCOUNT(DISTINCT ...).Related issues
close #58252
Additional information