Skip to content

[data] Support CountDistinct aggregate#59030

Merged
alexeykudinkin merged 7 commits intoray-project:masterfrom
my-vegetable-has-exploded:count-distinct
Jan 7, 2026
Merged

[data] Support CountDistinct aggregate#59030
alexeykudinkin merged 7 commits intoray-project:masterfrom
my-vegetable-has-exploded:count-distinct

Conversation

@my-vegetable-has-exploded
Copy link
Copy Markdown
Contributor

Description

CountDistinct allow users to compute the number of distinct values in a column, similar to SQL's COUNT(DISTINCT ...).

Related issues

close #58252

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
@my-vegetable-has-exploded my-vegetable-has-exploded requested a review from a team as a code owner November 27, 2025 08:13
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a CountDistinct aggregation, which is a useful addition. The implementation correctly leverages the existing Unique aggregation. However, I've identified a high-severity issue where the ignore_nulls parameter is not being respected due to the parent class's implementation. I've suggested a fix by overriding aggregate_block in the new CountDistinct class. Additionally, I've pointed out a minor error in the docstring and proposed corrections to the tests to ensure the ignore_nulls functionality is properly validated.

.groupby("A")
.aggregate(
Count("B", alias_name="count_b", ignore_nulls=ignore_nulls),
CountDistinct("B", alias_name="count_distinct_b"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The CountDistinct aggregation in this test should be parameterized with the ignore_nulls variable to properly test both cases (True and False), just like the other aggregations in this test.

Suggested change
CountDistinct("B", alias_name="count_distinct_b"),
CountDistinct("B", alias_name="count_distinct_b", ignore_nulls=ignore_nulls),


aggs = [
Count("B", alias_name="count_b", ignore_nulls=ignore_nulls),
CountDistinct("B", alias_name="count_distinct_b"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure the ignore_nulls functionality of CountDistinct is tested, you should pass the ignore_nulls parameter from the test's parameterization.

Suggested change
CountDistinct("B", alias_name="count_distinct_b"),
CountDistinct("B", alias_name="count_distinct_b", ignore_nulls=ignore_nulls),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

{
"B": [
("count_b", lambda s: s.count() if ignore_nulls else len(s)),
("count_distinct_b", lambda s: s.nunique(dropna=False)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test uses wrong pandas comparison for CountDistinct null handling

The test compares CountDistinct (using default ignore_nulls=True) against s.nunique(dropna=False), which have opposite semantics. With ignore_nulls=True, nulls are excluded from the count, but dropna=False tells pandas to include NaN as a unique value. Since the test data contains NaN values, this mismatch will cause test failures. The pandas comparison needs to use dropna=True to match, or CountDistinct needs to pass ignore_nulls=ignore_nulls to be consistent with other aggregations in the test.

Additional Locations (1)

Fix in Cursor Fix in Web

{
"B": [
("count", lambda s: s.count() if ignore_nulls else len(s)),
("count_distinct", lambda s: s.nunique(skipna=False)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test uses wrong parameter name for pandas nunique

The test uses s.nunique(skipna=False) but pandas Series.nunique() uses the parameter dropna, not skipna. This will raise a TypeError when the test executes. The parameter name needs to be dropna.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 27, 2025
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
@omatthew98 omatthew98 self-assigned this Dec 3, 2025
Copy link
Copy Markdown
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please address feedback from Gemini


aggs = [
Count("B", alias_name="count_b", ignore_nulls=ignore_nulls),
CountDistinct("B", alias_name="count_distinct_b"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
@my-vegetable-has-exploded
Copy link
Copy Markdown
Contributor Author

Hi @alexeykudinkin, thanks for review. I have resolved problems, but ci fails. I don't find actual fail info in microcheck. Could you please re-triage this test?

@goutamvenkat-anyscale
Copy link
Copy Markdown
Contributor

Hi @alexeykudinkin, thanks for review. I have resolved problems, but ci fails. I don't find actual fail info in microcheck. Could you please re-triage this test?

Some of the errors were transient (5xx errors). I just merge master into your branch. Those should resolve

@my-vegetable-has-exploded
Copy link
Copy Markdown
Contributor Author

Hi @alexeykudinkin, thanks for review. I have resolved problems, but ci fails. I don't find actual fail info in microcheck. Could you please re-triage this test?

Some of the errors were transient (5xx errors). I just merge master into your branch. Those should resolve

Thanks @goutamvenkat-anyscale !!!

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 1, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 1, 2026
@richardliaw richardliaw added go add ONLY when ready to merge, run all tests and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Jan 1, 2026
@goutamvenkat-anyscale
Copy link
Copy Markdown
Contributor

@my-vegetable-has-exploded

[2026-01-01T21:27:19Z] Public APIs that are NOT documented:

[2026-01-01T21:27:19Z] ray.data.aggregate.CountDistinct

Looks like you're missing some documentation on CountDistinct which is the root cause of the CI failure

Adds CountDistinct to the list of aggregation functions in the API reference
to improve documentation completeness.

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
@my-vegetable-has-exploded
Copy link
Copy Markdown
Contributor Author

@my-vegetable-has-exploded

[2026-01-01T21:27:19Z] Public APIs that are NOT documented:

[2026-01-01T21:27:19Z] ray.data.aggregate.CountDistinct

Looks like you're missing some documentation on CountDistinct which is the root cause of the CI failure

Done

@alexeykudinkin alexeykudinkin merged commit 6ee6e69 into ray-project:master Jan 7, 2026
6 checks passed
@my-vegetable-has-exploded
Copy link
Copy Markdown
Contributor Author

Thank you all!

AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
## Description

`CountDistinct` allow users to compute the number of distinct values in
a column, similar to SQL's `COUNT(DISTINCT ...)`.

## Related issues

close ray-project#58252

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
## Description

`CountDistinct` allow users to compute the number of distinct values in
a column, similar to SQL's `COUNT(DISTINCT ...)`.

## Related issues

close ray-project#58252

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
## Description

`CountDistinct` allow users to compute the number of distinct values in
a column, similar to SQL's `COUNT(DISTINCT ...)`.

## Related issues

close ray-project#58252

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description

`CountDistinct` allow users to compute the number of distinct values in
a column, similar to SQL's `COUNT(DISTINCT ...)`.

## Related issues

close ray-project#58252

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description

`CountDistinct` allow users to compute the number of distinct values in
a column, similar to SQL's `COUNT(DISTINCT ...)`.

## Related issues

close ray-project#58252

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[data] count distinct aggregator

5 participants