Skip to content

FIX raise an error if user defined categories contain duplicate values#27328

Merged
betatim merged 14 commits intoscikit-learn:mainfrom
xuefeng-xu:duplicate
Nov 1, 2023
Merged

FIX raise an error if user defined categories contain duplicate values#27328
betatim merged 14 commits intoscikit-learn:mainfrom
xuefeng-xu:duplicate

Conversation

@xuefeng-xu
Copy link
Copy Markdown
Contributor

Reference Issues/PRs

Follow up #27309
Mentioned in #27088

What does this implement/fix? Explain your changes.

In encoders, check user defined categories and raise an error if they have duplicate values.

Any other comments?

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 10, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: a852df5. Link to the linter CI: here

@glemaitre glemaitre self-requested a review October 31, 2023 09:52
Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also acknowledge this change in the changelog.

)
raise ValueError(msg)

if len(cats) != len(set(cats)):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we deal with a NumPy array, let's use numpy to solve this issue.

Suggested change
if len(cats) != len(set(cats)):
_, n_unique_categories = np.unique(cats, return_counts=True)
if cats.size != n_unique_categories:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this causes the error. How about if cats.size != np.unique(cats).size:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep n_unique_categories = array([1, 1, 1, 1, 1]) is older NumPy version. I am not really sure why. The change that you propose is fine with me.

Copy link
Copy Markdown
Contributor Author

@xuefeng-xu xuefeng-xu Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this still causes error due to None, see the fails https://github.com/scikit-learn/scikit-learn/pull/27328/checks?check_run_id=18232777028.
For example, np.unique(np.array([None, 'a', 'z'], dtype=object)) will raise an error.

How about if cats.size != len(set(cats)):?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so let's revert to the set then. However, we need a test where we specify several time nan in the category as well to check this corner case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to use _unique function instead, see the code below.

import numpy as np
from sklearn.utils._encode import _unique

print(set(np.array(['a', None, None])))            # {'a', None}
print(set(np.array(['a', np.nan, np.nan])))        # {'a', 'nan'}
print(set(np.array([1., np.nan, np.nan])))         # {nan, 1.0, nan}

print(_unique(np.array(['a', None, None])))        # ['a' None]
print(_unique(np.array(['a', np.nan, np.nan])))    # ['a' 'nan']
print(_unique(np.array([1., np.nan, np.nan])))     # [ 1. nan]

For several nan in the category, I don't think it's necessary. Since now we assume nan must at the last, and PR #27309 will resolve this.

BTW, which way do you think is better? First check if nan is at the last or first check if category contain duplicated values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thomasjpfan, would you like take a look at this PR and also #27309 ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, which way do you think is better? First check if nan is at the last or first check if category contain duplicated values?

Maybe the check for nan first since it is less expensive.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update code after #27309 is merged.

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

@glemaitre glemaitre added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 31, 2023
@glemaitre glemaitre added this to the 1.4 milestone Oct 31, 2023
@betatim betatim enabled auto-merge (squash) November 1, 2023 13:18
@betatim
Copy link
Copy Markdown
Member

betatim commented Nov 1, 2023

Looks good to me, I enabled auto merge.

auto-merge was automatically disabled November 1, 2023 14:31

Head branch was pushed to by a user without write access

@xuefeng-xu
Copy link
Copy Markdown
Contributor Author

I just resolved some conflicts, could you take a look again? @betatim

@betatim betatim enabled auto-merge (squash) November 1, 2023 14:38
@betatim betatim merged commit a55f167 into scikit-learn:main Nov 1, 2023
@xuefeng-xu xuefeng-xu deleted the duplicate branch November 2, 2023 02:10
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
scikit-learn#27328)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tim Head <betatim@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:preprocessing Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants