FIX raise an error if user defined categories contain duplicate values by xuefeng-xu · Pull Request #27328 · scikit-learn/scikit-learn

xuefeng-xu · 2023-09-10T05:52:21Z

Reference Issues/PRs

Follow up #27309
Mentioned in #27088

What does this implement/fix? Explain your changes.

In encoders, check user defined categories and raise an error if they have duplicate values.

Any other comments?

github-actions · 2023-09-10T05:53:54Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: a852df5. Link to the linter CI: here}

glemaitre

We should also acknowledge this change in the changelog.

glemaitre · 2023-10-31T13:25:07Z

sklearn/preprocessing/_encoders.py

                    )
                    raise ValueError(msg)

+                if len(cats) != len(set(cats)):


Since we deal with a NumPy array, let's use numpy to solve this issue.

Suggested change

if len(cats) != len(set(cats)):

_, n_unique_categories = np.unique(cats, return_counts=True)

if cats.size != n_unique_categories:

I think this causes the error. How about if cats.size != np.unique(cats).size:

Yep n_unique_categories = array([1, 1, 1, 1, 1]) is older NumPy version. I am not really sure why. The change that you propose is fine with me.

Unfortunately, this still causes error due to None, see the fails https://github.com/scikit-learn/scikit-learn/pull/27328/checks?check_run_id=18232777028.
For example, np.unique(np.array([None, 'a', 'z'], dtype=object)) will raise an error.

How about if cats.size != len(set(cats)):?

OK so let's revert to the set then. However, we need a test where we specify several time nan in the category as well to check this corner case.

I am going to use _unique function instead, see the code below.

import numpy as np from sklearn.utils._encode import _unique print(set(np.array(['a', None, None]))) # {'a', None} print(set(np.array(['a', np.nan, np.nan]))) # {'a', 'nan'} print(set(np.array([1., np.nan, np.nan]))) # {nan, 1.0, nan} print(_unique(np.array(['a', None, None]))) # ['a' None] print(_unique(np.array(['a', np.nan, np.nan]))) # ['a' 'nan'] print(_unique(np.array([1., np.nan, np.nan]))) # [ 1. nan]

For several nan in the category, I don't think it's necessary. Since now we assume nan must at the last, and PR #27309 will resolve this.

BTW, which way do you think is better? First check if nan is at the last or first check if category contain duplicated values?

Hi @thomasjpfan, would you like take a look at this PR and also #27309 ?

BTW, which way do you think is better? First check if nan is at the last or first check if category contain duplicated values?

Maybe the check for nan first since it is less expensive.

Sure, I will update code after #27309 is merged.

sklearn/preprocessing/_encoders.py

sklearn/preprocessing/tests/test_encoders.py

doc/whats_new/v1.4.rst

glemaitre

Otherwise LGTM.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

doc/whats_new/v1.4.rst

betatim · 2023-11-01T13:18:42Z

Looks good to me, I enabled auto merge.

xuefeng-xu · 2023-11-01T14:34:32Z

I just resolved some conflicts, could you take a look again? @betatim

scikit-learn#27328) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Tim Head <betatim@gmail.com>

FIX raise an error if user defined categories contain duplicate values

2bff9a9

github-actions bot added the module:preprocessing label Sep 10, 2023

xuefeng-xu mentioned this pull request Sep 10, 2023

Wrong infrequent categories and error in OrdinalEncoder #27088

Closed

add test

977201f

glemaitre self-requested a review October 31, 2023 09:52

glemaitre reviewed Oct 31, 2023

View reviewed changes

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved

xuefeng-xu added 4 commits October 31, 2023 22:26

Merge branch 'main' into duplicate

caab684

update according to suggested change

c3bd2b3

add an entry in changelog v1.4

82ab785

update error msg

f8dac59

glemaitre reviewed Oct 31, 2023

View reviewed changes

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved

glemaitre approved these changes Oct 31, 2023

View reviewed changes

glemaitre added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 31, 2023

glemaitre added this to the 1.4 milestone Oct 31, 2023

xuefeng-xu and others added 5 commits October 31, 2023 10:01

Update doc/whats_new/v1.4.rst

63993e6

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

fix linting issue

a817911

fix error

a10c9d5

fix error

e33b184

use _unique function

f1104f5

betatim reviewed Nov 1, 2023

View reviewed changes

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved

Update doc/whats_new/v1.4.rst

7eea198

betatim enabled auto-merge (squash) November 1, 2023 13:18

betatim approved these changes Nov 1, 2023

View reviewed changes

xuefeng-xu added 2 commits November 1, 2023 22:24

fix conflict

d131ce6

fix conflict

a852df5

auto-merge was automatically disabled November 1, 2023 14:31
Head branch was pushed to by a user without write access

betatim enabled auto-merge (squash) November 1, 2023 14:38

betatim merged commit a55f167 into scikit-learn:main Nov 1, 2023

xuefeng-xu deleted the duplicate branch November 2, 2023 02:10

	if len(cats) != len(set(cats)):
	_, n_unique_categories = np.unique(cats, return_counts=True)
	if cats.size != n_unique_categories:

Uh oh!

Conversation

xuefeng-xu commented Sep 10, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Sep 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuefeng-xu Oct 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

betatim commented Nov 1, 2023

Uh oh!

xuefeng-xu commented Nov 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Sep 10, 2023 •

edited

Loading

xuefeng-xu Oct 31, 2023 •

edited

Loading