MAINT validate parameter in sklearn.preprocessing._encoders by Diadochokinetic · Pull Request #23579 · scikit-learn/scikit-learn

Diadochokinetic · 2022-06-10T10:28:41Z

Reference Issues/PRs

Make all estimators use _validate_params #23462

What does this implement/fix? Explain your changes.

Implements _validate_params for sklear.preprocessing._encoders. Shared parameters will be implemented in the _BaseEncoder.

Any other comments?

This is my first contribution. Tips and Feedback are highly appreciated.

… successfully run tests

…RS_TO_IGNORE

…tic/scikit-learn into ohe_validate_params

Diadochokinetic · 2022-06-12T12:46:38Z

I'm currently stuck, because _InstancesOf doesn't support numpy data types, see #23599

…ass numpy dtypes as string and internally converts them

thomasjpfan · 2022-06-12T14:27:37Z

I think we leave off validation dtype for now. In NumPy, the dtype can be specified in many ways. For example, this currently works:

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(dtype="i4")
X = [['dog'], ['cat'], ['snake']]
enc.fit_transform(X)
# array([[1],
#        [0],
#        [2]], dtype=int32)

We can let NumPy handle bad dtypes. For example, this will raise already:

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(dtype="i423e12")
X = [['dog'], ['cat'], ['snake']]
enc.fit_transform(X)
# TypeError: data type 'i423e12' not understood

For reference, what NumPy considers a valid "dtype" is quite complex according to NumPy's typing system

Diadochokinetic · 2022-06-12T15:21:26Z

Since the parameter validation requires at least one constraint for each parameter. What is the best way to approach the dtype parameter? Is there some "meta" class/type that always returns True?

thomasjpfan · 2022-06-12T18:22:13Z

What is the best way to approach the dtype parameter?

I think we can use object. For future contributors, include a comment near the validate_params for dtype that states that we are allowing NumPy to do the validation.

Diadochokinetic · 2022-06-12T19:42:44Z

I think we can use object.

Unfortunately this collides with sklearn.utils.estimator_checks.check_param_validation. This function includes a test, whether an artificial "bad" parameter raises an appropriate error message. The artificial bad parameter yields True for isinstance.

>>> param_with_bad_type = type("BadType", (), {})()
>>> isinstance(param_with_bad_type, object)
True

Maybe this check should be disabled for object as parameter_constraint.

thomasjpfan · 2022-06-12T19:50:27Z

The current design of _validate_params is not flexible enough to turn off validation. Let's see what @jeremiedbb thinks about adding such functionality. For context, I think it's better to delegate validation of dtype to NumPy because the dtype parameter can accept many inputs.

jeremiedbb · 2022-06-13T10:26:58Z

The current design of _validate_params is not flexible enough to turn off validation. Let's see what @jeremiedbb thinks about adding such functionality. For context, I think it's better to delegate validation of dtype to NumPy because the dtype parameter can accept many inputs.

@thomasjpfan I guess we can add a special case "no validation" or "any" or "delegate validation". This is a case were going for using typing would have make it easier 😄. I'll make a PR to implement that.

Diadochokinetic · 2022-06-13T12:49:18Z

I'll make a PR to implement that.

Okay, then I'll resume working on this, when the PR is done :)

sklearn/preprocessing/_encoders.py

…23579

jeremiedbb · 2022-06-24T13:12:21Z

@Diadochokinetic I directly pushed some changes to take into account the recent improvements in the validation mechanism.

jeremiedbb

LGTM

glemaitre

Otherwise LGTM

sklearn/preprocessing/_encoders.py

glemaitre · 2022-06-24T15:52:46Z

Thanks @Diadochokinetic

…learn#23579) Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

Add minimal example for parameter 'categories' in OneHotEncoder() and…

adc5db9

… successfully run tests

github-actions bot added the module:preprocessing label Jun 10, 2022

Diadochokinetic added 9 commits June 10, 2022 16:31

Add all parameter constraints and run tests successfully

7520d3e

formatted code with black

200ade8

Remove old paramater checks

835cccd

remove OneHotEncode and OridnalEncoder form PARAM_VALIDATION_ESTIMATO…

c20ccdc

…RS_TO_IGNORE

remove simple tests and move self._infrequent_enabled to __init__

e240f8c

change paramter dtype to type

a3f9633

try different approach for dtype

8225384

Add type and np.dtype to parameter_constraints of dtype

48db2c4

Merge branch 'ohe_validate_params' of https://github.com/Diadochokine…

e7579ec

…tic/scikit-learn into ohe_validate_params

This was referenced Jun 12, 2022

sklearn.utils._param_validation._InstancesOf is insufficient for numpy data types #23599

Closed

[MRG] Fix sklearn.utils._param_validation._InstancesOf is insufficient for numpy data types #23600

Closed

Diadochokinetic added 5 commits June 12, 2022 14:56

Remove KernelCenterer from PARAM_VALIDATION_TO_IGNORE

732c009

Add StrOptions first and if_binary to parameter drop

19afd9b

Remove simple tests for error messages of parameter constraints

2f60d6e

Add StrOptions for paramater dtype of OridinalEncoder, it allows to p…

cd14e09

…ass numpy dtypes as string and internally converts them

Format with black

cc93146

Diadochokinetic added 3 commits June 12, 2022 17:08

Undo changes to _InstancesOf

0d9e2f4

Undo changes in _InstacnesOf docstring

073bc0f

Format with black

5409276

Diadochokinetic added 3 commits June 12, 2022 22:19

Change parameter_constraints of dtpye to object, the most abstract form

243434f

disable error message tests for object as parameter constraint

604954f

forgot to run flake8...

ced1030

jeremiedbb added the Validation related to input validation label Jun 13, 2022

jeremiedbb mentioned this pull request Jun 13, 2022

MNT Param validation: Allow to skip validation of a parameter #23602

Merged

Diadochokinetic added 3 commits June 17, 2022 09:32

Merge remote-tracking branch 'upstream/main' into ohe_validate_params

030efea

undo changes in estimator_checks

b75f62c

Fix inline comments

e3f230b

Diadochokinetic force-pushed the ohe_validate_params branch from ea27732 to e3f230b Compare June 17, 2022 08:04

Replace constraint [bool] with [boolean].

de91812

thomasjpfan reviewed Jun 23, 2022

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

jeremiedbb added 3 commits June 24, 2022 14:58

no validation for dtype

455140c

Merge remote-tracking branch 'upstream/main' into pr/Diadochokinetic/…

e61c5ce

…23579

update

00c8eae

jeremiedbb added the No Changelog Needed label Jun 24, 2022

jeremiedbb approved these changes Jun 24, 2022

View reviewed changes

glemaitre changed the title ~~towards #23462 [WIP] implement _validate_params for sklear.preprocessing._encoders~~ MAINT validate parameter in sklearn.preprocessing._encoders Jun 24, 2022

glemaitre self-requested a review June 24, 2022 14:31

glemaitre reviewed Jun 24, 2022

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

address review comment

04e65ab

glemaitre merged commit 8a8d068 into scikit-learn:main Jun 24, 2022

Diadochokinetic deleted the ohe_validate_params branch June 25, 2022 07:15

jeremiedbb mentioned this pull request Jul 6, 2022

Param validation for Dictvectorizer #23820

Merged

ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022

MAINT validate parameter in OneHotEncoder and OrdinalEncoder (scikit-…

edc858b

…learn#23579) Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

Uh oh!

Conversation

Diadochokinetic commented Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Diadochokinetic commented Jun 12, 2022

Uh oh!

thomasjpfan commented Jun 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Diadochokinetic commented Jun 12, 2022

Uh oh!

thomasjpfan commented Jun 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Diadochokinetic commented Jun 12, 2022

Uh oh!

thomasjpfan commented Jun 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Jun 13, 2022

Uh oh!

Diadochokinetic commented Jun 13, 2022

Uh oh!

Uh oh!

jeremiedbb commented Jun 24, 2022

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Jun 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Diadochokinetic commented Jun 10, 2022 •

edited

Loading

thomasjpfan commented Jun 12, 2022 •

edited

Loading

thomasjpfan commented Jun 12, 2022 •

edited

Loading

thomasjpfan commented Jun 12, 2022 •

edited

Loading