MAINT Common parameter validation by jeremiedbb · Pull Request #22722 · scikit-learn/scikit-learn

jeremiedbb · 2022-03-07T16:21:09Z

This PR proposes a unified design for parameter validation across estimators, classes and functions.

The goal is to have a consistent way to raise an informative error message when a parameter does not have a valid type/value. Here's an example:

>>> KMeans(init="wrong").fit(X)
ValueError: The 'init' parameter of KMeans must be a str among {'k-means++', 'random'}, a callable or an array-like. Got 'wrong' instead.

It's also meant to centralize all these checks in one place, i.e. being the first instruction of fit or of a function. Currently they can be spread throughout fit making it hard to follow and slow to fail. I also find that having all this boilerplate inside fit makes the actual interesting code of the algorithm hard to find and mixed up with non-relevant code.
In addition, these checks are currently often done for a small subset of the parameters and often not tested. And when tested, it's often spread inside several tests.

This PR only deals with non co-dependent types and values between parameters. For instance if a value of a parameter is valid only if some value of another parameter is set.

I propose to add to BaseEstimator a method _validate_params that performs validation for all parameters of estimators and a decorator validate_params for public functions. Validation is made against a dict param_name: constraint where constraint is a list of valid types/values.

# param validation of an estimator
class SomeEstimator(BaseEstimator):
    _parameter_constraints = {
        "n_clusters": [Interval(Integral, 1, None, closed="left")],
        "init": [StrOptions(["k-means++", "random"]), callable, "array-like")],
        "tol": [Interval(Real, 0, None, closed="left")],
        "algorithm": [StrOptions(["lloyd", "elkan", "auto", "full"], deprecated={"auto", "full"})],
        "max_no_improvement": [None,  Interval(Integral, 0, None, closed="left")]
    }

    def fit(X, y):
        self._validate_params()

# param validation of a function
@validate_params(
    {
        "n_clusters": [Interval(Integral, 1, None, closed="left")],
        "init": [StrOptions(["k-means++", "random"]), callable, "array-like")],
        "tol": [Interval(Real, 0, None, closed="left")],
        "algorithm": [StrOptions(["lloyd", "elkan", "auto", "full"], deprecated={"auto", "full"})],
        "max_no_improvement": [None,  Interval(Integral, 0, None, closed="left")]
    }
)
def some_func(n_clusters, init, tol, algorithm, max_no_improvement):
    ...

I also propose to add a new common test that makes sure this is done for all estimators (almost all of them being skipped right now).

closes #14721

jeremiedbb

Here are some comments for possible extensions of this work.

sklearn/cluster/_kmeans.py

jeremiedbb · 2022-03-07T16:27:52Z

sklearn/cluster/_kmeans.py

+        "n_clusters": [(numbers.Integral, Interval(1, None, closed="left"))],
+        "init": [
+            (str, {"k-means++", "random"}),
+            (callable,),


Future work: we can imagine defining the subset of callables with a specific signature to pass here as valid values

sklearn/cluster/_kmeans.py

jeremiedbb · 2022-03-07T16:31:25Z

sklearn/cluster/_kmeans.py

        )

-        self._check_params(X)
+        self._check_params_vs_input(X)


This is a second round of checks after data validation that deals with valid values that depend on the data or on other parameters.

thomasjpfan

Thanks for opening the PR on this topic!

The way the specification is defined is a dictionary of list of tuples where each tuple is: (valid_type, constraint). I like thinking of everything as a constraint.

As for the developer API, I see two parts:

Defining the constraints
Actually performing the validation.

In this PR, item 1 is a dictionary, and item 2 is a function call to validate_param. There is also another API for validate_params that combines item 1 and item 2 that is used in function calls. My preference is to have one API instead of two.

As we are already defining a Interval object, I think it's okay to go straight to defining a Validator object:

validator = Validator(
    n_clusters=[Interval(Integral, 1, None, closed="left")],
    init=[Options(["k-means++", "random"]), callable, "array-like")],
    tol=[Interval(Real, 0, None, closed="left")],
    algorithm=[Options(["lloyd", "elkan", "auto", "full"], deprecated={"auto", "full"})],
    max_no_improvement=[None,  Interval(Integral, 0, None, closed="left")]
)
validator.validate(n_clusters=2, ...)

The above can be used directly in functions.

For estimators:

class MyEstimator:
    _validator = Validator(...)

    def fit(self, X, y):
        self._validator.validate(self.get_params())

I think the dictionary of lists of tuples has semantics that makes it harder to parse and a validator object makes the semantics clear.

thomasjpfan

I like the dictionary of constraints idea!

sklearn/cluster/_kmeans.py

sklearn/utils/_param_validation.py

jjerphan

Thank you for tackling this, @jeremiedbb.

I think this very valuable for maintenance on the long term.

Here is a first review.

sklearn/utils/_param_validation.py

adrinjalali

This looks quite interesting, and I'm quite happy with it. But I really would like to see what @jnothman thinks about it. I don't think this adds too much complexity and it's not a required API for developers.

sklearn/cluster/_kmeans.py

adrinjalali

This looks really nice. I'm the second approver here, but since it's quite major, I'd like another set of eyes giving a thumb up before merging.

adrinjalali · 2022-05-13T11:09:17Z

Seems like there hasn't been any objections since I left my last comment a month ago. Will update the branch, and merge after if CI passes.

adrinjalali · 2022-05-13T11:30:44Z

@jeremiedbb CI fails here.

lorentzenchr · 2022-05-13T11:57:41Z

+1 for this change. It is a step in the right direction! Thank @jeremiedbb for your effort!

jjerphan · 2022-05-16T10:09:13Z

29c12fe resolves the failing tests. Feel free to cherry-pick (I've tried to do a PR on top of the branch of this PR but I can't).

Thank you once again, @jeremiedbb. I am looking forward to the merge.

jeremiedbb · 2022-05-16T10:15:45Z

thanks @jjerphan. I was also looking at this :)
It actually needed a little bit more fixes. Should be ok now

jjerphan · 2022-05-16T11:29:08Z

I would wait for another 3rd approval before merging this one. What do you think, @adrinjalali?

adrinjalali · 2022-05-16T11:54:53Z

We also have @lorentzenchr 's +1 here. I think we can merge. I think there's been enough time to object if there were concerns.

lorentzenchr · 2022-05-16T16:03:56Z

How about opening a follow-up issue to track progress on the modules (making PARAM_VALIDATION_ESTIMATORS_TO_IGNORE smaller)?
Maybe also some PR/issue for the documentation?

* common parameter validation * black * cln * wip * wip * rework * renaming and cleaning * lint * re lint * cln * add tests * lint * make random_state constraint * lint * closed positional * increase coverage + validate constraints * exp typing * trigger ci ? * lint * cln * rev type hints * cln * interval closed kwarg only * address comments * address comments + more tests + cln + improve err msg * lint * cln * cln * address comments * address comments * lint * adapt or skip new estimators * lint Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

…cSVM scikit-learn#24001) finish v1.2 deprecation of params kwargs in `.fit` of SVDD (similar to ocSVM scikit-learn#20843) removed SVDD param-validation exception from test_common.py since scikit-learn#23462 is go (scikit-learn#22722)

…cSVM scikit-learn#24001) finish v1.2 deprecation of params kwargs in `.fit` of SVDD (similar to ocSVM scikit-learn#20843) TST ensure SVDD passes param-validation test_common.py due to scikit-learn#23462 (scikit-learn#22722)

jeremiedbb added 2 commits March 7, 2022 16:55

common parameter validation

4484ccd

black

8e7535f

github-actions bot added module:cluster module:utils labels Mar 7, 2022

jeremiedbb commented Mar 7, 2022

View reviewed changes

jeremiedbb added module:test-suite everything related to our tests No Changelog Needed labels Mar 7, 2022

jeremiedbb mentioned this pull request Mar 7, 2022

Include entire range in check_scalar error message #22691

Open

cln

603d2d1

thomasjpfan reviewed Mar 7, 2022

View reviewed changes

jeremiedbb added 5 commits March 8, 2022 22:52

Merge branch 'master' into common-check-params

2570403

wip

84e8507

wip

87ae50e

Merge branch 'master' into common-check-params

7a797a5

rework

6de07ec

thomasjpfan reviewed Mar 9, 2022

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

sklearn/utils/_param_validation.py Outdated Show resolved Hide resolved

sklearn/utils/_param_validation.py Outdated Show resolved Hide resolved

jeremiedbb added 4 commits March 10, 2022 00:50

renaming and cleaning

ff5842a

lint

5664590

re lint

f2c2f4d

cln

3a03980

lorentzenchr reviewed Mar 10, 2022

View reviewed changes

sklearn/utils/_param_validation.py Show resolved Hide resolved

jjerphan reviewed Mar 10, 2022

View reviewed changes

sklearn/utils/_param_validation.py Outdated Show resolved Hide resolved

sklearn/utils/_param_validation.py Outdated Show resolved Hide resolved

sklearn/utils/_param_validation.py Outdated Show resolved Hide resolved

sklearn/utils/_param_validation.py Show resolved Hide resolved

jjerphan changed the title ~~Common parameter validation~~ MAINT Common parameter validation Mar 11, 2022

jeremiedbb added 3 commits March 11, 2022 16:20

Merge branch 'master' into common-check-params

70249cd

add tests

a79c19f

lint

33515b1

jeremiedbb mentioned this pull request Mar 12, 2022

Use decorators for simple input validations #14721

Closed

adrinjalali reviewed Mar 14, 2022

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

jeremiedbb added 2 commits March 14, 2022 17:15

make random_state constraint

f9be7c1

lint

9713bb7

lint

543182b

adrinjalali approved these changes Apr 14, 2022

View reviewed changes

Merge branch 'main' into common-check-params

40b9ccd

adapt or skip new estimators

f771dc0

lint

3b47006

adrinjalali merged commit 2b09fa0 into scikit-learn:main May 16, 2022

harupy mentioned this pull request May 17, 2022

Fix test_get_params_returns_dict_that_has_more_keys_than_max_params_tags_per_bat mlflow/mlflow#5886

Merged

29 tasks

jeremiedbb mentioned this pull request May 25, 2022

Make all estimators use _validate_params #23462

Closed

This was referenced Jul 26, 2022

MAINT validate parameter in KernelPCA #24020

Merged

MAINT add parameters validation for SplineTransformer #24057

Merged

kasmith11 mentioned this pull request Aug 3, 2022

MAINT Parameters validation for SpectralEmbedding #24103

Merged

naoise-h mentioned this pull request Sep 2, 2022

ENH Adding TypeError support to param validation #24327

Closed

glemaitre mentioned this pull request Nov 8, 2022

Make automatic validation for all scikit-learn public functions #24862

Closed

naoise-h mentioned this pull request Dec 8, 2022

[MNT] Diffprivlib 0.6.2 IBM/differential-privacy-library#77

Merged

jovan-stojanovic mentioned this pull request Dec 20, 2022

Encoders do not raise parameter Value Error at initialisation skrub-data/skrub#442

Closed

This was referenced Aug 2, 2023

API: consistency in input validation scipy/scipy#18972

Open

Document developer utils for parameter validation #27038

Open

AcylSilane mentioned this pull request Apr 26, 2024

Parameter Validation Documentation? #28903

Closed

jeromedockes mentioned this pull request Jun 18, 2024

MAINT add parameter validation using BaseEstimator skrub-data/skrub#958

Draft

Uh oh!

Conversation

jeremiedbb commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented May 13, 2022

Uh oh!

adrinjalali commented May 13, 2022

Uh oh!

lorentzenchr commented May 13, 2022

Uh oh!

jjerphan commented May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan commented May 16, 2022

Uh oh!

adrinjalali commented May 16, 2022

Uh oh!

lorentzenchr commented May 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jeremiedbb commented Mar 7, 2022 •

edited

Loading

thomasjpfan left a comment •

edited

Loading

jjerphan commented May 16, 2022 •

edited

Loading

jeremiedbb commented May 16, 2022 •

edited

Loading