MAINT Common parameter validation#22722
Conversation
jeremiedbb
left a comment
There was a problem hiding this comment.
Here are some comments for possible extensions of this work.
sklearn/cluster/_kmeans.py
Outdated
| "n_clusters": [(numbers.Integral, Interval(1, None, closed="left"))], | ||
| "init": [ | ||
| (str, {"k-means++", "random"}), | ||
| (callable,), |
There was a problem hiding this comment.
Future work: we can imagine defining the subset of callables with a specific signature to pass here as valid values
| ) | ||
|
|
||
| self._check_params(X) | ||
| self._check_params_vs_input(X) |
There was a problem hiding this comment.
This is a second round of checks after data validation that deals with valid values that depend on the data or on other parameters.
There was a problem hiding this comment.
Thanks for opening the PR on this topic!
The way the specification is defined is a dictionary of list of tuples where each tuple is: (valid_type, constraint). I like thinking of everything as a constraint.
As for the developer API, I see two parts:
- Defining the constraints
- Actually performing the validation.
In this PR, item 1 is a dictionary, and item 2 is a function call to validate_param. There is also another API for validate_params that combines item 1 and item 2 that is used in function calls. My preference is to have one API instead of two.
As we are already defining a Interval object, I think it's okay to go straight to defining a Validator object:
validator = Validator(
n_clusters=[Interval(Integral, 1, None, closed="left")],
init=[Options(["k-means++", "random"]), callable, "array-like")],
tol=[Interval(Real, 0, None, closed="left")],
algorithm=[Options(["lloyd", "elkan", "auto", "full"], deprecated={"auto", "full"})],
max_no_improvement=[None, Interval(Integral, 0, None, closed="left")]
)
validator.validate(n_clusters=2, ...)The above can be used directly in functions.
For estimators:
class MyEstimator:
_validator = Validator(...)
def fit(self, X, y):
self._validator.validate(self.get_params())I think the dictionary of lists of tuples has semantics that makes it harder to parse and a validator object makes the semantics clear.
thomasjpfan
left a comment
There was a problem hiding this comment.
I like the dictionary of constraints idea!
jjerphan
left a comment
There was a problem hiding this comment.
Thank you for tackling this, @jeremiedbb.
I think this very valuable for maintenance on the long term.
Here is a first review.
adrinjalali
left a comment
There was a problem hiding this comment.
This looks quite interesting, and I'm quite happy with it. But I really would like to see what @jnothman thinks about it. I don't think this adds too much complexity and it's not a required API for developers.
adrinjalali
left a comment
There was a problem hiding this comment.
This looks really nice. I'm the second approver here, but since it's quite major, I'd like another set of eyes giving a thumb up before merging.
|
Seems like there hasn't been any objections since I left my last comment a month ago. Will update the branch, and merge after if CI passes. |
|
@jeremiedbb CI fails here. |
|
+1 for this change. It is a step in the right direction! Thank @jeremiedbb for your effort! |
|
29c12fe resolves the failing tests. Feel free to cherry-pick (I've tried to do a PR on top of the branch of this PR but I can't). Thank you once again, @jeremiedbb. I am looking forward to the merge. |
|
thanks @jjerphan. I was also looking at this :) |
|
I would wait for another 3rd approval before merging this one. What do you think, @adrinjalali? |
|
We also have @lorentzenchr 's +1 here. I think we can merge. I think there's been enough time to object if there were concerns. |
|
How about opening a follow-up issue to track progress on the modules (making |
* common parameter validation * black * cln * wip * wip * rework * renaming and cleaning * lint * re lint * cln * add tests * lint * make random_state constraint * lint * closed positional * increase coverage + validate constraints * exp typing * trigger ci ? * lint * cln * rev type hints * cln * interval closed kwarg only * address comments * address comments + more tests + cln + improve err msg * lint * cln * cln * address comments * address comments * lint * adapt or skip new estimators * lint Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
…cSVM scikit-learn#24001) finish v1.2 deprecation of params kwargs in `.fit` of SVDD (similar to ocSVM scikit-learn#20843) removed SVDD param-validation exception from test_common.py since scikit-learn#23462 is go (scikit-learn#22722)
…cSVM scikit-learn#24001) finish v1.2 deprecation of params kwargs in `.fit` of SVDD (similar to ocSVM scikit-learn#20843) TST ensure SVDD passes param-validation test_common.py due to scikit-learn#23462 (scikit-learn#22722)
This PR proposes a unified design for parameter validation across estimators, classes and functions.
The goal is to have a consistent way to raise an informative error message when a parameter does not have a valid type/value. Here's an example:
It's also meant to centralize all these checks in one place, i.e. being the first instruction of fit or of a function. Currently they can be spread throughout fit making it hard to follow and slow to fail. I also find that having all this boilerplate inside fit makes the actual interesting code of the algorithm hard to find and mixed up with non-relevant code.
In addition, these checks are currently often done for a small subset of the parameters and often not tested. And when tested, it's often spread inside several tests.
This PR only deals with non co-dependent types and values between parameters. For instance if a value of a parameter is valid only if some value of another parameter is set.
I propose to add to
BaseEstimatora method_validate_paramsthat performs validation for all parameters of estimators and a decoratorvalidate_paramsfor public functions. Validation is made against a dictparam_name: constraintwhere constraint is a list of valid types/values.I also propose to add a new common test that makes sure this is done for all estimators (almost all of them being skipped right now).
closes #14721