[MRG + 1] Add check for estimator: parameters not modified by fit#7846
[MRG + 1] Add check for estimator: parameters not modified by fit#7846jnothman merged 4 commits intoscikit-learn:masterfrom kiote:feature-test_fit_adds_private_attributes
fit#7846Conversation
| if hasattr(estimator, "n_clusters"): | ||
| estimator.n_clusters = 1 | ||
|
|
||
| set_random_state(estimator, 1) |
There was a problem hiding this comment.
this part seems to be very repetitive, so maybe it's possible to refactor / extract method here.. not sure though
|
|
||
| def check_fit2d_predict1d(name, Estimator): | ||
| # check by fitting a 2d array and prediting with a 1d array | ||
| # check by fitting a 2d array and predicting with a 1d array |
There was a problem hiding this comment.
not relevant to the current commit, just caught my eye. Hope it's innocent enough!
|
there are some amount of estimators, which is not follow this rule from the beginning, that's the reason of failing tests |
sklearn/utils/estimator_checks.py
Outdated
| for val in substracted_dicts: | ||
| assert_true(val.startswith('_') or val.endswith('_'), | ||
| ('Estimator sets invalid attributes during the fit method' | ||
| 'should either start with an _ or end with a _')) |
There was a problem hiding this comment.
Could you please list the attributes that are set inappropriately here? Then the test logs will be more informative.
fit
sklearn/utils/estimator_checks.py
Outdated
| estimator.fit(X, y) | ||
|
|
||
| dict_after_fit = estimator.__dict__ | ||
| # leave only attributes which have been set by fit |
There was a problem hiding this comment.
I think we'd also like to check that attributes set in __init__ were not modified by fit
There was a problem hiding this comment.
okay, that uncovers my total lack of knowledge of how the things work 😕 . I wonder how this sentence and this comment: #7553 (comment) are working together?
There was a problem hiding this comment.
why the tears? Sorry response times are slow atm... I'm watching, but everything is just queued up for the moment!
There was a problem hiding this comment.
haha I understand! Just trying to figure out how to make this run and for me your sentence saying: "attributes set in __init__ were not modified by fit" and @amueller comment some time ago saying: "We certainly change the __dict__ during fit" conflicting in my mind, so I can't really continue here without resolving this conflict.
So the real question is: should we or shouldn't we modify __dict__ during fit for any given estimator?
Do I understand correctly, that we can add new attributes to __dict__ during fit, but we can't modify the old one (set by __init__)?
There was a problem hiding this comment.
In fit, we should only usually be modifying (usually adding, but not necessarily) attributes that start or end with _.
|
Made error message more informative, also add one more test, checking that A lot of estimators violate this rules, though. |
|
the trailing underscore is not about public or private, it's a scikit-learn convention. |
amueller
left a comment
There was a problem hiding this comment.
This looks pretty good already :)
sklearn/utils/estimator_checks.py
Outdated
|
|
||
| dict_after_fit = estimator.__dict__ | ||
| # leave only attributes which have been set by fit | ||
| substracted_dicts_keys = [k for k in dict_after_fit.keys() |
There was a problem hiding this comment.
I think you should check that they are the same before and after.
There was a problem hiding this comment.
yes, this we check this in the previous test here
sklearn/utils/estimator_checks.py
Outdated
| substracted_dicts_keys = [k for k in dict_after_fit.keys() | ||
| if k not in dict_before_fit.keys()] | ||
|
|
||
| for val in substracted_dicts_keys: |
|
|
||
| def check_fit2d_predict1d(name, Estimator): | ||
| # check by fitting a 2d array and prediting with a 1d array | ||
| # check by fitting a 2d array and predicting with a 1d array |
sklearn/utils/estimator_checks.py
Outdated
| if not (attr.startswith('_') or attr.endswith('_'))] | ||
|
|
||
| for attr in public_attrs: | ||
| assert_equal(dict_before_fit[attr], dict_after_fit[attr], |
There was a problem hiding this comment.
why don't you just do this as part of the test above, instead of removing the entries in the dict? There is a lot of code duplication otherwise as you saw.
There was a problem hiding this comment.
okay, we actually don't need the second tests, as you pointed out before, cause we have this test for fit doesn't change __dict__. So looks like I mislead myself here :)
sklearn/utils/estimator_checks.py
Outdated
|
|
||
|
|
||
| def check_fit_changes_private_attributes_only(name, Estimator): | ||
| if name in ['GaussianProcess', 'GaussianProcessRegressor', |
There was a problem hiding this comment.
This is pretty bad. Can you remove this line so that we can see the errors in continuous integration? These seems problematic and we might want to fix them.
|
I uncommented passing of the estimators. So now those ones which do not follow the rule "do not change public attributes" are failing in the CI. |
|
Should I try to do something with these estimators? |
|
These appear to be |
|
here:
|
|
Thanks! That is a very useful summary. Will look into it soon. |
|
This is great, thanks @kiote, although I can tell that it's only erroring for one attribute even if multiple are set. Would be a good idea to list all added. In my opinion:
Individual PRs welcome. If you don't want to make that effort, please post this as one or more new issues so we can seek contributors. |
|
Thanks Joel! To make this issue bounded somehow, I can suggest:
What do you think? |
|
I'd like to try and fix the issues as far as possible before merging this PR. It shouldn't be too hard to do most of those changes. |
|
okay than I'm on that |
|
Maybe skip classes that are deprecated? (you can find out if something is deprecated if |
|
If only we had some way to easily rename attributes cough futurepastcough. Maybe I should finish that at some point? |
|
After skipping depreciated estimators and printing all attributes we have the same result as before but without
error for |
|
y_train_mean can become _y_train_mean. y_train_ is already stored.
…On 2 December 2016 at 00:48, Ekaterina Krivich ***@***.***> wrote:
After skipping depreciated estimators and printing all attributes we have
the same result as before
<#7846 (comment)>
but without GaussianProcess and with
Estimators are only allowed to add private attributes either started with
_ or ended with _ but rng, y_train_mean added
error for GaussianProcessRegressor. I can follow Joel's recommendations
or just stop here and wait for Andreas [image:
|
sklearn/covariance/graph_lasso_.py
Outdated
| self.store_precision = True | ||
|
|
||
| @property | ||
| @deprecated("Attribute grid_scores was deprecated in version 0.__ and " |
| " before making predictions`.") | ||
|
|
||
| @property | ||
| @deprecated("Attribute n_features was deprecated in version 0.__ and " |
sklearn/gaussian_process/gpr.py
Outdated
| return self._rng | ||
|
|
||
| @property | ||
| @deprecated("Attribute y_train_mean was deprecated in version 0.__ and " |
| self.n_jobs = int(n_jobs) | ||
|
|
||
| @property | ||
| @deprecated("Attribute loss_function was deprecated in version 0.__ and " |
sklearn/manifold/t_sne.py
Outdated
| skip_num_points=skip_num_points) | ||
|
|
||
| @property | ||
| @deprecated("Attribute n_iter_final was deprecated in version 0.__ and " |
|
Set right versions in the deprecation messages |
| # if is_classification | ||
| if self.n_classes_ > 1: | ||
| max_features = max(1, int(np.sqrt(self.n_features))) | ||
| max_features = max(1, int(np.sqrt(self._n_features))) |
There was a problem hiding this comment.
I'm not an expert of that code but it is important to keep n_features in memory.
Can we change the signature of _check_params(self) -> _check_params(self, X, y) ?
X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], dtype=DTYPE) can also be done in _check_params.
| sklearn.decomposition.sparse_encode | ||
|
|
||
| """ | ||
| method = 'lar' |
There was a problem hiding this comment.
self.method is it used in another place than l678 and l699.
I don't like defined like that.
There was a problem hiding this comment.
I know it is done in other part of the code but is it absolutely necessary to put it here ?
There was a problem hiding this comment.
we were discussing this change before with @jnothman I suppose, that was his idea to move it here, so we'll be able to make new tests. Also he pointed that it's better to define it this way, so I'm a bit confused, honestly.
I didn't understand what l678 and l699, thought it means "line 678", but not sure what should I see on those lines then 😕
There was a problem hiding this comment.
Sorry for that, I've missed the discussion indeed. Forget my comment :)
There was a problem hiding this comment.
It's not wonderful, but I don't think it's this PR's job to fix it. The problem derives from the use of inheritance, but non-use of super...__init__. In the current code, method is twice set, like this, as a class attribute (in the CV objects), and twice as an instance attribute. It might make more sense to have it as a private attribute, but this approach at least ensures a consistency that is not in master, and that the instance's __dict__ only has true parameters.
There was a problem hiding this comment.
Ok I see. Thanks for the explanation @jnothman.
| sklearn.decomposition.sparse_encode | ||
|
|
||
| """ | ||
| method = 'lasso' |
There was a problem hiding this comment.
Sorry for that, I've missed the discussion indeed. Forget my comment :)
ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule see #7763
|
Thanks for your review, but honestly I was trying to keep this PR as much specific as possible, so with that changes you've pointed to (like changing the signature of _check_params) it would be much harder 😟 |
|
@kiote We can change that in another PR indeed. Thanks for your answers and your work. LGTM |
|
@tguillemot, did you happen to check that deprecated attributes etc are no longer in use in e.g. examples? I can't remember if I did that check, but it was @amueller's last stated concern since he gave his premature +1. |
|
I have missed one Can you replace it @kiote ? |
|
okay, I also noticed that I promised to rename |
|
Thanks a lot, @kiote! |
|
Thx @kiote |
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
Reference Issue
Fixes #7763
What does this implement/fix? Explain your changes.
Add simple test to estimator_tests, which checks that
__dict__does not have non-private attributes afterfitAny other comments?
didn't add any documentation for that, do I need to?