[MRG + 1] Add check for estimator: parameters not modified by `fit` by kiote · Pull Request #7846 · scikit-learn/scikit-learn

kiote · 2016-11-09T08:11:34Z

Reference Issue

Fixes #7763

What does this implement/fix? Explain your changes.

Add simple test to estimator_tests, which checks that __dict__ does not have non-private attributes after fit

Any other comments?

didn't add any documentation for that, do I need to?

kiote · 2016-11-09T08:13:01Z

sklearn/utils/estimator_checks.py

+    if hasattr(estimator, "n_clusters"):
+        estimator.n_clusters = 1
+
+    set_random_state(estimator, 1)


this part seems to be very repetitive, so maybe it's possible to refactor / extract method here.. not sure though

sounds good

kiote · 2016-11-09T08:14:09Z

sklearn/utils/estimator_checks.py

+
 def check_fit2d_predict1d(name, Estimator):
-    # check by fitting a 2d array and prediting with a 1d array
+    # check by fitting a 2d array and predicting with a 1d array


not relevant to the current commit, just caught my eye. Hope it's innocent enough!

Sure that's fine :)

kiote · 2016-11-10T08:30:28Z

there are some amount of estimators, which is not follow this rule from the beginning, that's the reason of failing tests

jnothman · 2016-11-10T12:38:16Z

sklearn/utils/estimator_checks.py

+    for val in substracted_dicts:
+        assert_true(val.startswith('_') or val.endswith('_'),
+                    ('Estimator sets invalid attributes during the fit method'
+                     'should either start with an _ or end with a _'))


Could you please list the attributes that are set inappropriately here? Then the test logs will be more informative.

jnothman · 2016-11-10T12:41:02Z

sklearn/utils/estimator_checks.py

+    estimator.fit(X, y)
+
+    dict_after_fit = estimator.__dict__
+    # leave only attributes which have been set by fit


I think we'd also like to check that attributes set in __init__ were not modified by fit

okay, that uncovers my total lack of knowledge of how the things work 😕 . I wonder how this sentence and this comment: #7553 (comment) are working together?

why the tears? Sorry response times are slow atm... I'm watching, but everything is just queued up for the moment!

haha I understand! Just trying to figure out how to make this run and for me your sentence saying: "attributes set in __init__ were not modified by fit" and @amueller comment some time ago saying: "We certainly change the __dict__ during fit" conflicting in my mind, so I can't really continue here without resolving this conflict.

So the real question is: should we or shouldn't we modify __dict__ during fit for any given estimator?

Do I understand correctly, that we can add new attributes to __dict__ during fit, but we can't modify the old one (set by __init__)?

In fit, we should only usually be modifying (usually adding, but not necessarily) attributes that start or end with _.

kiote · 2016-11-18T16:52:55Z

Made error message more informative, also add one more test, checking that fit does not change private attributes. So, with this we have two new tests: fit does not add new non-private attributes and fit does not change any non-private attributes added before.

A lot of estimators violate this rules, though.

amueller · 2016-11-18T17:25:47Z

the trailing underscore is not about public or private, it's a scikit-learn convention.
Leading underscores are indicating private attributes.
In __init__ we should only add attributes that have neigher a leading not a trailing underscore. In fit we should not change any of these, but we can add (or change if we call fit repeatedly) attributes with trailing underscore. Adding or changing private attributes is also fine.

amueller

This looks pretty good already :)

amueller · 2016-11-18T17:26:50Z

sklearn/utils/estimator_checks.py

+
+    dict_after_fit = estimator.__dict__
+    # leave only attributes which have been set by fit
+    substracted_dicts_keys = [k for k in dict_after_fit.keys()


I think you should check that they are the same before and after.

yes, this we check this in the previous test here

amueller · 2016-11-18T17:27:04Z

sklearn/utils/estimator_checks.py

+    substracted_dicts_keys = [k for k in dict_after_fit.keys()
+                              if k not in dict_before_fit.keys()]
+
+    for val in substracted_dicts_keys:


This is the right test :)

amueller · 2016-11-18T17:27:20Z

sklearn/utils/estimator_checks.py

+
 def check_fit2d_predict1d(name, Estimator):
-    # check by fitting a 2d array and prediting with a 1d array
+    # check by fitting a 2d array and predicting with a 1d array


Sure that's fine :)

amueller · 2016-11-18T17:28:44Z

sklearn/utils/estimator_checks.py

+                    if not (attr.startswith('_') or attr.endswith('_'))]
+
+    for attr in public_attrs:
+        assert_equal(dict_before_fit[attr], dict_after_fit[attr],


why don't you just do this as part of the test above, instead of removing the entries in the dict? There is a lot of code duplication otherwise as you saw.

okay, we actually don't need the second tests, as you pointed out before, cause we have this test for fit doesn't change __dict__. So looks like I mislead myself here :)

amueller · 2016-11-18T17:29:39Z

sklearn/utils/estimator_checks.py

+
+
+def check_fit_changes_private_attributes_only(name, Estimator):
+    if name in ['GaussianProcess', 'GaussianProcessRegressor',


This is pretty bad. Can you remove this line so that we can see the errors in continuous integration? These seems problematic and we might want to fix them.

kiote · 2016-11-22T07:11:36Z

I uncommented passing of the estimators. So now those ones which do not follow the rule "do not change public attributes" are failing in the CI.

kiote · 2016-11-23T07:28:42Z

Should I try to do something with these estimators?

jnothman · 2016-11-23T08:07:22Z

These appear to be KeyErrors, not AssertionErrors. Could you please start by making the error message more informative as to what needs to be fixed? Then please list the errors in a comment here so that we can review if they are straightforward to fix.

kiote · 2016-11-24T07:35:57Z

here:

Estimator	Error
GaussianProcess	Estiamtor adds public attribute during the fit method. Estimators are only allowed to add private attributes either started with _ or ended with _ but X_mean added
GaussianProcessRegressor	... rng added
GradientBoostingClassifier	... n_features added
GradientBoostingRegressor	... n_features added
GraphLassoCV	... grid_scores added
LarsCV	... fit_path added
LassoLarsCV	... fit_path added
LassoLarsIC	... fit_path added
PassiveAggressiveClassifier	... loss_function added
Perceptron	... loss_function added
SGDClassifier	... loss_function added
TSNE	... n_iter_final added

jnothman · 2016-11-24T21:33:59Z

Thanks! That is a very useful summary. Will look into it soon.

jnothman · 2016-11-30T03:04:17Z

This is great, thanks @kiote, although I can tell that it's only erroring for one attribute even if multiple are set. Would be a good idea to list all added.

In my opinion:

Estimator	Attribute	Solution
GaussianProcess	X_mean	we could deprecate this and other similar attributes (`X_std`) etc, so that the tests pass, but the class is deprecated in entirety.
GaussianProcessRegressor	rng	Deprecate for removal to a local variable, I think
GradientBoostingClassifier	n_features	deprecate and rename to `_n_features`; or abolish it and use `len(self.feature_importances_)` instead at prediction time.
GradientBoostingRegressor	n_features	"
GraphLassoCV	grid_scores	Deprecate and rename to `grid_scores_`. We may replace this with `cv_results_` some time soon, anyway.
LarsCV	fit_path	Move initialisation to `__init__`, I think
LassoLarsCV	fit_path	"
LassoLarsIC	fit_path	"
PassiveAggressiveClassifier	loss_function	Deprecate and rename to `loss_function_`
Perceptron	loss_function	"
SGDClassifier	loss_function	"
TSNE	n_iter_final	deprecate and rename to `n_iter_`

Individual PRs welcome. If you don't want to make that effort, please post this as one or more new issues so we can seek contributors.

kiote · 2016-11-30T06:20:58Z

Thanks Joel!

To make this issue bounded somehow, I can suggest:

As you said, add all attributes to error message instead of just one;
New input may be needed after that;
Make a deprecation message instead of error messages while testing, to be able to merge this PR;
Create new issue saying "replace deprecation message with failing tests" or something like that and fix all estimators during that new issue (hopefully I'll be able to do that as well).

What do you think?

jnothman · 2016-11-30T06:28:51Z

I'd like to try and fix the issues as far as possible before merging this PR. It shouldn't be too hard to do most of those changes.

kiote · 2016-11-30T06:37:50Z

okay than I'm on that

amueller · 2016-11-30T22:12:10Z

Maybe skip classes that are deprecated? (you can find out if something is deprecated if object.__init__ has an attribute _deprecated_original. (to get rid of the GaussianProcess failure)

amueller · 2016-11-30T22:14:15Z

If only we had some way to easily rename attributes cough futurepastcough. Maybe I should finish that at some point?

kiote · 2016-12-01T13:48:12Z

After skipping depreciated estimators and printing all attributes we have the same result as before but without GaussianProcess and with

Estimators are only allowed to add private attributes either started with _ or ended with _ but rng, y_train_mean added

error for GaussianProcessRegressor. I can follow Joel's recommendations or just stop here and wait for Andreas

jnothman · 2016-12-02T02:51:28Z

y_train_mean can become _y_train_mean. y_train_ is already stored.

…

On 2 December 2016 at 00:48, Ekaterina Krivich ***@***.***> wrote: After skipping depreciated estimators and printing all attributes we have the same result as before <#7846 (comment)> but without GaussianProcess and with Estimators are only allowed to add private attributes either started with _ or ended with _ but rng, y_train_mean added error for GaussianProcessRegressor. I can follow Joel's recommendations or just stop here and wait for Andreas [image:

] — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7846 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68FYzEE8XxbgudDeWKgRu3VseL6pks5rDtAdgaJpZM4KtRdT> .

amueller · 2016-12-06T19:21:30Z

sklearn/covariance/graph_lasso_.py

        self.store_precision = True

+    @property
+    @deprecated("Attribute grid_scores was deprecated in version 0.__ and "


0.19 and 0.21

amueller · 2016-12-06T19:21:49Z

sklearn/ensemble/gradient_boosting.py

                                 " before making predictions`.")

+    @property
+    @deprecated("Attribute n_features was deprecated in version 0.__ and "


0.19 and 0.21

amueller · 2016-12-06T19:22:02Z

sklearn/gaussian_process/gpr.py

+        return self._rng
+
+    @property
+    @deprecated("Attribute y_train_mean was deprecated in version 0.__ and "


amueller · 2016-12-06T19:22:10Z

sklearn/linear_model/stochastic_gradient.py

        self.n_jobs = int(n_jobs)

+    @property
+    @deprecated("Attribute loss_function was deprecated in version 0.__ and "


amueller · 2016-12-06T19:22:37Z

sklearn/manifold/t_sne.py

                          skip_num_points=skip_num_points)

+    @property
+    @deprecated("Attribute n_iter_final was deprecated in version 0.__ and "


kiote · 2016-12-07T06:13:05Z

Set right versions in the deprecation messages

tguillemot · 2017-01-18T16:36:12Z

sklearn/ensemble/gradient_boosting.py

                # if is_classification
                if self.n_classes_ > 1:
-                    max_features = max(1, int(np.sqrt(self.n_features)))
+                    max_features = max(1, int(np.sqrt(self._n_features)))


I'm not an expert of that code but it is important to keep n_features in memory.
Can we change the signature of _check_params(self) -> _check_params(self, X, y) ?

X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], dtype=DTYPE) can also be done in _check_params.

tguillemot · 2017-01-18T16:47:14Z

sklearn/linear_model/least_angle.py

    sklearn.decomposition.sparse_encode

    """
+    method = 'lar'


self.method is it used in another place than l678 and l699.
I don't like defined like that.

I know it is done in other part of the code but is it absolutely necessary to put it here ?

we were discussing this change before with @jnothman I suppose, that was his idea to move it here, so we'll be able to make new tests. Also he pointed that it's better to define it this way, so I'm a bit confused, honestly.

I didn't understand what l678 and l699, thought it means "line 678", but not sure what should I see on those lines then 😕

Sorry for that, I've missed the discussion indeed. Forget my comment :)

It's not wonderful, but I don't think it's this PR's job to fix it. The problem derives from the use of inheritance, but non-use of super...__init__. In the current code, method is twice set, like this, as a class attribute (in the CV objects), and twice as an instance attribute. It might make more sense to have it as a private attribute, but this approach at least ensures a consistency that is not in master, and that the instance's __dict__ only has true parameters.

Ok I see. Thanks for the explanation @jnothman.

tguillemot · 2017-01-18T16:48:04Z

sklearn/linear_model/least_angle.py

    sklearn.decomposition.sparse_encode

    """
+    method = 'lasso'


I don't like it.

Sorry for that, I've missed the discussion indeed. Forget my comment :)

ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule see #7763

kiote · 2017-01-19T07:42:22Z

Thanks for your review, but honestly I was trying to keep this PR as much specific as possible, so with that changes you've pointed to (like changing the signature of _check_params) it would be much harder 😟

tguillemot · 2017-01-19T08:36:20Z

@kiote We can change that in another PR indeed. Thanks for your answers and your work.

LGTM

jnothman · 2017-01-19T09:34:12Z

@tguillemot, did you happen to check that deprecated attributes etc are no longer in use in e.g. examples? I can't remember if I did that check, but it was @amueller's last stated concern since he gave his premature +1.

tguillemot · 2017-01-19T10:16:04Z

I have missed one n_features from partial_dependence.py -- line 261

   if gbrt.n_features != X.shape[1]:
        raise ValueError('X.shape[1] does not match gbrt.n_features')

Can you replace it @kiote ?

kiote · 2017-01-19T10:55:06Z

okay, I also noticed that I promised to rename _n_features to n_features_ to @amueller, but didn't! Sorry, will fix that, too.

jnothman · 2017-01-20T00:33:22Z

Thanks a lot, @kiote!

tguillemot · 2017-01-20T08:29:11Z

Thx @kiote

…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule

…earn#7846

…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule

…earn#7846

…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule

See scikit-learn/scikit-learn#7846

…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule

See scikit-learn/scikit-learn#7846

…earn#7846

kiote commented Nov 9, 2016

View reviewed changes

kiote changed the title ~~Add check for estimator~~ [WIP] Add check for estimator Nov 10, 2016

jnothman requested changes Nov 10, 2016

View reviewed changes

jnothman changed the title ~~[WIP] Add check for estimator~~ [WIP] Add check for estimator: parameters not modified by fit Nov 10, 2016

jnothman reviewed Nov 10, 2016

View reviewed changes

amueller requested changes Nov 18, 2016

View reviewed changes

jnothman mentioned this pull request Nov 30, 2016

rng attribute is set in sklearn.gaussian_process.GaussianProcessRegressor at fit time #7752

Closed

amueller reviewed Dec 6, 2016

View reviewed changes

tguillemot suggested changes Jan 18, 2017

View reviewed changes

kiote added 2 commits January 19, 2017 10:25

Add check for estimator: parameters not modified by fit

286ffb7

ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule see #7763

Fix typos

bbe8612

tguillemot approved these changes Jan 19, 2017

View reviewed changes

kiote added 2 commits January 19, 2017 22:05

Rename _n_features to n_features_

139c9d9

Rename n_features to n_features_

5c86f74

jnothman merged commit be305ce into scikit-learn:master Jan 20, 2017

Przemo10 mentioned this pull request Mar 17, 2017

update fork (#1) #8606

Closed

trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Jul 26, 2017

rebase and catch up to scikit-learn#6762, scikit-learn#7673, scikit-l…

e5dc8eb

…earn#7846

amueller mentioned this pull request Aug 3, 2017

sklearn-0.19: deprecation warnings of y_train_mean scikit-optimize/scikit-optimize#462

Closed

trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Aug 12, 2017

rebase and catch up to scikit-learn#6762, scikit-learn#7673, scikit-l…

10fa1bd

…earn#7846

wxchan mentioned this pull request Sep 3, 2017

[python] improved sklearn interface lightgbm-org/LightGBM#870

Merged

sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 16, 2017

Set self.features_ instead of self.features

0e58a65

See scikit-learn/scikit-learn#7846

sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 16, 2017

Set self.features_ instead of self.features

deeaeef

See scikit-learn/scikit-learn#7846

sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 30, 2017

Set self.features_ instead of self.features

f029ab5

See scikit-learn/scikit-learn#7846

sebp added a commit to sebp/scikit-survival that referenced this pull request Nov 18, 2017

Set self.features_ instead of self.features

c21f7f5

See scikit-learn/scikit-learn#7846

trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request May 25, 2018

rebase and catch up to scikit-learn#6762, scikit-learn#7673, scikit-l…

152a190

…earn#7846

abenbihi mentioned this pull request Feb 25, 2019

[MRG] Changed self.rng to private (self.rng_) in sklearn/gaussian_process/g… #7766

Closed



		def check_fit_changes_private_attributes_only(name, Estimator):
		if name in ['GaussianProcess', 'GaussianProcessRegressor',

Uh oh!

Conversation

kiote commented Nov 9, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiote commented Nov 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiote commented Nov 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Nov 18, 2016

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiote Nov 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiote commented Nov 22, 2016

Uh oh!

kiote commented Nov 23, 2016

Uh oh!

jnothman commented Nov 23, 2016

Uh oh!

kiote commented Nov 24, 2016

Uh oh!

jnothman commented Nov 24, 2016

Uh oh!

jnothman commented Nov 30, 2016

Uh oh!

kiote commented Nov 30, 2016

Uh oh!

jnothman commented Nov 30, 2016

Uh oh!

kiote commented Nov 30, 2016

Uh oh!

amueller commented Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Nov 30, 2016

Uh oh!

kiote commented Nov 18, 2016 •

edited

Loading

kiote Nov 21, 2016 •

edited

Loading

amueller commented Nov 30, 2016 •

edited

Loading

amueller Dec 6, 2016 •

edited

Loading