[MRG] Added check for idempotence of fit() by NicolasHug · Pull Request #12328 · scikit-learn/scikit-learn

NicolasHug · 2018-10-08T17:03:35Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR checks that fit() is idempotent for all non-meta estimator, that is est.fit(X) is equivalent to est.fit(X).fit(X).

Any other comments?

jnothman · 2018-10-08T20:34:25Z

sklearn/utils/estimator_checks.py

+    # Fit again
+    est.fit(X_train, y_train)
+
+    if hasattr(est, 'predict'):


Can we consider resuscitating the concept of assert_same_model (#4841)?

Ideally yes...

I tried to do that by comparing the attributes ending with _ but ran into tons of edge cases, so I went with the prediction comparison, which is probably enough for this issue.

Basically the question is, is it really worth it? It's doable but there's no doubt the code is going to be a pain to maintain.

The code in #4841 ignores the comparison of some attributes, in particular the attributes that are themselves estimators (see here), and it's already pretty involved!

I don't mind it being approximate and efficient in most cases, rather than having the complexity of #4841... but it should be reusable, rather than repeated ad-hoc.

A further option may be to make each estimator responsible for its tests of equivalence.

A further option may be to make each estimator responsible for its tests of equivalence

But I guess for those that have some deeply nested attributes that need to be checked for equality, the code would still be rather complex / adhoc, and duplicated across all those estimators?

I think most of the time we can presume that identical predictions/transformation is strong enough (although if all attributes can be tested for equality that would be faster in many cases). Where those methods do not exist, then we may need an alternative.
Asserting that two models are not equal would be harder. A function is_model_equal returning {yes, likely, no, unknown} would be useful and could be strengthened over time

So what do you suggest exactly? Should I turn this test into is_model_equal and return either likely if it passes or unknown if it fails?

If transform etc are not equal we know the model is not equal. If we can test that the objects are identical we know the models are identical. If we show that predict, transform, etc, match on random data then it is likely they are equal, and if we have no way to test, then it is unknown.

Here the test would pass if anything but 'no' was returned.

Don't we have similar assertions elsewhere? I'd expect this to be used multiple times already.

Don't we have similar assertions elsewhere?

I have no idea to be honest.

I still don't understand what you want to do regarding this PR:

Do you want to have is_model_equal?

If yes, in this PR? Another one? Should I create a new issue so we can discuss requirements / implementation details there?

Should we still merge this PR, and if no do you want to reimplement it once we have is_model_equal?

Omg. I'm sorry. I was coming at this thinking that we had tests with similar checks of model equivalence... But maybe they were only in pull requests, or my imagination, or aren't expressed like this. Okay. Let's not bother about making it generic in this issue.

NicolasHug · 2018-10-09T17:59:39Z

Python 2 tests are failing... Do we care?

amueller · 2018-10-10T00:48:24Z

It's actually a numpy error, np.stack doesn't exist. I think we're still supporting that version of numpy.

jnothman

I think I'd still rather this structured to use a generic model comparator. It would be more readable apart from anything else. But I'm okay with it as it stands.

jnothman · 2018-10-15T06:42:40Z

sklearn/utils/estimator_checks.py

+    if hasattr(est, 'predict'):
+        pred_1 = est.predict(X_test)
+    if hasattr(est, 'predict_proba'):
+        pred_proba_1 = est.predict_proba(X_test)


Note that we don't need to test predict if predict_proba is tested. This is the kind of rationalisation and optimisation that writing a generic function would benefit from where here it would look ugly.

jnothman

Actually I've realised where we have logic like this. And I think the code is cleaner to read, if not something we should refactor into a reusable function: check_estimators_pickle. Can we make this look more like that, or refactor?

NicolasHug · 2018-10-15T15:46:02Z

Thanks for the ref, it looks much better like this indeed.

I still kept my original data generation part, because using make_blobs like in check_estimators_pickle would result in consensus error from RANSACRegressor (I unsucessfully tried various parameters and random seed)

TomDLT · 2018-10-15T16:07:33Z

sklearn/utils/estimator_checks.py

+    X_train, X_test, y_train, _ = train_test_split(X, y)
+    # some estimators expect a square matrix
+    X_train = pairwise_estimator_convert_X(X_train, estimator)
+    X_test = pairwise_estimator_convert_X(X_train, estimator)


X_test = pairwise_estimator_convert_X(X_test, estimator)

TomDLT · 2018-10-15T16:56:56Z

Maybe we could improve the test by adding more diverse samples in X_test, such as sample from a shifted distribution, samples with large values, or samples used during training. What do you think?

NicolasHug · 2018-10-16T14:11:28Z

Yes, I agree that ideally the tests should be more thorough.

In practice though I found it a bit tricky to have a dataset generation that would work for all estimators. In a lot of cases some would not converge, or don't reach a consensus, some expect non negative data, etc.

Maybe that's just an indication that those tests aren't really appropriate and that we should resort to a more in depth approach like in #4841 .

amueller · 2018-10-16T19:54:29Z

I think they are appropriate, we just need to have estimator tags to check which data the estimators want ;)

…into idempotence

jnothman

Otherwise LGTM

jnothman · 2018-10-21T03:36:16Z

sklearn/utils/estimator_checks.py

+    # Fit for the first time
+    estimator.fit(X_train, y_train)
+
+    result = dict()


= {} is more idiomatic

jnothman · 2018-10-21T03:42:43Z

sklearn/utils/estimator_checks.py

+                                                   random_state=rng)
+    # some estimators expect a square matrix
+    X_train = pairwise_estimator_convert_X(X_train, estimator)
+    X_test = pairwise_estimator_convert_X(X_test, estimator)


This can't be right. X_test for pairwise needs to be of shape (test samples, train samples)

We're using test_size=.5 so it works...

But otherwise is there a builtin utility like pairwise_estimator_convert_X that I could use instead?

I don't think so?

jnothman

Please add a what's new under "changes to estimator checks" or whatever the heading was do v0.20

TomDLT · 2018-10-29T16:31:26Z

Thanks @NicolasHug !

…arn#12328)" This reverts commit beafb49.

Added check for idempotence of fit()

457b80f

jnothman reviewed Oct 8, 2018

View reviewed changes

Fixed test for estimator with pairwise kernel or metric

c578b22

Used column_stack instead of stack, should fix the tests

7365d09

jnothman approved these changes Oct 15, 2018

View reviewed changes

jnothman requested changes Oct 15, 2018

View reviewed changes

Refactored test

b8edccb

Removed useless line

88a81af

TomDLT reviewed Oct 15, 2018

View reviewed changes

addressed comment: X_train -> X_test

7cffbb5

should fix test

0bca035

NicolasHug added 2 commits October 18, 2018 15:09

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

88e6f50

…into idempotence

set test_size instead of train_size to avoid deprecation warning

275b4f9

jnothman reviewed Oct 21, 2018

View reviewed changes

NicolasHug added 2 commits October 22, 2018 10:58

dict -> {}

53b7f55

Used _safe_split for splitting data

ee8a0c9

jnothman approved these changes Oct 28, 2018

View reviewed changes

Added whatsnew entry

19b68ea

TomDLT merged commit f6b0c67 into scikit-learn:master Oct 29, 2018

jrbourbeau mentioned this pull request Nov 1, 2018

Common tests: Check that fit is idempotent, i.e. fitting an estimator twice gives the same result #11883

Closed

thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018

TST Added estimator check for idempotence of fit() (scikit-learn#12328)

a84fa66

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

TST Added estimator check for idempotence of fit() (scikit-learn#12328)

beafb49

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "TST Added estimator check for idempotence of fit() (scikit-le…

55208fb

…arn#12328)" This reverts commit beafb49.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "TST Added estimator check for idempotence of fit() (scikit-le…

55f83b0

…arn#12328)" This reverts commit beafb49.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

TST Added estimator check for idempotence of fit() (scikit-learn#12328)

12cec8b

amueller mentioned this pull request Aug 5, 2019

[MRG] Clone estimator between fits in sklearn/utils/estimator_checks.py::check_supervised_y_2d #10978

Closed

Uh oh!

Conversation

NicolasHug commented Oct 8, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Oct 9, 2018

Uh oh!

amueller commented Oct 10, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Oct 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomDLT commented Oct 15, 2018

Uh oh!

NicolasHug commented Oct 16, 2018

Uh oh!

amueller commented Oct 16, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

TomDLT commented Oct 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants