[MRG+2] Add a test for sample weights for estimators#11558
[MRG+2] Add a test for sample weights for estimators#11558GaelVaroquaux merged 13 commits intoscikit-learn:masterfrom
Conversation
GaelVaroquaux
left a comment
There was a problem hiding this comment.
Looks good, aside from 2 minor comments.
sklearn/utils/estimator_checks.py
Outdated
| X = np.array([[1, 3], [1, 3], [1, 3], [1, 3], | ||
| [2, 1], [2, 1], [2, 1], [2, 1], | ||
| [3, 3], [3, 3], [3, 3], [3, 3], | ||
| [4, 1], [4, 1], [4, 1], [4, 1]]) |
There was a problem hiding this comment.
I would force the dtype to be float here.
sklearn/utils/estimator_checks.py
Outdated
| "sample_weight=ones" % name) | ||
| if hasattr(estimator_orig, "transform"): | ||
| X_pred1 = estimator1.transform(X) | ||
| X_pred2 = estimator2.transform(X) |
There was a problem hiding this comment.
nitpick: I would call these X_trans1 and X_trans2
There was a problem hiding this comment.
Changed this back to X_pred1 and X_pred2 because I moved both methods inside a for loop.
sklearn/utils/estimator_checks.py
Outdated
| [3, 3], [3, 3], [3, 3], [3, 3], | ||
| [4, 1], [4, 1], [4, 1], [4, 1]], dtype=np.dtype('float')) | ||
| y = np.array([1, 1, 1, 1, 2, 2, 2, 2, | ||
| 1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float')) |
There was a problem hiding this comment.
Actually, I would have kept y as an integer. Only X as float.
glemaitre
left a comment
There was a problem hiding this comment.
I think that it could be great to add an entry in the what's new as well
sklearn/utils/estimator_checks.py
Outdated
|
|
||
|
|
||
| @ignore_warnings(category=(DeprecationWarning, FutureWarning)) | ||
| def check_sample_weight_invariance(name, estimator_orig): |
There was a problem hiding this comment.
It could be worth to add a one line comment regarding the purpose of the test
sklearn/utils/estimator_checks.py
Outdated
| y = np.array([1, 1, 1, 1, 2, 2, 2, 2, | ||
| 1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float')) | ||
|
|
||
| if has_fit_parameter(estimator_orig, "random_state"): |
There was a problem hiding this comment.
you can replace those using set_random_state from sklearn.utils.testing it will check if the estimators has already the random state
sklearn/utils/estimator_checks.py
Outdated
| 1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float')) | ||
|
|
||
| if has_fit_parameter(estimator_orig, "random_state"): | ||
| estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)), random_state=0) |
There was a problem hiding this comment.
so basically:
from sklearn.utils.testing import set_random_state
set_random_state(estimator1, random_state=42)
set_random_state(estimator2, random_state=42)
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)There was a problem hiding this comment.
Thanks! Used it as it is.
sklearn/utils/estimator_checks.py
Outdated
| estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y))) | ||
| estimator2.fit(X, y=y, sample_weight=None) | ||
|
|
||
| if hasattr(estimator_orig, "predict"): |
There was a problem hiding this comment.
make a loop here
for method in ('predict', 'transform'):
if hasattr(estimator_orig, method):
...
sklearn/utils/estimator_checks.py
Outdated
| X_pred1 = estimator1.predict(X) | ||
| X_pred2 = estimator2.predict(X) | ||
| try: | ||
| assert_allclose(X_pred1, X_pred2, rtol=0.5) |
There was a problem hiding this comment.
maybe you might want to use assert_allclose_dense_sparse if the output could be sparse.
There was a problem hiding this comment.
pass the err_msg to assert_allclose_dense_sparse avoiding the try .. except ... to raise the proper error.
There was a problem hiding this comment.
I don't think we expect the output to be sparse.
err_msg is done!
|
Test are passing! Congratulations. But you had a PEP8 failure. I fixed it on your branch. We need to wait for the tests to run again. |
|
@glemaitre : can you reassess you review. We think that this is good to go. |
|
I can't understand why we need to skip "KMeans" and "MiniBatchKMeans". It seems that the test passes locally on my PC with these two classes. Also, at least we have things like |
sklearn/utils/estimator_checks.py
Outdated
| # unit weights and no weights | ||
| if (has_fit_parameter(estimator_orig, "sample_weight") and | ||
| not (hasattr(estimator_orig, "_pairwise") | ||
| and estimator_orig._pairwise) and |
There was a problem hiding this comment.
Also, I'd prefer some comments here to show us what you're doing (why do you skip the test?).
There was a problem hiding this comment.
Done! Basically, the data we are testing is not pairwise. Estimator tests with _pariwise fails otherwise.
|
Indeed, I think that it's good to add comments. I canceled the corresponding travis builds, to save time in the queue. |
sklearn/utils/estimator_checks.py
Outdated
| and estimator_orig._pairwise) and | ||
| name not in ["KMeans", "MiniBatchKMeans"]): | ||
| # We skip pairwise because the data is not pairwise | ||
| # KMeans and MiniBatchKMeans were unstable; hence skipped. |
There was a problem hiding this comment.
I'm confused about it. Are they still unstable under a fixed random_state. If so, is there a bug?
(Sorry if I made a mistake :) Go ahead if you have enough confidence)
There was a problem hiding this comment.
This is strange. When I initially ran this without using set_random_state, they were failing. After using set_random_state, they seem to be stable. I wonder if estimator.fit(...,random_state=0) is missing something that set_random_state does not. I can include those estimators to the test. @GaelVaroquaux WDYT?
There was a problem hiding this comment.
It is weird to pass random_state in the fit. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state.
Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit.
|
@qinhanmin2014 : "KMeans" and "MiniBatchKMeans" are skipped because they were unstable. |
|
I would expect estimators in scikit-learn to be deterministic with same input and a fixed random_state, or am I wrong? |
sklearn/utils/estimator_checks.py
Outdated
| and estimator_orig._pairwise) and | ||
| name not in ["KMeans", "MiniBatchKMeans"]): | ||
| # We skip pairwise because the data is not pairwise | ||
| # KMeans and MiniBatchKMeans were unstable; hence skipped. |
There was a problem hiding this comment.
It is weird to pass random_state in the fit. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state.
Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit.
|
Oh you added back kmeans, cool so this is fine then. |
qinhanmin2014
left a comment
There was a problem hiding this comment.
LGTM if CIs are green. Thanks @sergulaydore
|
OK, travis is green. Merging. Hurray! |
|
Why do we need an rtol as high as 0.5? |
|
This doesn't assert that weighting makes any difference ever, does it? |
|
@jnothman Actually we don't need rtol as 0.5. I just tested with the default value and the tests still passed. No, it only makes sure that the no weighting is equal to unit weighting. If you recommend, I can submit another PR with random weights. |
|
+1 to use default rtol, @sergulaydore please submit a PR, thanks. |
|
Done! Please see #11621. |
Reference Issues/PRs
Fixes the invariant test for sample weights as mentioned in issue #11316 (Refactor tests for sample weights).
What is new?
This is a generic test for estimators that makes sure the sample weights yield consistent results.
What does this implement/fix? Explain your changes.
The test checks if the output of the estimators are the same for
sample_weight = Noneandsample_weight=np.ones().Any other comments?
Pairwise methods are skipped as they require pairwise data.