[MRG+1] Fixes #7578 added check_decision_proba_consistency in estimator_checks#8253
[MRG+1] Fixes #7578 added check_decision_proba_consistency in estimator_checks#8253lesteve merged 32 commits intoscikit-learn:masterfrom
Conversation
sklearn/utils/estimator_checks.py
Outdated
| if (name not in ["MultinomialNB", "LabelPropagation", "LabelSpreading"] | ||
| # TODO some complication with -1 label | ||
| and name not in ["DecisionTreeClassifier", | ||
| and name not in ["DecisionTreeClassifier", |
There was a problem hiding this comment.
You need to put the and on the previous line to make flake8 happy. Error from Travis is (it gives you an hint as to what to do):
./sklearn/utils/estimator_checks.py:119:9: W503 line break before binary operator
and name not in ["DecisionTreeClassifier",
|
Using |
sklearn/utils/estimator_checks.py
Outdated
|
|
||
|
|
||
| @ignore_warnings(category=DeprecationWarning) | ||
| def check_rank_corr(name, Estimator): |
There was a problem hiding this comment.
perhaps call it check_decision_proba_consistency.
…n_proba_consistency
sklearn/utils/estimator_checks.py
Outdated
| predict_proba methods has outputs with perfect rank correlation. | ||
| """ | ||
|
|
||
| X, Y = make_multilabel_classification(n_classes=2, n_labels=1, |
There was a problem hiding this comment.
Why are we using multilabel data? Why not just binary?
sklearn/utils/estimator_checks.py
Outdated
| try: | ||
| classif = OneVsRestClassifier(estimator) | ||
| classif.fit(X, Y) | ||
| a = classif.predict_proba([i for i in range(20)]) |
There was a problem hiding this comment.
Usually the input would be 2d. Why is it 1d? For test data, can just generate something random and uniform, or can take from a similar distribution to training data.
sklearn/utils/estimator_checks.py
Outdated
|
|
||
| if hasattr(estimator, "predict_proba"): | ||
| try: | ||
| classif = OneVsRestClassifier(estimator) |
sklearn/utils/estimator_checks.py
Outdated
| a = classif.predict_proba([i for i in range(20)]) | ||
| b = classif.decision_function([i for i in range(20)]) | ||
| assert_equal( | ||
| rankdata(a, method='average'), rankdata(b, method='average')) |
There was a problem hiding this comment.
method shouldn't matter as long as tied values have tied ranks. But if we're working with non-binary classification, we need to do this comparison column-wise. Use assert_array_equal rather than assert_equal.
sklearn/utils/estimator_checks.py
Outdated
| assert_equal( | ||
| rankdata(a, method='average'), rankdata(b, method='average')) | ||
|
|
||
| except ValueError: |
There was a problem hiding this comment.
What is this to catch? The try block should go around the smallest scope that we want to exclude otherwise this test can pass when all estimators raise a ValueError because the test is broken.
sklearn/utils/estimator_checks.py
Outdated
|
|
||
| if hasattr(estimator, "decision_function"): | ||
|
|
||
| if hasattr(estimator, "predict_proba"): |
|
@jnothman Travis-ci fails even though no errors found. |
Yeah I have opened an issue on Travis about that (this is under investigation): |
merge for fork update
Trigger
|
You seem to have confused a number of different patches in your PR. You should be using a branch in your fork to avoid this, not |
|
You can keep this one on master, but you need to revert or otherwise remove your changes pertaining to other issues. |
|
@jnothman anything else needed? |
sklearn/utils/estimator_checks.py
Outdated
| CROSS_DECOMPOSITION = ['PLSCanonical', 'PLSRegression', 'CCA', 'PLSSVD'] | ||
| MULTI_OUTPUT = ['CCA', 'DecisionTreeRegressor', 'ElasticNet', | ||
| 'ExtraTreeRegressor', 'ExtraTreesRegressor', 'GaussianProcess', | ||
| MULTI_OUTPUT = ['CCA', 'DecisionTreeClassifier', 'DecisionTreeRegressor', |
There was a problem hiding this comment.
all others here are regressors. What makes you sure it's appropriate to include multioutput classifiers here?
There was a problem hiding this comment.
In the function check_supervised_y_2d the line inside the warning `section-
estimator.fit(X, y[:, np.newaxis]) does not give any warnings for the classifiers I included. therefore I included it in the MULTI_OUTPUT list . Otherwise it would give me a error that ```expected 1 DataConversionWarning, got: ```. I checked sklearn documents for DecisionTreeClassifier it says y can accept [n_samples, n_outputs].
There was a problem hiding this comment.
Is this change relevant to the rest of the PR? Perhaps it should be a separate PR.
There was a problem hiding this comment.
Nosetests fail if I do not include them. Can 1 reference 2 issues with 1 pr otherwise this will not pass.Maybe I will open one and reference this issue and the opened one with this pr.What do you say?
There was a problem hiding this comment.
Could you be more specific what fails if you do not include them?
sklearn/utils/estimator_checks.py
Outdated
| yield check_classifiers_train | ||
| yield check_classifiers_regression_target | ||
| if (name not in ["MultinomialNB", "LabelPropagation", "LabelSpreading"] | ||
| if (name not in ["MultinomialNB", "LabelPropagation", "LabelSpreading"]): |
There was a problem hiding this comment.
please don't add these parentheses
sklearn/utils/estimator_checks.py
Outdated
| # TODO some complication with -1 label | ||
| and name not in ["DecisionTreeClassifier", | ||
| "ExtraTreeClassifier"]): | ||
| if (name not in ["DecisionTreeClassifier", "ExtraTreeClassifier"]): |
There was a problem hiding this comment.
please don't add these parentheses
build_tools/travis/test_script.sh
Outdated
|
|
||
| set -e | ||
|
|
||
|
|
There was a problem hiding this comment.
This was added by mistake during the days when travis went down, when I foolishly tried to make travis work in order to get my pr pass tests as this was my first one.Will correct it.
sklearn/utils/estimator_checks.py
Outdated
| return (p.name != 'self' | ||
| and p.kind != p.VAR_KEYWORD | ||
| and p.kind != p.VAR_POSITIONAL) | ||
| return (p.name != 'self' and p.kind != p.VAR_KEYWORD and |
travis.log
Outdated
| @@ -0,0 +1,87 @@ | |||
| Command line: | |||
sklearn/utils/estimator_checks.py
Outdated
|
|
||
| @ignore_warnings(category=DeprecationWarning) | ||
| def check_decision_proba_consistency(name, Estimator): | ||
| """ |
There was a problem hiding this comment.
We don't usually add docstrings to checks, because nose doesn't play nicely.
sklearn/utils/estimator_checks.py
Outdated
| predict_proba methods has outputs with perfect rank correlation. | ||
| """ | ||
| rnd = np.random.RandomState(0) | ||
| X_train = (3*rnd.uniform(size=(10, 4))).astype(int) |
There was a problem hiding this comment.
we usually use integer features, or binary, in common tests in case estimators can't deal with real-valued features.
There was a problem hiding this comment.
Absolutely I added .astype(int).
sklearn/utils/estimator_checks.py
Outdated
| if (hasattr(estimator, "decision_function") and | ||
| hasattr(estimator, "predict_proba")): | ||
|
|
||
| estimator.fit(X_train, y) |
There was a problem hiding this comment.
Nothing seems to be entering this case: I've modified it to say assert False but nothing is failing from sklearn/tests/test_common.py
There was a problem hiding this comment.
Ah. You've put this in as a regressor check.
sklearn/utils/estimator_checks.py
Outdated
| yield check_regressors_no_decision_function | ||
| yield check_supervised_y_2d | ||
| yield check_supervised_y_no_nan | ||
| yield check_decision_proba_consistency |
There was a problem hiding this comment.
This should be in ...classifier_checks not regressors
|
Awesome thats great I get what you are trying to say (likelihood(point belongs to(A))/likelihood(belongs to(B))) if both are around 0.5 that is we can have 0.6/0.4 meaning it can belong to either side then the values wont peak so much.Thanks a lot. It should work definitely.I will make changes and update. |
|
@jnothman I did not exactly take the points in the middle I just bought the cluster centres nearer and they kind of overlap. Other thing that I thought was to make ellipses on both blobs and consider all the points outside them for test set but this worked. Is it fine or should I improvise? |
sklearn/utils/estimator_checks.py
Outdated
| # TODO some complication with -1 label | ||
| and name not in ["DecisionTreeClassifier", | ||
| "ExtraTreeClassifier"]): | ||
| if name not in ["DecisionTreeClassifier", "ExtraTreeClassifier"]: |
There was a problem hiding this comment.
why did you change this from and to a separate if. That's what creates the errors in your screenshot.
sklearn/utils/estimator_checks.py
Outdated
| centers = [(2, 2), (4, 4)] | ||
| X, y = make_blobs(n_samples=100, random_state=0, n_features=4, | ||
| centers=centers, cluster_std=1.0, shuffle=True) | ||
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, |
There was a problem hiding this comment.
With this approach, the probabilities are again going to be very peaked around 0 and 1, since the blobs are more-or-less linearly separable, encouraging numerical precision errors etc. For test, I'd just use np.random.randn() + 3 or something.
sklearn/utils/estimator_checks.py
Outdated
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, | ||
| random_state=0) | ||
|
|
||
| X_test = np.random.randn(20, 2)+4 |
There was a problem hiding this comment.
Thanks a lot for all the reviews on pr. Learnt a lot.Making changes.
|
Please add an entry in what's new. Put it in API changes to say "Estimators with both x and y are now required ..." |
|
Got my Network exam now will surely do it by evening. |
|
@shubham0704 please use "Fix #issueNumber" in your PR description this way the associated issue gets closed automatically when the PR is merged. For more details, look at this. I have edited your description but please remember do it next time. |
|
Sure.Thanks @lesteve . |
|
[RFC] -request for close :) |
| return (p.name != 'self' | ||
| and p.kind != p.VAR_KEYWORD | ||
| and p.kind != p.VAR_POSITIONAL) | ||
| return (p.name != 'self' and |
There was a problem hiding this comment.
For next time, try not to change things that are not related to your PR. This adds noise into the diff and makes it harder for the review to be efficient.
|
LGTM, merging, thanks a lot! |
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)
…y in estimator_checks (scikit-learn#8253)

Reference Issue
Fix #7578
What does this implement/fix? Explain your changes.
It fixes the need for a test function to check whether the output predict_proba and decision_function are perfectly correlated or not.
Any other comments?
I need to understand the testing part of this function.I have done the pep8 linting and pyflakes but recieved 1 error while nose check stating set_testing_parameters() takes exactly 1 value 0 given and error=1.Also where is the best palce to yield this function.I did not make that change because I was unsure.