[MRG] TST Unskip test_importances in forest and loop over 64/32 bit for testing#9242
[MRG] TST Unskip test_importances in forest and loop over 64/32 bit for testing#9242raghavrv wants to merge 3 commits intoscikit-learn:masterfrom
Conversation
| @@ -228,18 +231,20 @@ def check_importances(name, criterion, X, y): | |||
| assert_less(np.abs(importances - importances_bis).mean(), 0.001) | |||
There was a problem hiding this comment.
There was a problem hiding this comment.
Apparently only the regressors with "mae" fail here. Classifiers and other regression scores pass. It's probably the case that mae is more sensitive to changes in scale of the weights. I think it's ok to increase the threshold. I did some experiments in an interactive shell and the first three features (the informative onces) are always more important that the others.
Maybe you can replace this check by the following:
# The forest estimator can detect that only the first 3 features of the dataset
# are informative:
expected_mask = np.zeros(20, dtype=bool)
expected_mask[:3] = True
assert_array_equal(est.feature_importances_ > 0.1, expected_mask)You call also add a check to rescale X instead of the sample_weight array and check that the features importances are exactly equal to the reference model.
There was a problem hiding this comment.
Ah okay. In addition to your proposed test, is it okay if I check the criterion and iff it's mae, change the tolerance to 0.01?
There was a problem hiding this comment.
Why not simply
assert np.all(importances[:3] > 0.1)|
(I'm marking this blocker as the issue was marked blocker as well. Please feel free to remove if this need not make into 0.19) |
|
|
||
|
|
||
| def check_importances(name, criterion, X, y): | ||
| def check_importances(name, criterion, X, y, dtype): |
There was a problem hiding this comment.
To make the test reports easier to read, please put dtype before X and y. Or even put X, y as global variables at the beginning of the module.
There was a problem hiding this comment.
Sorry I don't follow you. I thought since there were check_* functions, the parameters of the functions show up in pytest and it's easier to debug? Which is why I included them as parameters rather than looping them.
There was a problem hiding this comment.
It's just that the arrays are long and the end of the parameters are truncated. Putting dtype before X would render the report more informative. The actual content of X and y is not very informative.
ogrisel
left a comment
There was a problem hiding this comment.
LGTM once my comments are addressed and CI is green.
|
The RandomForest + mae test are really slow (more than 4s). It would be great to find a way to speed them up while still having the existing checks pass. For instance by reducing the dataset size (maybe the number of features), or the number of trees in the forest. |
|
But actually maybe this reveals a performance issue in our code. If so it would be great to open a new issue to track it and skip the MAE case in this test. |
|
Closing in favor of #9282. |
Reference Issue
Attempts to fix #7656
What does this implement/fix? Explain your changes.
Right now I'm unskipping the test and waiting for travis to throw errors