Skip to content

[MRG] TST Unskip test_importances in forest and loop over 64/32 bit for testing#9242

Closed
raghavrv wants to merge 3 commits intoscikit-learn:masterfrom
raghavrv:fix_feature_importance_test
Closed

[MRG] TST Unskip test_importances in forest and loop over 64/32 bit for testing#9242
raghavrv wants to merge 3 commits intoscikit-learn:masterfrom
raghavrv:fix_feature_importance_test

Conversation

@raghavrv
Copy link
Copy Markdown
Member

@raghavrv raghavrv commented Jun 28, 2017

Reference Issue

Attempts to fix #7656

What does this implement/fix? Explain your changes.

Right now I'm unskipping the test and waiting for travis to throw errors

@@ -228,18 +231,20 @@ def check_importances(name, criterion, X, y):
assert_less(np.abs(importances - importances_bis).mean(), 0.001)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel @jnothman All the builds fail with this error message at this line -

AssertionError: 0.0067716825379038909 not less than 0.001

I'm not sure if this has anything to do with the dataset per se.

@glouppe @jmschrei is it safe to reduce the tolerance of this test to 0.01 to pass the test?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently only the regressors with "mae" fail here. Classifiers and other regression scores pass. It's probably the case that mae is more sensitive to changes in scale of the weights. I think it's ok to increase the threshold. I did some experiments in an interactive shell and the first three features (the informative onces) are always more important that the others.

Maybe you can replace this check by the following:

# The forest estimator can detect that only the first 3 features of the dataset
# are informative:
expected_mask = np.zeros(20, dtype=bool)
expected_mask[:3] = True
assert_array_equal(est.feature_importances_ > 0.1, expected_mask)

You call also add a check to rescale X instead of the sample_weight array and check that the features importances are exactly equal to the reference model.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. In addition to your proposed test, is it okay if I check the criterion and iff it's mae, change the tolerance to 0.01?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's do that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply

assert np.all(importances[:3] > 0.1)

@raghavrv raghavrv added this to the 0.19 milestone Jun 29, 2017
@raghavrv
Copy link
Copy Markdown
Member Author

(I'm marking this blocker as the issue was marked blocker as well. Please feel free to remove if this need not make into 0.19)



def check_importances(name, criterion, X, y):
def check_importances(name, criterion, X, y, dtype):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the test reports easier to read, please put dtype before X and y. Or even put X, y as global variables at the beginning of the module.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't follow you. I thought since there were check_* functions, the parameters of the functions show up in pytest and it's easier to debug? Which is why I included them as parameters rather than looping them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that the arrays are long and the end of the parameters are truncated. Putting dtype before X would render the report more informative. The actual content of X and y is not very informative.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once my comments are addressed and CI is green.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 30, 2017

The RandomForest + mae test are really slow (more than 4s). It would be great to find a way to speed them up while still having the existing checks pass. For instance by reducing the dataset size (maybe the number of features), or the number of trees in the forest.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 30, 2017

But actually maybe this reveals a performance issue in our code. If so it would be great to open a new issue to track it and skip the MAE case in this test.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jul 5, 2017

Closing in favor of #9282.

@ogrisel ogrisel closed this Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants