Skip to content

[MRG+1] Stronger tests for variable importances#5261

Merged
glouppe merged 5 commits intoscikit-learn:masterfrom
glouppe:check-importances
Sep 14, 2015
Merged

[MRG+1] Stronger tests for variable importances#5261
glouppe merged 5 commits intoscikit-learn:masterfrom
glouppe:check-importances

Conversation

@glouppe
Copy link
Copy Markdown
Contributor

@glouppe glouppe commented Sep 12, 2015

This PR adds stronger tests for variable importances in forests, including:

  • checks for all forests and all criteria (as proposed earlier by @arjoly);
  • checks for invariance with respect to sample weight scaling;
  • checks for the convergence of variable importances of totally randomized trees towards their true values.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I disagree, but what is the reason for this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to make the tree construction more stable by automatically rescaling the sample weights. Nothing worked with much success (it just moved the issue to someplace else...), but this test failed a few times because of this. Using more samples made that test more stable.

@jmschrei
Copy link
Copy Markdown
Member

I've made some comments, but I support this.

@glouppe
Copy link
Copy Markdown
Contributor Author

glouppe commented Sep 12, 2015

Thanks for your review @jmschrei ! I believe I addressed all your suggestions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for including my suggestion. I apologize for being pedantic, but I meant a scale of more like [1e-8, 1e-4, 1e-1, 1e1, 1e4, 1e8]. Do you think this test is sufficient?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, which such larger/smaller scale factors, tests dont pass anymore. Numerical discrepancies creep in, leading to slightly different impurity values... I am not sure we can do anything about that. (My idea was to internally scale down sample_weight by 1. / sample_weight.max(), but this just moves the numerical issues somewhere else...)

So for me this test is more a safeguard that nothing obviously wrong is happening, rather than a bulletproof test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Thanks for checking!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but 1e-4 would still work, right?

@jmschrei
Copy link
Copy Markdown
Member

Other than my one comment, LGTM.

@jmschrei
Copy link
Copy Markdown
Member

With the last comment resolved, this LGTM, +1.

@glouppe glouppe changed the title [MRG] Stronger tests for variable importances [MRG+1] Stronger tests for variable importances Sep 12, 2015
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line confuses me.
X_new.shape[1] == 3
So he first expression evaluates to True, which is cast to 1, which is less than X.shape[1] == 10
So the assert would hold even if X_new.shape[1] == 11. Actually it succeeds for all values of X_new.shape[1] as long as X.shape[1] > 1

I guess remove the 0 <?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops :) This one has been there for long without anybody noticing

@amueller
Copy link
Copy Markdown
Member

thanks for more tests :) looks good though I didn't check the math in the theoretical importances.

@glouppe
Copy link
Copy Markdown
Contributor Author

glouppe commented Sep 13, 2015

@amueller Thanks for the comments! I fixed those

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use assert_less over assert_array_almost_equal?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it easier to control the quality of the approximation, rather than having all single importance values match up to some digit (but doing this, I only require their mean to do so).

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Sep 13, 2015

I would add a test to check the parallel computation of importances.
Something like

importance = est.feature_importance_
est.set_params(n_jobs=2)
importance_parrallel =  est.feature_importance_
assert_array_equal(importance, importance_parrallel)

@arjoly
Copy link
Copy Markdown
Member

arjoly commented Sep 13, 2015

Otherwise looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants