DOC Rewrite user-guide to clarify feature_importance_ are impurity based#16237
DOC Rewrite user-guide to clarify feature_importance_ are impurity based#16237rth merged 7 commits intoscikit-learn:masterfrom
Conversation
NicolasHug
left a comment
There was a problem hiding this comment.
Thanks @ysunmi0427 , a few comments
|
@NicolasHug I fixed line you mentioned. Thank you for your reviews! |
7445d91 to
170ca6b
Compare
rth
left a comment
There was a problem hiding this comment.
Thanks, a few comments otherwise LGTM.
|
@rth I fixed line you mentioned. Thanks for your reviews! |
doc/modules/ensemble.rst
Outdated
| to the prediction function. | ||
|
|
||
| The impurity-based feature importance suffers from being computed | ||
| on statistics derived from the training dataset. |
There was a problem hiding this comment.
I think that the issue is not only about deriving statistics from the training set but the bias toward feature with the high-cardinality features.
|
@glemaitre I fixed line you mentioned about that impurity-based feature importance favors high cardinality features (typically numerical features). |
rth
left a comment
There was a problem hiding this comment.
Thanks @ysunmi0427, LGTM.
impurity-based feature importance favors high cardinality features (typically numerical features).
Pushed a quick fix removing the last part , because high cardinality features are usual categorical nor numeric (at least before an encoding is applied to them).
Will merge when CI is green.
Reference Issues/PRs
Closes #14528. See also #14530
What does this implement/fix? Explain your changes.
I clarify feature importance in every tree method by mentioning that is impurity-based. In User-guide, examples, and doc-string, it is now clear that feature importance comes from impurity concept. At some point, I mention Permutation Importance vs Random Forest Feature Importance (MDI) to show impurity-based importance is not only choice but we have a good alternative.