Skip to content

DOC Rewrite user-guide to clarify feature_importance_ are impurity based#16237

Merged
rth merged 7 commits intoscikit-learn:masterfrom
ysunmi0427:doc-update
Feb 1, 2020
Merged

DOC Rewrite user-guide to clarify feature_importance_ are impurity based#16237
rth merged 7 commits intoscikit-learn:masterfrom
ysunmi0427:doc-update

Conversation

@ysunmi0427
Copy link
Copy Markdown
Contributor

Reference Issues/PRs

Closes #14528. See also #14530

What does this implement/fix? Explain your changes.

I clarify feature importance in every tree method by mentioning that is impurity-based. In User-guide, examples, and doc-string, it is now clear that feature importance comes from impurity concept. At some point, I mention Permutation Importance vs Random Forest Feature Importance (MDI) to show impurity-based importance is not only choice but we have a good alternative.

Copy link
Copy Markdown
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ysunmi0427 , a few comments

@ysunmi0427
Copy link
Copy Markdown
Contributor Author

@NicolasHug I fixed line you mentioned. Thank you for your reviews!

Copy link
Copy Markdown
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit but LGTM

Copy link
Copy Markdown
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a few comments otherwise LGTM.

@rth rth changed the title Rewrite user-guide to clarify feature_importance_ are impurity based DOC Rewrite user-guide to clarify feature_importance_ are impurity based Jan 27, 2020
@ysunmi0427
Copy link
Copy Markdown
Contributor Author

@rth I fixed line you mentioned. Thanks for your reviews!

to the prediction function.

The impurity-based feature importance suffers from being computed
on statistics derived from the training dataset.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the issue is not only about deriving statistics from the training set but the bias toward feature with the high-cardinality features.

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@rth rth added the Needs work label Jan 29, 2020
@ysunmi0427
Copy link
Copy Markdown
Contributor Author

@glemaitre I fixed line you mentioned about that impurity-based feature importance favors high cardinality features (typically numerical features).

Copy link
Copy Markdown
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ysunmi0427, LGTM.

impurity-based feature importance favors high cardinality features (typically numerical features).

Pushed a quick fix removing the last part , because high cardinality features are usual categorical nor numeric (at least before an encoding is applied to them).

Will merge when CI is green.

@rth rth merged commit 4a18796 into scikit-learn:master Feb 1, 2020
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewrite user-guide to clarify feature_importances_ are impurity based

5 participants