Skip to content

DOC plot gradient boosting regression changed to diabetes dataset#16400

Merged
ogrisel merged 23 commits intoscikit-learn:masterfrom
maikia:boston_plot_gradient_boosting_regression
Mar 19, 2020
Merged

DOC plot gradient boosting regression changed to diabetes dataset#16400
ogrisel merged 23 commits intoscikit-learn:masterfrom
maikia:boston_plot_gradient_boosting_regression

Conversation

@maikia
Copy link
Copy Markdown
Contributor

@maikia maikia commented Feb 6, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Towards #16155

Exchanged the Boston dataset for Breast cancer dataset (although it is classification dataset it does not matter in this example)

Improved the comments and the layout of the example

The figure
Before:
before

After:
after

Any other comments?

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @maika. Overall LGTM. Here are a few suggestions.

One could also add the permutation feature importance to the plot to address the last remaining point of #14528. Feel free to do this as part of this PR or keep it for a later PR (as you prefer).

@maikia
Copy link
Copy Markdown
Contributor Author

maikia commented Feb 17, 2020

Thanks for the contribution @maika. Overall LGTM. Here are a few suggestions.

One could also add the permutation feature importance to the plot to address the last remaining point of #14528. Feel free to do this as part of this PR or keep it for a later PR (as you prefer).

Thanks @ogrisel
I will keep the permutation feature importance for the next PR once this one is accepted :-)

@maikia maikia requested a review from thomasjpfan February 19, 2020 09:44
Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @maikia.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Feb 19, 2020

For the next PR it would be interesting to:

  • add a short paragraph to suggest to use HistGradientBoostingClassifier instead of GradientBoostingClassifier for larger datasets (e.g. more than 10000 observations);
  • compare results of impurity-based feature importance, permutation feature importance and shap-based feature importances: with shap.summary_plot(shap_values_test, X_test, plot_type="bar").

This requires an external dependency on shap but I think this is fine in examples.

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we have a problem:

The Breast cancer dataset is a binary classification task. Target is 0-1. So instead it would make more sense to use a GradientBoostingClassifier.

The problem is that we would need to change the title to "Gradient Boosting Regression Trees for classification" or something. But then the filename (plot_gradient_boosting_regression.py) is misleading but changing it would also change the URL of the example break some links to our documentation...

i am not sure what to do.


mse = mean_squared_error(y_test, clf.predict(X_test))
print("MSE: %.4f" % mse)
print("The mean squared error (MSE) on test set: {:.4f}".format(mse))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing the MSE loss on a classification task is misleading. We should rather use accuracy of ROC AUC. The data is approximately balanced: 62% for the positive class.

@maikia
Copy link
Copy Markdown
Contributor Author

maikia commented Feb 19, 2020

Actually we have a problem:

The Breast cancer dataset is a binary classification task. Target is 0-1. So instead it would make more sense to use a GradientBoostingClassifier.

The problem is that we would need to change the title to "Gradient Boosting Regression Trees for classification" or something. But then the filename (plot_gradient_boosting_regression.py) is misleading but changing it would also change the URL of the example break some links to our documentation...

i am not sure what to do.

How about instead of Cancer dataset we take yet another one..? Diabetes or Ames would be ok? I chose Cancer simply to differentiate a bit and not choose Diabetes each time

@maikia
Copy link
Copy Markdown
Contributor Author

maikia commented Feb 24, 2020

Here how it looks like now with the diabetes dataset:

after_diabetes

Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title plot gradient boosting regression changed to breast cancer dataset DOC plot gradient boosting regression changed to breast cancer dataset Feb 24, 2020
@maikia maikia changed the title DOC plot gradient boosting regression changed to breast cancer dataset DOC plot gradient boosting regression changed to diabetes dataset Mar 4, 2020
@nilichen
Copy link
Copy Markdown
Contributor

hmm. I'm working on #16023 and plan to update all relevant examples to favor permutation_importance, including this one. May I ask why it is still not merged?

@ogrisel ogrisel merged commit b65c53d into scikit-learn:master Mar 19, 2020
@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Mar 19, 2020

Merged! Thank you @maikia!

@nilichen feel free to go ahead with #16023.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants