DOC Replace boston in ensemble.rst by lucyleeow · Pull Request #16876 · scikit-learn/scikit-learn

lucyleeow · 2020-04-08T19:47:08Z

Replace Boston dataset in doc/modules/ensemble.rst

Section: '1.11.4.2. Regression'
This section uses code from example plot_gradient_boosting_regression.py which has already been updated to use diabetes dataset (#16400). Minor change to wording to reflect this change.

Section: '1.11.7. Voting Regressor'
This section uses code from plot_voting_regressor.py which has already been updated to use diabetes dataset (#16387). Wording change to reflect this.

Section: '1.11.8. Stacked generalization'
Amend to use subset of california dataset. Note that changed ('svr', SVR(C=1, gamma=1e-6)) to use the default gamma='scale'. The multiple stacking example performs slight worse (R=0.77 vs R=0.78) than stacking example.

thomasjpfan · 2020-04-10T18:24:48Z

May we have each section in its own PR? It would make it easier to review which results in merging faster.

lucyleeow · 2020-04-10T18:58:09Z

Thanks @thomasjpfan. I have separated the other files changed to their own PRs (#16896, #16895 and #16894).
This PR only includes changes to ensemble.rst now.

thomasjpfan

Thank you for the PR @lucyleeow !

thomasjpfan · 2020-04-10T19:46:31Z

doc/modules/ensemble.rst


   >>> # Training classifiers
-   >>> reg1 = GradientBoostingRegressor(random_state=1, n_estimators=10)
-   >>> reg2 = RandomForestRegressor(random_state=1, n_estimators=10)


I am +0.5 on this change. I think the original intention of setting n_estimators is to reduce the runtime of doctest.

No problem, I will amend here and in the corresponding example (#16895 (comment))

I am so sorry for being unclear.

I do prefer removing n_estimators here as long as the increase in doctest runtime is not too great. In this case, the fit runtime is increased by 1~ second, which is acceptable.

Okay I will change it back!

thomasjpfan · 2020-04-10T19:47:38Z

doc/modules/ensemble.rst

-  >>> from sklearn.datasets import load_boston
-  >>> X, y = load_boston(return_X_y=True)
+  >>> from sklearn.datasets import fetch_california_housing
+  >>> X, y = fetch_california_housing(return_X_y=True)


Is there an issue with using load_diabetes here?

I should have mentioned this. Not really but with the diabetes dataset the first 'single' stacking regressor reg gives a R2 score of 0.55 whereas the multiple stacking example (multi_layer_regressor) gives a R2 score of 0.52. Since we state that 'a stacking predictor predicts as good as the best predictor of the base layer' I thought this might be a bad example, or require explaining as to why the score was a little worse? Though the california housing also produced a slightly lower score with the multi_layer_regressor (0.77 vs 0.78).

Hmm for the user guide we can adapt the following:

from sklearn.linear_model import RidgeCV, LassoCV from sklearn.neighbors import KNeighborsRegressor estimators = [('ridge', RidgeCV()), ('lasso', LassoCV(random_state=42)), ('knr', KNeighborsRegressor())] from sklearn.ensemble import GradientBoostingRegressor from sklearn.ensemble import StackingRegressor reg = StackingRegressor( estimators=estimators, final_estimator=GradientBoostingRegressor(random_state=42) )

As for the comment about in practice, I think it is suggesting this:

from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score X, y = load_diabetes(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) predictors = {'stacker': reg} for name, est in estimators: predictors[name] = est scores = {} for name, est in predictors.items(): y_pred = est.fit(X_train, y_train).predict(X_test) scores[name] = r2_score(y_test, y_pred)

Which results in:

scores # {'stacker': 0.5533268189282505, # 'ridge': 0.49182837290627335, # 'lasso': 0.4866763449729544, # 'knr': 0.44659346214225026}

As for the comment about in practice, I think it is suggesting this:

Thanks for clearing that up @thomasjpfan. Would you also expect the result from 'multiple' stacked layers. e.g., multi_layer_regressor here:

final_layer = StackingRegressor( estimators=[('rf', RandomForestRegressor(random_state=42)), ('gbrt', GradientBoostingRegressor(random_state=42))], final_estimator=RidgeCV() ) multi_layer_regressor = StackingRegressor( estimators=[('ridge', RidgeCV()), ('lasso', LassoCV(random_state=42)), ('knr', KNeighborsRegressor())], final_estimator=final_layer )

to be similar/at least as good as reg here:

estimators = [('ridge', RidgeCV()), ('lasso', LassoCV(random_state=42)), ('knr', KNeighborsRegressor())] reg = StackingRegressor( estimators=estimators, final_estimator=GradientBoostingRegressor(random_state=42))

or is that more difficult to determine because the final_estimator is different between reg and multi_layer_regressor?

With the diabetes dataset and the switch to KNeighborsRegressor, reg gives a R2 score of 0.55 and multi_layer_regressor gives a R2 score of 0.51.

From my pass experience with stacking, there are times where I see that the stacking does not improve the results as you have demonstrated.

No problem, will amend as you suggest!

Should check for overfitting and tune the model parameters here?

For this specific case, the user guide is trying to demonstrate the stacking API and I would not want to add tuning into the mix. (Specifically, I would not want to introduce GridSearchCV here.)

It would be nice tune the parameters locally and then update the user guide with the tuned parameters.

It would be nice tune the parameters locally and then update the user guide with the tuned parameters.

Sorry that is what I meant! By here I meant for this PR.

Will do thanks for your advice!

@thomasjpfan I've tuned KNeighborsRegressor with diabetes and GradientBoostingRegressor and RandomForestRegressor with the output of the stacked estimators. The R2 value hasn't really changed but it isn't overfitting now.

glemaitre

2 nitpicks otherwise LGTM

glemaitre · 2020-05-18T13:41:41Z

doc/modules/ensemble.rst

+  >>> final_estimator = GradientBoostingRegressor(
+  ...                       n_estimators=25, subsample=0.5, min_samples_leaf=25,
+  ...                       max_features=1, random_state=42)


Suggested change

>>> final_estimator = GradientBoostingRegressor(

... n_estimators=25, subsample=0.5, min_samples_leaf=25,

... max_features=1, random_state=42)

>>> final_estimator = GradientBoostingRegressor(

... n_estimators=25, subsample=0.5, min_samples_leaf=25,

... max_features=1, random_state=42)

glemaitre · 2020-05-18T13:43:22Z

doc/modules/ensemble.rst

+    >>> final_layer_rfr = RandomForestRegressor(
+    ...                        n_estimators=10, max_features=1,
+    ...                        max_leaf_nodes=5,random_state=42)
+    >>> final_layer_gbr = GradientBoostingRegressor(
+    ...                        n_estimators=10, max_features=1,
+    ...                        max_leaf_nodes=5,random_state=42)


Suggested change

>>> final_layer_rfr = RandomForestRegressor(

... n_estimators=10, max_features=1,

... max_leaf_nodes=5,random_state=42)

>>> final_layer_gbr = GradientBoostingRegressor(

... n_estimators=10, max_features=1,

... max_leaf_nodes=5,random_state=42)

>>> final_layer_rfr = RandomForestRegressor(

... n_estimators=10, max_features=1, max_leaf_nodes=5,

... random_state=42)

>>> final_layer_gbr = GradientBoostingRegressor(

... n_estimators=10, max_features=1, max_leaf_nodes=5,

... random_state=42)

Thank you! I'm always unsure of formatting in these situations, this was helpful.

thomasjpfan

Thank you @lucyleeow !

glemaitre · 2020-05-19T06:51:46Z

thanks @lucyleeow

replace boston in ensemble.rst

5631791

lucyleeow changed the title ~~DOC Replace boston in ensemble.rst~~ [MRG] DOC Replace boston in ensemble.rst Apr 8, 2020

github-actions bot added the module:ensemble label Apr 8, 2020

lucyleeow added 2 commits April 9, 2020 09:02

formating

6b6f53f

formatting

cf70de3

lucyleeow changed the title ~~[MRG] DOC Replace boston in ensemble.rst~~ DOC Replace boston in ensemble.rst Apr 10, 2020

separate PR

79cf5a0

thomasjpfan reviewed Apr 10, 2020

View reviewed changes

lucyleeow added 6 commits April 11, 2020 13:35

add n est

baa15e2

remove n est

b21c94e

change back diabetes

c3f14fb

diabetes, tune knr

8cd325c

add import

7a4c760

tune gbr and rfr

404a39d

glemaitre reviewed May 18, 2020

View reviewed changes

formatting

cc4ae0b

thomasjpfan approved these changes May 18, 2020

View reviewed changes

glemaitre merged commit 3ff3981 into scikit-learn:master May 19, 2020

lucyleeow deleted the DOC_ensemble branch May 19, 2020 11:00

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

DOC Replace boston dataset in ensemble.rst (scikit-learn#16876)

b2b2978

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

DOC Replace boston dataset in ensemble.rst (scikit-learn#16876)

e604065

Uh oh!

Conversation

lucyleeow commented Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Apr 10, 2020

Uh oh!

lucyleeow commented Apr 10, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Apr 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented May 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucyleeow commented Apr 8, 2020 •

edited

Loading

lucyleeow Apr 11, 2020 •

edited

Loading

lucyleeow Apr 16, 2020 •

edited

Loading