Skip to content

DOC Replace boston in ensemble.rst#16876

Merged
glemaitre merged 11 commits intoscikit-learn:masterfrom
lucyleeow:DOC_ensemble
May 19, 2020
Merged

DOC Replace boston in ensemble.rst#16876
glemaitre merged 11 commits intoscikit-learn:masterfrom
lucyleeow:DOC_ensemble

Conversation

@lucyleeow
Copy link
Copy Markdown
Member

@lucyleeow lucyleeow commented Apr 8, 2020

Towards #16155

Replace Boston dataset in doc/modules/ensemble.rst

Section: '1.11.4.2. Regression'
This section uses code from example plot_gradient_boosting_regression.py which has already been updated to use diabetes dataset (#16400). Minor change to wording to reflect this change.

Section: '1.11.7. Voting Regressor'
This section uses code from plot_voting_regressor.py which has already been updated to use diabetes dataset (#16387). Wording change to reflect this.

Section: '1.11.8. Stacked generalization'
Amend to use subset of california dataset. Note that changed ('svr', SVR(C=1, gamma=1e-6)) to use the default gamma='scale'. The multiple stacking example performs slight worse (R=0.77 vs R=0.78) than stacking example.

@lucyleeow lucyleeow changed the title DOC Replace boston in ensemble.rst [MRG] DOC Replace boston in ensemble.rst Apr 8, 2020
@lucyleeow lucyleeow changed the title [MRG] DOC Replace boston in ensemble.rst DOC Replace boston in ensemble.rst Apr 10, 2020
@thomasjpfan
Copy link
Copy Markdown
Member

May we have each section in its own PR? It would make it easier to review which results in merging faster.

@lucyleeow
Copy link
Copy Markdown
Member Author

Thanks @thomasjpfan. I have separated the other files changed to their own PRs (#16896, #16895 and #16894).
This PR only includes changes to ensemble.rst now.

Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @lucyleeow !


>>> # Training classifiers
>>> reg1 = GradientBoostingRegressor(random_state=1, n_estimators=10)
>>> reg2 = RandomForestRegressor(random_state=1, n_estimators=10)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +0.5 on this change. I think the original intention of setting n_estimators is to reduce the runtime of doctest.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I will amend here and in the corresponding example (#16895 (comment))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am so sorry for being unclear.

I do prefer removing n_estimators here as long as the increase in doctest runtime is not too great. In this case, the fit runtime is increased by 1~ second, which is acceptable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I will change it back!

>>> from sklearn.datasets import load_boston
>>> X, y = load_boston(return_X_y=True)
>>> from sklearn.datasets import fetch_california_housing
>>> X, y = fetch_california_housing(return_X_y=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue with using load_diabetes here?

Copy link
Copy Markdown
Member Author

@lucyleeow lucyleeow Apr 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have mentioned this. Not really but with the diabetes dataset the first 'single' stacking regressor reg gives a R2 score of 0.55 whereas the multiple stacking example (multi_layer_regressor) gives a R2 score of 0.52. Since we state that 'a stacking predictor predicts as good as the best predictor of the base layer' I thought this might be a bad example, or require explaining as to why the score was a little worse? Though the california housing also produced a slightly lower score with the multi_layer_regressor (0.77 vs 0.78).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm for the user guide we can adapt the following:

from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.neighbors import KNeighborsRegressor
estimators = [('ridge', RidgeCV()),
              ('lasso', LassoCV(random_state=42)),
              ('knr', KNeighborsRegressor())]

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor

reg = StackingRegressor(
        estimators=estimators,
        final_estimator=GradientBoostingRegressor(random_state=42)
)

As for the comment about in practice, I think it is suggesting this:

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=42)

predictors = {'stacker': reg}
for name, est in estimators:
    predictors[name] = est

scores = {}
for name, est in predictors.items():
    y_pred = est.fit(X_train, y_train).predict(X_test)
    scores[name] = r2_score(y_test, y_pred)

Which results in:

scores
# {'stacker': 0.5533268189282505,
#  'ridge': 0.49182837290627335,
#  'lasso': 0.4866763449729544,
#  'knr': 0.44659346214225026}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the comment about in practice, I think it is suggesting this:

Thanks for clearing that up @thomasjpfan. Would you also expect the result from 'multiple' stacked layers. e.g., multi_layer_regressor here:

final_layer = StackingRegressor(
    estimators=[('rf', RandomForestRegressor(random_state=42)),
                ('gbrt', GradientBoostingRegressor(random_state=42))],
    final_estimator=RidgeCV()
    )
multi_layer_regressor = StackingRegressor(
    estimators=[('ridge', RidgeCV()),
                ('lasso', LassoCV(random_state=42)),
                ('knr', KNeighborsRegressor())],
    final_estimator=final_layer
)

to be similar/at least as good as reg here:

estimators = [('ridge', RidgeCV()),
              ('lasso', LassoCV(random_state=42)),
              ('knr', KNeighborsRegressor())]
reg = StackingRegressor(
    estimators=estimators,
    final_estimator=GradientBoostingRegressor(random_state=42))

or is that more difficult to determine because the final_estimator is different between reg and multi_layer_regressor?

With the diabetes dataset and the switch to KNeighborsRegressor, reg gives a R2 score of 0.55 and multi_layer_regressor gives a R2 score of 0.51.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my pass experience with stacking, there are times where I see that the stacking does not improve the results as you have demonstrated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, will amend as you suggest!

Copy link
Copy Markdown
Member Author

@lucyleeow lucyleeow Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check for overfitting and tune the model parameters here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific case, the user guide is trying to demonstrate the stacking API and I would not want to add tuning into the mix. (Specifically, I would not want to introduce GridSearchCV here.)

It would be nice tune the parameters locally and then update the user guide with the tuned parameters.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice tune the parameters locally and then update the user guide with the tuned parameters.

Sorry that is what I meant! By here I meant for this PR.

Will do thanks for your advice!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan I've tuned KNeighborsRegressor with diabetes and GradientBoostingRegressor and RandomForestRegressor with the output of the stacked estimators. The R2 value hasn't really changed but it isn't overfitting now.

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 nitpicks otherwise LGTM

Comment on lines +1402 to +1404
>>> final_estimator = GradientBoostingRegressor(
... n_estimators=25, subsample=0.5, min_samples_leaf=25,
... max_features=1, random_state=42)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> final_estimator = GradientBoostingRegressor(
... n_estimators=25, subsample=0.5, min_samples_leaf=25,
... max_features=1, random_state=42)
>>> final_estimator = GradientBoostingRegressor(
... n_estimators=25, subsample=0.5, min_samples_leaf=25,
... max_features=1, random_state=42)

Comment on lines +1466 to +1471
>>> final_layer_rfr = RandomForestRegressor(
... n_estimators=10, max_features=1,
... max_leaf_nodes=5,random_state=42)
>>> final_layer_gbr = GradientBoostingRegressor(
... n_estimators=10, max_features=1,
... max_leaf_nodes=5,random_state=42)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> final_layer_rfr = RandomForestRegressor(
... n_estimators=10, max_features=1,
... max_leaf_nodes=5,random_state=42)
>>> final_layer_gbr = GradientBoostingRegressor(
... n_estimators=10, max_features=1,
... max_leaf_nodes=5,random_state=42)
>>> final_layer_rfr = RandomForestRegressor(
... n_estimators=10, max_features=1, max_leaf_nodes=5,
... random_state=42)
>>> final_layer_gbr = GradientBoostingRegressor(
... n_estimators=10, max_features=1, max_leaf_nodes=5,
... random_state=42)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'm always unsure of formatting in these situations, this was helpful.

Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lucyleeow !

@glemaitre glemaitre merged commit 3ff3981 into scikit-learn:master May 19, 2020
@glemaitre
Copy link
Copy Markdown
Member

thanks @lucyleeow

@lucyleeow lucyleeow deleted the DOC_ensemble branch May 19, 2020 11:00
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants