DOC Replace boston in ensemble.rst#16876
Conversation
|
May we have each section in its own PR? It would make it easier to review which results in merging faster. |
|
Thanks @thomasjpfan. I have separated the other files changed to their own PRs (#16896, #16895 and #16894). |
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for the PR @lucyleeow !
|
|
||
| >>> # Training classifiers | ||
| >>> reg1 = GradientBoostingRegressor(random_state=1, n_estimators=10) | ||
| >>> reg2 = RandomForestRegressor(random_state=1, n_estimators=10) |
There was a problem hiding this comment.
I am +0.5 on this change. I think the original intention of setting n_estimators is to reduce the runtime of doctest.
There was a problem hiding this comment.
No problem, I will amend here and in the corresponding example (#16895 (comment))
There was a problem hiding this comment.
I am so sorry for being unclear.
I do prefer removing n_estimators here as long as the increase in doctest runtime is not too great. In this case, the fit runtime is increased by 1~ second, which is acceptable.
There was a problem hiding this comment.
Okay I will change it back!
doc/modules/ensemble.rst
Outdated
| >>> from sklearn.datasets import load_boston | ||
| >>> X, y = load_boston(return_X_y=True) | ||
| >>> from sklearn.datasets import fetch_california_housing | ||
| >>> X, y = fetch_california_housing(return_X_y=True) |
There was a problem hiding this comment.
Is there an issue with using load_diabetes here?
There was a problem hiding this comment.
I should have mentioned this. Not really but with the diabetes dataset the first 'single' stacking regressor reg gives a R2 score of 0.55 whereas the multiple stacking example (multi_layer_regressor) gives a R2 score of 0.52. Since we state that 'a stacking predictor predicts as good as the best predictor of the base layer' I thought this might be a bad example, or require explaining as to why the score was a little worse? Though the california housing also produced a slightly lower score with the multi_layer_regressor (0.77 vs 0.78).
There was a problem hiding this comment.
Hmm for the user guide we can adapt the following:
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.neighbors import KNeighborsRegressor
estimators = [('ridge', RidgeCV()),
('lasso', LassoCV(random_state=42)),
('knr', KNeighborsRegressor())]
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor
reg = StackingRegressor(
estimators=estimators,
final_estimator=GradientBoostingRegressor(random_state=42)
)As for the comment about in practice, I think it is suggesting this:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=42)
predictors = {'stacker': reg}
for name, est in estimators:
predictors[name] = est
scores = {}
for name, est in predictors.items():
y_pred = est.fit(X_train, y_train).predict(X_test)
scores[name] = r2_score(y_test, y_pred)Which results in:
scores
# {'stacker': 0.5533268189282505,
# 'ridge': 0.49182837290627335,
# 'lasso': 0.4866763449729544,
# 'knr': 0.44659346214225026}There was a problem hiding this comment.
As for the comment about in practice, I think it is suggesting this:
Thanks for clearing that up @thomasjpfan. Would you also expect the result from 'multiple' stacked layers. e.g., multi_layer_regressor here:
final_layer = StackingRegressor(
estimators=[('rf', RandomForestRegressor(random_state=42)),
('gbrt', GradientBoostingRegressor(random_state=42))],
final_estimator=RidgeCV()
)
multi_layer_regressor = StackingRegressor(
estimators=[('ridge', RidgeCV()),
('lasso', LassoCV(random_state=42)),
('knr', KNeighborsRegressor())],
final_estimator=final_layer
)to be similar/at least as good as reg here:
estimators = [('ridge', RidgeCV()),
('lasso', LassoCV(random_state=42)),
('knr', KNeighborsRegressor())]
reg = StackingRegressor(
estimators=estimators,
final_estimator=GradientBoostingRegressor(random_state=42))or is that more difficult to determine because the final_estimator is different between reg and multi_layer_regressor?
With the diabetes dataset and the switch to KNeighborsRegressor, reg gives a R2 score of 0.55 and multi_layer_regressor gives a R2 score of 0.51.
There was a problem hiding this comment.
From my pass experience with stacking, there are times where I see that the stacking does not improve the results as you have demonstrated.
There was a problem hiding this comment.
No problem, will amend as you suggest!
There was a problem hiding this comment.
Should check for overfitting and tune the model parameters here?
There was a problem hiding this comment.
For this specific case, the user guide is trying to demonstrate the stacking API and I would not want to add tuning into the mix. (Specifically, I would not want to introduce GridSearchCV here.)
It would be nice tune the parameters locally and then update the user guide with the tuned parameters.
There was a problem hiding this comment.
It would be nice tune the parameters locally and then update the user guide with the tuned parameters.
Sorry that is what I meant! By here I meant for this PR.
Will do thanks for your advice!
There was a problem hiding this comment.
@thomasjpfan I've tuned KNeighborsRegressor with diabetes and GradientBoostingRegressor and RandomForestRegressor with the output of the stacked estimators. The R2 value hasn't really changed but it isn't overfitting now.
doc/modules/ensemble.rst
Outdated
| >>> final_estimator = GradientBoostingRegressor( | ||
| ... n_estimators=25, subsample=0.5, min_samples_leaf=25, | ||
| ... max_features=1, random_state=42) |
There was a problem hiding this comment.
| >>> final_estimator = GradientBoostingRegressor( | |
| ... n_estimators=25, subsample=0.5, min_samples_leaf=25, | |
| ... max_features=1, random_state=42) | |
| >>> final_estimator = GradientBoostingRegressor( | |
| ... n_estimators=25, subsample=0.5, min_samples_leaf=25, | |
| ... max_features=1, random_state=42) |
doc/modules/ensemble.rst
Outdated
| >>> final_layer_rfr = RandomForestRegressor( | ||
| ... n_estimators=10, max_features=1, | ||
| ... max_leaf_nodes=5,random_state=42) | ||
| >>> final_layer_gbr = GradientBoostingRegressor( | ||
| ... n_estimators=10, max_features=1, | ||
| ... max_leaf_nodes=5,random_state=42) |
There was a problem hiding this comment.
| >>> final_layer_rfr = RandomForestRegressor( | |
| ... n_estimators=10, max_features=1, | |
| ... max_leaf_nodes=5,random_state=42) | |
| >>> final_layer_gbr = GradientBoostingRegressor( | |
| ... n_estimators=10, max_features=1, | |
| ... max_leaf_nodes=5,random_state=42) | |
| >>> final_layer_rfr = RandomForestRegressor( | |
| ... n_estimators=10, max_features=1, max_leaf_nodes=5, | |
| ... random_state=42) | |
| >>> final_layer_gbr = GradientBoostingRegressor( | |
| ... n_estimators=10, max_features=1, max_leaf_nodes=5, | |
| ... random_state=42) |
There was a problem hiding this comment.
Thank you! I'm always unsure of formatting in these situations, this was helpful.
|
thanks @lucyleeow |
Towards #16155
Replace Boston dataset in
doc/modules/ensemble.rstSection: '1.11.4.2. Regression'
This section uses code from example
plot_gradient_boosting_regression.pywhich has already been updated to use diabetes dataset (#16400). Minor change to wording to reflect this change.Section: '1.11.7. Voting Regressor'
This section uses code from
plot_voting_regressor.pywhich has already been updated to use diabetes dataset (#16387). Wording change to reflect this.Section: '1.11.8. Stacked generalization'
Amend to use subset of california dataset. Note that changed
('svr', SVR(C=1, gamma=1e-6))to use the defaultgamma='scale'. The multiple stacking example performs slight worse (R=0.77 vs R=0.78) than stacking example.