[MRG] Speed up plot_stack_predictors.py#21726
[MRG] Speed up plot_stack_predictors.py#21726chritter wants to merge 9 commits intoscikit-learn:mainfrom
Conversation
The current code uses three predictors (Random Forest, Lasso, Gradient Boosting) and then combine them with StackingRegressor to show an automatic selection. Currently, with the default of 5 folds for cross-validation, the run time can get considerably long, in particular in settings where only one (v)CPU is available (like CI tests). Experiments have shown that switching to 2 folds instead of the default 5, decreases the execution time around a factor of 10: The following table gives run times (real run time) on my system with n_jobs = 1 added to all computations (to have comparable output) with adding the `cv` parameter to the `cross_val(_predict)` calls (rows) and to the stacking regressor (columns) | 5 | 2 | 5 | 22 | 11 | 2 | 7.5 | 2 | In particular, changing the `cv` value didn't have any influence on the final outcome.
|
Modified from @norbusan commit message: The current code uses three predictors (Random Forest, Lasso,
Experiments have shown that switching to 2 folds instead of the The following table gives run times (real run time) on my system
In particular, changing the |
|
Please run |
|
@adrinjalali I would appreciate your help with the review. I addressed your comment. Thanks! |
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for the PR @chritter !
| start_time = time.time() | ||
| score = cross_validate( | ||
| est, X, y, scoring=["r2", "neg_mean_absolute_error"], n_jobs=2, verbose=0 | ||
| est, X, y, scoring=["r2", "neg_mean_absolute_error"], n_jobs=2, verbose=0, cv=2 |
There was a problem hiding this comment.
I think setting cv=2 cross_validate would promote bad ML practices. In this case, I prefer to have a longer running example with the default cv=5.
There was a problem hiding this comment.
That is fine with me, it still would cut the execution time to about the half, so it would anyway be a win.
There was a problem hiding this comment.
I pushed a commit that removes the above cv=2 part.
| elapsed_time = time.time() - start_time | ||
|
|
||
| y_pred = cross_val_predict(est, X, y, n_jobs=2, verbose=0) | ||
| y_pred = cross_val_predict(est, X, y, n_jobs=2, verbose=0, cv=2) |
There was a problem hiding this comment.
I feel like the cv setting should be consistent with cross_validate. Maybe @glemaitre has some input on this?
There was a problem hiding this comment.
I addressed this, using consistently 5-fold cv for cross_val_predict and cross_validate.
There was a problem hiding this comment.
Yes, I confirm that most performance gain is lost once we put the cv=5's back in place.
|
An alternative options to get a significant speedup: A) Remove Let me know which approach you deem reasonable to still follow good practices. |
I think I am okay with this, as long as we leave a comment saying this is to reduce runtime. Although making it 10 feels too low since we semi-recently moved the default from 10 to 100 in RandomForest. How much does the runtime lower when |
It takes only 28s, compared to the original 41s. If that is sufficient I would only implement this change. thanks! |
That sounds reasonable. Going down to 25 sounds reasonable to me, if the metric still hold. |
|
I mistakenly worked on this for a while. @chritter using |
|
|
I'll work on it if that's okay. |
|
Of course, I just didn't see anything here since this comment: #21726 (comment) |
|
superseded |
Reference Issues/PRs
Towards #21598
What does this implement/fix? Explain your changes.
These changes are accelerating example
../examples/ensemble/plot_stack_predictors.pyAny other comments?
Working with @norbusan
#DataUmbrella Sprint