Speed up example on plot_gradient_boosting_categorical.py #21634
Conversation
…of iterators, along with code changes
|
I'm not sure why the CI fails with the following error, since I didn't modify the corresponding line :
|
TomDLT
left a comment
There was a problem hiding this comment.
Thanks for the PR.
Here are a few comments:
|
The CI error seems unrelated to this PR and caused by an unfortunate incompatibility between new numpy and old pandas (see numpy/numpy#18355). |
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
ogrisel
left a comment
There was a problem hiding this comment.
Thanks for the PR. However I am not 100% sure the added complexity of the column filtering is worth the speed benefit.
Also limiting max_iter to 50 for the first part of the example makes the difference with the last part of the example a bit less visible since the models for the first part of the example are now underfitting a bit as well (the native categorical splits is always better than OHE and Ordinal Encoding while it was not the case before).
/cc @NicolasHug.
| mape = [np.mean(np.abs(item)) for item in items] | ||
| std_pred = [np.std(item) for item in items] |
There was a problem hiding this comment.
Not sure what std_pred stands for. Maybe the following would be more explicit:
| mape = [np.mean(np.abs(item)) for item in items] | |
| std_pred = [np.std(item) for item in items] | |
| mape_cv_mean = [np.mean(-item) for item in items] | |
| mape_cv_std = [np.std(item) for item in items] |
the lines below will also need to be adjusted.
| print(f"Number of categorical features: {n_categorical_features}") | ||
| print(f"Number of numerical features: {n_numerical_features}") | ||
| print(f"Number of categorical features: {n_columns}") | ||
| print(f"Number of numerical features: {n_columns}") |
There was a problem hiding this comment.
This will become wrong if n_columns is adjusted to another value. Please move back n_categorical_features and n_numerical_features here instead.
|
Thanks for your comments. I thought that reducing the number of columns would be the most promising option since 80 columns since unnecessary for the example to be relevant. I agree that the selection procedure adds too much complexity. Would you have any idea on either how to column selection in a clear and simple way (I was thinking we could manually select the columns and store them in a list at the beginning of the example but it doesn't look very clean either) or more generally how to reduce the training time? Since the first part seems to be a bit underfitting with the new number of iterations, maybe I could increase the |
|
Maybe you can subselect 10 categorical columns and 5 numerical columns manually by name and only change 1 line in the example? Try to focus on columns that are the most informative by running a permutation importance analysis on this dataset (outside of the example). |
|
@ogrisel I ran the permutation importance you describe and selected 10 categorical variables and 10 numerical variables. They don't exactly match the previous ones, particularly because the version dropping categorical performs now significantly worse than the other versions (since there are fewer numerical variables, it may be harder for the 'dropped' version to learn enough). However, the benefit of using the native handling of categorical is shown. The new version now takes 4 seconds to run (instead of ~15). What do you think? |
It makes sense. I think this is not a problem.
And in my opinion, more importantly, the good predictive performance of OHE and Ordinal Encoder strategies when the model does not underfit it still there. Can you confirm that you do not limit
This looks good, please feel free to update your PR accordingly. |
|
Ok, I just updated the PR. I can confirm that I restored |
TomDLT
left a comment
There was a problem hiding this comment.
LGTM
The running time gain goes only from 12 sec to 9 sec, but I am not sure how far we can go while keeping all the insights.
This PR also cleans up a few things in the example. Thanks !
|
This is a nice improvement, but we have this warning:
It'd be nice if you could investigate the issue @pedugnat . Merging this one as is, since the warning is not caused by this PR. |
…-learn#21634) * sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
…-learn#21634) * sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
…-learn#21634) * sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
* sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>


Reference Issues/PRs
Contributes to #21598.
What does this implement/fix? Explain your changes.
In order to speed up the example, I did two main things :
HistGradientBoostingRegressorfrom 100 to 50Please see the before-after for the plots, and the cProfile of the old and new versions: on my computer, the new version takes 4.5s vs 15.5s; the rank of the fit time and the error (plots) is the same for both figures:
cProfiles new version:
3894443 function calls (3808982 primitive calls) in 4.671 secondscProfiles old version:
6640438 function calls (6468083 primitive calls) in 15.580 secondsplots new version :


plots old version

