Speed up example on plot_gradient_boosting_categorical.py by pedugnat · Pull Request #21634 · scikit-learn/scikit-learn

pedugnat · 2021-11-11T14:01:35Z

Reference Issues/PRs

Contributes to #21598.

What does this implement/fix? Explain your changes.

In order to speed up the example, I did two main things :

Decrease the number of columns of the chosen dataset from 80 to 20, keeping both numerical and categorical columns
Decrease the number of iterations of the HistGradientBoostingRegressor from 100 to 50

Please see the before-after for the plots, and the cProfile of the old and new versions: on my computer, the new version takes 4.5s vs 15.5s; the rank of the fit time and the error (plots) is the same for both figures:

cProfiles new version: 3894443 function calls (3808982 primitive calls) in 4.671 seconds
cProfiles old version: 6640438 function calls (6468083 primitive calls) in 15.580 seconds
plots new version :
plots old version

…of iterators, along with code changes

pedugnat · 2021-11-11T16:20:53Z

I'm not sure why the CI fails with the following error, since I didn't modify the corresponding line :

File "/home/circleci/project/examples/ensemble/plot_gradient_boosting_categorical.py", line 33, in <module>
X, y = fetch_openml(data_id=41211, as_frame=True, return_X_y=True)

TomDLT

Thanks for the PR.
Here are a few comments:

examples/ensemble/plot_gradient_boosting_categorical.py

TomDLT · 2021-11-11T19:46:25Z

The CI error seems unrelated to this PR and caused by an unfortunate incompatibility between new numpy and old pandas (see numpy/numpy#18355).

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

ogrisel

Thanks for the PR. However I am not 100% sure the added complexity of the column filtering is worth the speed benefit.

Also limiting max_iter to 50 for the first part of the example makes the difference with the last part of the example a bit less visible since the models for the first part of the example are now underfitting a bit as well (the native categorical splits is always better than OHE and Ordinal Encoding while it was not the case before).

/cc @NicolasHug.

examples/ensemble/plot_gradient_boosting_categorical.py

ogrisel · 2021-11-12T09:06:22Z

examples/ensemble/plot_gradient_boosting_categorical.py

+        mape = [np.mean(np.abs(item)) for item in items]
+        std_pred = [np.std(item) for item in items]


Not sure what std_pred stands for. Maybe the following would be more explicit:

Suggested change

mape = [np.mean(np.abs(item)) for item in items]

std_pred = [np.std(item) for item in items]

mape_cv_mean = [np.mean(-item) for item in items]

mape_cv_std = [np.std(item) for item in items]

the lines below will also need to be adjusted.

ogrisel · 2021-11-12T09:08:20Z

examples/ensemble/plot_gradient_boosting_categorical.py

-print(f"Number of categorical features: {n_categorical_features}")
-print(f"Number of numerical features: {n_numerical_features}")
+print(f"Number of categorical features: {n_columns}")
+print(f"Number of numerical features: {n_columns}")


This will become wrong if n_columns is adjusted to another value. Please move back n_categorical_features and n_numerical_features here instead.

pedugnat · 2021-11-12T09:25:47Z

Thanks for your comments.

I thought that reducing the number of columns would be the most promising option since 80 columns since unnecessary for the example to be relevant. I agree that the selection procedure adds too much complexity.

Would you have any idea on either how to column selection in a clear and simple way (I was thinking we could manually select the columns and store them in a list at the beginning of the example but it doesn't look very clean either) or more generally how to reduce the training time?

Since the first part seems to be a bit underfitting with the new number of iterations, maybe I could increase the learning_rate instead of increasing the number of iterations? This way the time gain would be kept.

ogrisel · 2021-11-12T09:52:31Z

Maybe you can subselect 10 categorical columns and 5 numerical columns manually by name and only change 1 line in the example? Try to focus on columns that are the most informative by running a permutation importance analysis on this dataset (outside of the example).

ogrisel · 2021-11-12T09:53:52Z

See: https://scikit-learn.org/stable/modules/permutation_importance.html

pedugnat · 2021-11-12T14:13:11Z

@ogrisel I ran the permutation importance you describe and selected 10 categorical variables and 10 numerical variables.
I filtered the dataset to keep only these columns and here are the plots :

They don't exactly match the previous ones, particularly because the version dropping categorical performs now significantly worse than the other versions (since there are fewer numerical variables, it may be harder for the 'dropped' version to learn enough). However, the benefit of using the native handling of categorical is shown.

The new version now takes 4 seconds to run (instead of ~15).

What do you think?

ogrisel · 2021-11-12T14:22:21Z

They don't exactly match the previous ones, particularly because the version dropping categorical performs now significantly worse than the other versions (since there are fewer numerical variables, it may be harder for the 'dropped' version to learn enough).

It makes sense. I think this is not a problem.

However, the benefit of using the native handling of categorical is shown.

And in my opinion, more importantly, the good predictive performance of OHE and Ordinal Encoder strategies when the model does not underfit it still there.

Can you confirm that you do not limit max_iter anymore in the first part of the example?

The new version now takes 4 seconds to run (instead of ~15).

What do you think?

This looks good, please feel free to update your PR accordingly.

pedugnat · 2021-11-12T14:29:42Z

Ok, I just updated the PR.

I can confirm that I restored max_iter to its default value.

pedugnat · 2021-11-15T18:02:23Z

@ogrisel @TomDLT any other comments on your side?

TomDLT

LGTM

The running time gain goes only from 12 sec to 9 sec, but I am not sure how far we can go while keeping all the insights.
This PR also cleans up a few things in the example. Thanks !

adrinjalali · 2021-11-24T13:12:14Z

This is a nice improvement, but we have this warning:

/home/circleci/project/sklearn/datasets/_openml.py:876: UserWarning: Version 1 of dataset ames-housing is inactive, meaning that issues have been found in the dataset. Try using a newer version from this URL: https://www.openml.org/data/v1/download/20649135/ames-housing.arff

It'd be nice if you could investigate the issue @pedugnat . Merging this one as is, since the warning is not caused by this PR.

…-learn#21634) * sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

* sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

pedugnat added 2 commits November 11, 2021 14:59

sped up the example by reducing the number of columns and the number …

de2040e

…of iterators, along with code changes

removed warning filter as it breaks the CI

16d3bd3

TomDLT reviewed Nov 11, 2021

View reviewed changes

examples/ensemble/plot_gradient_boosting_categorical.py Outdated Show resolved Hide resolved

examples/ensemble/plot_gradient_boosting_categorical.py Outdated Show resolved Hide resolved

examples/ensemble/plot_gradient_boosting_categorical.py Outdated Show resolved Hide resolved

TomDLT changed the title ~~Speed up example gradient boosting categorical plot_gradient_boosting_early_stopping.py #21598~~ Speed up example on categorical gradient boosting Nov 11, 2021

TomDLT mentioned this pull request Nov 11, 2021

Accelerate slow examples #21598

Closed

41 tasks

pedugnat and others added 3 commits November 11, 2021 22:55

change in a comment

08343b4

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

fixed typo

838b6be

made code more robust as per PR comment

ab22fa6

ogrisel reviewed Nov 12, 2021

View reviewed changes

applied black

01cdae9

adrinjalali changed the title ~~Speed up example on categorical gradient boosting~~ Speed up example on plot_gradient_boosting_categorical.py Nov 12, 2021

applied black

26762d3

TomDLT approved these changes Nov 16, 2021

View reviewed changes

adrinjalali approved these changes Nov 24, 2021

View reviewed changes

adrinjalali merged commit 6a693ac into scikit-learn:main Nov 24, 2021

pedugnat mentioned this pull request Nov 24, 2021

Warning in example plot_gradient_boosting_categorical #21782

Closed

		mape = [np.mean(np.abs(item)) for item in items]
		std_pred = [np.std(item) for item in items]

Uh oh!

Conversation

pedugnat commented Nov 11, 2021 • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

pedugnat commented Nov 11, 2021

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomDLT commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel Nov 12, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Nov 12, 2021

Choose a reason for hiding this comment

Uh oh!

pedugnat commented Nov 12, 2021

Uh oh!

ogrisel commented Nov 12, 2021

Uh oh!

ogrisel commented Nov 12, 2021

Uh oh!

pedugnat commented Nov 12, 2021

Uh oh!

ogrisel commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pedugnat commented Nov 12, 2021

Uh oh!

pedugnat commented Nov 15, 2021

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Nov 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pedugnat commented Nov 11, 2021 •

edited by TomDLT

Loading

TomDLT commented Nov 11, 2021 •

edited

Loading

ogrisel commented Nov 12, 2021 •

edited

Loading