[MRG+2] Faster Gradient Boosting Decision Trees with binned features#12807
[MRG+2] Faster Gradient Boosting Decision Trees with binned features#12807ogrisel merged 276 commits intoscikit-learn:masterfrom
Conversation
|
So what you mean is that the fix is OK? |
|
The fix is okay, we should include a comment on why the fix is needed. |
coverage run is using is looking up it's configuration file by default from |
|
@NicolasHug I fixed the coverage thingy. Reading the coverage report, there is a bunch of things that we could improve with respect to test coverage. But this is experimental and I don't want to delay the merge further. Let's merge once my last commit is green on CI. |
|
I cannot get coverage to ignore the setup.py files for some reason. Anyways, let's merge. |
|
\o/ Awesome, thanks a lot everyone for the help and the reviews!! |
|
@ogrisel @NicolasHug really nice job for this feature!!! |
…eatures (scikit-learn#12807)" This reverts commit 5392234.
…eatures (scikit-learn#12807)" This reverts commit 5392234.
|
Congratulations on this work! This is so important. Also, I love the way that the "experimental" import was handled. Beautiful choice! It should stand as a reference for the future in similar situations. |
This PR proposes a new implementation for Gradient Boosting Decision Trees. This isn't meant to be a replacement of the current sklearn implementation but rather an addition.
This addresses the second bullet point from #8231.
This is a port from pygbm (with @ogrisel, in Numba), which itself uses lots of the optimizations from LightGBM.
Algorithm details and refs
The main differences with the current sklearn implementation are:
Notes to reviewers
This is going to be a lot of work to review, so please feel free to tell me if there's anything I can do / add that could ease reviewing.
Here's a list of things that probably need to be discussed at some point or that are worth pointing out.
The code is a port of pygbm (from numba to cython). I've ported all the tests as well. So a huge part of the code has already been carefully reviewed (or written) by @ogrisel. There are still a few non-trivial changes to the pygbm's code, to accommodate for the numba -> cython translation.
Like [MRG] new K-means implementation for improved performances #11950, this PR uses OpenMP parallelism with Cython
The code is in
sklearn/ensemble._hist_gradient_boostingand the estimators are exposed insklearn.experimental(which is created here, as a result of a discussion during the Paris sprint).Like in LightGBM, the targets y, gains, values, and sums of gradient / hessians are doubles, and the gradients and hessians array are floats to save space (14c7d47).Y_DTYPEand the associated C type for targetsyis double and not float, because with float the numerical checks (test_loss.py) would not pass. Maybe at some point we'll want to also allow floats since using doubles uses twice as much space (which is not negligible, see the attributes of theSplitterclass).I have only added a short note in the User Guide about the new estimators. I think that the gradient boosting section of the user guide could benefit from an in-depth rewriting. I'd be happy to do that, but in a later PR.
Currently the parallel code uses all possible threads. Do we want to expose
n_jobs(openmp-wise, not joblib of course)?The estimator names are currently
HistGradientBoostingClassifierandHistGradientBoostingRegressor.API differences with current implementation:
Happy to discuss these points of course. In general I tried to match the parameters names with those of the current GBDTs.
New features:
validation_fractioncan also be an int to specify absolute size of the validation set (not just a proportion)Changed parameters and attributes:
n_estimatorsparameter has been changed tomax_iterbecause unlike the current GBDTs implementations, the underlying "predictor" aren't estimators. They are private and have nofitmethod. Also, in multiclass classification we build C * max_iterestimators_attribute has been removed for the same reason.train_score_is of sizen_estimators + 1instead ofn_estimatorsbecause it contains the score of the 0th iteration (before the boosting process).oob_improvement_is replaced byvalidation_score_, also with sizen_estimators + 1Unsupported parameters and attributes:
subsample(doesn't really make sense here)criterion(same)min_samples_splitis not supported, butmin_samples_leafis supported.samples_weights-relatedmin_impurity_decreaseis not supported (we havemin_gain_to_splitbut it is not exposed in the public API)warm_startmax_features(probably not needed)staged_decision_function,staged_predict_proba, etc.initestimatorfeature_importances_loss_attribute is not exposed.Future improvement, for later PRs (no specific order):
_in_fithackish attribute.Benchmarks
Done on my laptop, intel i5 7th Gen, 4 cores, 8GB Ram.
TLDR:
Details
Comparison between proposed PR and current estimators:
on binary classification only, I don't think it's really needed to do more since the performance difference is striking. Note that for larger sample sizes the current estimators simply cannot run because of the sorting step that never terminates. I don't provide the benchmark code, it's exactly the same as that of

benchmarks/bench_fast_gradient_boosting.py:Comparison between proposed PR and LightGBM / XGBoost:
On the Higgs-Boson dataset:
python benchmarks/bench_hist_gradient_boosting_higgsboson.py --lightgbm --xgboost --subsample 5000000 --n-trees 50Sklearn: done in 28.787s, ROC AUC: 0.7330, ACC: 0.7346
LightGBM: done in 27.595s, ROC AUC: 0.7333, ACC: 0.7349
XGBoost: done in 41.726s, ROC AUC: 0.7335, ACC: 0.7351
Entire log:
regression task:

python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem regression --n-samples-max 5000000 --n-trees 50Binary classification task:
python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem classification --n-classes 2 --n-samples-max 5000000 --n-trees 50python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem classification --n-classes 3 --n-samples-max 5000000 --n-trees 50