[RFC] Gradient Boosting improvement

## Current situation

The speed performance of the scikit-learn gradient-boosting is relatively slow compared to the other implementations: [xgboost](https://github.com/dmlc/xgboost) and [lightgbm](https://github.com/Microsoft/LightGBM).

## Which bottlenecks have been identified

During some benchmarking, we could identified a major difference between scikit-learn and xgboost.

For xgboost, all samples for a given feature will be scanned. Each sample will be used to update the impurity statistic of the node to which it belongs to (cf. [here](https://github.com/dmlc/xgboost/blob/7078c41dad5c92da6a81afaecd49768c80c9242a/src/tree/updater_colmaker.cc#L527)).
On the contrary in scikit-learn, only the samples for a node will be selected (cf. [here](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pyx#L435) or [here](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pyx#L429)). Therefore, it involves repetitive scans which are not necessary.

## What have been tested

In an effort of speeding up the gradient-boosting, @ogrisel thought about [implementing a subsampling](https://github.com/glemaitre/scikit-learn/tree/tree_split_subsampling) at each split, lowering the computation cost while sorting and computing the statistic. The obtained results do not allow to find a satisfactory trade-off time/accuracy performance to compete with the other implementations.

## Proposal

I would like to contribute with two changes:

- [ ] Propose a prototype of the gradient boosting modifying the samples scanning and splitting strategies. It will required a change of the decision tree algorithm (that's why we want a prototype first). This change should allow to match the performance of xgboost while using the `exact` method. #8458
- [ ] Implement an approximation splitter by binning the data similarly to lightgbm and xgboost. This change allowed xgboost to match lightgbm performance recently (cf. [here](https://github.com/Microsoft/LightGBM/issues/211) and [here](https://github.com/dmlc/xgboost/issues/1950)). 

## Related issue

#5212

@GaelVaroquaux @glouppe @ogrisel @jmschrei @raghavrv If you have any comments, I will be happy to hear them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Gradient Boosting improvement #8231

Current situation

Which bottlenecks have been identified

What have been tested

Proposal

Related issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Gradient Boosting improvement #8231

Description

Current situation

Which bottlenecks have been identified

What have been tested

Proposal

Related issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions