[New Feature] Fast Histogram Optimized Grower, 8x to 10x Speedup

This issue is to document a series of recent improvements in the tree grower. Related: #1673.

## Faster training via feature binning
Histogram aggregation is a major computational bottleneck in tree growing. We introduce a new tree growing method `hist` where only a subset of possible split values are considered. As in [FastBDT](https://github.com/thomaskeck/FastBDT) and [LightGBM](https://github.com/Microsoft/LightGBM), continuous features are bucketed into discrete bins. The histogram accumulation becomes faster due to less index manipulations

**How does the new method differ from `tree_method=approx`?** The existing approximate splitting method in xgboost also buckets continuous features into discrete bins to speed up training. The `approx` method generates a new set of bins for each iteration, whereas the `hist` method re-uses the bins over multiple iterations. 

The latter approach enables additional optimizations that are not possible for the former, as follows:
* **Cache of the bins** : replace continuous feature values in the training data with bin IDs and cache the data structure
* **Histogram subtraction trick**: to compute the histogram for a node, we simply take the difference between the histograms for its parent and sibling.

Besides the above improvements, some highlights
* **Efficient representation of sparse matrices**  as in  xgboost, sparse matrices is supported naturally, with effective speedup for mixed sparse+dense matrices 
* **Extensible to other existing features in xgboost**  such as monotonic constraints, language bindings.

### How to use
Simply set `tree_method` to `hist`. You may also want to set `max_bin`, which represents the (maximum) number of discrete bins to bucket continuous features. By default, `max_bin` is set to 256. Increasing this number improves the optimality of splits at the cost of higher computation time.

## Flexible tree growing policies
The existing tree grower in xgboost grows a tree in a depth-wise fashion, executing splits in first level before splits in second and so forth. The new grower lets you control the way new nodes are added to the tree:
* `grow_policy=depthwise` (default): split at nodes closest to the root, i.e. grow depth-wise.
* `grow_policy=lossguide`: split at nodes with highest loss change. This behavior mimics that of [LightGBM](https://github.com/Microsoft/LightGBM).

It has been reported that the `lossguide` policy often results in faster convergence in loss, though there is also ***risk of over-fitting***(see the preliminary results).

### How to use
For now, only the `hist` grower supports multiple growing policies; we may extend the support to other growers in the future. So you should set `tree_method` to `hist`. The `grow_policy` parameters lets you select the growing policy. If unspecified, `grow_policy` will be set to `depthwise` by default. Here are two parameters that are relevant:
* `max_leaves`: maximum number of nodes to be added. Only relevant for the `lossguide` policy.
* `max_depth`: maximum depth. Behaves as usual when `grow_policy` is set to `depthwise`. If `grow_policy` is `lossguide`, `max_depth` can be set to zero, indicating the lack of limit on depth. 

## Benchmark
### Configurations
* CPU: Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz, 6 physical cores
* Memory: 48 GB
* Training parameters: Evaluation metric is turn on during the benchmark, since that is how users commonly use the code
   * exact: `tree_method=exact, eta=0.1, gamma=1.0, max_depth=8, min_child_weight=100`
   * approx: `tree_method=approx, eta=0.1, gamma=1.0, max_depth=8, min_child_weight=100`
   * hist + depthwise: `tree_method=hist, grow_policy=depthwise, eta=0.1, gamma=1.0, max_depth=8, min_child_weight=100` 
   * hist + lossguide: `tree_method=hist, grow_policy=lossguide, eta=0.1, gamma=1.0, max_depth=0, max_leaves=255, min_child_weight=100`
   * [LightGBM](https://github.com/Microsoft/LightGBM): `learning_rate=0.1, max_bin=255, num_leaves=255, min_sum_hessian_in_leaf=100, min_data_in_leaf=0`

### Model accuracy
For `allstate` and `higgs`, we plot AUC (Area Under Curve) on the validation set. For `yahoo`, we plot NDCG (Normalized Discounted Cumulative Gain) at 10.
![vauc](https://cloud.githubusercontent.com/assets/2532981/21960636/eb79a50e-daa6-11e6-918b-ca4b1f6905ae.png)

Notice that favoring splits with highest change leads to overfitting in the case of `allstate`, which is more like a real world insurance dataset. While the validation training curve is more consistent on `higgs`, which is simulated data points. 
We tried different options such as `min_child_weight` didn't move the curve much. This seems to suggest that depthwise is more conservative and more robust to overfitting and should be used when the training validation distribution might not be very identical, or overfitting is a problem in the dataset

### Single-threaded performance
Keep in mind that the tree growing policy of LightGBM is equivalent to setting `grow_policy=lossguide` in xgboost, so `LightGBM` column should be compared with `hist + lossguide` column.

**Per-iteration runtime (seconds)**
The average was taken over 300 iterations.

| Datasets | `exact` | `approx` | `hist + depthwise` | `hist + lossguide` | `LightGBM`
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| `allstate` | 26.47 | 14.34 | 3.82  | 5.24  | 5.41 |
| `higgs`  | 61.38  | 25.94 | 6.17  | 6.44  | 5.41 |
| `yahoo`  | 13.39  | 7.32 | 1.37  | 1.75  | 2.24 |

**Cumulative runtime over 300 iterations**
![single](https://cloud.githubusercontent.com/assets/2532981/21960474/0d8782d8-daa2-11e6-85b6-d97f82cfb7f4.png)

### Multi-threaded performance, not yet fully optimized, 
Parallel implementation of `tree_method=hist` needs more work, for the time being.

**Per-iteration runtime (seconds)**

| Datasets | `exact` | `approx` | `hist + depthwise` | `hist + lossguide` | `LightGBM`
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| `allstate` | 7.50  | 4.21 | 2.39  | 2.94  | 3.26 |
| `higgs` | 15.55  | 6.96 | 3.38  | 3.37  | 3.07 |
| `yahoo` | 2.68 | 1.46 | 0.66  | 0.96  | 0.53 |

**Cumulative runtime over 300 iterations**
![mult](https://cloud.githubusercontent.com/assets/2532981/21960495/021406b4-daa3-11e6-88ce-87c7b57f0ba6.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[New Feature] Fast Histogram Optimized Grower, 8x to 10x Speedup #1950

Faster training via feature binning

How to use

Flexible tree growing policies

How to use

Benchmark

Configurations

Model accuracy

Single-threaded performance

Multi-threaded performance, not yet fully optimized,

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Datasets	`exact`	`approx`	`hist + depthwise`	`hist + lossguide`	`LightGBM`
`allstate`	26.47	14.34	3.82	5.24	5.41
`higgs`	61.38	25.94	6.17	6.44	5.41
`yahoo`	13.39	7.32	1.37	1.75	2.24

Datasets	`exact`	`approx`	`hist + depthwise`	`hist + lossguide`	`LightGBM`
`allstate`	7.50	4.21	2.39	2.94	3.26
`higgs`	15.55	6.96	3.38	3.37	3.07
`yahoo`	2.68	1.46	0.66	0.96	0.53

Uh oh!

[New Feature] Fast Histogram Optimized Grower, 8x to 10x Speedup #1950

Description

Faster training via feature binning

How to use

Flexible tree growing policies

How to use

Benchmark

Configurations

Model accuracy

Single-threaded performance

Multi-threaded performance, not yet fully optimized,

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions