Bug in Gradient Boosting: Feature Importances do not sum to 1

#### Description

I found conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.  

This error occurs once the ensemble reaches a large number of estimators.  The exact conditions depend variously.  For example, the error shows up sooner with a smaller amount of training samples.  Or, if the depth of the tree is large.  

When this error appears, the predicted value seems to have converged.  But it’s unclear if the error is causing the predicted value not to change with more estimators.  In fact, the feature importance sum goes lower and lower with more estimators thereafter.  

Consequently, it's questionable if the tree ensemble code is functioning as expected.  

Here's sample code to reproduce this:

``` python
import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor

boston = datasets.load_boston()
X, Y = (boston.data, boston.target)

n_estimators = 720
# Note: From 712 onwards, the feature importance sum is less than 1

params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)

feature_importance_sum = np.sum(clf.feature_importances_)
print("At n_estimators = %i, feature importance sum = %f" % (n_estimators , feature_importance_sum))
```

_Output:_

```
At n_estimators = 720, feature importance sum = 0.987500
```

In fact, if we examine the tree at each staged prediction, we'll see that the feature importance goes to 0 after we hit a certain number of estimators.  (For the code above, it's 712.)

Here's code to describe what I mean:

``` python
for i, tree in enumerate(clf.estimators_):
    feature_importance_sum = np.sum(tree[0].feature_importances_)
    print("At n_estimators = %i, feature importance sum = %f" % (i , feature_importance_sum))
```

_Output:_

```
...
At n_estimators = 707, feature importance sum = 1.000000
At n_estimators = 708, feature importance sum = 1.000000
At n_estimators = 709, feature importance sum = 1.000000
At n_estimators = 710, feature importance sum = 1.000000
At n_estimators = 711, feature importance sum = 0.000000
At n_estimators = 712, feature importance sum = 0.000000
At n_estimators = 713, feature importance sum = 0.000000
At n_estimators = 714, feature importance sum = 0.000000
At n_estimators = 715, feature importance sum = 0.000000
At n_estimators = 716, feature importance sum = 0.000000
At n_estimators = 717, feature importance sum = 0.000000
At n_estimators = 718, feature importance sum = 0.000000
...
```

I wonder if we’re hitting some floating point calculation error. 

BTW, I've posted this issue on the mailing list [Link](https://mail.python.org/pipermail/scikit-learn/2016-September/000508.html).  There aren't a lot of discussion, but others seem to think there's a bug here too.

Hope we can get this fixed or clarified.

Thank you!
-Doug
#### Versions

Windows-7;'Python', '2.7.9 ;'NumPy', '1.9.2';'SciPy', '0.15.1';'Scikit-Learn', '0.16.1' 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug in Gradient Boosting: Feature Importances do not sum to 1 #7406

Description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Bug in Gradient Boosting: Feature Importances do not sum to 1 #7406

Description

Description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions