Skip to content

Bug in Gradient Boosting: Feature Importances do not sum to 1 #7406

@dsc888

Description

@dsc888

Description

I found conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.

This error occurs once the ensemble reaches a large number of estimators. The exact conditions depend variously. For example, the error shows up sooner with a smaller amount of training samples. Or, if the depth of the tree is large.

When this error appears, the predicted value seems to have converged. But it’s unclear if the error is causing the predicted value not to change with more estimators. In fact, the feature importance sum goes lower and lower with more estimators thereafter.

Consequently, it's questionable if the tree ensemble code is functioning as expected.

Here's sample code to reproduce this:

import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor

boston = datasets.load_boston()
X, Y = (boston.data, boston.target)

n_estimators = 720
# Note: From 712 onwards, the feature importance sum is less than 1

params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)

feature_importance_sum = np.sum(clf.feature_importances_)
print("At n_estimators = %i, feature importance sum = %f" % (n_estimators , feature_importance_sum))

Output:

At n_estimators = 720, feature importance sum = 0.987500

In fact, if we examine the tree at each staged prediction, we'll see that the feature importance goes to 0 after we hit a certain number of estimators. (For the code above, it's 712.)

Here's code to describe what I mean:

for i, tree in enumerate(clf.estimators_):
    feature_importance_sum = np.sum(tree[0].feature_importances_)
    print("At n_estimators = %i, feature importance sum = %f" % (i , feature_importance_sum))

Output:

...
At n_estimators = 707, feature importance sum = 1.000000
At n_estimators = 708, feature importance sum = 1.000000
At n_estimators = 709, feature importance sum = 1.000000
At n_estimators = 710, feature importance sum = 1.000000
At n_estimators = 711, feature importance sum = 0.000000
At n_estimators = 712, feature importance sum = 0.000000
At n_estimators = 713, feature importance sum = 0.000000
At n_estimators = 714, feature importance sum = 0.000000
At n_estimators = 715, feature importance sum = 0.000000
At n_estimators = 716, feature importance sum = 0.000000
At n_estimators = 717, feature importance sum = 0.000000
At n_estimators = 718, feature importance sum = 0.000000
...

I wonder if we’re hitting some floating point calculation error.

BTW, I've posted this issue on the mailing list Link. There aren't a lot of discussion, but others seem to think there's a bug here too.

Hope we can get this fixed or clarified.

Thank you!
-Doug

Versions

Windows-7;'Python', '2.7.9 ;'NumPy', '1.9.2';'SciPy', '0.15.1';'Scikit-Learn', '0.16.1'

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions