-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Bug in Gradient Boosting: Feature Importances do not sum to 1 #7406
Description
Description
I found conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.
This error occurs once the ensemble reaches a large number of estimators. The exact conditions depend variously. For example, the error shows up sooner with a smaller amount of training samples. Or, if the depth of the tree is large.
When this error appears, the predicted value seems to have converged. But it’s unclear if the error is causing the predicted value not to change with more estimators. In fact, the feature importance sum goes lower and lower with more estimators thereafter.
Consequently, it's questionable if the tree ensemble code is functioning as expected.
Here's sample code to reproduce this:
import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor
boston = datasets.load_boston()
X, Y = (boston.data, boston.target)
n_estimators = 720
# Note: From 712 onwards, the feature importance sum is less than 1
params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)
feature_importance_sum = np.sum(clf.feature_importances_)
print("At n_estimators = %i, feature importance sum = %f" % (n_estimators , feature_importance_sum))Output:
At n_estimators = 720, feature importance sum = 0.987500
In fact, if we examine the tree at each staged prediction, we'll see that the feature importance goes to 0 after we hit a certain number of estimators. (For the code above, it's 712.)
Here's code to describe what I mean:
for i, tree in enumerate(clf.estimators_):
feature_importance_sum = np.sum(tree[0].feature_importances_)
print("At n_estimators = %i, feature importance sum = %f" % (i , feature_importance_sum))Output:
...
At n_estimators = 707, feature importance sum = 1.000000
At n_estimators = 708, feature importance sum = 1.000000
At n_estimators = 709, feature importance sum = 1.000000
At n_estimators = 710, feature importance sum = 1.000000
At n_estimators = 711, feature importance sum = 0.000000
At n_estimators = 712, feature importance sum = 0.000000
At n_estimators = 713, feature importance sum = 0.000000
At n_estimators = 714, feature importance sum = 0.000000
At n_estimators = 715, feature importance sum = 0.000000
At n_estimators = 716, feature importance sum = 0.000000
At n_estimators = 717, feature importance sum = 0.000000
At n_estimators = 718, feature importance sum = 0.000000
...
I wonder if we’re hitting some floating point calculation error.
BTW, I've posted this issue on the mailing list Link. There aren't a lot of discussion, but others seem to think there's a bug here too.
Hope we can get this fixed or clarified.
Thank you!
-Doug
Versions
Windows-7;'Python', '2.7.9 ;'NumPy', '1.9.2';'SciPy', '0.15.1';'Scikit-Learn', '0.16.1'