FIX several bugs in initial predictions for GBDT#12983
FIX several bugs in initial predictions for GBDT#12983jnothman merged 46 commits intoscikit-learn:masterfrom
Conversation
|
|
||
|
|
||
| class QuantileEstimator(object): | ||
| # 0.23 |
| dtype=np.float64) | ||
| else: | ||
| try: | ||
| self.init_.fit(X, y, sample_weight=sample_weight) |
There was a problem hiding this comment.
I think that we are using something like that to check the support for sample_weight:
support_sample_weight = has_fit_parameter(self.init_, "sample_weight")
if not support_sample_weight and not sample_weight_is_none:
raise ValueError("xxxx")| If None it uses ``loss.init_estimator``. | ||
| init : estimator or 'zero', optional (default=None) | ||
| An estimator object that is used to compute the initial predictions. | ||
| ``init`` has to provide `fit` and `predict_proba`. If 'zero', the |
There was a problem hiding this comment.
are we using the double or single backticks?
There was a problem hiding this comment.
I think we use single backticks for anything that has a glossary entry (that will automatically link it) and double backticks for everything else (to avoid shpinx warnings)
There was a problem hiding this comment.
@glemaitre Once #13000 gets merged, we do not need to worry about this anymore! Single backticks without a reference will automatically be treated as double backticks!
sklearn/ensemble/losses.py
Outdated
| Number of classes | ||
| """ | ||
| def init_estimator(self): | ||
| # Predicts the median |
There was a problem hiding this comment.
I think that you can remove this comment. Strategy quantile with 0.5 should be self explicit (and this is the Least Absolute Error) as well
sklearn/ensemble/losses.py
Outdated
| raw_predictions : array, shape (n_samples, K) | ||
| The raw_predictions (i.e. values from the tree leaves) | ||
|
|
||
| sample_weight : array-like, shape (n_samples,), optional |
sklearn/ensemble/losses.py
Outdated
| the does not support probabilities raises AttributeError. | ||
| """ | ||
| raise TypeError( | ||
| '%s does not support predict_proba' % type(self).__name__) |
There was a problem hiding this comment.
self.class.name, this might be the same actually
No the problem isn't related to the validation data. It's related to the training data. Basically the raised ValueErrors in e2e1a5c may happen in 2 scenarios:
I'm sorry I find it hard to explain clearly and concisely. I think that using stratify in |
|
The init predictor's output can be fixed by aligning it up through classes_
I think. But stratifying does seem a reasonable fix. Does this happen
elsewhere in the library with validation splits?
|
Yes but we still need to identify which of the classes are missing in
I don't think the same issues happens in other estimators. This issue is happening because our main estimator (GBDT) relies on another estimator (the init estimator) which is passed only a subset of the data where some classes might be missing, because of early stopping. GBDTs are the only estimators that use early stopping + another estimator, as far as I know. I took a look at different early stopping implementations:
Would it be acceptable that I open a new PR to "bugfix" multiplayer perceptron and gradient boosting by making the splits stratified, and wait for it to be merged? |
fine with me. |
|
I am still +1 here |
|
Thanks for a phenomenal effort, @NicolasHug! |
|
awesome ! thanks heaps @NicolasHug <https://github.com/NicolasHug>!
… |
Reference Issues/PRs
Closes #12436 , continuation of the work from @jeremiedbb
What does this implement/fix? Explain your changes.
This PR:
initestimator parameter in GBDTsinit='zero'which was supported but undocumented.gradient_boosting.pylosses.pyfile and have a newget_init_raw_predictionsmethodDummyClassfierorDummyRegressory_predintoraw_predictions, and relevant (private) methods as well.0.000388(std =0.008266) between the 2 methods, withn_estimators=1,n_samples=10000.The
loss.get_init_raw_predictions(X, predictor)methods return araw_predictionwith the correct shape(n_samples, K)whereKis the number of classes in multiclass classification, else1.Those
raw_predictionsare homogeneous to what the trees predict, not homogeneous to what the ensemble predicts. For regression GBDT this is the same thing, but for e.g. binary classificationraw_predictionis homogeneous to a log-odds ratio.For more, see #12436 (comment) and following discussion (not sure if this is much clearer ^^)
Any other comments?