ENH Poisson loss for HistGradientBoostingRegressor#16692
ENH Poisson loss for HistGradientBoostingRegressor#16692thomasjpfan merged 23 commits intoscikit-learn:masterfrom
Conversation
|
ping @NicolasHug |
There was a problem hiding this comment.
Thanks @lorentzenchr , this looks good!
I mostly have minor comments.
Should we check that we always have y >= 0?
Please make a minor update to ensemble.rst (around line 955) to document the new loss.
This will also need an entry in the what's new
| # than least squares measured in Poisson deviance as score. | ||
| rng = np.random.RandomState(42) | ||
| X, y, coef = make_regression(n_samples=500, coef=True, random_state=rng) | ||
| coef /= np.max(np.abs(coef)) |
There was a problem hiding this comment.
Why is this needed?
Also at this point since we're also overriding y, should we still eb using make_regression?
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
| assert_almost_equal(np.mean(y_baseline), y_train.mean()) | ||
|
|
||
| # Test baseline for y_true = 0 | ||
| y_train.fill(0.) |
There was a problem hiding this comment.
| y_train.fill(0.) | |
| y_train = np.zeros(100) |
There was a problem hiding this comment.
Why do you prefer it this way?
| assert gbdt.score(X, y) > .9 | ||
|
|
||
|
|
||
| def test_poisson_loss(): |
There was a problem hiding this comment.
Would it make sense to also test that the score is above a given threshold?
There was a problem hiding this comment.
Unlike r2 score, it is hard to give an absolute "good" value for the poisson deviance. With the "d2 score" of #15244 this would make more sense.
I added a DummyRegressor with mean as prediction. This is (almost) equivalent to a d2 score. And I added out-of-sample tests.
|
@NicolasHug Thanks for your fast first review pass. I think I addressed all comments. I have to say, the histogram gradient boosting implementation seems like a piece of art. I wish it would have been that easy to include Poisson for linear models 😄 |
NicolasHug
left a comment
There was a problem hiding this comment.
Thanks @lorentzenchr , a few more nits but looks good!
Thanks for the fast work!
(I wonder why the CI doesn't show the test suite instances... tests pass locally at least)
pinging @ogrisel who will be interested.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
|
Can someone explain, why the test suddenly fails and how to resolve it? An unfitted |
|
You'll need to call I think the error comes from the fact that you added
so now the error is "this estimator doesn't have a loss_ attribute" instead of being "this estimator isn't fitted" (as would be raised by Alternatively this should also work pred = self._raw_predict(X).ravel()
return self.loss_.inverse_link_function(pred) |
|
@NicolasHug Thanks. That solves it. In particular, I was wondering why the tests did pass before. Never mind. |
thomasjpfan
left a comment
There was a problem hiding this comment.
Nice work here @lorentzenchr
| # return a view. | ||
| raw_predictions = raw_predictions.reshape(-1) | ||
| # TODO: For speed, we could remove the constant xlogy(y_true, y_true) | ||
| # Advantage of this form: minimum of zero at raw_predictions = y_true. |
There was a problem hiding this comment.
Are we taking advantage of this advantage somewhere?
There was a problem hiding this comment.
Not that I know of. Might be interesting to see, if it matters (at all).
| y = rng.poisson(lam=np.exp(X @ coef)) | ||
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test, | ||
| random_state=rng) | ||
| gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng) |
There was a problem hiding this comment.
| gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng) | |
| gbdt_pois = HistGradientBoostingRegressor(loss='poisson', random_state=rng) |
And below
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test, | ||
| random_state=rng) | ||
| gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng) | ||
| gbdt2 = HistGradientBoostingRegressor(loss='least_squares', |
There was a problem hiding this comment.
| gbdt2 = HistGradientBoostingRegressor(loss='least_squares', | |
| gbdt_ls = HistGradientBoostingRegressor(loss='least_squares', |
And below
| # log(0) | ||
| assert y_train.sum() > 0 | ||
| baseline_prediction = loss.get_baseline_prediction(y_train, None, 1) | ||
| assert baseline_prediction.shape == tuple() # scalar |
There was a problem hiding this comment.
Nit:
| assert baseline_prediction.shape == tuple() # scalar | |
| assert np.isscaler(baseline_prediction) |
|
Thank you @lorentzenchr ! |
|
@thomasjpfan Thank you for your review and merging. 👍 Now, the good old, brand new Poisson GLM will come out in the same release as this Poisson HGB. That is a strong competitor! 😄 |
|
Thanks for all the work by the three of you in this PR! Looking forward to the release :) |
Reference Issues/PRs
This PR partly addresses #16668 and #5975.
What does this implement/fix? Explain your changes.
This PR implements the Poisson loss for
HistGradientBoostingRegressor, i.e. splitting based on improvement in Poisson deviance.