ENH Poisson loss for HistGradientBoostingRegressor by lorentzenchr · Pull Request #16692 · scikit-learn/scikit-learn

lorentzenchr · 2020-03-14T18:28:02Z

Reference Issues/PRs

This PR partly addresses #16668 and #5975.

What does this implement/fix? Explain your changes.

This PR implements the Poisson loss for HistGradientBoostingRegressor, i.e. splitting based on improvement in Poisson deviance.

lorentzenchr · 2020-03-15T19:39:05Z

ping @NicolasHug

NicolasHug

Thanks @lorentzenchr , this looks good!

I mostly have minor comments.

Should we check that we always have y >= 0?

Please make a minor update to ensemble.rst (around line 955) to document the new loss.
This will also need an entry in the what's new

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/_loss.pyx

sklearn/ensemble/_hist_gradient_boosting/loss.py

NicolasHug · 2020-03-15T20:49:09Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    # than least squares measured in Poisson deviance as score.
+    rng = np.random.RandomState(42)
+    X, y, coef = make_regression(n_samples=500, coef=True, random_state=rng)
+    coef /= np.max(np.abs(coef))


Why is this needed?

Also at this point since we're also overriding y, should we still eb using make_regression?

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_loss.py

NicolasHug · 2020-03-15T21:10:43Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_loss.py

+    assert_almost_equal(np.mean(y_baseline), y_train.mean())
+
+    # Test baseline for y_true = 0
+    y_train.fill(0.)


Suggested change

y_train.fill(0.)

y_train = np.zeros(100)

Why do you prefer it this way?

NicolasHug · 2020-03-15T21:26:57Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

    assert gbdt.score(X, y) > .9


+def test_poisson_loss():


Would it make sense to also test that the score is above a given threshold?

Unlike r2 score, it is hard to give an absolute "good" value for the poisson deviance. With the "d2 score" of #15244 this would make more sense.
I added a DummyRegressor with mean as prediction. This is (almost) equivalent to a d2 score. And I added out-of-sample tests.

lorentzenchr · 2020-03-16T11:21:30Z

@NicolasHug Thanks for your fast first review pass. I think I addressed all comments. I have to say, the histogram gradient boosting implementation seems like a piece of art. I wish it would have been that easy to include Poisson for linear models 😄

NicolasHug

Thanks @lorentzenchr , a few more nits but looks good!

Thanks for the fast work!

(I wonder why the CI doesn't show the test suite instances... tests pass locally at least)

pinging @ogrisel who will be interested.

doc/whats_new/v0.23.rst

sklearn/ensemble/_hist_gradient_boosting/loss.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/loss.py

lorentzenchr · 2020-03-25T21:17:20Z

Can someone explain, why the test suddenly fails and how to resolve it?

___ test_estimators[HistGradientBoostingRegressor()-check_estimators_unfitted] ____
...
def predict(self, X):
...
>       return self.loss_.inverse_link_function(self._raw_predict(X).ravel())
E       AttributeError: 'HistGradientBoostingRegressor' object has no attribute 'loss_'

An unfitted HistGradientBoostingRegressordoes not have attribute loss_ only attribute loss.

NicolasHug · 2020-03-25T21:42:00Z

You'll need to call check_is_fitted in predict of HistGradientBoostingRegressor now.

I think the error comes from the fact that you added

return self.loss_.inverse_link_function(self._raw_predict(X).ravel())

so now the error is "this estimator doesn't have a loss_ attribute" instead of being "this estimator isn't fitted" (as would be raised by _raw_predict())

Alternatively this should also work

		pred = self._raw_predict(X).ravel()
        return self.loss_.inverse_link_function(pred)

lorentzenchr · 2020-03-26T07:23:13Z

@NicolasHug Thanks. That solves it. In particular, I was wondering why the tests did pass before. Never mind.

thomasjpfan

Nice work here @lorentzenchr

sklearn/ensemble/_hist_gradient_boosting/loss.py

thomasjpfan · 2020-04-05T18:33:43Z

sklearn/ensemble/_hist_gradient_boosting/loss.py

+        # return a view.
+        raw_predictions = raw_predictions.reshape(-1)
+        # TODO: For speed, we could remove the constant xlogy(y_true, y_true)
+        # Advantage of this form: minimum of zero at raw_predictions = y_true.


Are we taking advantage of this advantage somewhere?

Not that I know of. Might be interesting to see, if it matters (at all).

thomasjpfan · 2020-04-21T16:42:38Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    y = rng.poisson(lam=np.exp(X @ coef))
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test,
+                                                        random_state=rng)
+    gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng)


Suggested change

gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng)

gbdt_pois = HistGradientBoostingRegressor(loss='poisson', random_state=rng)

And below

thomasjpfan · 2020-04-21T16:42:49Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test,
+                                                        random_state=rng)
+    gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng)
+    gbdt2 = HistGradientBoostingRegressor(loss='least_squares',


Suggested change

gbdt2 = HistGradientBoostingRegressor(loss='least_squares',

gbdt_ls = HistGradientBoostingRegressor(loss='least_squares',

And below

thomasjpfan · 2020-04-22T16:11:35Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_loss.py

+    # log(0)
+    assert y_train.sum() > 0
+    baseline_prediction = loss.get_baseline_prediction(y_train, None, 1)
+    assert baseline_prediction.shape == tuple()  # scalar


Nit:

Suggested change

assert baseline_prediction.shape == tuple() # scalar

assert np.isscaler(baseline_prediction)

sklearn/ensemble/_hist_gradient_boosting/tests/test_loss.py

thomasjpfan · 2020-04-23T14:27:58Z

Thank you @lorentzenchr !

lorentzenchr · 2020-04-23T15:14:18Z

@thomasjpfan Thank you for your review and merging. 👍

Now, the good old, brand new Poisson GLM will come out in the same release as this Poisson HGB. That is a strong competitor! 😄

rth · 2020-04-23T15:39:36Z

Thanks for all the work by the three of you in this PR! Looking forward to the release :)

Christian Lorentzen added 3 commits March 14, 2020 16:08

PEP8 blank line after class docstring

3f800b5

ENH add Poisson loss class to histogram gradient boosting

78a2598

TST add tests for Poisson loss class

3480af0

github-actions bot added the module:ensemble label Mar 14, 2020

Christian Lorentzen added 3 commits March 15, 2020 20:10

BUG hessians are not constant for Poisson loss

60ec7f1

ENH add Poisson loss to HistGradientBoostingRegressor

36da4bc

TST test Poisson loss for HistGradientBoostingRegressor

7e4257e

lorentzenchr changed the title ~~[WIP] Poisson loss for GradientBoostingRegressor~~ [MRG] Poisson loss for HistGradientBoostingRegressor Mar 15, 2020

NicolasHug reviewed Mar 15, 2020

View reviewed changes

Christian Lorentzen added 9 commits March 16, 2020 08:29

MNT rename poisson_loss to poisson

e176d59

TST improve test_poisson for hist gradient boosting

2c0343f

MNT move up poisson loss to regression losses

047a5fb

TST sanity check that sum(y)>0 for poisson

2ba3dba

TST compare poisson baseline_prediction directly

424b5d0

ENH check for invalid y for poisson in hist gradient boosting

9ff2d02

DOC add poisson loss to user guide

96eba11

DOC mention log-link and y>=0 for poisson loss in hist gradient boosting

332d394

DOC whats new

ebbf816

lorentzenchr mentioned this pull request Mar 16, 2020

DOC Improve claim prediction example #16648

Merged

NicolasHug approved these changes Mar 16, 2020

View reviewed changes

Christian Lorentzen added 3 commits March 16, 2020 13:28

DOC whats new wording

48b54b7

MNT remove wrong comment

64da8c4

MNT nitpicks

54e3a75

NicolasHug mentioned this pull request Mar 17, 2020

Loss Function (All-Threshold Loss) for ordinal response variables HistGradientBoosting #16694

Open

Merge branch 'master' into hgb_poisson

d9f7f55

BUG check if fitted in predict of hist gradient boosting

84330a0

lorentzenchr changed the title ~~[MRG] Poisson loss for HistGradientBoostingRegressor~~ [MRG+1] Poisson loss for HistGradientBoostingRegressor Mar 30, 2020

rth mentioned this pull request Mar 30, 2020

Poisson, gamma and tweedie family of loss functions #5975

Closed

thomasjpfan reviewed Apr 5, 2020

View reviewed changes

MNT remove unnecessary np.empty_like

fe60e24

rth self-requested a review April 21, 2020 08:42

thomasjpfan reviewed Apr 22, 2020

View reviewed changes

Christian Lorentzen added 2 commits April 23, 2020 10:27

address review comments for tests

5e7a701

remove unused method Poisson.predict_target

acc38ed

thomasjpfan approved these changes Apr 23, 2020

View reviewed changes

thomasjpfan changed the title ~~[MRG+1] Poisson loss for HistGradientBoostingRegressor~~ ENH Poisson loss for HistGradientBoostingRegressor Apr 23, 2020

thomasjpfan merged commit a93b15f into scikit-learn:master Apr 23, 2020

lorentzenchr deleted the hgb_poisson branch April 23, 2020 15:16

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

ENH Poisson loss for HistGradientBoostingRegressor (scikit-learn#16692)

b5716d5

lorentzenchr mentioned this pull request May 28, 2020

Tweedie deviance loss for tree based models #16668

Open

5 tasks

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

ENH Poisson loss for HistGradientBoostingRegressor (scikit-learn#16692)

5637101

	gbdt1 = HistGradientBoostingRegressor(loss='poisson', random_state=rng)
	gbdt_pois = HistGradientBoostingRegressor(loss='poisson', random_state=rng)

	gbdt2 = HistGradientBoostingRegressor(loss='least_squares',
	gbdt_ls = HistGradientBoostingRegressor(loss='least_squares',

	assert baseline_prediction.shape == tuple() # scalar
	assert np.isscaler(baseline_prediction)

Uh oh!

Conversation

lorentzenchr commented Mar 14, 2020 • edited by rth Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

lorentzenchr commented Mar 15, 2020

Uh oh!

NicolasHug left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Mar 16, 2020

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Mar 25, 2020

Uh oh!

NicolasHug commented Mar 25, 2020

Uh oh!

lorentzenchr commented Mar 26, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan commented Apr 23, 2020

Uh oh!

lorentzenchr commented Apr 23, 2020

Uh oh!

rth commented Apr 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lorentzenchr commented Mar 14, 2020 •

edited by rth

Loading

NicolasHug left a comment •

edited

Loading