Minimal Generalized linear models implementation (L2 + lbfgs) by rth · Pull Request #14300 · scikit-learn/scikit-learn

rth · 2019-07-09T16:52:10Z

A minimal implementation of Generalized Linear Models (GLM) from #9405 by @lorentzenchr

This only includes L2 penalty with the lbfgs solver. In other words, in excludes L1 penalties (or CD solver), matrix penalties, warm start, some distributions (e.g. BinomialDistribution), newton-cg & irls solvers from the original GLM PR.

The goals is to get an easier to review initial implementation. Benchmarks were done in #9405 (comment)

TODO

fix user manual and comments regarding removed functionality
examples may need some more work
use a standalone Tweedie deviance metric ENH Add Poisson, Gamma and Tweedie deviances to regression metrics #14263 once it is merged
refactor the LBFS optimizer interface once Use common convergence checks for lbfgs solver #14250 is merged
use a common function for sample_weight validations MAINT Common sample_weight validation #14307
expose objects that are for now not controversial to sklearn.linear_model
~~consider using a diagonal preconditioner as discussed in Scaling issues in l-bfgs for LogisticRegression #15556 and MRG Logistic regression preconditioning #15583 for logistic regression~~ Edit: Not in this PR -- Roman

* smarter initialization of intercept * PEP 257 -- Docstring Conventions * minor docstring changes

* P2 also accepts 1d array and interprets it as diagonal matrix * improved input checks for P1 and P2

…hecks * adapt examples of GeneralizedLinearModel to new defaults for P1, P2 and selection * fix precision/decimal issue in test_poisson_enet * use more robust least squares instead of solve in IRLS * fix sign error in input checks

* add Binomial distribution * add Logit link * tests for binomial against LogisticRegression * option 'auto' for link * reduce code duplication by replacing @abstractproperty by @Property

* refactor into function _irls_solver * refactor into function _cd_solver * replace of safe_sparse_dot by matmul operator @ * more efficient handling of fisher matrix * sparse coo matrices are converted to csc or csr * sample weights don't except sparse matrices * minor doc changes

* renamed option irls into guess * removed option least_squares * updated tests

* new estimator GeneralizedLinearRegressor * loss functions for Tweedie family and Binomial * elasitc net penalties * control of penalties by matrix P2 and vector P1 * new solvers: coordinate descent, irls * tests * documentation * example for Poisson regression

* make glm module private * typos * improve tests

NicolasHug

I made minor pushes to compare the perf of the 2 models side by side in the example (293214c)

CI is green, LGTM. I think most of the comments are either minor or addressed.

Thank you very much for the great work and for your patience @lorentzenchr @rth . Much appreciated. Looking forward to seeing the future improvements!

Feel free to review my last changes and merge when you see fit

lorentzenchr · 2020-03-01T17:18:03Z

And the second prize of chocolate goes to ... @NicolasHug 🥈 🍫 🎉

rth · 2020-03-01T17:40:49Z

Thanks a lot for the review @NicolasHug and for quickly addressing comments @lorentzenchr !

Let's give it another couple of days in case @ogrisel has other suggestions.

For the person who is going to merge this please edit the merge commit description to remove individual commits but to keep one of each of Co-authored-by tags as discussed in #16550 to keep attributions.

rth · 2020-03-01T17:44:08Z

(Sorry accidental key press, updated message above with the missing part).

jnothman · 2020-03-02T07:37:36Z

Very excited for this! Sorry I've not been able to review in a while.

lorentzenchr · 2020-03-02T12:38:17Z

@ogrisel As a last remark: How do you feel about reversing the sorting of the x-axis in the last plot of plot_tweedie_regression_insurance_claims.py from "safest to riskiest" to "riskiest to safest"? This way, both examples would have a comparable "rank" plot at the end.

rth · 2020-03-04T13:06:14Z

OK, merging with +2 and several partial reviews. Thanks @lorentzenchr and to all reviewers! I'm really glad this is done.

How do you feel about reversing the sorting of the x-axis in the last plot of plot_tweedie_regression_insurance_claims.py from "safest to riskiest" to "riskiest to safest"? This way, both examples would have a comparable "rank" plot at the end.

Feel free to open a follow up PR. Also looking forward to other additional features we could add from #9405. Will comment there.

agramfort · 2020-03-04T14:14:13Z

🍺 !

lorentzenchr · 2020-03-04T18:55:15Z

Many thanks to all of you! This was hard work and with enough perseverance we got it done. 👏

jnothman · 2020-03-05T08:01:51Z

Awesome! Might be worth announcing on the mailing list and asking people to take it for a ride.

jnothman · 2020-03-05T08:09:54Z

Definitely some impressive perseverance. But you forgot to credit the influential effect of timely chocolate, too.

rth · 2020-03-05T16:01:31Z

Awesome! Might be worth announcing on the mailing list and asking people to take it for a ride.

@lorentzenchr Would you like to write that announcement email? Users can try it with nightly wheels https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds and the dev documentation https://scikit-learn.org/dev/modules/linear_model.html#generalized-linear-regression

lorentzenchr · 2020-03-05T20:31:26Z

@rth Thanks for asking. As I'm not (yet) on the scikit-learn mailing list and as I can't find a precedent for new features as template, I'd rather let you go ahead 😊

…ikit-learn#14300) Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

Christian Lorentzen and others added 30 commits February 17, 2019 18:38

Smarter intercept initialization and docstring improvements

919912c

* smarter initialization of intercept * PEP 257 -- Docstring Conventions * minor docstring changes

Fix false formula in starting_mu and improve start_params

01033e3

Improve argument handling of P1 and P2

4071a8a

* P2 also accepts 1d array and interprets it as diagonal matrix * improved input checks for P1 and P2

Use pytest decorators and pytest.raises

ed8e74f

Add Logistic regression=Binomial + Logit

fe876da

* add Binomial distribution * add Logit link * tests for binomial against LogisticRegression * option 'auto' for link * reduce code duplication by replacing @abstractproperty by @Property

Treat the intercept separately, i.e. X, P1, P2 never include intercept

a6f9f13

Revised option start_params

c9a7a95

* renamed option irls into guess * removed option least_squares * updated tests

Fix a few typos

a7755de

Make module private

9aa1fc4

Working on tests

ca3eae2

Improve tests

61bc6b8

Remove unused dec parameter in tests

b24a7ca

MAINT: merge branch 'GLM-impr' of https://github.com/rth/scikit-learn

09176b4

* make glm module private * typos * improve tests

[MAINT] make glm private, fix typos, improve tests

def12ae

Fix docstrings for the new print_changed_only=True by default

9b574bd

Increase coverage

90299fd

More tests and addressing some review comments

e3a5a9a

TST More specific checks of error messages in tests

54b80b8

Merge branch 'master' into GLM

e962859

Add PoissonRegressor alias

7db0320

TST Simplify comparison with ridge

dcfe9ed

EXA Add plot_tweedie_regression_insurance_claims.py

4879bb6

EXA Fix issues with older pandas versions in example

56069e5

DOC Add second poisson regression example

53f3c5f

Merge remote-tracking branch 'upstream/master' into GLM-minimal

ac1fef3

Add GeneralizedHyperbolicSecant and BinomialDistributions

be5a3c4

Remove start params option

e67fecb

NicolasHug added 4 commits March 1, 2020 09:19

updated remaining comment about property

b96e021

Compare perfs of models side by side in example

293214c

Shorten text for df to fit fully in width

987239a

Use context manager instead?

edba3b8

NicolasHug approved these changes Mar 1, 2020

View reviewed changes

rth closed this Mar 1, 2020

rth reopened this Mar 1, 2020

rth mentioned this pull request Mar 2, 2020

Visualization and validation tools for regression #16608

Open

github-actions bot added module:linear_model module:metrics labels Mar 2, 2020

rth merged commit 69ea066 into scikit-learn:master Mar 4, 2020

rth deleted the GLM-minimal branch March 4, 2020 13:47

rth mentioned this pull request Mar 4, 2020

Add L1 penalty and coordinate descent solver to TweedieRegression #16637

Open

lorentzenchr mentioned this pull request Mar 4, 2020

[MRG] EXA align lorenz curves between the two examples with GLMs #16640

Merged

lorentzenchr mentioned this pull request Apr 28, 2020

DOC Improve claim prediction example #16648

Merged

lorentzenchr mentioned this pull request Dec 13, 2020

[MRG] Add quantile regression #9978

Merged

Uh oh!

Conversation

rth commented Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Mar 1, 2020

Uh oh!

rth commented Mar 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Mar 1, 2020

Uh oh!

jnothman commented Mar 2, 2020

Uh oh!

lorentzenchr commented Mar 2, 2020

Uh oh!

rth commented Mar 4, 2020

Uh oh!

agramfort commented Mar 4, 2020

Uh oh!

lorentzenchr commented Mar 4, 2020

Uh oh!

jnothman commented Mar 5, 2020

Uh oh!

jnothman commented Mar 5, 2020 via email

Uh oh!

rth commented Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Mar 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

rth commented Jul 9, 2019 •

edited

Loading

rth commented Mar 1, 2020 •

edited

Loading

rth commented Mar 5, 2020 •

edited

Loading