[MRG] Ridge can use sample weights in feature space (X.T.dot(X) gram matrix) by eickenberg · Pull Request #3034 · scikit-learn/scikit-learn

eickenberg · 2014-04-03T07:39:34Z

Hello everybody,

here is an enhancement to the ridge_regression function: Up until now, if sample_weight is given, the decision tree imposes a treatment in sample space, which can lead to Memory errors if the number of samples is large even though the number of features is small. See http://stackoverflow.com/questions/22766978/sklearn-ridge-and-sample-weight-gives-memory-error for an external issue on this.

All I added was the possibility to treat sample weights in _solve_dense_cholesky and not only in _solve_dense_cholesky_kernel.

I provide a batch of tests checking that the results of these two functions are the same given sample weights.

Further I provide a test to ensure that the correct branch is chose in ridge_regression (i.e. use _solve_dense_cholesky if n_samples > n_features even if there are sample weights). The way I implemented this is that it will cause a MemoryError if the wrong branch is taken. By TDD I made sure it fails first - now it passes. Please comment on this practice: Evidently in this current, corrected case, the MemoryError will not be raised, but if ever the branching is wrong, and the computer tested has 2TB of memory available, this test will no longer raise a MemoryError. If there are better ways of checking the right branching, please let me know.

PS: There are some inconsistencies in function signature default values: _solve_dense_cholesky_kernel has sample_weight=None as default, which will, if called directly, be interpreted as though sample_weights are actually present. This has never caused an error, because the public function ridge_regression always calls with sample_weight=1. if no sample weights are given. Should I correct this in the present PR or elsewhere?

coveralls · 2014-04-03T07:49:39Z

Coverage remained the same when pulling fd81c72 on eickenberg:ridge_sample_weights_in_feature_space into 42663e4 on scikit-learn:master.

ogrisel · 2014-04-03T11:08:55Z

sklearn/linear_model/ridge.py

I think this should raise a ValueError is sample_weight.shape > 1 instead of calling ravel. I do not see any valid case for higher than 1D sample_weight.

I think we should raise a ValueError is sample_weight.shape > 1 instead of
calling ravel.

👍

The sample weight could have been 0-d, just a scalar. This was my way of making it an array. But it is true that the subsequent multiplications work with scalars as well as 1d arrays. The crucial point is that the array must not be higher dimensional, so yes, I will add this verification

Is using at_least_1d in order to be sure to obtain something on which I can call .shape justifiable or do you prefer that I check isinstance(numbers.Number) vs isinstance(np.ndarray)?

ogrisel · 2014-04-03T11:33:08Z

About the test that checks the right branching in Ridge by giving data that would cause a memory error if Ridge decides to pick the wrong solver, I think it's fine.

If other people think that this is too dangerous (for instance if someone introduces a regression while refactoring the Ridge class on a developer box with 2TB+ of RAM), then we would need to find a way to instrument the ridge_regression function but that sounds too complicated.

ogrisel · 2014-04-03T11:37:36Z

The docstring for the "solver" and "sample_weight" params need to be updated to document that new behavior in all the Ridge* classes and in the ridge_regression function.

GaelVaroquaux · 2014-04-03T11:38:51Z

If other people think that this is too dangerous (for instance if someone
introduces a regression while refactoring the Ridge class on a developer box
with 2TB+ of RAM),

Yes, on an shared box (like drago), which would then screw up everybody
on the box. I am not too excited about it :$

ogrisel · 2014-04-03T12:04:35Z

About the sample_weight=1. default, I think we should switch to sample_weight=None by default to be consistent with other models such as SGDClassifier for instance, while still handling the sample_weight=1 case internally to keep backward compat...

GaelVaroquaux · 2014-04-03T12:12:59Z

About the sample_weight=1. default, I think we should switch to sample_weight=
None by default to be consistent with other models

+1

ogrisel · 2014-04-03T12:29:51Z

I think the checks at line:

https://github.com/eickenberg/scikit-learn/blob/ridge_sample_weights_in_feature_space/sklearn/linear_model/ridge.py#L271

are not consistent: if the user passes a sparse matrix as input along with a sample_weight array the dense_cholesky solver will be used with a warning and then will probably crash. I think this should raise a ValueError with an explicit error message (and tests).

ogrisel · 2014-04-03T12:30:56Z

Yes, on an shared box (like drago), which would then screw up everybody on the box. I am not too excited about it :$

We can just remove that test then. Monkey-patching is too cumbersome just for a test I think.

eickenberg · 2014-04-03T12:39:24Z

Both branches in solver="dense_cholesky" use safe_sparse_dot. But it is true that I added a non-safe multiplication by sample weights in there. I will try to make it safe as elegantly as possible (it amounts to a multiplication with a diagonal matrix, which necessarily needs to be sparse, because otherwise enormous).
Independently of this I can change the warning to a ValueError also

ogrisel · 2014-04-03T12:50:53Z

I think I prefer ValueError instead of warning when solver is explicitly passed by the user (to the non default value). But @mblondel might prefer to keep the current behavior.

+1 for a test for the sample_weight + sparse data case if you can make it work.

eickenberg · 2014-04-03T14:26:45Z

sklearn/linear_model/ridge.py

Sample weight error raised already in fit. Otherwise np.average, which is called by _center_data will raise an equivalent error message

This may be useful on a more abstract level

coveralls · 2014-04-03T16:16:54Z

Coverage remained the same when pulling 8a5e2a9 on eickenberg:ridge_sample_weights_in_feature_space into 42663e4 on scikit-learn:master.

ogrisel · 2014-04-04T08:37:19Z

This PR seems to fix #1190 but maybe not as @mblondel intended.

ogrisel · 2014-04-04T08:39:11Z

@eickenberg if you think you are done with the changes you planned on your side please prefix the title of the PR with the "[MRG]" tag to let reviewers know that this PR is ready for final review. Otherwise please put the "[WIP]" tag instead.

eickenberg · 2014-04-04T09:18:29Z

Switched to MRG. Happy to discuss further extensions / refactorings if they make the ridge story more consistent.
WRT issue #1190 : what I did was write the same loss function that is presented, but I do not distribute square roots of the sample weights, since they multiply back together at a later stage, or in fact can be kept together. I address sparse safe multiplication by using a scipy.sparse.dia_matrix instead of working directly on the data. This should not cause overhead (except for a copy of the data of course ... if this is an issue, we are in gradient descent territory anyway).

eickenberg · 2014-04-04T09:23:27Z

I don't completely understand the (non-)failure of travis CI builds at some of my commits: I wrote a failing test for sparse matrices, it definitely fails on my machine, but travis says it passes eickenberg@8c9410d

Conversely, after fixing the failing test, travis says that one of its builds fails, but 2 pass eickenberg@ecb8a99

Same for the commit after PEP8+257 corrections

ogrisel · 2014-04-04T09:25:47Z

sklearn/linear_model/tests/test_ridge.py

Why are those 2 commented out? Please include them (the code should internally convert them to one of csr / csc).

Should this conversion be done within the ridge? Should I warn that the format is not appropriate?

Certain array checking functions (safe_as_array I think) seem to choke on these because they don't expose .data properly.

I think in other estimators we do not issue a warning when a sparse matrix conversion is required. The fact that safe_asarray chokes might be a bug, but should probably be fixed in a separate PR. Please add an inline comment such as XXX: add support for other sparse matrix formats without data copy if possible

safe_asarray is already called in _BaseRidge.fit, so the conversion either needs to go before this. I'd prefer to check this within ridge_regression, but this would call for placing the safe_asarray logic within ridge_regression as well.

safe_asarray is already called in _BaseRidge.fit, so the conversion either
needs to go before this. I'd prefer to check this within ridge_regression, but
this would call for placing the safe_asarray logic within ridge_regression as
well.

Sounds good to me.

mblondel · 2014-04-04T09:37:24Z

The solution suggested in #1190 would still be useful for the "lsqr" solver (which is very fast on sparse data).

ogrisel · 2014-04-04T09:39:51Z

Indeed the test history looks weird. I checked that in the last build the sparse + sample_weight test is actually running. However it raises a warning:

Sample weights must work with sparse matrices ... /home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/ridge.py:283: UserWarning: sample_weight and class_weight not supported in sparse_cg, fall back to dense_cholesky.
"dense_cholesky." % solver)
ok

I think using the model with the default parameters should not raise a warning. If the input is sparse and sparse_weight are given and the solver is left to "auto", the model should automatically select the cholesky solver without warning and this should be documented in the docstring.

mblondel · 2014-04-04T09:43:40Z

I'd suggest renaming dense_cholesky to just cholesky. dense_cholesky is misleading in that one may think that it cannot handle sparse data but it can.

GaelVaroquaux · 2014-04-04T09:47:04Z

I'd suggest renaming dense_cholesky to just cholesky. dense_cholesky is
misleading in that one may think that it cannot handle sparse data but it can.

+1

larsmans · 2014-04-10T10:23:23Z

I think you should rebase again. git pull --rebase upstream master (where upstream is the main repo), then git push -f origin ridge_sample_weights_in_feature_space.

eickenberg · 2014-04-10T11:29:34Z

Did another rebase which lead to an interesting telescopic sum of conflicting diffs :)

But I don't know if it now looks as it is supposed to.

larsmans · 2014-04-10T11:54:33Z

Pretty much, but before merging I do prefer a further squash (though that's just my opinion). Ideally there should at least be no commits that leave the code base in a broken state, i.e. where the tests don't pass, since it should be possible to run git bisect to fix bugs and not be bothered by stuff not related to the bug you're chasing.

larsmans · 2014-04-10T11:56:20Z

doc/whats_new.rst

Please restrict lines to 80 cols. My editor doesn't like longer lines. Also the stuff about #1190 is an implementation detail, and this log is primarily for users to see if they need to upgrade anything.

coveralls · 2014-04-11T11:09:01Z

Coverage remained the same when pulling 878d264 on eickenberg:ridge_sample_weights_in_feature_space into fb795eb on scikit-learn:master.

ogrisel · 2014-04-11T22:22:29Z

Apart from #3034 (comment) I think @larsmans would like most commits of this PR to be squashed and then we are good to go for merging this I think.

eickenberg · 2014-04-12T16:19:01Z

Simplified has_sw to has_sw = sample_weight is None for ridge_regression, Ridge, RidgeClassifier and _BaseRidge.

I didn't touch it in _RidgeGCV etc, because it is assumed to be 1 implicitly in the code if nothing was set (there are multiplications by sample_weight). As indicated before, I am very willing to undertake a full refactoring of this code, but right now my understanding of the analytical ridge loov/e calculations is purely theoretical (and the coding style in that part of the file somewhat different to what I have edited).

I can tackle it in this PR if it adds sufficiently to consistency.

eickenberg · 2014-04-12T18:17:23Z

Ok, it ended up being easy to make this work for RidgeGCV etc with sample_weight=None as default, but I had to set it to 1.0 if initialized with None in the body of RidgeClassifierCV.fit.

ogrisel · 2014-04-14T15:47:22Z

@eickenberg do you need help for the squashing? Here is some doc:

http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html

… commit over around 20 commits to avoid failing tests

eickenberg · 2014-04-15T11:42:43Z

Thanks for that. I think it worked. @fabianp encouraged me through the sticky parts :)

ogrisel · 2014-04-15T13:48:59Z

Looks good, for some reason travis did not pickup your new commit. I will run the tests on my box. If they pass I will merge.

eickenberg · 2014-04-15T13:49:55Z

Yes, something is weird - I am not getting update emails anymore either (saw your comment 20h later), so probably some hooks aren't running?

ogrisel · 2014-04-15T13:50:19Z

Hum the cause is a conflict in the whats new file as usual. I am on it.

eickenberg · 2014-04-15T13:55:12Z

Ok, thanks!

ogrisel · 2014-04-17T18:31:11Z

I forgot to say: thanks @eickenberg for the non-trivial fix!

eickenberg · 2014-04-17T18:34:19Z

thanks for seeing it through with me :) @ogrisel @mblondel @larsmans @GaelVaroquaux @agramfort @fabianp

On Thursday, April 17, 2014, Olivier Grisel notifications@github.com
wrote:

I forgot to say: thanks @eickenberg https://github.com/eickenberg for
the non-trivial fix!

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3034#issuecomment-40746833
.

mblondel · 2014-04-17T20:34:10Z

You should add your name to the header next time.

ogrisel · 2014-04-17T21:12:23Z

Indeed, please @eickenberg feel free to open a new PR for this and I will merge it right away.

eickenberg · 2014-04-17T21:20:17Z

Haha OK I'll do that, because I love making pull requests! :)

ogrisel reviewed Apr 3, 2014
View reviewed changes

eickenberg reviewed Apr 3, 2014
View reviewed changes

eickenberg changed the title ~~[ENH] Ridge can use sample weights in feature space (X.T.dot(X) gram matrix)~~ [MRG] Ridge can use sample weights in feature space (X.T.dot(X) gram matrix) Apr 4, 2014

ogrisel reviewed Apr 4, 2014
View reviewed changes

Now capable of treating sample_weights in feature space

50e35b5

larsmans reviewed Apr 10, 2014
View reviewed changes

Ridge regression now can use sample_weights in feature space. Summary…

8307eee

… commit over around 20 commits to avoid failing tests

ogrisel merged commit 8307eee into scikit-learn:master Apr 15, 2014

arjoly mentioned this pull request Jul 18, 2014

Better sample_weight support in Ridge #1190

Closed

trevorstephens mentioned this pull request Jun 12, 2015

Ridge - don't set sample_weight if no class_weight provided #4851

Merged

Uh oh!

Conversation

eickenberg commented Apr 3, 2014

Uh oh!

coveralls commented Apr 3, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Apr 3, 2014

Uh oh!

ogrisel commented Apr 3, 2014

Uh oh!

GaelVaroquaux commented Apr 3, 2014

Uh oh!

ogrisel commented Apr 3, 2014

Uh oh!

GaelVaroquaux commented Apr 3, 2014

Uh oh!

ogrisel commented Apr 3, 2014

Uh oh!

ogrisel commented Apr 3, 2014

Uh oh!

eickenberg commented Apr 3, 2014

Uh oh!

ogrisel commented Apr 3, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Apr 3, 2014

Uh oh!

ogrisel commented Apr 4, 2014

Uh oh!

ogrisel commented Apr 4, 2014

Uh oh!

eickenberg commented Apr 4, 2014

Uh oh!

eickenberg commented Apr 4, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mblondel commented Apr 4, 2014

Uh oh!

ogrisel commented Apr 4, 2014

Uh oh!

mblondel commented Apr 4, 2014

Uh oh!

GaelVaroquaux commented Apr 4, 2014

Uh oh!

larsmans commented Apr 10, 2014

Uh oh!

eickenberg commented Apr 10, 2014

Uh oh!

larsmans commented Apr 10, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Apr 11, 2014

Uh oh!

ogrisel commented Apr 11, 2014

Uh oh!

eickenberg commented Apr 12, 2014

Uh oh!

eickenberg commented Apr 12, 2014

Uh oh!