[MRG] Ridge can use sample weights in feature space (X.T.dot(X) gram matrix)#3034
Conversation
sklearn/linear_model/ridge.py
Outdated
There was a problem hiding this comment.
I think this should raise a ValueError is sample_weight.shape > 1 instead of calling ravel. I do not see any valid case for higher than 1D sample_weight.
There was a problem hiding this comment.
I think we should raise a ValueError is sample_weight.shape > 1 instead of
calling ravel.
👍
There was a problem hiding this comment.
The sample weight could have been 0-d, just a scalar. This was my way of making it an array. But it is true that the subsequent multiplications work with scalars as well as 1d arrays. The crucial point is that the array must not be higher dimensional, so yes, I will add this verification
There was a problem hiding this comment.
Is using at_least_1d in order to be sure to obtain something on which I can call .shape justifiable or do you prefer that I check isinstance(numbers.Number) vs isinstance(np.ndarray)?
|
About the test that checks the right branching in Ridge by giving data that would cause a memory error if Ridge decides to pick the wrong solver, I think it's fine. If other people think that this is too dangerous (for instance if someone introduces a regression while refactoring the Ridge class on a developer box with 2TB+ of RAM), then we would need to find a way to instrument the |
|
The docstring for the "solver" and "sample_weight" params need to be updated to document that new behavior in all the |
Yes, on an shared box (like drago), which would then screw up everybody |
|
About the |
+1 |
|
I think the checks at line: are not consistent: if the user passes a sparse matrix as input along with a |
We can just remove that test then. Monkey-patching is too cumbersome just for a test I think. |
|
Both branches in solver="dense_cholesky" use safe_sparse_dot. But it is true that I added a non-safe multiplication by sample weights in there. I will try to make it safe as elegantly as possible (it amounts to a multiplication with a diagonal matrix, which necessarily needs to be sparse, because otherwise enormous). |
|
I think I prefer +1 for a test for the |
sklearn/linear_model/ridge.py
Outdated
There was a problem hiding this comment.
Sample weight error raised already in fit. Otherwise np.average, which is called by _center_data will raise an equivalent error message
There was a problem hiding this comment.
This may be useful on a more abstract level
|
@eickenberg if you think you are done with the changes you planned on your side please prefix the title of the PR with the "[MRG]" tag to let reviewers know that this PR is ready for final review. Otherwise please put the "[WIP]" tag instead. |
|
Switched to MRG. Happy to discuss further extensions / refactorings if they make the ridge story more consistent. |
|
I don't completely understand the (non-)failure of travis CI builds at some of my commits: I wrote a failing test for sparse matrices, it definitely fails on my machine, but travis says it passes eickenberg@8c9410d Conversely, after fixing the failing test, travis says that one of its builds fails, but 2 pass eickenberg@ecb8a99 Same for the commit after PEP8+257 corrections |
There was a problem hiding this comment.
Why are those 2 commented out? Please include them (the code should internally convert them to one of csr / csc).
There was a problem hiding this comment.
Should this conversion be done within the ridge? Should I warn that the format is not appropriate?
Certain array checking functions (safe_as_array I think) seem to choke on these because they don't expose .data properly.
There was a problem hiding this comment.
I think in other estimators we do not issue a warning when a sparse matrix conversion is required. The fact that safe_asarray chokes might be a bug, but should probably be fixed in a separate PR. Please add an inline comment such as XXX: add support for other sparse matrix formats without data copy if possible
There was a problem hiding this comment.
safe_asarray is already called in _BaseRidge.fit, so the conversion either needs to go before this. I'd prefer to check this within ridge_regression, but this would call for placing the safe_asarray logic within ridge_regression as well.
There was a problem hiding this comment.
safe_asarray is already called in _BaseRidge.fit, so the conversion either
needs to go before this. I'd prefer to check this within ridge_regression, but
this would call for placing the safe_asarray logic within ridge_regression as
well.
Sounds good to me.
|
The solution suggested in #1190 would still be useful for the "lsqr" solver (which is very fast on sparse data). |
|
Indeed the test history looks weird. I checked that in the last build the sparse + sample_weight test is actually running. However it raises a warning: I think using the model with the default parameters should not raise a warning. If the input is sparse and sparse_weight are given and the solver is left to "auto", the model should automatically select the cholesky solver without warning and this should be documented in the docstring. |
|
I'd suggest renaming |
+1 |
|
I think you should rebase again. |
|
Did another rebase which lead to an interesting telescopic sum of conflicting diffs :) But I don't know if it now looks as it is supposed to. |
|
Pretty much, but before merging I do prefer a further squash (though that's just my opinion). Ideally there should at least be no commits that leave the code base in a broken state, i.e. where the tests don't pass, since it should be possible to run |
doc/whats_new.rst
Outdated
There was a problem hiding this comment.
Please restrict lines to 80 cols. My editor doesn't like longer lines. Also the stuff about #1190 is an implementation detail, and this log is primarily for users to see if they need to upgrade anything.
|
Apart from #3034 (comment) I think @larsmans would like most commits of this PR to be squashed and then we are good to go for merging this I think. |
|
Simplified I didn't touch it in I can tackle it in this PR if it adds sufficiently to consistency. |
|
Ok, it ended up being easy to make this work for |
|
@eickenberg do you need help for the squashing? Here is some doc: http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html |
… commit over around 20 commits to avoid failing tests
|
Thanks for that. I think it worked. @fabianp encouraged me through the sticky parts :) |
|
Looks good, for some reason travis did not pickup your new commit. I will run the tests on my box. If they pass I will merge. |
|
Yes, something is weird - I am not getting update emails anymore either (saw your comment 20h later), so probably some hooks aren't running? |
|
Hum the cause is a conflict in the whats new file as usual. I am on it. |
|
Ok, thanks! |
|
I forgot to say: thanks @eickenberg for the non-trivial fix! |
|
thanks for seeing it through with me :) @ogrisel @mblondel @larsmans @GaelVaroquaux @agramfort @fabianp On Thursday, April 17, 2014, Olivier Grisel notifications@github.com
|
|
You should add your name to the header next time. |
|
Indeed, please @eickenberg feel free to open a new PR for this and I will merge it right away. |
|
Haha OK I'll do that, because I love making pull requests! :) |
Hello everybody,
here is an enhancement to the ridge_regression function: Up until now, if sample_weight is given, the decision tree imposes a treatment in sample space, which can lead to Memory errors if the number of samples is large even though the number of features is small. See http://stackoverflow.com/questions/22766978/sklearn-ridge-and-sample-weight-gives-memory-error for an external issue on this.
All I added was the possibility to treat sample weights in _solve_dense_cholesky and not only in _solve_dense_cholesky_kernel.
I provide a batch of tests checking that the results of these two functions are the same given sample weights.
Further I provide a test to ensure that the correct branch is chose in ridge_regression (i.e. use _solve_dense_cholesky if n_samples > n_features even if there are sample weights). The way I implemented this is that it will cause a MemoryError if the wrong branch is taken. By TDD I made sure it fails first - now it passes. Please comment on this practice: Evidently in this current, corrected case, the MemoryError will not be raised, but if ever the branching is wrong, and the computer tested has 2TB of memory available, this test will no longer raise a MemoryError. If there are better ways of checking the right branching, please let me know.
PS: There are some inconsistencies in function signature default values: _solve_dense_cholesky_kernel has sample_weight=None as default, which will, if called directly, be interpreted as though sample_weights are actually present. This has never caused an error, because the public function ridge_regression always calls with sample_weight=1. if no sample weights are given. Should I correct this in the present PR or elsewhere?