-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
optimisations to Ridge Regression GCV #650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Optimisations follow from noting that we do not need to explicitly form the matrix for the regularised inverse, we just need to compute its action upon the response vector y. Similarly, we don't need to compute the whole explicit matrix in order to evaluate its diagonal, instead just evaluate the diagonal entries. This leads to a large performance increase when GCV is applied to training datasets with larger numbers of points. E.g. I observe a ~ 11x speedup on the UCI Communities and Crime dataset even for modest training set sizes of 398 examples only (20% of the data).
|
thanks for sharing ! I suspect there could be a trade off between extreme cases n_samples << n_features or n_samples >> n_features. Could you bench ? See Lasso bench for example on ml-benchmarks. |
sklearn/linear_model/ridge.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that simple the square norm 2 of Q[i, :]? If so, I would think that it can be computed without the for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each iteration computes a weighted square 2 norm of Q[i, :] with the dimensions weighted by the vector v_prime. E.g.
diag[i] := \sum_{j=1}^n v_prime_j Q_{i,j}^2
It could be rewritten as an unweighted square 2 norm since the elements of v_prime are non-negative
np.dot((v_prime ** 0.5) * Q[i, :], (v_prime ** 0.5) * Q[i, :])
How would you compute it without the for loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong, but I think that:
diag = (v_prime * Q**2).sum(axis=-1)
should do the trick.
|
I agree with @agramfort that there is probably an inherent trade off, in which case we should choose automatically the best option if possible, and if not provide an option to the user to make the choice and a sane default. |
|
Thanks @agramfort for directing me toward ml-benchmarks. Here are benchmark results comparing the base sklearn version of RidgeCV to the patched version from this pull request:
Both versions of the code seem quick for the n_features >> n_samples case. For the case of large n_samples, both versions of code seem slow, although the patched version is less slow. I suspect a problem with the large n_samples case is the cost of computing the eigen decomposition of K, which has shape (n_samples, n_samples). |
|
For the n_samples > n_features case, see my comment in issue #624. But this doesn't need to be addressed in this PR. |
|
RidgeCV didn't support the n_samples > n_features case efficiently to begin with so I think the improvements brought by this PR are good to take. To handle the n_samples > n_features case efficiently, we can use the SVD trick mentioned by @npinto or we might be able to use the trick I mentioned in #624. But again, this is orthogonal to this PR. |
|
+1 for merge then. |
GCV sped up by using SVD of X in special cases where n_samples > n_features instead of eigen decomposition of linear kernel matrix XX^T.
|
(Thanks @mblondel -- I don't understand how the other trick from Chapelle's paper that you mention in #624 might apply to generalised cross validation, although I can see how it would apply if you are solving a ridge regression problem with a fixed value of the regularisation parameter.) I got motivated and implemented the SVD trick for the n_samples > n_features case. Here are some more benchmarks:
"sklearn" is the base version of RidgeCV. "patched-eigen" is the version in this pull request. "patched-svd" is a version using SVD trick all the time (note how it runs slower on arcene, where n_features >> n_samples). "patched-auto" uses SVD if n_samples > n_features, otherwise eigen. The latter two versions are not in this pull request. Should I add the commits for the SVD trick to this pull request, or open a new one to keep things orthogonal? |
|
Thanks @fcostin for being so reactive, are you re-running the benchmarks with the SVD trick? |
|
yes, the benchmarks for the "patched-svd" and "patched-auto" in the table above include the SVD trick ;) |
|
I like the results for the patched auto :). Do they correspond to what is |
Actually, you said that they weren't. I think that it would be great if |
|
The PR does seem to include the code for "auto" mode. Thanks for the SVD trick implementation. BTW, don't forget to add your name to the credits of the file. I'll have a look at Chapelle's paper later. |
|
Thanks @fcostin , sorry for my seemingly dumb last message. I must have posted it while you were posting the new benchmarks results (I might have not refreshed the PR page). It looks like a 👍 for merging. As @mblondel said please add your self to the credit of the file and also add a new entry in the 'doc/whats_new.rst' file. |
|
Whoops - @mblondel is correct - github automagically added the two new commits for the SVD stuff & auto mode to this pull request. I didn't expect that. I'll append some lines to the credits and whats_new and commit that too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line above should be
gcv_mode: {None, 'auto', 'svd', eigen'}
|
Apart from the few remarks, I am 👍 for merge. This is great work! |
|
+1 for merge to |
|
A test is failing on my box: The line failing is the 145th in _test_ridge_loo. |
|
@mblondel -- good find, I can replicate that on my machine. I ran the tests before via but I guess that didn't run the ridge tests for whatever reason. when I run nosetests from inside sklearn/linear_model I see the error. I'll have a look at it. EDIT: fixed it, I'll commit it once I work through Gael's remarks and the other issue I noted below |
sklearn/linear_model/ridge.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a mistake:
w * QT_y
fails when y is 2d. I imagine something like
w[(slice(None), ) + (numpy.newaxis, ) * (len(y.shape) - 1)] * QT_y
could force the broadcasting to work properly. Is there a better way of doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use an if-else statement instead because the above one-liner looks hard to read.
|
OK, all the ridge tests are passing now and I've incorporated the previous comments. "svd" mode breaks when sample weights are used so if that case is detected a warning is printed and the "eig" mode is forced instead. I think there is probably a way to get the SVD to work properly with sample weights but I will have to stare at http://en.wikipedia.org/wiki/Woodbury_matrix_identity and look over the LOO-CV references again. EDIT: note that benchmarks remain the same as the most recent ones I posted -- there is still a good speedup on the madelon dataset |
|
In Ridge, I also force the use of kernel ridge when sample_weight is enabled. |
|
Alright, I'm merging with minor fixes. |
|
Merged. |
|
Awesome! Great work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I wonder if we could further speed things up with sklearn's fast_svd instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree that might help even more for particularly low rank matrices X. Not sure how it would work in general. I didn't realise sklearn had an implementation of that randomised low rank SVD algorithm - very nice! Are there any benchmarks comparing fast_svd to other SVD approaches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/bench_plot_svd.py
If you want to add a benchmark for arpack on sparse data, please feel free to do so as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree that might help even more for particularly low rank matrices X. Not sure how it would work in general. I didn't realise sklearn had an implementation of that randomised low random SVD algorithm - very nice! Are there any benchmarks comparing fast_svd to other SVD approaches?
I'm curious whether this can work or not. I would tend to think it
requires an exact SVD. BTW, keep in mind that using fast_svd would
require an extra argument (e.g. named "n_components").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Adding an extra hyperparameter is not user friendly. But that might make it possible to scale to very large n_samples dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this can work well in the special case where n_samples >> n_features >> rank. In this case the low rank approximation obtained via fast_svd could be exact, by requesting n_components = rank. If the approximation obtained by fast_svd is not exact then we miss the smallest singular values which make the largest contributions when we apply the inverse matrix.
Hi all,
I've made a few optimisations to the ridge regression generalised cross validation (GCV) code:
These changes eliminate unnecessary construction of the explicit matrix G=dot(Q,dot(diag(1/(v+alpha)),Q.T)) when we just want dot(G, y) and diag(G) inside the _error method. Similarly inside the _values method, we can do the same thing to avoid explicitly constructing dot(K,G) = dot(Q, dot(diag(v/(v+alpha)), Q.T)).
I've tested the speedup on the "communities and crime" dataset (http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime), and I get a ~ 11x speedup when using a small training set consisting of the first 398 rows. I expect the speedup is even more convincing for larger training sets.
The scripts I'm using to benchmark the code are here: https://gist.github.com/1893136
I should probably add a new test for this too.