optimisations to Ridge Regression GCV #650

fcostin · 2012-02-23T14:59:36Z

Hi all,

I've made a few optimisations to the ridge regression generalised cross validation (GCV) code:

These changes eliminate unnecessary construction of the explicit matrix G=dot(Q,dot(diag(1/(v+alpha)),Q.T)) when we just want dot(G, y) and diag(G) inside the _error method. Similarly inside the _values method, we can do the same thing to avoid explicitly constructing dot(K,G) = dot(Q, dot(diag(v/(v+alpha)), Q.T)).

I've tested the speedup on the "communities and crime" dataset (http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime), and I get a ~ 11x speedup when using a small training set consisting of the first 398 rows. I expect the speedup is even more convincing for larger training sets.

The scripts I'm using to benchmark the code are here: https://gist.github.com/1893136

I should probably add a new test for this too.

Optimisations follow from noting that we do not need to explicitly form the matrix for the regularised inverse, we just need to compute its action upon the response vector y. Similarly, we don't need to compute the whole explicit matrix in order to evaluate its diagonal, instead just evaluate the diagonal entries. This leads to a large performance increase when GCV is applied to training datasets with larger numbers of points. E.g. I observe a ~ 11x speedup on the UCI Communities and Crime dataset even for modest training set sizes of 398 examples only (20% of the data).

agramfort · 2012-02-23T20:07:10Z

thanks for sharing ! I suspect there could be a trade off between extreme cases n_samples << n_features or n_samples >> n_features. Could you bench ? See Lasso bench for example on ml-benchmarks.

GaelVaroquaux · 2012-02-25T19:38:12Z

sklearn/linear_model/ridge.py

Isn't that simple the square norm 2 of Q[i, :]? If so, I would think that it can be computed without the for loop.

Each iteration computes a weighted square 2 norm of Q[i, :] with the dimensions weighted by the vector v_prime. E.g.

diag[i] := \sum_{j=1}^n v_prime_j Q_{i,j}^2

It could be rewritten as an unweighted square 2 norm since the elements of v_prime are non-negative

np.dot((v_prime ** 0.5) * Q[i, :], (v_prime ** 0.5) * Q[i, :])

How would you compute it without the for loop?

Correct me if I am wrong, but I think that:

diag = (v_prime * Q**2).sum(axis=-1)

should do the trick.

GaelVaroquaux · 2012-02-25T19:40:37Z

I agree with @agramfort that there is probably an inherent trade off, in which case we should choose automatically the best option if possible, and if not provide an option to the user to make the choice and a sane default.

fcostin · 2012-02-26T00:12:07Z

Thanks @agramfort for directing me toward ml-benchmarks. Here are benchmark results comparing the base sklearn version of RidgeCV to the patched version from this pull request:

ridge version	dataset	n_samples	n_features	time mean (s)	time std dev (s)	MSE
sklearn	arcene	100	10000	0.21	0.00	46.3827841963
patched	arcene	100	10000	0.20	0.00	46.3827597987
sklearn	madelon	2000	500	895.10	63.83	691.602867089
patched	madelon	2000	500	18.27	1.40	691.602867089

Both versions of the code seem quick for the n_features >> n_samples case. For the case of large n_samples, both versions of code seem slow, although the patched version is less slow. I suspect a problem with the large n_samples case is the cost of computing the eigen decomposition of K, which has shape (n_samples, n_samples).

mblondel · 2012-02-26T03:38:08Z

For the n_samples > n_features case, see my comment in issue #624. But this doesn't need to be addressed in this PR.

mblondel · 2012-02-26T03:55:05Z

RidgeCV didn't support the n_samples > n_features case efficiently to begin with so I think the improvements brought by this PR are good to take.

To handle the n_samples > n_features case efficiently, we can use the SVD trick mentioned by @npinto or we might be able to use the trick I mentioned in #624. But again, this is orthogonal to this PR.

agramfort · 2012-02-26T09:15:15Z

+1 for merge then.

GCV sped up by using SVD of X in special cases where n_samples > n_features instead of eigen decomposition of linear kernel matrix XX^T.

fcostin · 2012-02-26T10:29:50Z

(Thanks @mblondel -- I don't understand how the other trick from Chapelle's paper that you mention in #624 might apply to generalised cross validation, although I can see how it would apply if you are solving a ridge regression problem with a fixed value of the regularisation parameter.)

I got motivated and implemented the SVD trick for the n_samples > n_features case.

Here are some more benchmarks:

ridge version	dataset	n_samples	n_features	time mean (s)	time std dev (s)	MSE
sklearn	arcene	100	10000	0.21	0.00	46.3827841963
patched-auto	arcene	100	10000	0.21	0.00	46.3827597987
patched-eigen	arcene	100	10000	0.24	0.03	46.3827597987
patched-svd	arcene	100	10000	1.10	0.05	46.3803702145
sklearn	madelon	2000	500	895.10	63.83	691.602867089
patched-auto	madelon	2000	500	1.89	0.15	691.602867089
patched-eigen	madelon	2000	500	18.68	0.51	691.602867089
patched-svd	madelon	2000	500	1.95	0.22	691.602867089

"sklearn" is the base version of RidgeCV. "patched-eigen" is the version in this pull request. "patched-svd" is a version using SVD trick all the time (note how it runs slower on arcene, where n_features >> n_samples). "patched-auto" uses SVD if n_samples > n_features, otherwise eigen. The latter two versions are not in this pull request.

Should I add the commits for the SVD trick to this pull request, or open a new one to keep things orthogonal?

ogrisel · 2012-02-26T10:31:32Z

Thanks @fcostin for being so reactive, are you re-running the benchmarks with the SVD trick?

fcostin · 2012-02-26T10:42:52Z

yes, the benchmarks for the "patched-svd" and "patched-auto" in the table above include the SVD trick ;)

GaelVaroquaux · 2012-02-26T10:49:28Z

I like the results for the patched auto :). Do they correspond to what is
in the pull request?

GaelVaroquaux · 2012-02-26T10:50:53Z

I like the results for the patched auto :). Do they correspond to what is
in the pull request?

Actually, you said that they weren't. I think that it would be great if
you add them, and then we can review and merge everything at once. I am
eager to merge such an improvement!

mblondel · 2012-02-26T11:13:05Z

The PR does seem to include the code for "auto" mode.

Thanks for the SVD trick implementation. BTW, don't forget to add your name to the credits of the file.

I'll have a look at Chapelle's paper later.

ogrisel · 2012-02-26T12:00:46Z

Thanks @fcostin , sorry for my seemingly dumb last message. I must have posted it while you were posting the new benchmarks results (I might have not refreshed the PR page).

It looks like a 👍 for merging. As @mblondel said please add your self to the credit of the file and also add a new entry in the 'doc/whats_new.rst' file.

fcostin · 2012-02-26T12:14:44Z

Whoops - @mblondel is correct - github automagically added the two new commits for the SVD stuff & auto mode to this pull request. I didn't expect that. I'll append some lines to the credits and whats_new and commit that too.

GaelVaroquaux · 2012-02-26T14:47:17Z

sklearn/linear_model/ridge.py

The line above should be

gcv_mode: {None, 'auto', 'svd', eigen'}

GaelVaroquaux · 2012-02-26T14:52:13Z

Apart from the few remarks, I am 👍 for merge. This is great work!

agramfort · 2012-02-26T15:39:35Z

+1 for merge to

mblondel · 2012-02-28T08:14:58Z

A test is failing on my box:

======================================================================
FAIL: sklearn.linear_model.tests.test_ridge.test_dense_sparse
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/nose/case.py", line 183, in runTest
    self.test(*self.arg)
  File "/home/mathieu/Desktop/projects/scikit-learn/sklearn/linear_model/tests/test_ridge.py", line 263, in test_dense_sparse
    ret_dense = test_func(DENSE_FILTER)
  File "/home/mathieu/Desktop/projects/scikit-learn/sklearn/linear_model/tests/test_ridge.py", line 145, in _test_ridge_loo
    assert_almost_equal(values, values2)
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 262, in assert_almost_equal
    return assert_array_almost_equal(actual, desired, decimal, err_msg)
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 537, in assert_array_almost_equal
    header='Arrays are not almost equal')
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 395, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal

(mismatch 100.0%)
 x: array([ 364.50424893,  463.80230164,  216.93458374,  137.6709951 ,
        286.08220951,  166.62880025,  314.37446323,  212.73833979,
        188.69268162,  159.13806598,  146.14023593,  208.27922575,...
 y: array([ 27.49575107,  52.19769836, -36.93458374, -39.6709951 ,
         5.91779049,   1.37119975, -26.37446323, -24.73833979,
       -40.69268162, -15.13806598, -42.14023593, -34.27922575,...
>>  raise AssertionError('\nArrays are not almost equal\n\n(mismatch 100.0%)\n x: array([ 364.50424893,  463.80230164,  216.93458374,  137.6709951 ,\n        286.08220951,  166.62880025,  314.37446323,  212.73833979,\n        188.69268162,  159.13806598,  146.14023593,  208.27922575,...\n y: array([ 27.49575107,  52.19769836, -36.93458374, -39.6709951 ,\n         5.91779049,   1.37119975, -26.37446323, -24.73833979,\n       -40.69268162, -15.13806598, -42.14023593, -34.27922575,...')

The line failing is the 145th in _test_ridge_loo.

fcostin · 2012-02-28T08:39:51Z

@mblondel -- good find, I can replicate that on my machine. I ran the tests before via

python -c "import sklearn; sklearn.test()"

but I guess that didn't run the ridge tests for whatever reason. when I run nosetests from inside sklearn/linear_model I see the error. I'll have a look at it.

EDIT: fixed it, I'll commit it once I work through Gael's remarks and the other issue I noted below

fcostin · 2012-02-28T08:53:36Z

sklearn/linear_model/ridge.py

Here's a mistake:

w * QT_y

fails when y is 2d. I imagine something like

w[(slice(None), ) + (numpy.newaxis, ) * (len(y.shape) - 1)] * QT_y

could force the broadcasting to work properly. Is there a better way of doing this?

You could use an if-else statement instead because the above one-liner looks hard to read.

...all ridge tests pass now.

fcostin · 2012-02-28T11:07:06Z

OK, all the ridge tests are passing now and I've incorporated the previous comments.

"svd" mode breaks when sample weights are used so if that case is detected a warning is printed and the "eig" mode is forced instead. I think there is probably a way to get the SVD to work properly with sample weights but I will have to stare at http://en.wikipedia.org/wiki/Woodbury_matrix_identity and look over the LOO-CV references again.

EDIT: note that benchmarks remain the same as the most recent ones I posted -- there is still a good speedup on the madelon dataset

mblondel · 2012-02-28T11:17:39Z

In Ridge, I also force the use of kernel ridge when sample_weight is enabled.

mblondel · 2012-02-28T11:37:52Z

Alright, I'm merging with minor fixes.

mblondel · 2012-02-28T11:48:33Z

Merged.

GaelVaroquaux · 2012-02-28T11:50:01Z

Awesome! Great work.

ogrisel · 2012-02-28T12:55:21Z

sklearn/linear_model/ridge.py

Also I wonder if we could further speed things up with sklearn's fast_svd instead.

Yes I agree that might help even more for particularly low rank matrices X. Not sure how it would work in general. I didn't realise sklearn had an implementation of that randomised low rank SVD algorithm - very nice! Are there any benchmarks comparing fast_svd to other SVD approaches?

Yes: https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/bench_plot_svd.py

If you want to add a benchmark for arpack on sparse data, please feel free to do so as well :)

Yes I agree that might help even more for particularly low rank matrices X. Not sure how it would work in general. I didn't realise sklearn had an implementation of that randomised low random SVD algorithm - very nice! Are there any benchmarks comparing fast_svd to other SVD approaches?

I'm curious whether this can work or not. I would tend to think it
requires an exact SVD. BTW, keep in mind that using fast_svd would
require an extra argument (e.g. named "n_components").

True. Adding an extra hyperparameter is not user friendly. But that might make it possible to scale to very large n_samples dataset.

Perhaps this can work well in the special case where n_samples >> n_features >> rank. In this case the low rank approximation obtained via fast_svd could be exact, by requesting n_components = rank. If the approximation obtained by fast_svd is not exact then we miss the smallest singular values which make the largest contributions when we apply the inverse matrix.

GaelVaroquaux reviewed Feb 25, 2012
View reviewed changes

fcostin added 2 commits February 26, 2012 20:24

faster GCV for Ridge for n_samples > n_features

7e83783

GCV sped up by using SVD of X in special cases where n_samples > n_features instead of eigen decomposition of linear kernel matrix XX^T.

fixed tests to work with Ridge GCV

70b55dd

updated RidgeCV docstring and changelog

6042fd5

GaelVaroquaux reviewed Feb 26, 2012
View reviewed changes

sklearn/linear_model/ridge.py

Copy link

Member

GaelVaroquaux Feb 26, 2012

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line above should be

gcv_mode: {None, 'auto', 'svd', eigen'}

fcostin reviewed Feb 28, 2012
View reviewed changes

fcostin added 4 commits February 28, 2012 20:01

fixed bug with _values (thanks @mblondel)

a5a0ba6

fixed bug with > 1d y arrays

c3658c3

svd fails for sample_weights, use eig instead

4b9a2af

coerce sparse matrices to dense before SVD

d1589f3

...all ridge tests pass now.

refactoring (thanks @GaelVaroquaux, @mblondel)

a84f832

mblondel closed this Feb 28, 2012

ogrisel reviewed Feb 28, 2012
View reviewed changes

Uh oh!

optimisations to Ridge Regression GCV #650

optimisations to Ridge Regression GCV #650

Uh oh!

Conversation

fcostin commented Feb 23, 2012

Uh oh!

agramfort commented Feb 23, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Feb 25, 2012

Uh oh!

fcostin commented Feb 26, 2012

Uh oh!

mblondel commented Feb 26, 2012

Uh oh!

mblondel commented Feb 26, 2012

Uh oh!

agramfort commented Feb 26, 2012

Uh oh!

fcostin commented Feb 26, 2012

Uh oh!

ogrisel commented Feb 26, 2012

Uh oh!

fcostin commented Feb 26, 2012

Uh oh!

GaelVaroquaux commented Feb 26, 2012

Uh oh!

GaelVaroquaux commented Feb 26, 2012

Uh oh!

mblondel commented Feb 26, 2012

Uh oh!

ogrisel commented Feb 26, 2012

Uh oh!

fcostin commented Feb 26, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Feb 26, 2012

Uh oh!

agramfort commented Feb 26, 2012

Uh oh!

mblondel commented Feb 28, 2012

Uh oh!

fcostin commented Feb 28, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fcostin commented Feb 28, 2012

Uh oh!

mblondel commented Feb 28, 2012

Uh oh!

mblondel commented Feb 28, 2012

Uh oh!

mblondel commented Feb 28, 2012

Uh oh!

GaelVaroquaux commented Feb 28, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels