Skip to content

Conversation

@fcostin
Copy link
Contributor

@fcostin fcostin commented Feb 23, 2012

Hi all,

I've made a few optimisations to the ridge regression generalised cross validation (GCV) code:

These changes eliminate unnecessary construction of the explicit matrix G=dot(Q,dot(diag(1/(v+alpha)),Q.T)) when we just want dot(G, y) and diag(G) inside the _error method. Similarly inside the _values method, we can do the same thing to avoid explicitly constructing dot(K,G) = dot(Q, dot(diag(v/(v+alpha)), Q.T)).

I've tested the speedup on the "communities and crime" dataset (http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime), and I get a ~ 11x speedup when using a small training set consisting of the first 398 rows. I expect the speedup is even more convincing for larger training sets.

The scripts I'm using to benchmark the code are here: https://gist.github.com/1893136

I should probably add a new test for this too.

Optimisations follow from noting that we do
not need to explicitly form the matrix for
the regularised inverse, we just need to
compute its action upon the response vector
y. Similarly, we don't need to compute the
whole explicit matrix in order to evaluate
its diagonal, instead just evaluate the
diagonal entries.

This leads to a large performance increase
when GCV is applied to training datasets
with larger numbers of points. E.g. I observe
a ~ 11x speedup on the UCI Communities and
Crime dataset even for modest training set
sizes of 398 examples only (20% of the data).
@agramfort
Copy link
Member

thanks for sharing ! I suspect there could be a trade off between extreme cases n_samples << n_features or n_samples >> n_features. Could you bench ? See Lasso bench for example on ml-benchmarks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that simple the square norm 2 of Q[i, :]? If so, I would think that it can be computed without the for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each iteration computes a weighted square 2 norm of Q[i, :] with the dimensions weighted by the vector v_prime. E.g.

diag[i] := \sum_{j=1}^n v_prime_j Q_{i,j}^2

It could be rewritten as an unweighted square 2 norm since the elements of v_prime are non-negative

np.dot((v_prime ** 0.5) * Q[i, :], (v_prime ** 0.5) * Q[i, :])

How would you compute it without the for loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, but I think that:

diag = (v_prime * Q**2).sum(axis=-1)

should do the trick.

@GaelVaroquaux
Copy link
Member

I agree with @agramfort that there is probably an inherent trade off, in which case we should choose automatically the best option if possible, and if not provide an option to the user to make the choice and a sane default.

@fcostin
Copy link
Contributor Author

fcostin commented Feb 26, 2012

Thanks @agramfort for directing me toward ml-benchmarks. Here are benchmark results comparing the base sklearn version of RidgeCV to the patched version from this pull request:

ridge version dataset n_samples n_features time mean (s) time std dev (s) MSE
sklearn arcene 100 10000 0.21 0.00 46.3827841963
patched arcene 100 10000 0.20 0.00 46.3827597987
sklearn madelon 2000 500 895.10 63.83 691.602867089
patched madelon 2000 500 18.27 1.40 691.602867089

Both versions of the code seem quick for the n_features >> n_samples case. For the case of large n_samples, both versions of code seem slow, although the patched version is less slow. I suspect a problem with the large n_samples case is the cost of computing the eigen decomposition of K, which has shape (n_samples, n_samples).

@mblondel
Copy link
Member

For the n_samples > n_features case, see my comment in issue #624. But this doesn't need to be addressed in this PR.

@mblondel
Copy link
Member

RidgeCV didn't support the n_samples > n_features case efficiently to begin with so I think the improvements brought by this PR are good to take.

To handle the n_samples > n_features case efficiently, we can use the SVD trick mentioned by @npinto or we might be able to use the trick I mentioned in #624. But again, this is orthogonal to this PR.

@agramfort
Copy link
Member

+1 for merge then.

GCV sped up by using SVD of X in special cases
where n_samples > n_features instead of eigen
decomposition of linear kernel matrix XX^T.
@fcostin
Copy link
Contributor Author

fcostin commented Feb 26, 2012

(Thanks @mblondel -- I don't understand how the other trick from Chapelle's paper that you mention in #624 might apply to generalised cross validation, although I can see how it would apply if you are solving a ridge regression problem with a fixed value of the regularisation parameter.)

I got motivated and implemented the SVD trick for the n_samples > n_features case.

Here are some more benchmarks:

ridge version dataset n_samples n_features time mean (s) time std dev (s) MSE
sklearn arcene 100 10000 0.21 0.00 46.3827841963
patched-auto arcene 100 10000 0.21 0.00 46.3827597987
patched-eigen arcene 100 10000 0.24 0.03 46.3827597987
patched-svd arcene 100 10000 1.10 0.05 46.3803702145
sklearn madelon 2000 500 895.10 63.83 691.602867089
patched-auto madelon 2000 500 1.89 0.15 691.602867089
patched-eigen madelon 2000 500 18.68 0.51 691.602867089
patched-svd madelon 2000 500 1.95 0.22 691.602867089

"sklearn" is the base version of RidgeCV. "patched-eigen" is the version in this pull request. "patched-svd" is a version using SVD trick all the time (note how it runs slower on arcene, where n_features >> n_samples). "patched-auto" uses SVD if n_samples > n_features, otherwise eigen. The latter two versions are not in this pull request.

Should I add the commits for the SVD trick to this pull request, or open a new one to keep things orthogonal?

@ogrisel
Copy link
Member

ogrisel commented Feb 26, 2012

Thanks @fcostin for being so reactive, are you re-running the benchmarks with the SVD trick?

@fcostin
Copy link
Contributor Author

fcostin commented Feb 26, 2012

yes, the benchmarks for the "patched-svd" and "patched-auto" in the table above include the SVD trick ;)

@GaelVaroquaux
Copy link
Member

I like the results for the patched auto :). Do they correspond to what is
in the pull request?

@GaelVaroquaux
Copy link
Member

I like the results for the patched auto :). Do they correspond to what is
in the pull request?

Actually, you said that they weren't. I think that it would be great if
you add them, and then we can review and merge everything at once. I am
eager to merge such an improvement!

@mblondel
Copy link
Member

The PR does seem to include the code for "auto" mode.

Thanks for the SVD trick implementation. BTW, don't forget to add your name to the credits of the file.

I'll have a look at Chapelle's paper later.

@ogrisel
Copy link
Member

ogrisel commented Feb 26, 2012

Thanks @fcostin , sorry for my seemingly dumb last message. I must have posted it while you were posting the new benchmarks results (I might have not refreshed the PR page).

It looks like a 👍 for merging. As @mblondel said please add your self to the credit of the file and also add a new entry in the 'doc/whats_new.rst' file.

@fcostin
Copy link
Contributor Author

fcostin commented Feb 26, 2012

Whoops - @mblondel is correct - github automagically added the two new commits for the SVD stuff & auto mode to this pull request. I didn't expect that. I'll append some lines to the credits and whats_new and commit that too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line above should be

gcv_mode: {None, 'auto', 'svd', eigen'}

@GaelVaroquaux
Copy link
Member

Apart from the few remarks, I am 👍 for merge. This is great work!

@agramfort
Copy link
Member

+1 for merge to

@mblondel
Copy link
Member

A test is failing on my box:

======================================================================
FAIL: sklearn.linear_model.tests.test_ridge.test_dense_sparse
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/nose/case.py", line 183, in runTest
    self.test(*self.arg)
  File "/home/mathieu/Desktop/projects/scikit-learn/sklearn/linear_model/tests/test_ridge.py", line 263, in test_dense_sparse
    ret_dense = test_func(DENSE_FILTER)
  File "/home/mathieu/Desktop/projects/scikit-learn/sklearn/linear_model/tests/test_ridge.py", line 145, in _test_ridge_loo
    assert_almost_equal(values, values2)
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 262, in assert_almost_equal
    return assert_array_almost_equal(actual, desired, decimal, err_msg)
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 537, in assert_array_almost_equal
    header='Arrays are not almost equal')
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 395, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal

(mismatch 100.0%)
 x: array([ 364.50424893,  463.80230164,  216.93458374,  137.6709951 ,
        286.08220951,  166.62880025,  314.37446323,  212.73833979,
        188.69268162,  159.13806598,  146.14023593,  208.27922575,...
 y: array([ 27.49575107,  52.19769836, -36.93458374, -39.6709951 ,
         5.91779049,   1.37119975, -26.37446323, -24.73833979,
       -40.69268162, -15.13806598, -42.14023593, -34.27922575,...
>>  raise AssertionError('\nArrays are not almost equal\n\n(mismatch 100.0%)\n x: array([ 364.50424893,  463.80230164,  216.93458374,  137.6709951 ,\n        286.08220951,  166.62880025,  314.37446323,  212.73833979,\n        188.69268162,  159.13806598,  146.14023593,  208.27922575,...\n y: array([ 27.49575107,  52.19769836, -36.93458374, -39.6709951 ,\n         5.91779049,   1.37119975, -26.37446323, -24.73833979,\n       -40.69268162, -15.13806598, -42.14023593, -34.27922575,...')

The line failing is the 145th in _test_ridge_loo.

@fcostin
Copy link
Contributor Author

fcostin commented Feb 28, 2012

@mblondel -- good find, I can replicate that on my machine. I ran the tests before via

python -c "import sklearn; sklearn.test()"

but I guess that didn't run the ridge tests for whatever reason. when I run nosetests from inside sklearn/linear_model I see the error. I'll have a look at it.

EDIT: fixed it, I'll commit it once I work through Gael's remarks and the other issue I noted below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a mistake:

w * QT_y

fails when y is 2d. I imagine something like

w[(slice(None), ) + (numpy.newaxis, ) * (len(y.shape) - 1)] * QT_y

could force the broadcasting to work properly. Is there a better way of doing this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use an if-else statement instead because the above one-liner looks hard to read.

@fcostin
Copy link
Contributor Author

fcostin commented Feb 28, 2012

OK, all the ridge tests are passing now and I've incorporated the previous comments.

"svd" mode breaks when sample weights are used so if that case is detected a warning is printed and the "eig" mode is forced instead. I think there is probably a way to get the SVD to work properly with sample weights but I will have to stare at http://en.wikipedia.org/wiki/Woodbury_matrix_identity and look over the LOO-CV references again.

EDIT: note that benchmarks remain the same as the most recent ones I posted -- there is still a good speedup on the madelon dataset

@mblondel
Copy link
Member

In Ridge, I also force the use of kernel ridge when sample_weight is enabled.

@mblondel
Copy link
Member

Alright, I'm merging with minor fixes.

@mblondel
Copy link
Member

Merged.

@mblondel mblondel closed this Feb 28, 2012
@GaelVaroquaux
Copy link
Member

Awesome! Great work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I wonder if we could further speed things up with sklearn's fast_svd instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree that might help even more for particularly low rank matrices X. Not sure how it would work in general. I didn't realise sklearn had an implementation of that randomised low rank SVD algorithm - very nice! Are there any benchmarks comparing fast_svd to other SVD approaches?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/bench_plot_svd.py

If you want to add a benchmark for arpack on sparse data, please feel free to do so as well :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree that might help even more for particularly low rank matrices X. Not sure how it would work in general. I didn't realise sklearn had an implementation of that randomised low random SVD algorithm - very nice! Are there any benchmarks comparing fast_svd to other SVD approaches?

I'm curious whether this can work or not. I would tend to think it
requires an exact SVD. BTW, keep in mind that using fast_svd would
require an extra argument (e.g. named "n_components").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Adding an extra hyperparameter is not user friendly. But that might make it possible to scale to very large n_samples dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this can work well in the special case where n_samples >> n_features >> rank. In this case the low rank approximation obtained via fast_svd could be exact, by requesting n_components = rank. If the approximation obtained by fast_svd is not exact then we miss the smallest singular values which make the largest contributions when we apply the inverse matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants