Add svd_flip (#6599) by eric-czech · Pull Request #6613 · dask/dask

eric-czech · 2020-09-09T17:42:08Z

Tests added / passed
Passes black dask / flake8 dask

This adds the function mentioned in #6599.

Notes on it:

Sign correction in SVD is a heuristic process though there seems to be some consensus that an idealized process for it identifies eigenvector directions that point in the direction of data vectors [1] (this gets cited in a lot of stack exchanges / github issues)
The "use the sign of the largest absolute value in a singular vector" approach taken in scikit-learn seems to be fairly common and is convenient when you want deterministic but arbitrary behavior (i.e. a data-independent heuristic)
I don't think this needs to be a meaningful correction, just a deterministic one
The scikit-learn svd_flip function allows for the correction to be based on U or V and as best I can tell, this is only relevant when you're running SVD on transposed inputs in the first place. I don't see any reason to need that and it appears to be used rarely so I simply left out the correct-based-on-V option (instead of U).
The trivial dask implementation of sklearn.svd_flip replaces np.sign(u[max_abs_cols, range(u.shape[1])]) with np.sign(u.vindex[max_abs_cols, range(u.shape[1])]) where vindex is used for fancy-indexing. I found this to be unacceptably slow though. Instead, a similarly arbitrary but more efficient correction would be to make sure all singular vectors fall in the same half-space as any arbitrarily chosen vector. I can't find an authoritative citation for this but it's kind of a common sense approach. Here is a notebook that compares the two approaches on equally sized short-fat or tall-skinny arrays and contains benchmark results that look like this:
- svd_flip1 = scikit-learn svd_flip adapted to dask arrays
- svd_flip2 = the implementation in this PR

Interestingly, the transposition needed to make SVD work for short-fat arrays (#6591) has almost no effect at all. The scikit-learn correction using vindex on dask arrays almost doubles the time it takes for the whole SVD + correction routine to run but this approach using a matrix-vector multiplication instead (as a sum across rows) accounts for a fairly negligible increase in time taken.

tl;dr This method is efficient but I still don't know quite where to put it. I think I would prefer if it was used based on an option in the linalg.svd signature set to True by default though that's not going to make sense until #3576 is done. Also, it will mean that the setting full_matrices=True, correct_signs=True isn't a valid one. For now though, I put this in array.utils and the test in test_linalg.py which is a little weird, but it seemed to make the most sense there alongside the other svd tests.

eric-czech · 2020-09-10T11:41:42Z

Another reason this could be useful is that singular vectors currently have different signs depending on the chunking:

import numpy as np
import dask.array as da
rs = np.random.RandomState(1)
x = rs.random(size=(8, 4))

u, s, v = da.linalg.svd(da.asarray(x, chunks=(4, 4)))
print(u.compute().round(1))
[[ 0.3 -0.1  0.2 -0.4]
 [ 0.1  0.3 -0.3 -0.2]
 [ 0.3  0.4 -0.5  0.3]
 [ 0.4  0.4  0.5 -0.4]
 [ 0.3 -0.2 -0.1  0.3]
 [ 0.5  0.  -0.3 -0.3]
 [ 0.4 -0.7 -0.1  0.1]
 [ 0.3  0.2  0.5  0.6]]

u, s, v = da.linalg.svd(da.asarray(x, chunks=(1, 4)))
print(u.compute().round(1))
[[-0.3  0.1 -0.2 -0.4]
 [-0.1 -0.3  0.3 -0.2]
 [-0.3 -0.4  0.5  0.3]
 [-0.4 -0.4 -0.5 -0.4]
 [-0.3  0.2  0.1  0.3]
 [-0.5 -0.   0.3 -0.3]
 [-0.4  0.7  0.1  0.1]
 [-0.3 -0.2 -0.5  0.6]]

This would also come up with #6616 in chunked vs not-chunked results.

In any case, if/when this goes through I would happily integrate this in dask.svd as a part of #6616. Let me know if you guys have any concerns or objections @mrocklin / @TomAugspurger.

Update

FYI svd_compressed also has this same behavior:

import numpy as np
import dask.array as da
rs = np.random.RandomState(1)
x = rs.random(size=(8, 4))

u, s, v = da.linalg.svd_compressed(da.asarray(x, chunks=(6, 3)), k=4, n_power_iter=10, compute=True, seed=1)
print(u.compute().round(1))
[[ 0.3  0.1 -0.3  0.5]
 [ 0.1 -0.3  0.4 -0.4]
 [ 0.3 -0.4  0.5 -0.1]
 [ 0.4 -0.4 -0.5 -0.1]
 [ 0.3  0.2  0.   0.3]
 [ 0.5 -0.   0.3  0.4]
 [ 0.4  0.7  0.1 -0.5]
 [ 0.3 -0.2 -0.5 -0.4]]

u, s, v = da.linalg.svd_compressed(da.asarray(x, chunks=(6, 4)), k=4, n_power_iter=10, compute=True, seed=1)
print(u.compute().round(1))
[[-0.3 -0.1 -0.2  0.9]
 [-0.1  0.3  0.3  0.1]
 [-0.3  0.4  0.5  0.2]
 [-0.4  0.4 -0.5 -0.2]
 [-0.3 -0.2  0.1  0. ]
 [-0.5  0.   0.4 -0.3]
 [-0.4 -0.7  0.1 -0.2]
 [-0.3  0.2 -0.5 -0.2]]

TomAugspurger · 2020-09-10T11:43:18Z

Thanks for the PR and the analysis. The changes here look great.

@eric-czech when do you think a hypothetical correct_signs=False would be useful? Just for performance?

eric-czech · 2020-09-10T12:14:01Z

@eric-czech when do you think a hypothetical correct_signs=False would be useful? Just for performance?

That's about the only good reason I can think of. Hypothetically, it might also be useful to turn off if somebody wanted to try to support the full_matrices=True behavior like np.linalg.svd does since this sign correction wouldn't handle that. The sklearn svd_flip function breaks in that case too (like we saw in dask/dask-ml#732), but I can see how this svd_flip function could be extended to work in that scenario if it's ever necessary.

EDIT

Oh also there is a chance that somebody might like to run SVD on transposed inputs as some minor performance improvement too (to avoid a transposition as they do in scikit-learn), but as far as I know there is no reason that they couldn't simply transpose the inputs and use the right vectors instead of the left vectors. I believe that's why the u_based_decision flag exists in sklearn.svd_flip, and the equivalent here would be to set correct_signs=False and then run svd_flip yourself with u_based_decision=False. That's one more potential reason for it but I would also call that a performance concern.

TomAugspurger · 2020-09-10T12:52:57Z

OK, thanks! I don't have a strong opinion, but right now my preferences align with yours (on be default, with a keyword to disable).

Let's revisit that once the other PRs and full_matrices issues are resolved.

Co-authored-by: Eric Czech <eric@related.vc>

eric-czech force-pushed the svd_flip branch 2 times, most recently from 1f3e58c to d6a0b4a Compare September 9, 2020 19:07

Add svd_flip dask#6599

8d97096

eric-czech force-pushed the svd_flip branch from d6a0b4a to 8d97096 Compare September 10, 2020 10:56

eric-czech mentioned this pull request Sep 10, 2020

Improve SVD consistency and small array handling #6616

Merged

2 tasks

TomAugspurger merged commit 3327b2e into dask:master Sep 10, 2020

This was referenced Sep 11, 2020

Support short-fat array chunking in svd #6590

Closed

Update svd tests for svd_flip in dask dask/dask-ml#736

Closed

Preserve dtype in svd #6643

Merged

PCA implementation sgkit-dev/sgkit#262

Merged

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020

Add svd_flip dask#6599 (dask#6613)

e802703

Co-authored-by: Eric Czech <eric@related.vc>

eric-czech mentioned this pull request May 15, 2021

svd_compressed() fails for complex input #7639

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add svd_flip (#6599)#6613

Add svd_flip (#6599)#6613
TomAugspurger merged 1 commit intodask:masterfrom
eric-czech:svd_flip

eric-czech commented Sep 9, 2020 •

edited

Loading

Uh oh!

eric-czech commented Sep 10, 2020 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 10, 2020

Uh oh!

eric-czech commented Sep 10, 2020 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

eric-czech commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-czech commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 10, 2020

Uh oh!

eric-czech commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eric-czech commented Sep 9, 2020 •

edited

Loading

eric-czech commented Sep 10, 2020 •

edited

Loading

eric-czech commented Sep 10, 2020 •

edited

Loading