Add svd_flip (#6599)#6613
Conversation
1f3e58c to
d6a0b4a
Compare
d6a0b4a to
8d97096
Compare
|
Another reason this could be useful is that singular vectors currently have different signs depending on the chunking: import numpy as np
import dask.array as da
rs = np.random.RandomState(1)
x = rs.random(size=(8, 4))
u, s, v = da.linalg.svd(da.asarray(x, chunks=(4, 4)))
print(u.compute().round(1))
[[ 0.3 -0.1 0.2 -0.4]
[ 0.1 0.3 -0.3 -0.2]
[ 0.3 0.4 -0.5 0.3]
[ 0.4 0.4 0.5 -0.4]
[ 0.3 -0.2 -0.1 0.3]
[ 0.5 0. -0.3 -0.3]
[ 0.4 -0.7 -0.1 0.1]
[ 0.3 0.2 0.5 0.6]]
u, s, v = da.linalg.svd(da.asarray(x, chunks=(1, 4)))
print(u.compute().round(1))
[[-0.3 0.1 -0.2 -0.4]
[-0.1 -0.3 0.3 -0.2]
[-0.3 -0.4 0.5 0.3]
[-0.4 -0.4 -0.5 -0.4]
[-0.3 0.2 0.1 0.3]
[-0.5 -0. 0.3 -0.3]
[-0.4 0.7 0.1 0.1]
[-0.3 -0.2 -0.5 0.6]]This would also come up with #6616 in chunked vs not-chunked results. In any case, if/when this goes through I would happily integrate this in dask.svd as a part of #6616. Let me know if you guys have any concerns or objections @mrocklin / @TomAugspurger. Update FYI import numpy as np
import dask.array as da
rs = np.random.RandomState(1)
x = rs.random(size=(8, 4))
u, s, v = da.linalg.svd_compressed(da.asarray(x, chunks=(6, 3)), k=4, n_power_iter=10, compute=True, seed=1)
print(u.compute().round(1))
[[ 0.3 0.1 -0.3 0.5]
[ 0.1 -0.3 0.4 -0.4]
[ 0.3 -0.4 0.5 -0.1]
[ 0.4 -0.4 -0.5 -0.1]
[ 0.3 0.2 0. 0.3]
[ 0.5 -0. 0.3 0.4]
[ 0.4 0.7 0.1 -0.5]
[ 0.3 -0.2 -0.5 -0.4]]
u, s, v = da.linalg.svd_compressed(da.asarray(x, chunks=(6, 4)), k=4, n_power_iter=10, compute=True, seed=1)
print(u.compute().round(1))
[[-0.3 -0.1 -0.2 0.9]
[-0.1 0.3 0.3 0.1]
[-0.3 0.4 0.5 0.2]
[-0.4 0.4 -0.5 -0.2]
[-0.3 -0.2 0.1 0. ]
[-0.5 0. 0.4 -0.3]
[-0.4 -0.7 0.1 -0.2]
[-0.3 0.2 -0.5 -0.2]] |
|
Thanks for the PR and the analysis. The changes here look great. @eric-czech when do you think a hypothetical |
That's about the only good reason I can think of. Hypothetically, it might also be useful to turn off if somebody wanted to try to support the EDIT Oh also there is a chance that somebody might like to run SVD on transposed inputs as some minor performance improvement too (to avoid a transposition as they do in scikit-learn), but as far as I know there is no reason that they couldn't simply transpose the inputs and use the right vectors instead of the left vectors. I believe that's why the |
|
OK, thanks! I don't have a strong opinion, but right now my preferences align with yours (on be default, with a keyword to disable). Let's revisit that once the other PRs and |
black dask/flake8 daskThis adds the function mentioned in #6599.
Notes on it:
UorVand as best I can tell, this is only relevant when you're running SVD on transposed inputs in the first place. I don't see any reason to need that and it appears to be used rarely so I simply left out the correct-based-on-Voption (instead ofU).np.sign(u[max_abs_cols, range(u.shape[1])])withnp.sign(u.vindex[max_abs_cols, range(u.shape[1])])wherevindexis used for fancy-indexing. I found this to be unacceptably slow though. Instead, a similarly arbitrary but more efficient correction would be to make sure all singular vectors fall in the same half-space as any arbitrarily chosen vector. I can't find an authoritative citation for this but it's kind of a common sense approach. Here is a notebook that compares the two approaches on equally sized short-fat or tall-skinny arrays and contains benchmark results that look like this:Interestingly, the transposition needed to make SVD work for short-fat arrays (#6591) has almost no effect at all. The scikit-learn correction using
vindexon dask arrays almost doubles the time it takes for the whole SVD + correction routine to run but this approach using a matrix-vector multiplication instead (as a sum across rows) accounts for a fairly negligible increase in time taken.tl;dr This method is efficient but I still don't know quite where to put it. I think I would prefer if it was used based on an option in the
linalg.svdsignature set toTrueby default though that's not going to make sense until #3576 is done. Also, it will mean that the settingfull_matrices=True, correct_signs=Trueisn't a valid one. For now though, I put this inarray.utilsand the test intest_linalg.pywhich is a little weird, but it seemed to make the most sense there alongside the other svd tests.