Improve SVD consistency and small array handling by eric-czech · Pull Request #6616 · dask/dask

eric-czech · 2020-09-10T11:26:02Z

Tests added / passed
Passes black dask / flake8 dask

This addresses #6612, builds on #6599, and is related to #3576.

For #6612, this calls numpy.linalg.svd for single-chunk arrays which avoids performance issues and result shape inconsistencies that occur when the array has more columns than rows.

This uses the svd_flip function added for #6599 by default and adds a coerce_signs argument to disable that behavior.

For #3576, this ensures that all results are consistent with np.linalg.svd when full_matrices=False.

Note: The code I added here related to the truncate flag is a bit of a band-aid on top of not knowing what's going on in tsqr. To be clear that code is addressing a corner case that should never happen on real datasets though. It only occurs when an array has column blocks but is still wider than it is tall (e.g. as shown in this example: #3576 (comment)). Let me know if anybody thinks that is worthy of a separate tsqr-specific issue. I'm on the fence about it but leaning towards no, that this is actually more of an SVD-specific problem so having the logic here may not be a bad solution anyways.

mrocklin · 2020-09-10T14:55:04Z

dask/array/linalg.py

-    # Short-and-fat case
+    # Single-chunk case
+    if nb[0] == nb[1] == 1:
+        u, s, v = map(asarray, np.linalg.svd(np.asarray(a), full_matrices=False))


This will trigger computation. I think that instead we probably want to keep things lazy, and instead return three single-chunked dask arrays. Thoughts?

I gave it a shot with 2d7194f#diff-d190ab15173165ce6615a5df4e3067ebR862-R868. That look ok?

TomAugspurger

Thanks, just a couple comments / questions.

TomAugspurger · 2020-09-10T14:55:08Z

dask/array/linalg.py

+    coerce_signs : bool
+        Whether or not to apply sign correction to singular vectors in
+        order to maintain deterministic results, by default True.


This should go under the Parameters section.

Heh whoops, 2d7194f#diff-d190ab15173165ce6615a5df4e3067ebR807-R809

TomAugspurger · 2020-09-10T15:00:52Z

dask/array/linalg.py

-    # Short-and-fat case
+    # Single-chunk case
+    if nb[0] == nb[1] == 1:
+        u, s, v = map(asarray, np.linalg.svd(np.asarray(a), full_matrices=False))


Is the np.asarray here to avoid dispatching back to dask's svd?

We might instead consider

Suggested change

u, s, v = map(asarray, np.linalg.svd(np.asarray(a), full_matrices=False))

a = a.compute()

u, s, v = map(asarray, np.linalg.svd(a, full_matrices=False))

For the case where we have a Dask array backed by a cupy array on the GPU, this would (I think) keep things on the GPU the whole time.

I had originally just done that to avoid dispatch to the same function but tried to make it all lazy instead (as in @mrocklin's comment above).

eric-czech · 2020-09-10T18:09:32Z

Thanks @mrocklin / @TomAugspurger, a few other notes:

I went with the name coerce_signs for the argument instead of correct_signs since I thought "correction" is too strong an implication without trying to push the singular vectors into some direction that has a more meaningful relationship with the original data
I added coerce_signs to svd_compressed as well (2d7194f#diff-d190ab15173165ce6615a5df4e3067ebR695)
- I don't know how you guys feel about that, but it seemed justified based on Add svd_flip (#6599) #6613 (comment)
I defined the np.linalg.svd using a delayed call instead. It looks like np.svd will return float32 if you give it float32 and float64 otherwise, but I went with a meta array based on a tiny svd run to get the output type as something a little more future-proof. Let me know if there's a more idiomatic way to do that
- 2d7194f#diff-d190ab15173165ce6615a5df4e3067ebR862-R868

TomAugspurger · 2020-09-10T21:15:55Z

dask/array/linalg.py

+    if nb[0] == nb[1] == 1:
+        m, n = a.shape
+        k = min(a.shape)
+        mu, ms, mv = np.linalg.svd(np.ones((1, 1), dtype=a.dtype))


We might still have the issue of a being on a GPU here. IIUC, this is to determine the appropriate dtypes after the SVD for the meta later on. Hopefully this suffices

Suggested change

mu, ms, mv = np.linalg.svd(np.ones((1, 1), dtype=a.dtype))

mu, ms, mv = np.linalg.svd(a._meta)

That's a size-0 array, but at least for ndarrays np.linalg.svd seems to be OK with it.

Ah interesting, that does clear all the tests (change committed). Good to know.

What's the concern with a being on a GPU? Does anything in that code look like it would pull it back onto the CPU?

Actually that doesn't work on Python 3.6 apparently (https://github.com/dask/dask/pull/6616/checks?check_run_id=1099031906):

dask\array\linalg.py:864: in svd mu, ms, mv = np.linalg.svd(a._meta) C:\Miniconda3\envs\test-environment\lib\site-packages\numpy\linalg\linalg.py:1542: in svd _assertNoEmpty2d(a)

On linux https://travis-ci.org/github/dask/dask/jobs/726090041:

E numpy.linalg.linalg.LinAlgError: Arrays cannot be empty

Any objections to me changing it back? I'm not sure how to work around that otherwise.

Hey @TomAugspurger I reverted it just to give CI time to run in case you're OK with it as it was. Let me know if there's something else that can be done to avoid any issues on a GPU (I don't understand what they are or I'd happily experiment with other solutions).

Yeah, I'm fine with changing it back for now.

I'm not 100% sure what would happen if a were on the GPU. u, s, and v will be fine, since they're returning CuPy arrays. But I don't know what from_delayed does when it gets GPU memory and meta is a NumPy array rather than a CuPy array.

cc @jakirkham or @pentschev if you think there's any followup needed here.

Ah I see, maybe something like this would make more sense?

np.linalg.svd(np.ones_like(a._meta, shape=(1,1)))

Ah I see, maybe something like this would make more sense?

np.linalg.svd(np.ones_like(a._meta, shape=(1,1)))

Yeah, we should do that, otherwise mu, ms, mv will all be NumPy-arrays, and using them below, e.g. meta=mu, becomes completely irrelevant, as it will default anyway to a NumPy array.

It should actually use ones_like_safe from

dask/dask/array/utils.py

Lines 314 to 323 in 519b62f

def ones_like_safe(a, shape, **kwargs):

"""

Return np.ones_like(a, shape=shape, **kwargs) if the shape argument

is supported (requires NumPy >= 1.17), otherwise falls back to

using the old behavior, returning np.ones(shape, **kwargs).

"""

try:

return np.ones_like(a, shape=shape, **kwargs)

except TypeError:

return np.ones(shape, **kwargs)

, to ensure it still works on NumPy < 1.17 as well.

Ok thanks @pentschev. To clarify on this:

using them below, e.g. meta=mu, becomes completely irrelevant, as it will default anyway to a NumPy array.

I was only after the dtype on the mu, ms, mv arrays. I'll submit another PR that just sets the dtype in from_delayed instead of using meta so that the array backend shouldn't matter.

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

TomAugspurger · 2020-09-11T11:29:23Z

dask/array/linalg.py

+    if nb[0] == nb[1] == 1:
+        m, n = a.shape
+        k = min(a.shape)
+        mu, ms, mv = np.linalg.svd(np.ones((1, 1), dtype=a.dtype))


Yeah, I'm fine with changing it back for now.

I'm not 100% sure what would happen if a were on the GPU. u, s, and v will be fine, since they're returning CuPy arrays. But I don't know what from_delayed does when it gets GPU memory and meta is a NumPy array rather than a CuPy array.

cc @jakirkham or @pentschev if you think there's any followup needed here.

* SVD single chunk bypass and sign coercion

eric-czech mentioned this pull request Sep 10, 2020

Add svd_flip (#6599) #6613

Merged

2 tasks

eric-czech force-pushed the svd_single_chunk branch from dc1306d to aa42a8d Compare September 10, 2020 14:13

SVD single chunk bypass and sign coercion

078d4af

eric-czech force-pushed the svd_single_chunk branch from aa42a8d to 078d4af Compare September 10, 2020 14:16

eric-czech marked this pull request as ready for review September 10, 2020 14:35

mrocklin reviewed Sep 10, 2020

View reviewed changes

TomAugspurger reviewed Sep 10, 2020

View reviewed changes

Suggested changes

2d7194f

TomAugspurger reviewed Sep 10, 2020

View reviewed changes

Update dask/array/linalg.py

0e3b70f

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

eric-czech mentioned this pull request Sep 11, 2020

Expose svd_flip for controlling sign indeterminacy #6599

Closed

1 task

Revert meta definition

c4afe20

TomAugspurger approved these changes Sep 11, 2020

View reviewed changes

TomAugspurger merged commit 22c0184 into dask:master Sep 11, 2020

This was referenced Sep 11, 2020

Backend-aware dtype inference for single-chunk svd #6623

Merged

Improve single chunk array handling in svd #6612

Closed

PCA implementation sgkit-dev/sgkit#262

Merged

eric-czech mentioned this pull request Oct 2, 2020

utils.svd_flip breaks on rectangular matrix dask/dask-ml#732

Closed

pentschev mentioned this pull request Oct 12, 2020

Fixes for svd with __array_function__ #6727

Merged

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020

Improve SVD consistency and small array handling (dask#6616)

12167f0

* SVD single chunk bypass and sign coercion

	u, s, v = map(asarray, np.linalg.svd(np.asarray(a), full_matrices=False))
	a = a.compute()
	u, s, v = map(asarray, np.linalg.svd(a, full_matrices=False))

	mu, ms, mv = np.linalg.svd(np.ones((1, 1), dtype=a.dtype))
	mu, ms, mv = np.linalg.svd(a._meta)

	def ones_like_safe(a, shape, **kwargs):
	"""
	Return np.ones_like(a, shape=shape, **kwargs) if the shape argument
	is supported (requires NumPy >= 1.17), otherwise falls back to
	using the old behavior, returning np.ones(shape, **kwargs).
	"""
	try:
	return np.ones_like(a, shape=shape, **kwargs)
	except TypeError:
	return np.ones(shape, **kwargs)

Uh oh!

Conversation

eric-czech commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-czech commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-czech Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-czech Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-czech Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eric-czech commented Sep 10, 2020 •

edited

Loading

eric-czech commented Sep 10, 2020 •

edited

Loading

eric-czech Sep 10, 2020 •

edited

Loading

eric-czech Sep 11, 2020 •

edited

Loading

eric-czech Sep 11, 2020 •

edited

Loading