Quick fix for unbounded memory usage in tensordot by GenevieveBuckley · Pull Request #7980 · dask/dask

GenevieveBuckley · 2021-08-03T04:01:37Z

This is a quick fix for issue #6916

It removes the memory problem for non-sparse array types. Note that tensordot with sparse arrays still needs a proper fix for the memory issues observed after #6846

Related to issue #Unbounded memory usage in tensordot #6916
Tests ~~added~~ / passed
Passes black dask / flake8 dask / isort dask

…proper fix)

GenevieveBuckley · 2021-08-03T04:12:52Z

Here's a teeny tiny example to show with the graph visualization where the problem is.

import dask.array as da

x = da.random.random((10, 10), chunks=(10, 1))
y = da.random.random((10, 10), chunks=(1, 10))
z = da.tensordot(x, y, axes=(1, 0))
z.visualize()

After #6846 everything gets shoved into a single tensordot operation at the very end of the computation graph. This happens regardless of the size of the input arrays, and is the reason we can run out of memory.

This is what it looks like before that change (and again with the fix for non-sparse arrays in this PR). The layers of sum aggregation really take the pressure off.

GenevieveBuckley · 2021-08-03T08:28:28Z

I had been assuming that the sparse arrays on the cpu would function equivalent to sparse arrays on the GPU. Looking at the GPU CI (thanks for setting this up @quasiben & team) it appears that assumption is faulty.

Unfortunately I haven't been able to get cupy working on my laptop again since the last time it broke. As I recall it was a pretty fragile installation to begin with, and following the (very detailed) notes I took about how we got it working earlier isn't producing the same result this time. I'm having problems selecting the right NVIDIA driver - I need it to (1) make cupy happy, and (2) still allow me to use my external monitor.

GenevieveBuckley · 2021-08-03T12:11:09Z

Giving up on cupy for tonight - I have CUDA but not cupy working. I'm getting an incompatible version, not sure how to force it to be the correct version since it looks like I'm doing the right things to pin it already. I've posted on the RAPIDS slack to see if anyone has more advice & I'll come back to it tomorrow.

(Would Coiled be useful as a workaround for these install issues @mrocklin? You suggested I consider using it for some of the tensordot work, but I think GPU compute was still a little experimental last time I looked.)

mrocklin · 2021-08-03T12:32:00Z

I had been assuming that the sparse arrays on the cpu would function equivalent to sparse arrays on the GPU. Looking at the GPU CI (thanks for setting this up @quasiben & team) it appears that assumption is faulty.

@quasiben does NVIDIA have anyone who could look at restoring cupy sparse support here?

quasiben · 2021-08-03T16:20:19Z

@anaruse or @pentschev if you have time to comment on this it would be appreciated -- currently things are failing in the cupy's sparse/_index.py .

When running locally I get some more output:

dask/array/tests/test_cupy.py::test_sparse_dot[csr]  ** On entry to cusparseCreateDnMat() parameter number 2 (rows) had an illegal value: 0 <= 0

Traceback (most recent call last):
  File "cupy_backends/cuda/libs/cusparse.pyx", line 1490, in cupy_backends.cuda.libs.cusparse.check_status
cupy_backends.cuda.libs.cusparse.CuSparseError: CUSPARSE_STATUS_INVALID_VALUE
 ** On entry to cusparseCreateCsr() parameter number 2 (rows) had an illegal value: 0 <= 0

Traceback (most recent call last):
  File "cupy_backends/cuda/libs/cusparse.pyx", line 1490, in cupy_backends.cuda.libs.cusparse.check_status
cupy_backends.cuda.libs.cusparse.CuSparseError: CUSPARSE_STATUS_INVALID_VALUE
 ** On entry to cusparseDenseToSparse_bufferSize() parameter number 2 (mat_in) had an illegal value: NULL pointer

Traceback (most recent call last):
  File "cupy_backends/cuda/libs/cusparse.pyx", line 1490, in cupy_backends.cuda.libs.cusparse.check_status
cupy_backends.cuda.libs.cusparse.CuSparseError: CUSPARSE_STATUS_INVALID_VALUE
 ** On entry to cusparseDenseToSparse_analysis() parameter number 2 (mat_in) had an illegal value: NULL pointer

 ** On entry to cusparseCreateDnMat() parameter number 2 (rows) had an illegal value: 0 <= 0

pentschev · 2021-08-03T17:25:06Z

From a quick check, this seems to have been introduced in CuPy 9.0.0, there are no errors in 8.6.0.

pentschev · 2021-08-03T18:24:51Z

It seems also that the CUSPARSE_STATUS_INVALID_VALUE issue only occurs when Dask is involved, the same kind of computation on a standalone example causes no errors, which makes me think this is some sort of race condition with multithreading/multiprocessing. Despite those errors/warning, all those tests pass as they did in #7982 even though CuPy raised CUSPARSE_STATUS_INVALID_VALUE. Locally I see the same, all tests pass, as soon as I switch to the branch in this PR those two tests fail in this line: https://github.com/dask/dask/pull/7980/files#diff-be1ec6725c3829aa2f919897e0d4cfc1e00b8db1e70e90c82531c5a12fd727afR268.

I must admit I'm a bit ignorant of the code change, but it seems to be due to a change in the input/output dimensions. @GenevieveBuckley could you help me understand what this change should mean w.r.t. dimensionality changes?

quasiben · 2021-08-03T21:38:29Z

Perhaps this test is suspect (though it does pass now. I say suspect because when I convert that test to numpy / scipy it also fails with a similar IndexError

@pytest.mark.parametrize("sp_format", ["csr", "csc"])
def test_np_sparse_dot(sp_format):
    import scipy.sparse
    if sp_format == "csr":
        sp_matrix = scipy.sparse.csr_matrix
    elif sp_format == "csc":
        sp_matrix = scipy.sparse.csc_matrix
    dtype = "f"
    density = 0.3
    x_shape, x_chunks = (4, 8), (2, 4)
    y_shape, y_chunks = (8, 6), (4, 3)
    x = np.random.random(x_shape)
    y = np.random.random(y_shape)
    x[x < 1 - density] = 0
    y[y < 1 - density] = 0
    z = x.dot(y)
    da_x = da.from_array(x, chunks=x_chunks, asarray=False, fancy=False)
    da_y = da.from_array(y, chunks=y_chunks, asarray=False, fancy=False)
    da_x = da_x.map_blocks(sp_matrix, dtype=dtype)
    da_y = da_y.map_blocks(sp_matrix, dtype=dtype)
    da_z = da.dot(da_x, da_y).compute()
    assert scipy.sparse.isspmatrix(da_z)
    assert_eq(z, da_z.todense())

GenevieveBuckley · 2021-08-04T03:14:38Z

Perhaps this test is suspect (though it does pass now. I say suspect because when I convert that test to numpy / scipy it also fails with a similar IndexError

What do you mean by "though it does pass now"? Passes locally? The gpuCI still marks a lot of failed tests)

I'll look more at test_np_sparse_dot.

When I make a similar version, using the pydata sparse library (not scipy.sparse), the test passes for me. I did have to make a slight change to how we pass in the dtype information to the dask array, but I don't think that has much to do with anything we're discussing here.

def test_pydata_sparse_dot():
    import sparse

    density = 0.3
    x_shape, x_chunks = (4, 8), (2, 4)
    y_shape, y_chunks = (8, 6), (4, 3)
    x = np.random.random(x_shape)
    y = np.random.random(y_shape)
    x[x < 1 - density] = 0
    y[y < 1 - density] = 0
    z = x.dot(y)
    da_x = da.from_array(x, chunks=x_chunks, asarray=False, fancy=False)
    da_y = da.from_array(y, chunks=y_chunks, asarray=False, fancy=False)
    da_x = da_x.map_blocks(sparse.COO.from_numpy, dtype=x.dtype)
    da_y = da_y.map_blocks(sparse.COO.from_numpy, dtype=y.dtype)
    da_z = da.dot(da_x, da_y).compute()
    assert isinstance(da_z, sparse.COO)
    assert_eq(z, da_z.todense())

GenevieveBuckley · 2021-08-04T03:15:43Z

I must admit I'm a bit ignorant of the code change, but it seems to be due to a change in the input/output dimensions. @GenevieveBuckley could you help me understand what this change should mean w.r.t. dimensionality changes?

I don't think there should be changes to the input/output dimensions, so that's a bit worrying.

EDIT: It looks like that line is trying to do the equivalent of expand_dims. I'll keep looking.

pentschev · 2021-08-04T06:33:53Z

What do you mean by "though it does pass now"? Passes locally? The gpuCI still marks a lot of failed tests)

Sorry, I thought they were run again already. Those other tests should have been fixed when we upgraded to NumPy 1.21 in rapidsai/dask-build-environment#3 (which is the case for gpuCI now), but also resolved in #7982 by skipping them with NumPy < 1.20. I'll rerun gpuCI, those other failures should hopefully go away.

pentschev · 2021-08-04T06:33:56Z

rerun tests

pentschev · 2021-08-04T06:44:24Z

In the latest gpuCI run, all those other tests that were previously failing now pass, the only failures left are the test_sparse_dot ones.

@GenevieveBuckley I'm not super familiar with the tensordot code, but please let me know if I can assist debugging this in some way.

GenevieveBuckley · 2021-08-04T07:36:53Z

I think I've got it now - it's only failing when we input mixed array types into tensordot (eg: a numpy and a sparse array). I need to pass in another argument to _tensordot instead of trying to figure out if the whole computation is sparse or not based on only one of the input arrays.

Thanks for your help here @pentschev

GenevieveBuckley · 2021-08-04T08:05:50Z

Here's some tests that can be run locally to check different input array types (you can uncomment the cupyx lines if you have cupy available).

import pytest
import dask.array as da
from dask.array.utils import assert_eq, same_keys
import numpy as np
import sparse
import scipy.sparse
#import cupyx.sparse

@pytest.mark.parametrize("func, arraytype", [
    (lambda x: x, 'numpy'),
    (sparse.COO.from_numpy, 'sparse'),
    (scipy.sparse.csr_matrix, 'sparse'),
    (scipy.sparse.csc_matrix, 'sparse'),
    #(cupyx.sparse.csr_matrix, 'sparse'),
    #(cupyx.sparse.csr_matrix, 'sparse'),
])
def test_tensordot_arraytypes(func, arraytype):
    x = np.arange(400).reshape((20, 20))
    a = da.from_array(x, chunks=(5, 4))
    y = np.arange(200).reshape((20, 10))
    b = da.from_array(y, chunks=(4, 5))

    a = a.map_blocks(func)
    b = b.map_blocks(func)

    assert arraytype in str(type(a._meta))
    assert arraytype in str(type(b._meta))

    z = np.tensordot(x, y, axes=(1, 0))
    c = da.tensordot(a, b, axes=(1, 0))
    assert_eq(c, z, check_meta=False, check_type=False)

    for axes in [1, (1, 0)]:
        assert_eq(da.tensordot(a, b, axes=axes), np.tensordot(x, y, axes=axes), check_meta=False, check_type=False)
        assert_eq(da.tensordot(x, b, axes=axes), np.tensordot(x, y, axes=axes), check_meta=False, check_type=False)
        assert_eq(da.tensordot(a, y, axes=axes), np.tensordot(x, y, axes=axes), check_meta=False, check_type=False)

    assert same_keys(da.tensordot(a, b, axes=(1, 0)), da.tensordot(a, b, axes=(1, 0)))

GenevieveBuckley · 2021-08-04T08:15:11Z

Ah, and the final assert statement in test_sparse_dot is failing because it is comparing a regular numpy array and a densified sparse array and finding that the array types don't match - that's to be expected (we're comparing a cupy.ndarray with a densified cupy.matrix).

I can pass in check_type=False to stop it failing.

Separately, a lot of the sparse array stuff doesn't necessarily propagate the metadata for the array type properly through the whole dask computation. We might look at that more broadly later and try and preserve this information.

GenevieveBuckley · 2021-08-04T08:43:41Z

Here's what the task graphs look like again

import sparse
import scipy.sparse
import dask.array as da
from dask.array.utils import assert_eq

x = da.random.random((2, 3, 4), chunks=(1, 2, 2))
x[x < 0.8] = 0
y = da.random.random((4, 3, 2), chunks=(2, 2, 1))
y[y < 0.8] = 0

# dask arrays with numpy chunks
z = da.tensordot(x, y, axes=((1, 2), (1, 0)))

# dask arrays with pydata sparse chunks
x_sparse = x.map_blocks(sparse.COO.from_numpy)
y_sparse = y.map_blocks(sparse.COO.from_numpy)
z_sparse = da.tensordot(x_sparse, y_sparse, axes=((1, 2), (1, 0)))

# dask arrays with scipy sparse csr_matrix chunks
x_csrmatrix = x.map_blocks(scipy.sparse.csr_matrix, dtype=x.dtype)
y_csrmatrix = y.map_blocks(scipy.sparse.csr_matrix, dtype=y.dtype)
z_csrmatrix = da.tensordot(x_csrmatrix, y_csrmatrix, axes=((1, 2), (1, 0)))

# dask arrays with scipy sparse csc_matrix chunks
x_sparse_csc_matrix = x.map_blocks(scipy.sparse.csc_matrix, dtype=x.dtype)
y_sparse_csc_matrix = y.map_blocks(scipy.sparse.csc_matrix, dtype=y.dtype)
z_sparse_csc_matrix = da.tensordot(x_sparse_csc_matrix, y_sparse_csc_matrix, axes=((1, 2), (1, 0)))

z.visualize()  # numpy chunks
z_sparse.visualize()  # pydata sparse chunks
z_csrmatrix.visualize()  # scipy sparse csc_matrix chunks
z_sparse_csc_matrix.visualize()  # scipy sparse csc_matrix chunks

pentschev · 2021-08-04T09:30:41Z

Not the cleanest way to resolve the issue with _meta, but it seems that replacing

dask/dask/array/tests/test_cupy.py

Lines 1165 to 1166 in c59a509

    
           da_x = da_x.map_blocks(sp_matrix, dtype=dtype) 
        
           da_y = da_y.map_blocks(sp_matrix, dtype=dtype)

by the code below will work for now:

da_x = da_x.map_blocks(sp_matrix, meta=sp_matrix(cupy.array([0], dtype=dtype)))
da_y = da_y.map_blocks(sp_matrix, meta=sp_matrix(cupy.array([0], dtype=dtype)))

Could you try that @GenevieveBuckley ?

GenevieveBuckley · 2021-08-04T09:32:15Z

We do still see the bottleneck happen if we pass in sparse arrays (that's why this is a quick fix workaround, rather than a proper solution).

import dask.array as da
import numpy as np
import sparse
import scipy.sparse

x = da.random.random((1, 100), chunks=(1, 50))
y = da.random.random((100, 1), chunks=(50, 1))
z = da.tensordot(x, y, axes=(1, 0))
z.visualize()

This is good: for numpy chunked arrays we no longer have a bottleneck in the compute graph

x_csrmatrix = x.map_blocks(scipy.sparse.csr_matrix, dtype=x.dtype)
y_csrmatrix = y.map_blocks(scipy.sparse.csr_matrix, dtype=y.dtype)
z_csrmatrix = da.tensordot(x_csrmatrix, y_csrmatrix, axes=(1, 0))
z_csrmatrix.visualize()

This is not good: pushing everything into that final tensordot operation at the very end gives us memory problems.

GenevieveBuckley · 2021-08-04T09:34:14Z

It's worth noting that the pydata sparse arrays are being caught by my check for sparse arrays and forced down the less efficient code path - but there's no real technical reason we need to do this. Pydata sparse arrays work just as well as numpy arrays do, because they can be n-dimensional.

... writing this out tells me I should go back and make my check for sparse arrays exclude the pydata sparse arrays.

EDIT: Done

…these

quasiben · 2021-08-04T20:00:26Z

I was able to confirm @pentschev's fixes for the test though I know that's not what's being discussed in your last few comments @GenevieveBuckley

GenevieveBuckley · 2021-08-05T03:30:13Z

I was able to confirm @pentschev's fixes for the test though I know that's not what's being discussed in your last few comments @GenevieveBuckley

Thanks @quasiben
I got cupy working locally (thanks to Paul Taylor at NVIDIA) and all of the dask/array/tests/test_cupy.py tests pass with cupy 8.6
I do get fatal floating point errors with cupy 9.2, but I think this is probably the same thing you and Peter have been looking into already.

Details:

dask/array/tests/test_cupy.py::test_sparse_hstack_vstack_csr Fatal Python error: Floating point exception

dask/array/tests/test_cupy.py::test_sparse_dot[csr] Fatal Python error: Floating point exception

Current thread 0x00007f4b5a95f740 (most recent call first):
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupy/cusparse.py", line 1263 in create
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupy/cusparse.py", line 1644 in denseToSparse
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupyx/scipy/sparse/csr.py", line 1131 in dense2csr
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupyx/scipy/sparse/csr.py", line 77 in _convert_dense
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupyx/scipy/sparse/compressed.py", line 263 in __init__
  File "/home/genevieve/GitHub/dask/dask/array/utils.py", line 144 in compute_meta
  File "/home/genevieve/GitHub/dask/dask/array/blockwise.py", line 279 in blockwise
  File "/home/genevieve/GitHub/dask/dask/array/core.py", line 713 in map_blocks
  File "/home/genevieve/GitHub/dask/dask/array/core.py", line 2378 in map_blocks
  File "/home/genevieve/GitHub/dask/dask/array/tests/test_cupy.py", line 1162 in test_sparse_dot
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/home/genevieve/anaconda3/envs/dask-dev/bin/pytest", line 11 in <module>
Floating point exception (core dumped)

GenevieveBuckley · 2021-08-05T03:35:23Z

It's possible we still need to check against this concern, too

#6916 (comment)

Note that there seems to have been a hidden bug in dask's tensordot for a long time that was coincidentally resolved by #6846. For a few more details see #6907.

#6916 (comment)

Good to know. Maybe we should include a test then? Would you be willing to submit a PR? slightly_smiling_face

I would. The problem is that I am not completely sure what is going on. In
dask.config.set(scheduler='single-threaded')
x = da.random.random((4,4), chunks=(2,2))
x[x < 0.65] = 0
y = x.persist()
r1 = da.dot(x, x).compute()
r2 = da.dot(y, y).compute()
r1 is wrong and r2 is correct (with a dask version prior to fbccc4e). This is not due to special values; when I take the problematic values and create an array from a numpy array with them, the problem does not occur. The problem is also not with the (tensor)dot implementation of the blocks, but rather that they get fed wrong blocks. But I was not able yet to track down why and therefore I am not sure how to set up the test. I think having random stuff in tests should rather be avoided.

What do you think? Include a test with random as above? Track down the problem in more detail before writing the test?

EDIT: yes, this is still a thing with this PR #6907

GenevieveBuckley · 2021-08-05T03:55:56Z

Marking this PR as a draft, so it doesn't accidentally get merged.

Needs a simultaneous fix for the unbounded memory usage while also not reintroducing this error #6907

EDIT: looked into 6907 and uncovered more weirdness, detailed here #6907 (comment)

dask/array/routines.py

Co-authored-by: Klaus Zimmermann <klaus.zimmermann@smhi.se>

GenevieveBuckley · 2021-09-09T04:29:06Z

Thank you for the excellent explanation @zklaus

I just tried your suggestions, but it seems the problem in issue 6907 still exists. I've added this example as a regression test in this PR (I figured it's helpful to have included in the CI runs).

GenevieveBuckley · 2021-09-09T05:07:29Z

I think maybe it has to do with the way dask names the task keys?

pentschev · 2021-09-14T15:47:24Z

Since we've been blocked here for a while and it seems that the issue only happens for square matrices, if people (perhaps @mrocklin) would like to have the issue solved even if not completely for now, I propose doing the following:

Special-casing square matrices with the previous behavior that will still contain the unbounded memory usage issue;
Have the fix in for non-square matrices;
(Potentially add a runtime warning for square matrices to let users know of the memory usage issue);
File an issue where we can continue the discussion for edge case(s) and fix them later.

Thoughts on the above?

GenevieveBuckley · 2021-09-15T04:17:12Z

Thoughts: It's possible it isn't a square/non-square chunk thing. I was playing with very tiny arrays there, we might find that it's a case between "chunk spans a full row / chunk spans a full column" vs "chunk does not span a full row / a full column". I could do some more experimenting to find out if that's the case?

dcherian · 2022-03-07T21:39:11Z

@GenevieveBuckley can you merge main and restart the tests please? Perhaps whatever fixed #8131 has also fixed the blocker here 🤞🏾

martindurant · 2022-03-09T17:27:21Z

@dcherian , tests are running and looking good

dcherian · 2022-03-09T17:29:11Z

Wonderful. Thanks @martindurant !

martindurant · 2022-03-09T18:05:26Z

Windows failure is about missing PDF backend for graphviz

martindurant · 2022-03-11T14:12:39Z

@dask/maintenance - I don't know much about the implementation here, but things seem to pass aside from something completely unrelated. Does anyone have a chance to look this over?

pentschev · 2022-03-11T15:50:17Z

To try and capture the potential breaking points we discussed in #7980 (comment) and #7980 (comment), I opened GenevieveBuckley#1 against @GenevieveBuckley 's branch. Locally the tests all pass, so it would be great if we could have them in before merging this, but otherwise I think it covers the issues we've seen with CuPy as well.

Thanks everyone for the work in this very extensive issue!

martindurant · 2022-03-11T16:15:47Z

If @GenevieveBuckley is not available, we can either push to this branch (maintainers can), or start a new one; but only she can merge your PR into her branch.

jsignell · 2022-03-14T21:19:50Z

Thank you @pentschev @martindurant and @dcherian! I am going to pull in the @pentschev's changes to this branch.

jsignell · 2022-03-15T21:26:33Z

Ok this is in!!

dcherian · 2022-03-16T02:08:50Z

Great thanks everyone

GenevieveBuckley · 2022-03-18T05:56:24Z

Thanks for picking this up @[pentschev

Quick fix for issue 6916 (tensordot with sparse arrays still needs a …

4dc2c34

…proper fix)

github-actions bot added the array label Aug 3, 2021

Use same tensordot sum aggregation for sparse and dense arrays alike

23f8391

Fix for mixed type (sparse & dense) input tensordot arrays

62bb6aa

Do not expect types to match in test (array vs matrix types)

b89e1de

Exclude pydata sparse arrays from checks, no workaround required for …

3eb7c71

…these

zklaus reviewed Sep 8, 2021

View reviewed changes

dask/array/routines.py Show resolved Hide resolved

dask/array/routines.py Show resolved Hide resolved

dask/array/routines.py Show resolved Hide resolved

GenevieveBuckley and others added 4 commits September 9, 2021 14:00

Update dask/array/routines.py

9e647a2

Co-authored-by: Klaus Zimmermann <klaus.zimmermann@smhi.se>

Update dask/array/routines.py

3e9684a

Co-authored-by: Klaus Zimmermann <klaus.zimmermann@smhi.se>

Update dask/array/routines.py

bb55ac9

Co-authored-by: Klaus Zimmermann <klaus.zimmermann@smhi.se>

Add regression test for issue 6907

2ee0bd7

GenevieveBuckley mentioned this pull request Sep 16, 2021

Use of da.where before da.einsum on a chunked array produces incorrectly sized result #8131

Closed

pentschev mentioned this pull request Oct 12, 2021

[WIP] fix OOM error of dask-glm with cupy on GPU #8247

Open

3 tasks

dcherian mentioned this pull request Nov 10, 2021

Very large Dask graph size with high resolution grids pangeo-data/xESMF#127

Closed

Merge branch 'main' into issue-6916

263657b

Apply changes suggested by @pentschev

2ac4ce6

jsignell added the almost done Work is almost done! label Mar 14, 2022

jsignell merged commit a7e57ba into dask:main Mar 15, 2022

jsignell added bug Something is broken and removed almost done Work is almost done! labels Mar 15, 2022

galipremsagar mentioned this pull request Apr 6, 2022

Failures on GPU CI #8825

Closed

Uh oh!

Conversation

GenevieveBuckley commented Aug 3, 2021

Uh oh!

GenevieveBuckley commented Aug 3, 2021

Uh oh!

GenevieveBuckley commented Aug 3, 2021

Uh oh!

GenevieveBuckley commented Aug 3, 2021

Uh oh!

mrocklin commented Aug 3, 2021

Uh oh!

quasiben commented Aug 3, 2021

Uh oh!

pentschev commented Aug 3, 2021

Uh oh!

pentschev commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quasiben commented Aug 3, 2021

Uh oh!

GenevieveBuckley commented Aug 4, 2021

Uh oh!

GenevieveBuckley commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pentschev commented Aug 4, 2021

Uh oh!

pentschev commented Aug 4, 2021

Uh oh!

pentschev commented Aug 4, 2021

Uh oh!

GenevieveBuckley commented Aug 4, 2021

Uh oh!

GenevieveBuckley commented Aug 4, 2021

Uh oh!

GenevieveBuckley commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GenevieveBuckley commented Aug 4, 2021

Uh oh!

pentschev commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GenevieveBuckley commented Aug 4, 2021

Uh oh!

GenevieveBuckley commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quasiben commented Aug 4, 2021

Uh oh!

GenevieveBuckley commented Aug 5, 2021

Uh oh!

GenevieveBuckley commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GenevieveBuckley commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GenevieveBuckley commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GenevieveBuckley commented Sep 9, 2021

Uh oh!

pentschev commented Sep 14, 2021

Uh oh!

GenevieveBuckley commented Sep 15, 2021

Uh oh!

dcherian commented Mar 7, 2022

Uh oh!

martindurant commented Mar 9, 2022

Uh oh!

dcherian commented Mar 9, 2022

Uh oh!

martindurant commented Mar 9, 2022

Uh oh!

martindurant commented Mar 11, 2022

pentschev commented Aug 3, 2021 •

edited

Loading

GenevieveBuckley commented Aug 4, 2021 •

edited

Loading

GenevieveBuckley commented Aug 4, 2021 •

edited

Loading

pentschev commented Aug 4, 2021 •

edited

Loading

GenevieveBuckley commented Aug 4, 2021 •

edited

Loading

GenevieveBuckley commented Aug 5, 2021 •

edited

Loading

GenevieveBuckley commented Aug 5, 2021 •

edited

Loading

GenevieveBuckley commented Sep 9, 2021 •

edited

Loading