Skip to content

Quick fix for unbounded memory usage in tensordot #7980

Merged
jsignell merged 13 commits intodask:mainfrom
GenevieveBuckley:issue-6916
Mar 15, 2022
Merged

Quick fix for unbounded memory usage in tensordot #7980
jsignell merged 13 commits intodask:mainfrom
GenevieveBuckley:issue-6916

Conversation

@GenevieveBuckley
Copy link
Copy Markdown
Contributor

This is a quick fix for issue #6916

It removes the memory problem for non-sparse array types. Note that tensordot with sparse arrays still needs a proper fix for the memory issues observed after #6846

@github-actions github-actions bot added the array label Aug 3, 2021
@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

Here's a teeny tiny example to show with the graph visualization where the problem is.

import dask.array as da

x = da.random.random((10, 10), chunks=(10, 1))
y = da.random.random((10, 10), chunks=(1, 10))
z = da.tensordot(x, y, axes=(1, 0))
z.visualize()

After #6846 everything gets shoved into a single tensordot operation at the very end of the computation graph. This happens regardless of the size of the input arrays, and is the reason we can run out of memory.

image

This is what it looks like before that change (and again with the fix for non-sparse arrays in this PR). The layers of sum aggregation really take the pressure off.

image

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

I had been assuming that the sparse arrays on the cpu would function equivalent to sparse arrays on the GPU. Looking at the GPU CI (thanks for setting this up @quasiben & team) it appears that assumption is faulty.

Unfortunately I haven't been able to get cupy working on my laptop again since the last time it broke. As I recall it was a pretty fragile installation to begin with, and following the (very detailed) notes I took about how we got it working earlier isn't producing the same result this time. I'm having problems selecting the right NVIDIA driver - I need it to (1) make cupy happy, and (2) still allow me to use my external monitor.

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

Giving up on cupy for tonight - I have CUDA but not cupy working. I'm getting an incompatible version, not sure how to force it to be the correct version since it looks like I'm doing the right things to pin it already. I've posted on the RAPIDS slack to see if anyone has more advice & I'll come back to it tomorrow.

(Would Coiled be useful as a workaround for these install issues @mrocklin? You suggested I consider using it for some of the tensordot work, but I think GPU compute was still a little experimental last time I looked.)

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Aug 3, 2021

I had been assuming that the sparse arrays on the cpu would function equivalent to sparse arrays on the GPU. Looking at the GPU CI (thanks for setting this up @quasiben & team) it appears that assumption is faulty.

@quasiben does NVIDIA have anyone who could look at restoring cupy sparse support here?

@quasiben
Copy link
Copy Markdown
Member

quasiben commented Aug 3, 2021

@anaruse or @pentschev if you have time to comment on this it would be appreciated -- currently things are failing in the cupy's sparse/_index.py .

When running locally I get some more output:

dask/array/tests/test_cupy.py::test_sparse_dot[csr]  ** On entry to cusparseCreateDnMat() parameter number 2 (rows) had an illegal value: 0 <= 0

Traceback (most recent call last):
  File "cupy_backends/cuda/libs/cusparse.pyx", line 1490, in cupy_backends.cuda.libs.cusparse.check_status
cupy_backends.cuda.libs.cusparse.CuSparseError: CUSPARSE_STATUS_INVALID_VALUE
 ** On entry to cusparseCreateCsr() parameter number 2 (rows) had an illegal value: 0 <= 0

Traceback (most recent call last):
  File "cupy_backends/cuda/libs/cusparse.pyx", line 1490, in cupy_backends.cuda.libs.cusparse.check_status
cupy_backends.cuda.libs.cusparse.CuSparseError: CUSPARSE_STATUS_INVALID_VALUE
 ** On entry to cusparseDenseToSparse_bufferSize() parameter number 2 (mat_in) had an illegal value: NULL pointer

Traceback (most recent call last):
  File "cupy_backends/cuda/libs/cusparse.pyx", line 1490, in cupy_backends.cuda.libs.cusparse.check_status
cupy_backends.cuda.libs.cusparse.CuSparseError: CUSPARSE_STATUS_INVALID_VALUE
 ** On entry to cusparseDenseToSparse_analysis() parameter number 2 (mat_in) had an illegal value: NULL pointer

 ** On entry to cusparseCreateDnMat() parameter number 2 (rows) had an illegal value: 0 <= 0

@pentschev
Copy link
Copy Markdown
Member

From a quick check, this seems to have been introduced in CuPy 9.0.0, there are no errors in 8.6.0.

@pentschev
Copy link
Copy Markdown
Member

pentschev commented Aug 3, 2021

It seems also that the CUSPARSE_STATUS_INVALID_VALUE issue only occurs when Dask is involved, the same kind of computation on a standalone example causes no errors, which makes me think this is some sort of race condition with multithreading/multiprocessing. Despite those errors/warning, all those tests pass as they did in #7982 even though CuPy raised CUSPARSE_STATUS_INVALID_VALUE. Locally I see the same, all tests pass, as soon as I switch to the branch in this PR those two tests fail in this line: https://github.com/dask/dask/pull/7980/files#diff-be1ec6725c3829aa2f919897e0d4cfc1e00b8db1e70e90c82531c5a12fd727afR268.

I must admit I'm a bit ignorant of the code change, but it seems to be due to a change in the input/output dimensions. @GenevieveBuckley could you help me understand what this change should mean w.r.t. dimensionality changes?

@quasiben
Copy link
Copy Markdown
Member

quasiben commented Aug 3, 2021

Perhaps this test is suspect (though it does pass now. I say suspect because when I convert that test to numpy / scipy it also fails with a similar IndexError

@pytest.mark.parametrize("sp_format", ["csr", "csc"])
def test_np_sparse_dot(sp_format):
    import scipy.sparse
    if sp_format == "csr":
        sp_matrix = scipy.sparse.csr_matrix
    elif sp_format == "csc":
        sp_matrix = scipy.sparse.csc_matrix
    dtype = "f"
    density = 0.3
    x_shape, x_chunks = (4, 8), (2, 4)
    y_shape, y_chunks = (8, 6), (4, 3)
    x = np.random.random(x_shape)
    y = np.random.random(y_shape)
    x[x < 1 - density] = 0
    y[y < 1 - density] = 0
    z = x.dot(y)
    da_x = da.from_array(x, chunks=x_chunks, asarray=False, fancy=False)
    da_y = da.from_array(y, chunks=y_chunks, asarray=False, fancy=False)
    da_x = da_x.map_blocks(sp_matrix, dtype=dtype)
    da_y = da_y.map_blocks(sp_matrix, dtype=dtype)
    da_z = da.dot(da_x, da_y).compute()
    assert scipy.sparse.isspmatrix(da_z)
    assert_eq(z, da_z.todense())

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

Perhaps this test is suspect (though it does pass now. I say suspect because when I convert that test to numpy / scipy it also fails with a similar IndexError

What do you mean by "though it does pass now"? Passes locally? The gpuCI still marks a lot of failed tests)

I'll look more at test_np_sparse_dot.

When I make a similar version, using the pydata sparse library (not scipy.sparse), the test passes for me. I did have to make a slight change to how we pass in the dtype information to the dask array, but I don't think that has much to do with anything we're discussing here.

def test_pydata_sparse_dot():
    import sparse

    density = 0.3
    x_shape, x_chunks = (4, 8), (2, 4)
    y_shape, y_chunks = (8, 6), (4, 3)
    x = np.random.random(x_shape)
    y = np.random.random(y_shape)
    x[x < 1 - density] = 0
    y[y < 1 - density] = 0
    z = x.dot(y)
    da_x = da.from_array(x, chunks=x_chunks, asarray=False, fancy=False)
    da_y = da.from_array(y, chunks=y_chunks, asarray=False, fancy=False)
    da_x = da_x.map_blocks(sparse.COO.from_numpy, dtype=x.dtype)
    da_y = da_y.map_blocks(sparse.COO.from_numpy, dtype=y.dtype)
    da_z = da.dot(da_x, da_y).compute()
    assert isinstance(da_z, sparse.COO)
    assert_eq(z, da_z.todense())

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

GenevieveBuckley commented Aug 4, 2021

I must admit I'm a bit ignorant of the code change, but it seems to be due to a change in the input/output dimensions. @GenevieveBuckley could you help me understand what this change should mean w.r.t. dimensionality changes?

I don't think there should be changes to the input/output dimensions, so that's a bit worrying.

EDIT: It looks like that line is trying to do the equivalent of expand_dims. I'll keep looking.

@pentschev
Copy link
Copy Markdown
Member

What do you mean by "though it does pass now"? Passes locally? The gpuCI still marks a lot of failed tests)

Sorry, I thought they were run again already. Those other tests should have been fixed when we upgraded to NumPy 1.21 in rapidsai/dask-build-environment#3 (which is the case for gpuCI now), but also resolved in #7982 by skipping them with NumPy < 1.20. I'll rerun gpuCI, those other failures should hopefully go away.

@pentschev
Copy link
Copy Markdown
Member

rerun tests

@pentschev
Copy link
Copy Markdown
Member

In the latest gpuCI run, all those other tests that were previously failing now pass, the only failures left are the test_sparse_dot ones.

@GenevieveBuckley I'm not super familiar with the tensordot code, but please let me know if I can assist debugging this in some way.

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

I think I've got it now - it's only failing when we input mixed array types into tensordot (eg: a numpy and a sparse array). I need to pass in another argument to _tensordot instead of trying to figure out if the whole computation is sparse or not based on only one of the input arrays.

Thanks for your help here @pentschev

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

Here's some tests that can be run locally to check different input array types (you can uncomment the cupyx lines if you have cupy available).

import pytest
import dask.array as da
from dask.array.utils import assert_eq, same_keys
import numpy as np
import sparse
import scipy.sparse
#import cupyx.sparse

@pytest.mark.parametrize("func, arraytype", [
    (lambda x: x, 'numpy'),
    (sparse.COO.from_numpy, 'sparse'),
    (scipy.sparse.csr_matrix, 'sparse'),
    (scipy.sparse.csc_matrix, 'sparse'),
    #(cupyx.sparse.csr_matrix, 'sparse'),
    #(cupyx.sparse.csr_matrix, 'sparse'),
])
def test_tensordot_arraytypes(func, arraytype):
    x = np.arange(400).reshape((20, 20))
    a = da.from_array(x, chunks=(5, 4))
    y = np.arange(200).reshape((20, 10))
    b = da.from_array(y, chunks=(4, 5))

    a = a.map_blocks(func)
    b = b.map_blocks(func)

    assert arraytype in str(type(a._meta))
    assert arraytype in str(type(b._meta))

    z = np.tensordot(x, y, axes=(1, 0))
    c = da.tensordot(a, b, axes=(1, 0))
    assert_eq(c, z, check_meta=False, check_type=False)

    for axes in [1, (1, 0)]:
        assert_eq(da.tensordot(a, b, axes=axes), np.tensordot(x, y, axes=axes), check_meta=False, check_type=False)
        assert_eq(da.tensordot(x, b, axes=axes), np.tensordot(x, y, axes=axes), check_meta=False, check_type=False)
        assert_eq(da.tensordot(a, y, axes=axes), np.tensordot(x, y, axes=axes), check_meta=False, check_type=False)

    assert same_keys(da.tensordot(a, b, axes=(1, 0)), da.tensordot(a, b, axes=(1, 0)))

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

GenevieveBuckley commented Aug 4, 2021

Ah, and the final assert statement in test_sparse_dot is failing because it is comparing a regular numpy array and a densified sparse array and finding that the array types don't match - that's to be expected (we're comparing a cupy.ndarray with a densified cupy.matrix).

I can pass in check_type=False to stop it failing.

Separately, a lot of the sparse array stuff doesn't necessarily propagate the metadata for the array type properly through the whole dask computation. We might look at that more broadly later and try and preserve this information.

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

Here's what the task graphs look like again

import sparse
import scipy.sparse
import dask.array as da
from dask.array.utils import assert_eq

x = da.random.random((2, 3, 4), chunks=(1, 2, 2))
x[x < 0.8] = 0
y = da.random.random((4, 3, 2), chunks=(2, 2, 1))
y[y < 0.8] = 0

# dask arrays with numpy chunks
z = da.tensordot(x, y, axes=((1, 2), (1, 0)))

# dask arrays with pydata sparse chunks
x_sparse = x.map_blocks(sparse.COO.from_numpy)
y_sparse = y.map_blocks(sparse.COO.from_numpy)
z_sparse = da.tensordot(x_sparse, y_sparse, axes=((1, 2), (1, 0)))

# dask arrays with scipy sparse csr_matrix chunks
x_csrmatrix = x.map_blocks(scipy.sparse.csr_matrix, dtype=x.dtype)
y_csrmatrix = y.map_blocks(scipy.sparse.csr_matrix, dtype=y.dtype)
z_csrmatrix = da.tensordot(x_csrmatrix, y_csrmatrix, axes=((1, 2), (1, 0)))

# dask arrays with scipy sparse csc_matrix chunks
x_sparse_csc_matrix = x.map_blocks(scipy.sparse.csc_matrix, dtype=x.dtype)
y_sparse_csc_matrix = y.map_blocks(scipy.sparse.csc_matrix, dtype=y.dtype)
z_sparse_csc_matrix = da.tensordot(x_sparse_csc_matrix, y_sparse_csc_matrix, axes=((1, 2), (1, 0)))

z.visualize()  # numpy chunks
z_sparse.visualize()  # pydata sparse chunks
z_csrmatrix.visualize()  # scipy sparse csc_matrix chunks
z_sparse_csc_matrix.visualize()  # scipy sparse csc_matrix chunks

image

image

image

@pentschev
Copy link
Copy Markdown
Member

pentschev commented Aug 4, 2021

Not the cleanest way to resolve the issue with _meta, but it seems that replacing

da_x = da_x.map_blocks(sp_matrix, dtype=dtype)
da_y = da_y.map_blocks(sp_matrix, dtype=dtype)
by the code below will work for now:

da_x = da_x.map_blocks(sp_matrix, meta=sp_matrix(cupy.array([0], dtype=dtype)))
da_y = da_y.map_blocks(sp_matrix, meta=sp_matrix(cupy.array([0], dtype=dtype)))

Could you try that @GenevieveBuckley ?

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

We do still see the bottleneck happen if we pass in sparse arrays (that's why this is a quick fix workaround, rather than a proper solution).

import dask.array as da
import numpy as np
import sparse
import scipy.sparse

x = da.random.random((1, 100), chunks=(1, 50))
y = da.random.random((100, 1), chunks=(50, 1))
z = da.tensordot(x, y, axes=(1, 0))
z.visualize()

This is good: for numpy chunked arrays we no longer have a bottleneck in the compute graph
image

x_csrmatrix = x.map_blocks(scipy.sparse.csr_matrix, dtype=x.dtype)
y_csrmatrix = y.map_blocks(scipy.sparse.csr_matrix, dtype=y.dtype)
z_csrmatrix = da.tensordot(x_csrmatrix, y_csrmatrix, axes=(1, 0))
z_csrmatrix.visualize()

This is not good: pushing everything into that final tensordot operation at the very end gives us memory problems.
image

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

GenevieveBuckley commented Aug 4, 2021

It's worth noting that the pydata sparse arrays are being caught by my check for sparse arrays and forced down the less efficient code path - but there's no real technical reason we need to do this. Pydata sparse arrays work just as well as numpy arrays do, because they can be n-dimensional.

... writing this out tells me I should go back and make my check for sparse arrays exclude the pydata sparse arrays.

EDIT: Done

@quasiben
Copy link
Copy Markdown
Member

quasiben commented Aug 4, 2021

I was able to confirm @pentschev's fixes for the test though I know that's not what's being discussed in your last few comments @GenevieveBuckley

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

I was able to confirm @pentschev's fixes for the test though I know that's not what's being discussed in your last few comments @GenevieveBuckley

Thanks @quasiben
I got cupy working locally (thanks to Paul Taylor at NVIDIA) and all of the dask/array/tests/test_cupy.py tests pass with cupy 8.6
I do get fatal floating point errors with cupy 9.2, but I think this is probably the same thing you and Peter have been looking into already.

Details:
dask/array/tests/test_cupy.py::test_sparse_hstack_vstack_csr Fatal Python error: Floating point exception

dask/array/tests/test_cupy.py::test_sparse_dot[csr] Fatal Python error: Floating point exception

Current thread 0x00007f4b5a95f740 (most recent call first):
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupy/cusparse.py", line 1263 in create
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupy/cusparse.py", line 1644 in denseToSparse
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupyx/scipy/sparse/csr.py", line 1131 in dense2csr
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupyx/scipy/sparse/csr.py", line 77 in _convert_dense
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/cupyx/scipy/sparse/compressed.py", line 263 in __init__
  File "/home/genevieve/GitHub/dask/dask/array/utils.py", line 144 in compute_meta
  File "/home/genevieve/GitHub/dask/dask/array/blockwise.py", line 279 in blockwise
  File "/home/genevieve/GitHub/dask/dask/array/core.py", line 713 in map_blocks
  File "/home/genevieve/GitHub/dask/dask/array/core.py", line 2378 in map_blocks
  File "/home/genevieve/GitHub/dask/dask/array/tests/test_cupy.py", line 1162 in test_sparse_dot
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
  File "/home/genevieve/anaconda3/envs/dask-dev/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/home/genevieve/anaconda3/envs/dask-dev/bin/pytest", line 11 in <module>
Floating point exception (core dumped)

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

GenevieveBuckley commented Aug 5, 2021

It's possible we still need to check against this concern, too

#6916 (comment)

Note that there seems to have been a hidden bug in dask's tensordot for a long time that was coincidentally resolved by #6846. For a few more details see #6907.

#6916 (comment)

Good to know. Maybe we should include a test then? Would you be willing to submit a PR? slightly_smiling_face

I would. The problem is that I am not completely sure what is going on. In

dask.config.set(scheduler='single-threaded')
x = da.random.random((4,4), chunks=(2,2))
x[x < 0.65] = 0
y = x.persist()
r1 = da.dot(x, x).compute()
r2 = da.dot(y, y).compute()

r1 is wrong and r2 is correct (with a dask version prior to fbccc4e). This is not due to special values; when I take the problematic values and create an array from a numpy array with them, the problem does not occur. The problem is also not with the (tensor)dot implementation of the blocks, but rather that they get fed wrong blocks. But I was not able yet to track down why and therefore I am not sure how to set up the test. I think having random stuff in tests should rather be avoided.

What do you think? Include a test with random as above? Track down the problem in more detail before writing the test?

EDIT: yes, this is still a thing with this PR #6907

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

GenevieveBuckley commented Aug 5, 2021

Marking this PR as a draft, so it doesn't accidentally get merged.

Needs a simultaneous fix for the unbounded memory usage while also not reintroducing this error #6907

EDIT: looked into 6907 and uncovered more weirdness, detailed here #6907 (comment)

GenevieveBuckley and others added 4 commits September 9, 2021 14:00
Co-authored-by: Klaus Zimmermann <klaus.zimmermann@smhi.se>
Co-authored-by: Klaus Zimmermann <klaus.zimmermann@smhi.se>
Co-authored-by: Klaus Zimmermann <klaus.zimmermann@smhi.se>
@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

GenevieveBuckley commented Sep 9, 2021

Thank you for the excellent explanation @zklaus

I just tried your suggestions, but it seems the problem in issue 6907 still exists. I've added this example as a regression test in this PR (I figured it's helpful to have included in the CI runs).

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

I think maybe it has to do with the way dask names the task keys?

@pentschev
Copy link
Copy Markdown
Member

Since we've been blocked here for a while and it seems that the issue only happens for square matrices, if people (perhaps @mrocklin) would like to have the issue solved even if not completely for now, I propose doing the following:

  1. Special-casing square matrices with the previous behavior that will still contain the unbounded memory usage issue;
  2. Have the fix in for non-square matrices;
  3. (Potentially add a runtime warning for square matrices to let users know of the memory usage issue);
  4. File an issue where we can continue the discussion for edge case(s) and fix them later.

Thoughts on the above?

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

Thoughts: It's possible it isn't a square/non-square chunk thing. I was playing with very tiny arrays there, we might find that it's a case between "chunk spans a full row / chunk spans a full column" vs "chunk does not span a full row / a full column". I could do some more experimenting to find out if that's the case?

@dcherian
Copy link
Copy Markdown
Collaborator

dcherian commented Mar 7, 2022

@GenevieveBuckley can you merge main and restart the tests please? Perhaps whatever fixed #8131 has also fixed the blocker here 🤞🏾

@martindurant
Copy link
Copy Markdown
Member

@dcherian , tests are running and looking good

@dcherian
Copy link
Copy Markdown
Collaborator

dcherian commented Mar 9, 2022

Wonderful. Thanks @martindurant !

@martindurant
Copy link
Copy Markdown
Member

Windows failure is about missing PDF backend for graphviz

@martindurant
Copy link
Copy Markdown
Member

@dask/maintenance - I don't know much about the implementation here, but things seem to pass aside from something completely unrelated. Does anyone have a chance to look this over?

@pentschev
Copy link
Copy Markdown
Member

To try and capture the potential breaking points we discussed in #7980 (comment) and #7980 (comment), I opened GenevieveBuckley#1 against @GenevieveBuckley 's branch. Locally the tests all pass, so it would be great if we could have them in before merging this, but otherwise I think it covers the issues we've seen with CuPy as well.

Thanks everyone for the work in this very extensive issue!

@martindurant
Copy link
Copy Markdown
Member

If @GenevieveBuckley is not available, we can either push to this branch (maintainers can), or start a new one; but only she can merge your PR into her branch.

@jsignell
Copy link
Copy Markdown
Member

Thank you @pentschev @martindurant and @dcherian! I am going to pull in the @pentschev's changes to this branch.

@jsignell jsignell added the almost done Work is almost done! label Mar 14, 2022
@jsignell jsignell merged commit a7e57ba into dask:main Mar 15, 2022
@jsignell
Copy link
Copy Markdown
Member

Ok this is in!!

@jsignell jsignell added bug Something is broken and removed almost done Work is almost done! labels Mar 15, 2022
@dcherian
Copy link
Copy Markdown
Collaborator

Great thanks everyone

@GenevieveBuckley
Copy link
Copy Markdown
Contributor Author

Thanks for picking this up @[pentschev

@galipremsagar galipremsagar mentioned this pull request Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

array bug Something is broken

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants