Skip to content

Informative error if chunks are unknown while slicing#9188

Closed
pavithraes wants to merge 1 commit intodask:mainfrom
pavithraes:unknown-chunk-slicing
Closed

Informative error if chunks are unknown while slicing#9188
pavithraes wants to merge 1 commit intodask:mainfrom
pavithraes:unknown-chunk-slicing

Conversation

@pavithraes
Copy link
Member

Thanks to @ncclementi and @phobson who figured out the exact solution!

@github-actions github-actions bot added the array label Jun 15, 2022
@pavithraes pavithraes marked this pull request as draft June 15, 2022 15:15
@ncclementi
Copy link
Member

It looks like this change is causing bunch of other tests to fail. I think some of them are not to hard to fix, because they are marked as xfail due to the problem this PR is solving. But I'm a bit confused about certain cases where it seems we are ok, having nan chunks ? For example

the test dask/array/tests/test_slicing.py::test_getitem_avoids_large_chunks_missing it seems that works when the first element in chunks is doesn't include nans.

Is this the expected behavior? because if that's the case then the fix we added here breaks this. As any appearance of nan in self.chunks will raise a ValueError now.

Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pavithraes! The fix looks good to me, I just have a suggestion to improve the test


def test_slicing_and_unknown_chunks():
a = da.ones((10, 5), chunks=5)
a._chunks = ((np.nan, np.nan), (5,))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than setting the chunks to nan, I think it would be better to write some user code that we know results in unknown chunks (like slicing with a boolean):

a = da.random.uniform(size=10, chunks=2)
a = a[a > 0.5]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't seem to result in unknown chunks, I tried this and it has no problems. There are certain situations where we have nan in the chunks but we have ways of solving this. I'm not sure what's the right example that triggers the type of unknown chunks that causes this problem.

@ncclementi
Copy link
Member

We have multiple situations where we were able to check for

X = np.arange(24).reshape((4, 6))
x = da.from_array(X, chunks=(2, 2))
x._chunks = ((2, 2), (np.nan, np.nan, np.nan))

assert_eq(x[0], X[0]) #where here x has unknown chunks

but now this check will fail with the fix on this PR. So test like the one below fail, because the ValueError raises one step before the pytest.raises .

test_index_array_with_array_1d

I think we might need to discuss how to fix all these tests, and what we expect to work or not. Basically with this change any time that there is an unknown chunk, it'll raise as soon as we do any operation that involves __getitem__ it seems this might be very restrictive, as I see that in certain situations we can work with unkown chunks. For example, the case of test_index_array_with_array_2d

@ncclementi
Copy link
Member

After chatting with Julia for a bit, it seems that finding the right place where to raise the error is not quite possible, or at least we weren't able to sort this out. We have multiple situations where dask can do its magic even though there are nan values in the chunks but not always. It seems the best situation, in this case, is to catch the exception in the compute, and give a more informative error there.

Because I don't have push access to this PR, I continued the work in #9285

@pavithraes
Copy link
Member Author

I continued the work in #9285

Thank you, @ncclementi! I'll close this one then. :)

@pavithraes pavithraes closed this Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IndexError when slicing dask Array with unknown chunk sizes.

3 participants