Skip to content

Automatic rechunking and coercion of numpy.ndarrays#622

Merged
shoyer merged 3 commits intodask:masterfrom
shoyer:coerce-ndarray
Sep 1, 2015
Merged

Automatic rechunking and coercion of numpy.ndarrays#622
shoyer merged 3 commits intodask:masterfrom
shoyer:coerce-ndarray

Conversation

@shoyer
Copy link
Member

@shoyer shoyer commented Aug 24, 2015

Fixes #290

The implementation of automatic rechunking here is the most basic version possible: it only succeeds if at most one of the arrays has multiple chunks defined along an axis.

This suffices for coercing ndarrays -- in the future we can consider more complete solutions.

Also includes a change to rechunking, where it doesn't bother if the chunk size is unchanged. We could remove this (and do it in unify_chunks instead), but it seemed like a general improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is not really the right logic here. The dtype signature is fixed independent of the number of dimensions except for 0d arrays, for which the dtype depends on the value according to some complex set of rules: numpy/numpy#6240

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does numpy use?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 for np.array
10 for np.matrix

@mrocklin
Copy link
Member

This ended up being a lot simpler than I anticipated.

@mrocklin
Copy link
Member

Looks like there are some errors on travis.ci

@shoyer
Copy link
Member Author

shoyer commented Aug 26, 2015

Ah, I missed that dask.dataframe also uses partial_by_order. Given that only dask.dataframe needs it now, I think I'll move over the function (and tests), unless you have a better place for it.

@mrocklin
Copy link
Member

Yes, that seems reasonable.

@shoyer shoyer force-pushed the coerce-ndarray branch 4 times, most recently from 6e2cfaa to c84a184 Compare August 29, 2015 17:02
Fixes GH290

The implementation of rechunking here is the most basic version possible: it
only succeeds if at most one of the arrays has multiple chunks defined along
an axis.

This suffices for coercing ndarrays -- in the future we can consider more
complete solutions.

Also includes a change to rechunking, where it doesn't bother if the chunk
size is unchanged. We could remove this (and do it in unify_chunks instead),
but it seemed like a general improvement.
@shoyer
Copy link
Member Author

shoyer commented Aug 31, 2015

Did another refactor. Decided to keep partial_by_order in dask.array.core after all, because it's useful for handling scalar arguments (which need to be treated differently to reproduce NumPy's dtype promotion rules).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, what's the type of this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a numpy.ndarray

But actually, I think I meant to be checking y here, not x...

@mrocklin
Copy link
Member

mrocklin commented Sep 1, 2015

This all looks pretty sane to me.

@mrocklin
Copy link
Member

mrocklin commented Sep 1, 2015

+1

@shoyer
Copy link
Member Author

shoyer commented Sep 1, 2015

Tweaked things a little bit. Now elemwise does not consider objects without a shape attribute to be ndarrays -- those arguments will be passed directly into the graph.

shoyer added a commit that referenced this pull request Sep 1, 2015
Automatic rechunking and coercion of numpy.ndarrays
@shoyer shoyer merged commit 6f290d8 into dask:master Sep 1, 2015
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this doctest skipped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual error includes the repr of a set object (for which the order of elements will differ on different Python versions), and doctests can't catch generic error messages.

@sinhrks sinhrks added this to the 0.7.1 milestone Sep 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants