Efficiently serialize zero strided NumPy arrays#3180
Conversation
|
That's a neat trick. Any reason you marked this as a WIP? No idea what is going on with the CI: |
I wanted to run the full CI with these changes...which is currently experiencing some problems. I've not seen this problem before. Seems to only happen on the 3.5 build, I wonder if it's related to dropping Python 3.5 in dask |
|
@jrbourbeau should this go in? |
|
Or rather, in principle this seems fine to me. If you're comfortable merging, then please do. |
|
Merging this later today if there are no further comments. |
|
Thanks for giving this a look @mrocklin! The only additional change I think we should make here is handling |
|
OK. I'm happy to defer to your judgement here. |
distributed/sizeof.py
Outdated
| try: | ||
| # Handle zero-strided NumPy arrays | ||
| if np is not None and isinstance(obj, np.ndarray) and 0 in obj.strides: | ||
| from .protocol.numpy import _zero_strided_slice |
There was a problem hiding this comment.
It seems off to include this special case in this function. Is this a case that we can fix upstream?
There was a problem hiding this comment.
I agree this seems a little off. My main motivation for including this special case here instead of upstream in dask.sizeof was because this is tied specifically to how distributed handles zero-strided arrays. In theory, dask.sizeof could be used independently of distributed. Although I'm not sure how common that is (I suspect probably not very common).
We could move this upstream to dask. Does that seem more reasonable than what I have here?
There was a problem hiding this comment.
To be honest I'm not sure what you mean by zero-strided numpy array. This is an array that is stored as, say a 2d array, but uses stride tricks to appear to be 3d or 4d or somethiing?
In that case does dask.sizeof report the correct size of the object in memory? I guess I'm not sure how what distributed does here affects things. If we have one of these objects in memory, how much RAM does it take up? I would think that this would be independent of which library is asking for the size.
|
Checking in again here @jrbourbeau . I know that you're busy these days so no rush. Should we remove the sizeof change here, merge the rest of this in, and change the sizeof implementation in dask/dask? |
|
Yeah, following our conversation the other day I think that sounds good. Just pushed a commit to revert the |
|
Thanks @jrbourbeau . This is in. |
Use code written by jrbourbeau in a private thread. Add tests. Related to dask/distributed#3180
Use code written by jrbourbeau in a private thread. Add tests. Related to dask/distributed#3180
This PR updates how NumPy arrays with zero-strided axes are serialized in order to avoid transmitting the entire array where possible. Instead dimensions with a zero stride are collapsed, this collapsed array is serialized, and upon deserialization the smaller array is then broadcasted to the appropriate shape.
Taking the following as an example:
Instead of serializing the entire
xarray (withx.shape = (5,3)), we would serialize the smaller[0, 1, 2]array. Then after deserializing the smaller array, we usenp.broadcast_toto reproduce the expectedxarray with shape(5,3).