Skip to content

Efficiently serialize zero strided NumPy arrays#3180

Merged
mrocklin merged 7 commits intodask:masterfrom
Quansight-Labs:serialize-zero-strided
Nov 14, 2019
Merged

Efficiently serialize zero strided NumPy arrays#3180
mrocklin merged 7 commits intodask:masterfrom
Quansight-Labs:serialize-zero-strided

Conversation

@jrbourbeau
Copy link
Member

This PR updates how NumPy arrays with zero-strided axes are serialized in order to avoid transmitting the entire array where possible. Instead dimensions with a zero stride are collapsed, this collapsed array is serialized, and upon deserialization the smaller array is then broadcasted to the appropriate shape.

Taking the following as an example:

In [1]: import numpy as np

In [2]: a = np.arange(3)

In [3]: x = np.broadcast_to(a, (5, 3))

In [4]: x
Out[4]:
array([[0, 1, 2],
       [0, 1, 2],
       [0, 1, 2],
       [0, 1, 2],
       [0, 1, 2]])

Instead of serializing the entire x array (with x.shape = (5,3)), we would serialize the smaller [0, 1, 2] array. Then after deserializing the smaller array, we use np.broadcast_to to reproduce the expected x array with shape (5,3).

@TomAugspurger
Copy link
Member

TomAugspurger commented Oct 29, 2019

That's a neat trick.

Any reason you marked this as a WIP?

No idea what is going on with the CI:

Traceback (most recent call last):

  File "/home/travis/miniconda/envs/test-environment/lib/python3.5/site-packages/_pytest/config/__init__.py", line 451, in _importconftest

    return self._conftestpath2mod[key]

KeyError: PosixPath('/home/travis/build/dask/distributed/conftest.py')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/home/travis/miniconda/envs/test-environment/lib/python3.5/site-packages/_pytest/config/__init__.py", line 567, in import_plugin

    __import__(importspec)

  File "/home/travis/build/dask/distributed/distributed/__init__.py", line 3, in <module>

    from .actor import Actor, ActorFuture

  File "/home/travis/build/dask/distributed/distributed/actor.py", line 7, in <module>

    from .client import Future, default_client

  File "/home/travis/build/dask/distributed/distributed/client.py", line 28, in <module>

    from dask.optimization import SubgraphCallable

ImportError: cannot import name 'SubgraphCallable'

During handling of the above exception, another exception occurred:

@jrbourbeau
Copy link
Member Author

Any reason you marked this as a WIP?

I wanted to run the full CI with these changes...which is currently experiencing some problems. I've not seen this problem before. Seems to only happen on the 3.5 build, I wonder if it's related to dropping Python 3.5 in dask

@jrbourbeau jrbourbeau changed the title [WIP] Efficiently serialize zero strided NumPy arrays Efficiently serialize zero strided NumPy arrays Oct 31, 2019
@mrocklin
Copy link
Member

mrocklin commented Nov 4, 2019

@jrbourbeau should this go in?

@mrocklin
Copy link
Member

mrocklin commented Nov 4, 2019

Or rather, in principle this seems fine to me. If you're comfortable merging, then please do.

@mrocklin
Copy link
Member

mrocklin commented Nov 5, 2019

Merging this later today if there are no further comments.

@jrbourbeau
Copy link
Member Author

Thanks for giving this a look @mrocklin! The only additional change I think we should make here is handling sizeof(x) when x is a zero-strided array. As dask.sizeof.sizeof can be used outside the context of distributed, it probably makes more sense for me to update safe_sizeof here in distributed.

@mrocklin
Copy link
Member

mrocklin commented Nov 5, 2019

OK. I'm happy to defer to your judgement here.

try:
# Handle zero-strided NumPy arrays
if np is not None and isinstance(obj, np.ndarray) and 0 in obj.strides:
from .protocol.numpy import _zero_strided_slice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems off to include this special case in this function. Is this a case that we can fix upstream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this seems a little off. My main motivation for including this special case here instead of upstream in dask.sizeof was because this is tied specifically to how distributed handles zero-strided arrays. In theory, dask.sizeof could be used independently of distributed. Although I'm not sure how common that is (I suspect probably not very common).

We could move this upstream to dask. Does that seem more reasonable than what I have here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I'm not sure what you mean by zero-strided numpy array. This is an array that is stored as, say a 2d array, but uses stride tricks to appear to be 3d or 4d or somethiing?

In that case does dask.sizeof report the correct size of the object in memory? I guess I'm not sure how what distributed does here affects things. If we have one of these objects in memory, how much RAM does it take up? I would think that this would be independent of which library is asking for the size.

@mrocklin
Copy link
Member

Checking in again here @jrbourbeau . I know that you're busy these days so no rush. Should we remove the sizeof change here, merge the rest of this in, and change the sizeof implementation in dask/dask?

@jrbourbeau
Copy link
Member Author

Yeah, following our conversation the other day I think that sounds good. Just pushed a commit to revert the sizeof.py changes (although I did leave the added safe_sizeof tests)

@mrocklin mrocklin merged commit fd98e30 into dask:master Nov 14, 2019
@mrocklin
Copy link
Member

Thanks @jrbourbeau . This is in.

@jrbourbeau jrbourbeau deleted the serialize-zero-strided branch November 14, 2019 17:07
Carreau added a commit to Carreau/dask that referenced this pull request Jun 23, 2020
Use code written by jrbourbeau in a private thread. Add tests.
Related to dask/distributed#3180
Carreau added a commit to Carreau/dask that referenced this pull request Jun 23, 2020
Use code written by jrbourbeau in a private thread. Add tests.
Related to dask/distributed#3180
TomAugspurger pushed a commit to dask/dask that referenced this pull request Jun 24, 2020
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants