Blockwise array creation redux by ian-r-rose · Pull Request #7417 · dask/dask

ian-r-rose · 2021-03-19T14:29:08Z

Follow-on work to #6931 and #7281, supersedes #6984. This introduces more blockwise array creation routines, including from_array/from_zarr, and (at least a subset of) random arrays.

Following #7381, I am avoiding numpy/pandas imports in all graph materialization routines, as well as deserializing any helper arrays.

Tests added / passed
Passes black dask / flake8 dask

ian-r-rose · 2021-03-19T16:17:43Z

@rjzamora @jrbourbeau your work on import checks already paying dividends: https://github.com/dask/dask/pull/7417/checks?check_run_id=2149288104

rjzamora

Thanks for working on this @ian-r-rose ! I know this is still WIP, but I left some initial thoughts since I was excited to take a look.

rjzamora · 2021-03-19T17:11:19Z

dask/layers.py

+        chunk_shape = tuple(chunk[i] for i, chunk in zip(idx, self.chunks))
+        array_location = tuple(
+            (start[i], start[i + 1]) for i, start in zip(idx, self.starts)
+        )
+        return {
+            "shape": self.shape,
+            "num-chunks": self.num_chunks,
+            "array-location": array_location,
+            "chunk-shape": chunk_shape,
+        }


Perhaps we should make it possible to avoid unnecessary logic for "simple" cases like zeros/full/ones?

This block_info dict is modeled after that in Array.map_blocks (though it's not identical, since that can include dtypes). My thinking was that, since they are similar operations, the API should be similar, and similar data should be provided to the tasks.

That being said, it's true that in simpler cases there is some unnecessary logic here when constructing the block info. We could revert these changes to keep CreateArrayDeps simple, and then make a new blockwise layer for things like from_array. So to me it seems the choice is between having a larger number of more specific layer classes, or a smaller number of more general ones like this one.

Or __init__ could take something like block_info=True to indicate that it should provide more than just the chunk shape. I generally like to have more predictable outputs for functions, but can be talked out of it :)

I generally like to have more predictable outputs for functions, but can be talked out of it :)

Makes sense - I do prefer consistency (as you already have it), so I'm not really motivated to talk you out of this unless we discover a need for performance optimizations.

rjzamora · 2021-03-19T17:42:14Z

dask/layers.py

+            module,
+            name,
+            chunks,
+            serialize(seeds),


It may be easiest to separately serialize each seed, so that we can still do block_info["seed"] = self.seeds[n] in the getitem definition without needed any __dask_distributed_unpack__ logic in this class. However, we would then need to wrap the underlying creation function to handle deserialization of block_info["seed"].

If we do use a function wrapper, it probably makes sense to fully move the random_state_data operation from the client to the worker (and avoid the need to ship the seeds through the scheduler altogether). Do we know any RNG experts? Can we simply shift the default/random seed by some factor related to the chunk index?

Yeah, I haven't fully thought through the serialization story here. But at least, I think that serialize does drill down into the list and serialize the seeds separately, c.f. https://github.com/dask/distributed/blob/bef0308962345303123aba7ec6730757a61a4dfc/distributed/protocol/serialize.py#L263-L284 , and it shouldn't be necessary to deserialize them on the scheduler.

It would definitely be nice to be able to move a smaller representation of the random state to the workers, but at least this implementations doesn't ship more seed data than the current implementation.

But at least, I think that serialize does drill down into the list and serialize the seeds separately, c.f. https://github.com/dask/distributed/blob/bef0308962345303123aba7ec6730757a61a4dfc/distributed/protocol/serialize.py#L263-L284 , and it shouldn't be necessary to deserialize them on the scheduler.

Ah - You are correct that serialize will internally handle the elements of seeds seperately. I hadn't considered the possibility of simply splitting the header/frames on the scheduler, but that may actually be a good way to go.

Regarding the question of moving random state generation to the workers: I like the idea, but I also don't think you should worry about it in this PR.

I was just experimenting with this. It seems that we will need to add an explicit iterate_collection= argument to serialize. Otherwise, we cannot guarentee that the scheduler will be able to access individual elements of a list/dict without needing to deserialize. I can/will submit a PR to distributed to add this feature, but this means a proper __dask_distributed_unpack__ soution here will not work with the latest release of distributed.

I can/will submit a PR to distributed to add this feature, but this means a proper dask_distributed_unpack soution here will not work with the latest release of distributed.

On second thought, I don't think it makes sense to do this. In general, it seems like a much cleaner solution to just call serialize seperately on each element on the client, rather than adding an extra step of splitting the header/frames on the scheduler.

On second thought, I don't think it makes sense to do this. In general, it seems like a much cleaner solution to just call serialize seperately on each element on the client,

Yep, after taking another look at it, I agree with you

rather than adding an extra step of splitting the header/frames on the scheduler.

Can you elaborate more on what you mean here?

In general, I'm having a tough time reasoning about how deserialization should look on the worker. What would the idiomatic way of doing that look like? i.e.,

How does one identify whether one should deserialize? If it is coming back as distributed.protocol.Serialized, what would be the best way to identify that without relying on distributed being installed?

I ran into a couple of problems around making sure the numpy deserializer is registered on the worker. What am I doing wrong?

(I'm not demanding answers to the above, just what I'm thinking about :) )

rather than adding an extra step of splitting the header/frames on the scheduler.

Can you elaborate more on what you mean here?

If you call serialize on an entire collection, you will get a (header, frames) tuple, and so the scheduler will need to process/split the header and then convert this single tuple to a list of (header, frame) tuples. The logic is not that complicated, but it is messier than simply serializing the individual elements to begin with.

In general, I'm having a tough time reasoning about how deserialization should look on the worker. What would the idiomatic way of doing that look like? i.e.,

I have experimented with ways to do this, but I haven't established anything that I would call idiomatic yet...

How does one identify whether one should deserialize? If it is coming back as distributed.protocol.Serialized, what would be the best way to identify that without relying on distributed being installed?
I ran into a couple of problems around making sure the numpy deserializer is registered on the worker. What am I doing wrong?

How does one identify whether one should deserialize? If it is coming back as distributed.protocol.Serialized, what would be the best way to identify that without relying on distributed being installed?

You have certainly isolated the crux of the problem here. It is easy to decide when to serialize (within an __dask_distributed_pack__ definition). However, it is not obvious when the worker will need to deserialize one or all of its arguments. In my first pass, I was relying on pickle rather than serialize/deserialize, so the function simply needed to check if the argument(s) in question was a bytes object. In 7415, I just moved over to serialize/deserialize, and I am relying on an explicit "serialized" label. I don't "love" that solution, but it seems to work.

I ran into a couple of problems around making sure the numpy deserializer is registered on the worker. What am I doing wrong?

Not sure - but I'm hoping we wont need to do anything like this to check if the data is serialized.

Yeah, I am probably doing something wrong, but I am finding that the first time I need to deserialize a numpy array on a worker it fails because that protocol hasn't been registered yet. Subsequent deserializations work fine.

I'm currently forcing the registration by relying on the side effect of this import, but I don't view that as a real solution. Perhaps @jrbourbeau has some insight as to what might be going wrong here?

GenevieveBuckley · 2021-03-22T01:37:49Z

This is very cool to see @ian-r-rose
We've been talking to the ilastik team about this recently #7404

GenevieveBuckley · 2021-03-22T02:22:48Z

James has made a quick fix for the failing sparse test in #7421, btw

ian-r-rose · 2021-03-23T22:49:35Z

dask/array/core.py

-
-        dsk = getem(
-            get_from,
+        dsk = graph_from_arraylike(


@rjzamora wanted to flag this as where we'll have to think through some of the array inlining logic.

Got it. So, the original code allows you to include the array in the graph once (with a dedicated key), or to inline it in every task. So far, this PR is effectively inlining the array in every task by including it in the IO function (in graph_from_arraylike). Is that correct?

It seems like you could support inline_array=False, but you may need to further expand CreateArrayDeps to include optional args/kwargs that should be the same for all keys (and make it possible to specify these arguments in BlockwiseCreateArray). I guess if you can do this, you could avoid using partial to embed the array in io funciton altogether?

Got it. So, the original code allows you to include the array in the graph once (with a dedicated key), or to inline it in every task. So far, this PR is effectively inlining the array in every task by including it in the IO function (in graph_from_arraylike). Is that correct?

It seems like you could support inline_array=False, but you may need to further expand CreateArrayDeps to include optional args/kwargs that should be the same for all keys (and make it possible to specify these arguments in BlockwiseCreateArray). I guess if you can do this, you could avoid using partial to embed the array in io funciton altogether?

Yes, you have this exactly right -- I'm planning to re-introduce non-inlining, but haven't tackled it yet. For the case of lazy arrays like Zarr, or small ones, this should already do what we want.

ian-r-rose · 2021-03-23T22:51:04Z

This is very cool to see @ian-r-rose
We've been talking to the ilastik team about this recently #7404

Thanks for the xref @GenevieveBuckley! Hopefully this will indeed help with some of their pain points (though not anything to do with map_overlap for the time being).

James has made a quick fix for the failing sparse test in #7421, btw

Thanks, I'll try to rebase

high-level graph. Will need to think about what happens when the user materializes client side.

dependencies for a random graph.

its use.

gjoseph92 · 2022-01-12T23:31:56Z

@ian-r-rose just double-checking, we're just waiting on #8542 to merge this right?

ian-r-rose · 2022-01-12T23:35:04Z

@ian-r-rose just double-checking, we're just waiting on #8542 to merge this right?

Yes, that's right

gjoseph92 · 2022-01-14T18:07:28Z

Shall we rerun tests and merge?

ian-r-rose · 2022-01-14T18:41:21Z

I've just merged main to pick up #8542. I suspect we'll want to have this on main for a bit before release, but would defer to @jrbourbeau on that

ian-r-rose · 2022-01-14T22:36:48Z

All passing except for windows37

jrbourbeau

Thanks @ian-r-rose @gjoseph92 @rjzamora

This is a small follow-up to #7417 which moves `import`s of `cached_cumsum` to be from `dask.utils` (its new location) instead of `dask.array.slicing` (where it used to be)

Due to dask/dask#7417

This is due to a change which was introduced in dask-2022.1.1. Seems to be coming from: dask/dask#7417 Related pull request: hyperspy#2888

Dask 2022.01.1 renamed `da.core.getem` to `da.core.graph_from_arraylike` and changed the interface (see dask/dask#7417). These functions are still the most convenient way to create a dask array from `get_chunk` calls and a chunk specification, so add a shim that picks the appropriate function. We also keep the more convenient `getem` API (I guess until the minimum version of dask for katdal no longer has `getem`).

This is an attempt to improve on the slight performance drop of `da.map_blocks`. Taking our cue from the internals of the new `da.graph_from_arraylike` function, we call `dask.blockwise.blockwise` directly. This allows us to construct a custom graph with three arguments to our putter function (adding the actual chunk to the usual two getter arguments). We need this because the returned graph is now a high-level one that is harder to modify than an old-school dict. This improves performance to equal that of the original getem version, even surpassing it when there are many chunks. The only downside is that classes like `ArraySliceDep` are brand new and might still change yet again in the near future, having been added in the recent dask/dask#7417 PR.

This is due to a change which was introduced in dask-2022.1.1. Seems to be coming from: dask/dask#7417 Related pull request: hyperspy/hyperspy#2888

ian-r-rose marked this pull request as draft March 19, 2021 14:29

ian-r-rose mentioned this pull request Mar 19, 2021

[WIP] Blockwise from array #6984

Closed

2 tasks

rjzamora reviewed Mar 19, 2021

View reviewed changes

ian-r-rose commented Mar 23, 2021

View reviewed changes

ian-r-rose force-pushed the blockwise-array-creation-redux branch from 247c015 to f92c0fd Compare March 23, 2021 23:03

ian-r-rose mentioned this pull request Mar 24, 2021

Use Blockwise for DataFrame IO (parquet, csv, and orc) #7415

Merged

Ian Rose added 20 commits March 26, 2021 12:42

Rework CreateArrayDeps to use block_info dict, which is more general.

1bc7556

Use block_info in uniform array creation.

2f1cafd

WIP refactor from_array to use high level graph array creation.

4732153

Update lock test.

9c11f5a

Test array embedding in graph_from_arraylike

ea1ff6e

Remove inline_array, which doesn't make much sense in the context of a

9cffa5e

high-level graph. Will need to think about what happens when the user materializes client side.

This test is no longer relevant when using HLG.

ef89354

WIP blockwise IO random arrays.

3685f9a

WIP more work on random.

b7ad844

Fall back on materialized graph if there are dask.array.Array

5cee0ec

dependencies for a random graph.

Handle small args,kwargs in blockwise branch.

b733296

Cleanup

e991b0f

Move cached_cumsum to general utils to avoid top-level numpy import in

83a2744

its use.

Move BlockwiseCreateRandomArray into layers, remove use of numpy.

cd17f94

Add TODO note

707d41e

WIP handling extra chunks

1e4a232

Handle scalar results

716097d

Deserialization works out of the box.

0cf7881

Finish fixing imports.

b555e2e

Update import

5c4a1bd

Fix doctest

474eff4

ian-r-rose force-pushed the blockwise-array-creation-redux branch from ed49ca1 to 474eff4 Compare January 12, 2022 23:26

gjoseph92 mentioned this pull request Jan 14, 2022

Culling massive Blockwise graphs is very slow, not constant-time #8570

Open

Merge branch 'main' into blockwise-array-creation-redux

91de1d6

gjoseph92 requested a review from jrbourbeau January 19, 2022 20:50

gjoseph92 mentioned this pull request Jan 20, 2022

Use map_partitions (Blockwise) in to_parquet #8487

Merged

3 tasks

jrbourbeau approved these changes Jan 21, 2022

View reviewed changes

jrbourbeau merged commit 57f5cc3 into dask:main Jan 21, 2022

rjzamora mentioned this pull request Jan 21, 2022

Move DataFrame ACA aggregations to HLG #8468

Merged

jrbourbeau mentioned this pull request Jan 21, 2022

Move cached_cumsum imports to be from dask.utils #8606

Merged

maartenbreddels added a commit to vaexio/vaex that referenced this pull request Jan 31, 2022

🐛 to_dask_array broke with most recent dask (2022.1.1)

5da646e

Due to dask/dask#7417

maartenbreddels mentioned this pull request Jan 31, 2022

🐛 to_dask_array broke with most recent dask (2022.1.1) vaexio/vaex#1873

Merged

maartenbreddels added a commit to vaexio/vaex that referenced this pull request Jan 31, 2022

🐛 to_dask_array broke with most recent dask (2022.1.1)

2e7052a

Due to dask/dask#7417

maartenbreddels added a commit to vaexio/vaex that referenced this pull request Jan 31, 2022

🐛 to_dask_array broke with most recent dask (2022.1.1)

7bb31fc

Due to dask/dask#7417

tomwhite mentioned this pull request Feb 7, 2022

Pin pandas < 1.4.0 since statsmodels doesn't support this version yet. sgkit-dev/sgkit#819

Merged

magnunor mentioned this pull request Feb 18, 2022

Fix test_blockfile unit test, due to change in dask keyword naming hyperspy/hyperspy#2889

Merged

jsignell mentioned this pull request Mar 3, 2022

Much slower (100 x) xarray.open_mfdataset + compute after upgrading to dask > 2022.1.0 #8753

Closed

tomwhite mentioned this pull request Mar 3, 2022

Rechunker 0.3.3 is incompatible with Dask 2022.01.1 and later pangeo-data/rechunker#110

Closed

ludwigschwardt mentioned this pull request Mar 17, 2022

SPR1-1944: dask 2022.01.1 breaks katdal due to missing da.core.getem ska-sa/katdal#351

Merged

ian-r-rose mentioned this pull request May 6, 2022

Adding from_map like API for Dask Arrays #9049

Open

Uh oh!

Conversation

ian-r-rose commented Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ian-r-rose commented Mar 19, 2021

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GenevieveBuckley commented Mar 22, 2021

Uh oh!

GenevieveBuckley commented Mar 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ian-r-rose commented Mar 23, 2021

Uh oh!

gjoseph92 commented Jan 12, 2022

Uh oh!

ian-r-rose commented Jan 12, 2022

Uh oh!

gjoseph92 commented Jan 14, 2022

Uh oh!

ian-r-rose commented Jan 14, 2022

Uh oh!

ian-r-rose commented Jan 14, 2022

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

ian-r-rose commented Mar 19, 2021 •

edited

Loading

rjzamora Mar 24, 2021 •

edited

Loading

rjzamora Mar 24, 2021 •

edited

Loading