[WIP] Optimization to swap getitem and elementwise operations by jcrist · Pull Request #755 · dask/dask

jcrist · 2015-09-25T19:52:15Z

Currently swaps getitem and ufuncs properly, but doesn't merge getitem_broadcast with lower getitem/getarray operations. Also still needs tests.

Fixes #746.

Currently swaps getitem and ufuncs properly, but doesn't merge `getitem_broadcast` with lower getitem operators. Also still needs tests.

jcrist · 2015-09-25T19:57:10Z

@shoyer, I'm curious on your use case for this. Right now in dask master we do some optimizations to reduce disk I/O - chunks that are never used aren't read from disk. But the reduction in I/O is never more granular than the chunksize of the dask.array. Generally (IMO) the granularity of a dask array should be low enough that I'm not sure how much performance gain finishing up this optimization will get us. I may be misunderstanding your problem here though.

Once I figured out some edge case behaviors, this wasn't that difficult to get working, so it can certainly be finished up. I just want to make sure it's worth it.

shoyer · 2015-09-25T20:58:52Z

So the use case here are gigantic 3D arrays (say ~1TB on disk) representing movie-like data. This is typical for both climate/weather data and the neuroimaging data @freeman-lab works with.

We need to be able to do generic preprocessing on such datasets before we know what our access pattern will be, which may be either as 2D images or 1D time-series or even both (this is especially common for exploratory analysis). For datasets of this size, it's impossible to pick a single chunk size a priori that provides reasonable performance in both cases.

jcrist · 2015-09-25T21:27:04Z

So you're asking for rechunking then based on access patterns? This sounds like something that would be much bettered handled by a higher-level api (xray/blaze type things), rather than dask itself. We can certainly fuse getitem slices, but we can't (easily) change the chunking - it's built into the graph.

shoyer · 2015-09-26T00:31:00Z

If it's a single big file on disk (or a collection of medium sized files, such as in my xray + dask blogpost), it's possible to do very coarse chunking at first, and then do rechunking when necessary. I don't need a high level library to do the chunking, but it's convenient to be able to defer it until necessary. In some cases chunking won't even be necessary at all (e.g., if I'm just plotting a single image or time series).

jcrist · 2015-10-02T20:42:27Z

I think it would help me to have a few examples of your expected behavior, as right now I think we're talking past each other. From my point of view, the only thing that swapping these operations will do is get a slighly higher level of granularity on the disk I/O, but I'm unsure how valuable this will actually be in practice.

What I'm looking for is a few simple common situations and expected results, including the following information:

On disk storage format (e.g. dimension and chunking for netcdf4/hdf5 formats)
Expression being run
Expected result of optimization pass.

A few simple cases of this would help me better understand what you're trying to do.

mrocklin · 2015-10-02T20:51:29Z

dask/array/optimization.py

Thoughts on repeated application as in ((x + 1) * 2)[:5]? Perhaps this needs another loop?

That could definitely be added. I'm more interested in getting a set of example use cases right now, as I'm unsure how valuable this optimization will be in practice. As it stands, this gets us sub-chunk granularity, but for all test cases I've played with this only slightly reduces I/O, with the cost of added complexity.

OK, sounds reasonable to wait then on @shoyer

shoyer · 2015-10-04T00:37:50Z

Here's one to get you started (borrowed from my dask + xray blogpost):

On disk, we have one more uncompressed arrays in a binary (mmap compatible) files format (e.g., netcdf3 or npy, but there are a million domain specific file formats like this). Concatenated and loaded into dask, the original array looks like: dask.array<concate..., shape=(21916, 256, 512), dtype=int16, chunksize=(124, 256, 512)>
The original array is compressed in int16, but it really should be floating point data in memory. To decompress, we write something like: scaled = scale_factor * original + add_offset with, e.g., scale_factor = 0.01 and add_offset = 300. Note: this step is done automatically by xray, based on metadata associated with the array.
Now I want to plot a map (for one time, e.g., scaled[0, :, :]) and a time series (for one location, e.g., scale[:, 0, 0]). Each of these should move the indexing operation to the inner-most array, e.g., scaled[0, :, :] = (scale_factor * original + add_offset)[0, :, :] = scale_factor * original[0, :, :] + add_offset.

Another example: Instead of my array being stored on disk, it's actually hosted by a remote server via OpenDAP. The limiting factor is bandwidth, not disk IO.

If the users knows exactly what sort of streaming analysis they want to make ahead of time (e.g., time-series vs. image based), then chunking preemptively makes sense. But it's highly convenient to be able to do these sort of operations automatically and even ahead of chunking, because (1) the ideal chunking varies depending on use-case and (2) for many use cases that involve subsetting the data it's actually perfectly fine to load everything being used into memory in a single chunk. For these reasons, I want to defer the explicit chunking until step 3 (or not even bother at all).

shoyer · 2015-10-04T00:57:50Z

So I just realized that my example here (rewriting after a concatenate) wouldn't even work with this change. That's actually OK -- in reality, we concatenate after scaling in xray. But this broader set of optimizations is an excellent example of something that works much more smoothly with biggus than dask.

Edit: I realized now that because of how dask's stack and concatenate work, by only manipulating the task graph, these sorts of optimizations would actually work fine.

jcrist · 2015-10-05T17:04:54Z

For your example above, only considering what data is read:

Current Behavior

scaled[0, :, :] reads in scaled[:124, :, :] (1 chunk, shape = (124, 256, 512))
scaled[:, 0, 0] reads in scaled[:, :, :] (176 chunks, shape = (124, 256, 512))

What could be done (not quite covered in this PR)

scaled[0, :, :] reads in scaled[0, :, :] (1 chunk, shape = (256, 512))
scaled[:, 0, 0] reads in scaled[:, 0, 0] (176 chunks, shape = (124,))

What can't be done (easily)

Rechunking into larger chunks. Chunks of (124,) are small, and could probably be handled more efficiently in larger sizes. rechunk can handle this though.

It also might be easier to only handle the case where the slice is being pushed down to a load operation (da.from_array), although I haven't thought about that too much. I'm unsure if this optimization pass would be useful except in minimizing the loaded data.

Is this suitable/what you wanted?

shoyer · 2015-10-05T17:32:02Z

@jcrist Yes, that sounds great! I agree that rechunking into larger chunks automatically can't be done easily, but that's perfectly fine. I'm also not sure if there are other use cases for pushing down slices.

- Handles `None` in slices - Attempts to fuse slices (does it wrong for now) - Handles larger set of elemwise functions Still no tests, still doesn't work properly, but now it also errors :)

jcrist · 2015-10-08T20:10:06Z

As discussed on gitter, this is actually less correct now after the latest commit - fusing slices without knowledge of underlying shapes, in the presence of broadcasting causes problems. I'm apt to push the fusing down to runtime, by making getitem_broadcast take an iterable of indices.

mrocklin · 2016-01-31T20:12:14Z

This seems to be stale. Closing for now.

First pass at getitem_ufunc swap

c511c46

Currently swaps getitem and ufuncs properly, but doesn't merge `getitem_broadcast` with lower getitem operators. Also still needs tests.

mrocklin reviewed Oct 2, 2015
View reviewed changes

More work on rewriting (ufunc, (getitem, ...))

f875885

- Handles `None` in slices - Attempts to fuse slices (does it wrong for now) - Handles larger set of elemwise functions Still no tests, still doesn't work properly, but now it also errors :)

mrocklin closed this Jan 31, 2016

shoyer mentioned this pull request Jan 17, 2017

8 million nodes in the dask dict #1916

Closed

shoyer mentioned this pull request Jun 7, 2017

Swapping getitem and elementwise operations via graph opimization, redux? #2431

Open

phofl added a commit to phofl/dask that referenced this pull request Dec 23, 2024

Partially address graph blowup in index based merges (dask#755)

ed3f0e0

Uh oh!

Conversation

jcrist commented Sep 25, 2015

Uh oh!

jcrist commented Sep 25, 2015

Uh oh!

shoyer commented Sep 25, 2015

Uh oh!

jcrist commented Sep 25, 2015

Uh oh!

shoyer commented Sep 26, 2015

Uh oh!

jcrist commented Oct 2, 2015

Uh oh!

mrocklin Oct 2, 2015

Choose a reason for hiding this comment

Uh oh!

jcrist Oct 2, 2015

Choose a reason for hiding this comment

Uh oh!

mrocklin Oct 2, 2015

Choose a reason for hiding this comment

Uh oh!

shoyer commented Oct 4, 2015

Uh oh!

shoyer commented Oct 4, 2015

Uh oh!

jcrist commented Oct 5, 2015

Uh oh!

shoyer commented Oct 5, 2015

Uh oh!

jcrist commented Oct 8, 2015

Uh oh!

mrocklin commented Jan 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants