Skip to content

[WIP] Optimization to swap getitem and elementwise operations#755

Closed
jcrist wants to merge 2 commits intodask:masterfrom
jcrist:opt_ufunc_getitem
Closed

[WIP] Optimization to swap getitem and elementwise operations#755
jcrist wants to merge 2 commits intodask:masterfrom
jcrist:opt_ufunc_getitem

Conversation

@jcrist
Copy link
Member

@jcrist jcrist commented Sep 25, 2015

Currently swaps getitem and ufuncs properly, but doesn't merge getitem_broadcast with lower getitem/getarray operations. Also still needs tests.

Fixes #746.

Currently swaps getitem and ufuncs properly, but doesn't merge
`getitem_broadcast` with lower getitem operators. Also still needs
tests.
@jcrist
Copy link
Member Author

jcrist commented Sep 25, 2015

@shoyer, I'm curious on your use case for this. Right now in dask master we do some optimizations to reduce disk I/O - chunks that are never used aren't read from disk. But the reduction in I/O is never more granular than the chunksize of the dask.array. Generally (IMO) the granularity of a dask array should be low enough that I'm not sure how much performance gain finishing up this optimization will get us. I may be misunderstanding your problem here though.

Once I figured out some edge case behaviors, this wasn't that difficult to get working, so it can certainly be finished up. I just want to make sure it's worth it.

@shoyer
Copy link
Member

shoyer commented Sep 25, 2015

So the use case here are gigantic 3D arrays (say ~1TB on disk) representing movie-like data. This is typical for both climate/weather data and the neuroimaging data @freeman-lab works with.

We need to be able to do generic preprocessing on such datasets before we know what our access pattern will be, which may be either as 2D images or 1D time-series or even both (this is especially common for exploratory analysis). For datasets of this size, it's impossible to pick a single chunk size a priori that provides reasonable performance in both cases.

@jcrist
Copy link
Member Author

jcrist commented Sep 25, 2015

So you're asking for rechunking then based on access patterns? This sounds like something that would be much bettered handled by a higher-level api (xray/blaze type things), rather than dask itself. We can certainly fuse getitem slices, but we can't (easily) change the chunking - it's built into the graph.

@shoyer
Copy link
Member

shoyer commented Sep 26, 2015

If it's a single big file on disk (or a collection of medium sized files, such as in my xray + dask blogpost), it's possible to do very coarse chunking at first, and then do rechunking when necessary. I don't need a high level library to do the chunking, but it's convenient to be able to defer it until necessary. In some cases chunking won't even be necessary at all (e.g., if I'm just plotting a single image or time series).

@jcrist
Copy link
Member Author

jcrist commented Oct 2, 2015

I think it would help me to have a few examples of your expected behavior, as right now I think we're talking past each other. From my point of view, the only thing that swapping these operations will do is get a slighly higher level of granularity on the disk I/O, but I'm unsure how valuable this will actually be in practice.

What I'm looking for is a few simple common situations and expected results, including the following information:

  1. On disk storage format (e.g. dimension and chunking for netcdf4/hdf5 formats)
  2. Expression being run
  3. Expected result of optimization pass.

A few simple cases of this would help me better understand what you're trying to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on repeated application as in ((x + 1) * 2)[:5]? Perhaps this needs another loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could definitely be added. I'm more interested in getting a set of example use cases right now, as I'm unsure how valuable this optimization will be in practice. As it stands, this gets us sub-chunk granularity, but for all test cases I've played with this only slightly reduces I/O, with the cost of added complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds reasonable to wait then on @shoyer

@shoyer
Copy link
Member

shoyer commented Oct 4, 2015

Here's one to get you started (borrowed from my dask + xray blogpost):

  1. On disk, we have one more uncompressed arrays in a binary (mmap compatible) files format (e.g., netcdf3 or npy, but there are a million domain specific file formats like this). Concatenated and loaded into dask, the original array looks like: dask.array<concate..., shape=(21916, 256, 512), dtype=int16, chunksize=(124, 256, 512)>
  2. The original array is compressed in int16, but it really should be floating point data in memory. To decompress, we write something like: scaled = scale_factor * original + add_offset with, e.g., scale_factor = 0.01 and add_offset = 300. Note: this step is done automatically by xray, based on metadata associated with the array.
  3. Now I want to plot a map (for one time, e.g., scaled[0, :, :]) and a time series (for one location, e.g., scale[:, 0, 0]). Each of these should move the indexing operation to the inner-most array, e.g., scaled[0, :, :] = (scale_factor * original + add_offset)[0, :, :] = scale_factor * original[0, :, :] + add_offset.

Another example: Instead of my array being stored on disk, it's actually hosted by a remote server via OpenDAP. The limiting factor is bandwidth, not disk IO.

If the users knows exactly what sort of streaming analysis they want to make ahead of time (e.g., time-series vs. image based), then chunking preemptively makes sense. But it's highly convenient to be able to do these sort of operations automatically and even ahead of chunking, because (1) the ideal chunking varies depending on use-case and (2) for many use cases that involve subsetting the data it's actually perfectly fine to load everything being used into memory in a single chunk. For these reasons, I want to defer the explicit chunking until step 3 (or not even bother at all).

@shoyer
Copy link
Member

shoyer commented Oct 4, 2015

So I just realized that my example here (rewriting after a concatenate) wouldn't even work with this change. That's actually OK -- in reality, we concatenate after scaling in xray. But this broader set of optimizations is an excellent example of something that works much more smoothly with biggus than dask.

Edit: I realized now that because of how dask's stack and concatenate work, by only manipulating the task graph, these sorts of optimizations would actually work fine.

@jcrist
Copy link
Member Author

jcrist commented Oct 5, 2015

For your example above, only considering what data is read:

Current Behavior

  • scaled[0, :, :] reads in scaled[:124, :, :] (1 chunk, shape = (124, 256, 512))
  • scaled[:, 0, 0] reads in scaled[:, :, :] (176 chunks, shape = (124, 256, 512))

What could be done (not quite covered in this PR)

  • scaled[0, :, :] reads in scaled[0, :, :] (1 chunk, shape = (256, 512))
  • scaled[:, 0, 0] reads in scaled[:, 0, 0] (176 chunks, shape = (124,))

What can't be done (easily)

  • Rechunking into larger chunks. Chunks of (124,) are small, and could probably be handled more efficiently in larger sizes. rechunk can handle this though.

It also might be easier to only handle the case where the slice is being pushed down to a load operation (da.from_array), although I haven't thought about that too much. I'm unsure if this optimization pass would be useful except in minimizing the loaded data.

Is this suitable/what you wanted?

@shoyer
Copy link
Member

shoyer commented Oct 5, 2015

@jcrist Yes, that sounds great! I agree that rechunking into larger chunks automatically can't be done easily, but that's perfectly fine. I'm also not sure if there are other use cases for pushing down slices.

- Handles `None` in slices
- Attempts to fuse slices (does it wrong for now)
- Handles larger set of elemwise functions

Still no tests, still doesn't work properly, but now it also errors :)
@jcrist
Copy link
Member Author

jcrist commented Oct 8, 2015

As discussed on gitter, this is actually less correct now after the latest commit - fusing slices without knowledge of underlying shapes, in the presence of broadcasting causes problems. I'm apt to push the fusing down to runtime, by making getitem_broadcast take an iterable of indices.

@mrocklin
Copy link
Member

This seems to be stale. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants