[WIP] Optimization to swap getitem and elementwise operations#755
[WIP] Optimization to swap getitem and elementwise operations#755jcrist wants to merge 2 commits intodask:masterfrom
Conversation
Currently swaps getitem and ufuncs properly, but doesn't merge `getitem_broadcast` with lower getitem operators. Also still needs tests.
|
@shoyer, I'm curious on your use case for this. Right now in dask master we do some optimizations to reduce disk I/O - chunks that are never used aren't read from disk. But the reduction in I/O is never more granular than the chunksize of the Once I figured out some edge case behaviors, this wasn't that difficult to get working, so it can certainly be finished up. I just want to make sure it's worth it. |
|
So the use case here are gigantic 3D arrays (say ~1TB on disk) representing movie-like data. This is typical for both climate/weather data and the neuroimaging data @freeman-lab works with. We need to be able to do generic preprocessing on such datasets before we know what our access pattern will be, which may be either as 2D images or 1D time-series or even both (this is especially common for exploratory analysis). For datasets of this size, it's impossible to pick a single chunk size a priori that provides reasonable performance in both cases. |
|
So you're asking for rechunking then based on access patterns? This sounds like something that would be much bettered handled by a higher-level api (xray/blaze type things), rather than dask itself. We can certainly fuse getitem slices, but we can't (easily) change the chunking - it's built into the graph. |
|
If it's a single big file on disk (or a collection of medium sized files, such as in my xray + dask blogpost), it's possible to do very coarse chunking at first, and then do rechunking when necessary. I don't need a high level library to do the chunking, but it's convenient to be able to defer it until necessary. In some cases chunking won't even be necessary at all (e.g., if I'm just plotting a single image or time series). |
|
I think it would help me to have a few examples of your expected behavior, as right now I think we're talking past each other. From my point of view, the only thing that swapping these operations will do is get a slighly higher level of granularity on the disk I/O, but I'm unsure how valuable this will actually be in practice. What I'm looking for is a few simple common situations and expected results, including the following information:
A few simple cases of this would help me better understand what you're trying to do. |
dask/array/optimization.py
Outdated
There was a problem hiding this comment.
Thoughts on repeated application as in ((x + 1) * 2)[:5]? Perhaps this needs another loop?
There was a problem hiding this comment.
That could definitely be added. I'm more interested in getting a set of example use cases right now, as I'm unsure how valuable this optimization will be in practice. As it stands, this gets us sub-chunk granularity, but for all test cases I've played with this only slightly reduces I/O, with the cost of added complexity.
|
Here's one to get you started (borrowed from my dask + xray blogpost):
Another example: Instead of my array being stored on disk, it's actually hosted by a remote server via OpenDAP. The limiting factor is bandwidth, not disk IO. If the users knows exactly what sort of streaming analysis they want to make ahead of time (e.g., time-series vs. image based), then chunking preemptively makes sense. But it's highly convenient to be able to do these sort of operations automatically and even ahead of chunking, because (1) the ideal chunking varies depending on use-case and (2) for many use cases that involve subsetting the data it's actually perfectly fine to load everything being used into memory in a single chunk. For these reasons, I want to defer the explicit chunking until step 3 (or not even bother at all). |
|
Edit: I realized now that because of how dask's |
|
For your example above, only considering what data is read: Current Behavior
What could be done (not quite covered in this PR)
What can't be done (easily)
It also might be easier to only handle the case where the slice is being pushed down to a load operation ( Is this suitable/what you wanted? |
|
@jcrist Yes, that sounds great! I agree that rechunking into larger chunks automatically can't be done easily, but that's perfectly fine. I'm also not sure if there are other use cases for pushing down slices. |
- Handles `None` in slices - Attempts to fuse slices (does it wrong for now) - Handles larger set of elemwise functions Still no tests, still doesn't work properly, but now it also errors :)
|
As discussed on gitter, this is actually less correct now after the latest commit - fusing slices without knowledge of underlying shapes, in the presence of broadcasting causes problems. I'm apt to push the fusing down to runtime, by making |
|
This seems to be stale. Closing for now. |
Currently swaps
getitemand ufuncs properly, but doesn't mergegetitem_broadcastwith lowergetitem/getarrayoperations. Also still needs tests.Fixes #746.