Conversation
aaca41c to
6ab8fd9
Compare
|
Pinging @dask/array folks since I suspect someone in there will be able to review. |
FirefoxMetzger
left a comment
There was a problem hiding this comment.
Just adding two small points that I noticed while testing the PR. Not sure if they are relevant or not, so feel free to ignore them if not applicable :)
Co-authored-by: Sebastian Wallkötter <sebastian@wallkoetter.net>
|
Is there a helper function to validate that a valid array graph is being constructed? For e.g., with this branch, block sizes change after compute. This is resulting in some hard to debug errors. Earlier, I had an issue where the shape was changing before/after compute. It'd be nice if there was a function to check these things. |
|
I don't think that will check that the individual chunks have the sizes that they're supposed to though (and which I also have my doubts about after reading the code), just that the pasted-together array does. I'm an infrequent dask developer so I don't know if there are any tools to do that, but if you wanted to write your own, you could probably do something with |
bmerry
left a comment
There was a problem hiding this comment.
I've added comments. I'm not a core dev, so it's up to them what needs to be addressed.
| .. autosummary:: | ||
| overlap.overlap | ||
| overlap.map_overlap | ||
| lib.stride_tricks.sliding_window_view |
There was a problem hiding this comment.
Question for dask core devs: even if it has to be implemented in this namespace for dispatching reasons, would it make sense for the public API to be overlap.sliding_window_view?
There was a problem hiding this comment.
i like this idea so i moved it to overlap.py
|
I've pushed an experiment to https://github.com/bmerry/sliding-window which greatly simplifies the implementation, and I think fixes a few bugs too (in particular, it should handle repeated axes). It passes the tests you wrote, but I can't say for sure whether it gets the |
- Put overlap on +ve side, so that last chunk is the one that is trimmed. - Ensure that last chunk is big enough to leave chunks of positive size after windowing. - Remove trim function and just trim the last chunk. - Remove wrapper that tries to slice the chunks.
This was inspired by a question in dask#7234. At present `assert_eq` checks that the computed array has the right size and dtype, but it's still possible that the individual chunks have the wrong size (as long as they add up to the right total size) and maybe the wrong dtype (as long as they coerce to produce the right dtype - although I haven't actually checked that this would slip through). Now `persist` is used to access the individual computed chunks and checked for shape and dtype. This is a draft because there are a couple of failures. One is in `test_zarr_return_stored[False]` (which I'll need to investigate further, but possibly `persist` just doesn't make sense for this case) and the other is in `test_blockwise_concatenate` which looks like it might just be a test bug.
|
Thanks @bmerry. I have a simpler implementation too now by using symmetric overlaps and boundary padding. I'll take a closer look at yours soon, it looks cleaner! |
#7277 should help with this. |
|
Very nice implementation @bmerry ! this version passes all of xarray's tests too so it should be ready to go. |
* Check that computed chunks have right size and dtype This was inspired by a question in #7234. At present `assert_eq` checks that the computed array has the right size and dtype, but it's still possible that the individual chunks have the wrong size (as long as they add up to the right total size) and maybe the wrong dtype (as long as they coerce to produce the right dtype - although I haven't actually checked that this would slip through). Now `persist` is used to access the individual computed chunks and checked for shape and dtype. * Fix test_blockwise_concatenate It had a `blockwise` function that returned a shape inconsistent with that inferred from the input block sizes. * Fix dtype handling for *_like_safe on old numpy It was being used with non-arrays, so explicitly force things to be arrays.
* upstream/master: (43 commits) bump version to 2021.03.0 Bump minimum version of distributed (dask#7328) Fix `percentiles_summary` with `dask_cudf` (dask#7325) Temporarily revert recent Array.__setitem__ updates (dask#7326) Blockwise.clone (dask#7312) NEP-35 duck array update (dask#7321) Don't allow setting `.name` for array (dask#7222) Use nearest interpolation for creating percentiles of integer input (dask#7305) Test `exp` with CuPy arrays (dask#7322) Check that computed chunks have right size and dtype (dask#7277) pytest.mark.flaky (dask#7319) Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308) Support for Python 3.9 (dask#7289) Add broadcast-based merge implementation (dask#7143) Add split_every to graph_manipulation (dask#7282) Typo in optimize docs (dask#7306) dask.graph_manipulation support for xarray.Dataset (dask#7276) Add plot width and height support for Bokeh 2.3.0 (dask#7297) Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997) Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260) ...
|
It's looking good. The only thing I think still left is a test for a window size of 0, which is legal in numpy and has some slightly odd behaviour that I suspect might not work as is. It's also pretty useless (you get an array with a length-zero dimension) so if it's difficult to make work I don't think it would be unreasonable to just raise an exception stating that it's not supported by Dask. |
I think we can leave that for someone who really needs it :) . I am raising |
|
FWIW I have been working on something similar, but only came across this PR today as I was cleaning up my code. I have uploaded it here https://github.com/dtpc/dask-windows This PR is obviously must cleaner. I am only posting my version as it is based on |
|
It seems like this is very close to being ready. I am not quite sure if the failures are known ones or not (ping @jrbourbeau who will know), but it's probably worth merging main in to see if this passes. |
* upstream/main: Change default branch from master to main (dask#7198) Add Xarray to CI software environment (dask#7338) Update repartition argument name in error text (dask#7336) Run upstream tests based on commit message (dask#7329) Use pytest.register_assert_rewrite on util modules (dask#7278) add example on using specific chunk sizes in from_array() (dask#7330) Move numpy skip into test (dask#7247)
|
yay. tests pass! |
|
Thanks for persevering! :) |
|
Just testing out this new function. Are you aware that it can easily result in humongous chunks?
Transforming it to a one-year sliding window results in three 500gb chunks.
|
|
This is misleading unfortunately. it does rechunk to your window size, but after that it uses numpy's |
|
@dcherian True, the last dimension is fake. Yet, if I try to compute anything with these "mega" chunks, serialization fails (probably because dask realizes early it can't hold that much memory). This effectively prevents me from using the sliding window function. Anyways, I believe it should be possible to circumvent the My current approach is to call |
|
While it is possible to circumvent the additional org = np.broadcast_to(np.arange(8)[:,None],(8,3))
X = da.from_array(org, chunks=(3, 3))
>>> org
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7]])
window_size = 3
X_overlap = da.overlap.overlap(X, depth={0: (0, window_size - 1)}, boundary='none')
>>> X_overlap.chunks
((5, 5, 2), (3,))
# this has to be calculated algorithmically in a full impl:
# subtract (window_size-1) from each of X_overlap's chunks along window axis, and set negative ones to zero afterwards
new_chunks = ((3,3,0), (3,), (window_size,))
mapped = da.map_blocks(
sliding_window_view,
X_overlap,
chunks=new_chunks,
new_axis=2,
meta=X_overlap,
window_shape=window_size,
axis=0
)
>>> mapped
dask.array<sliding_window_view, shape=(6, 3, 3), dtype=int64, chunksize=(3, 3, 3), chunktype=numpy.ndarray>
>>> mapped.chunks
((3, 3, 0), (3,), (3,))
# mapped has correct shape, but zero chunks will get computed anyways and will produce errors when computed
# in the simple / usual case where all zero chunks are at the end, we can just opt to not include them
fixed = da.Array(mapped.dask, mapped.name, ((3,3), (3,), (3,)), meta=mapped)
>>> fixed
dask.array<sliding_window_view, shape=(6, 3, 3), dtype=int64, chunksize=(3, 3, 3), chunktype=numpy.ndarray>
computed = fixed.compute()
>>> computed.shape
(6, 3, 3)
>>> computed
array([[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]],
[[1, 2, 3],
[1, 2, 3],
[1, 2, 3]],
[[2, 3, 4],
[2, 3, 4],
[2, 3, 4]],
[[3, 4, 5],
[3, 4, 5],
[3, 4, 5]],
[[4, 5, 6],
[4, 5, 6],
[4, 5, 6]],
[[5, 6, 7],
[5, 6, 7],
[5, 6, 7]]])
# for a robust solution also working on irregular chunks, we have to "reindex" the chunks by creating a new task graph
mapped_irregular_chunks = ((3, 0, 3, 0), (3,), (3,))
fixed2_chunks = ((3,3), (3,), (3,))
new_tasks = {
('fixed2', 0,0,0): (mapped.name, 0,0,0),
('fixed2', 1,0,0): (mapped.name, 2,0,0),
}
fixed2 = da.Array(new_tasks + dict(mapped.dask), 'fixed2', fixed2_chunks, meta=mapped) |
black dask/flake8 dask