Add sliding_window_view by dcherian · Pull Request #7234 · dask/dask

dcherian · 2021-02-16T20:05:58Z

Closes Support sliding window computations #4659
Tests added / passed
needs docstring
match numpy's default kwargs
copy over error checking from numpy
Passes black dask / flake8 dask

dask/array/overlap.py

Fixes dask#4659

jsignell · 2021-02-22T16:42:26Z

Pinging @dask/array folks since I suspect someone in there will be able to review.

dask/array/overlap.py

FirefoxMetzger

Just adding two small points that I noticed while testing the PR. Not sure if they are relevant or not, so feel free to ignore them if not applicable :)

dask/array/overlap.py

Co-authored-by: Sebastian Wallkötter <sebastian@wallkoetter.net>

dcherian · 2021-02-23T16:36:57Z

Is there a helper function to validate that a valid array graph is being constructed?

For e.g., with this branch, block sizes change after compute. This is resulting in some hard to debug errors.

before compute: (20, 3) | after compute: (17, 3)
before compute: (8, 3) | after compute: (11, 3)

Earlier, I had an issue where the shape was changing before/after compute.

It'd be nice if there was a function to check these things.

jsignell · 2021-02-24T15:18:49Z

Is there a helper function to validate that a valid array graph is being constructed?

assert_eq should check the shape before and after compute.

dask/array/tests/test_overlap.py

bmerry · 2021-02-25T10:30:05Z

assert_eq should check the shape before and after compute.

I don't think that will check that the individual chunks have the sizes that they're supposed to though (and which I also have my doubts about after reading the code), just that the pasted-together array does. I'm an infrequent dask developer so I don't know if there are any tools to do that, but if you wanted to write your own, you could probably do something with map_blocks and block_info to assert that the expected shape from block_info makes the actual shape for each chunk.

bmerry

I've added comments. I'm not a core dev, so it's up to them what needs to be addressed.

dask/array/numpy_compat.py

bmerry · 2021-02-25T09:34:25Z

docs/source/array-api.rst

 .. autosummary::
   overlap.overlap
   overlap.map_overlap
+   lib.stride_tricks.sliding_window_view


Question for dask core devs: even if it has to be implemented in this namespace for dispatching reasons, would it make sense for the public API to be overlap.sliding_window_view?

i like this idea so i moved it to overlap.py

dask/array/overlap.py

dask/array/lib/stride_tricks.py

dask/array/tests/test_overlap.py

bmerry · 2021-02-25T13:47:38Z

I've pushed an experiment to https://github.com/bmerry/sliding-window which greatly simplifies the implementation, and I think fixes a few bugs too (in particular, it should handle repeated axes). It passes the tests you wrote, but I can't say for sure whether it gets the newchunks correct. You're welcome to cherry-pick it.

- Put overlap on +ve side, so that last chunk is the one that is trimmed. - Ensure that last chunk is big enough to leave chunks of positive size after windowing. - Remove trim function and just trim the last chunk. - Remove wrapper that tries to slice the chunks.

This was inspired by a question in dask#7234. At present `assert_eq` checks that the computed array has the right size and dtype, but it's still possible that the individual chunks have the wrong size (as long as they add up to the right total size) and maybe the wrong dtype (as long as they coerce to produce the right dtype - although I haven't actually checked that this would slip through). Now `persist` is used to access the individual computed chunks and checked for shape and dtype. This is a draft because there are a couple of failures. One is in `test_zarr_return_stored[False]` (which I'll need to investigate further, but possibly `persist` just doesn't make sense for this case) and the other is in `test_blockwise_concatenate` which looks like it might just be a test bug.

dcherian · 2021-02-26T15:01:23Z

Thanks @bmerry.

I have a simpler implementation too now by using symmetric overlaps and boundary padding.

I'll take a closer look at yours soon, it looks cleaner!

bmerry · 2021-02-26T17:15:20Z

I don't think that will check that the individual chunks have the sizes that they're supposed to though (and which I also have my doubts about after reading the code), just that the pasted-together array does. I'm an infrequent dask developer so I don't know if there are any tools to do that, but if you wanted to write your own, you could probably do something with map_blocks and block_info to assert that the expected shape from block_info makes the actual shape for each chunk.

#7277 should help with this.

dcherian · 2021-03-01T15:06:48Z

Very nice implementation @bmerry !

this version passes all of xarray's tests too so it should be ready to go.

* Check that computed chunks have right size and dtype This was inspired by a question in #7234. At present `assert_eq` checks that the computed array has the right size and dtype, but it's still possible that the individual chunks have the wrong size (as long as they add up to the right total size) and maybe the wrong dtype (as long as they coerce to produce the right dtype - although I haven't actually checked that this would slip through). Now `persist` is used to access the individual computed chunks and checked for shape and dtype. * Fix test_blockwise_concatenate It had a `blockwise` function that returned a shape inconsistent with that inferred from the input block sizes. * Fix dtype handling for *_like_safe on old numpy It was being used with non-arrays, so explicitly force things to be arrays.

* upstream/master: (43 commits) bump version to 2021.03.0 Bump minimum version of distributed (dask#7328) Fix `percentiles_summary` with `dask_cudf` (dask#7325) Temporarily revert recent Array.__setitem__ updates (dask#7326) Blockwise.clone (dask#7312) NEP-35 duck array update (dask#7321) Don't allow setting `.name` for array (dask#7222) Use nearest interpolation for creating percentiles of integer input (dask#7305) Test `exp` with CuPy arrays (dask#7322) Check that computed chunks have right size and dtype (dask#7277) pytest.mark.flaky (dask#7319) Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308) Support for Python 3.9 (dask#7289) Add broadcast-based merge implementation (dask#7143) Add split_every to graph_manipulation (dask#7282) Typo in optimize docs (dask#7306) dask.graph_manipulation support for xarray.Dataset (dask#7276) Add plot width and height support for Bokeh 2.3.0 (dask#7297) Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997) Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260) ...

bmerry · 2021-03-09T08:20:24Z

It's looking good. The only thing I think still left is a test for a window size of 0, which is legal in numpy and has some slightly odd behaviour that I suspect might not work as is. It's also pretty useless (you get an array with a length-zero dimension) so if it's difficult to make work I don't think it would be unreasonable to just raise an exception stating that it's not supported by Dask.

dcherian · 2021-03-09T14:25:26Z

The only thing I think still left is a test for a window size of 0, which is legal in numpy and has some slightly odd behaviour that I suspect might not work as is

I think we can leave that for someone who really needs it :) . I am raising ValueError now. I could change it to NotImplementedError if some one felt strongly about it.

dtpc · 2021-03-18T07:11:25Z

FWIW I have been working on something similar, but only came across this PR today as I was cleaning up my code. I have uploaded it here https://github.com/dtpc/dask-windows

This PR is obviously must cleaner. I am only posting my version as it is based on skimage.util.view_as_windows and so supports a step argument as well, in case anybody is looking for that. :)

jsignell · 2021-03-18T13:33:25Z

It seems like this is very close to being ready. I am not quite sure if the failures are known ones or not (ping @jrbourbeau who will know), but it's probably worth merging main in to see if this passes.

* upstream/main: Change default branch from master to main (dask#7198) Add Xarray to CI software environment (dask#7338) Update repartition argument name in error text (dask#7336) Run upstream tests based on commit message (dask#7329) Use pytest.register_assert_rewrite on util modules (dask#7278) add example on using specific chunk sizes in from_array() (dask#7330) Move numpy skip into test (dask#7247)

dcherian · 2021-03-18T15:44:25Z

yay. tests pass!

jsignell · 2021-03-18T15:46:54Z

Thanks for persevering! :)

dcherian · 2021-03-18T16:15:52Z

Thanks @jsignell and @bmerry

sehoffmann · 2021-03-19T18:02:16Z

Just testing out this new function. Are you aware that it can easily result in humongous chunks?
I got three years worth of wind-velocity measurements as lat/long images:

dataset:

\n \n \n \n \n \n \n

	Array	Chunk
Bytes	574.62 MB	16.78 MB
Shape	(8768, 64, 128)	(256, 64, 128)
Count	36 Tasks	36 Chunks
Type	float64	numpy.ndarray

Transforming it to a one-year sliding window results in three 500gb chunks.

sliding_window_view(dataset, 2920, axis=0):

\n \n \n \n \n \n \n \n \n \n

	Array	Chunk
Bytes	1.12 TB	560.32 GB
Shape	(5849, 64, 128, 2920)	(2928, 64, 128, 2920)
Count	52 Tasks	3 Chunks
Type	float64	numpy.ndarray

safe_chunks are ((2920, 2928, 2920), (64,), (128,))

dcherian · 2021-03-20T02:57:15Z

This is misleading unfortunately. it does rechunk to your window size, but after that it uses numpy's sliding_window_view to create a view that looks like a giant array 2928x64x128x2920 but is really using memory for 2928x64x128 = 200MB

sehoffmann · 2021-03-20T14:53:58Z

@dcherian True, the last dimension is fake. Yet, if I try to compute anything with these "mega" chunks, serialization fails (probably because dask realizes early it can't hold that much memory). This effectively prevents me from using the sliding window function.

Anyways, I believe it should be possible to circumvent the rechunk() entirely (which also incurs additional overhead). I thought I had a working solution yesterday (by reusing the tasks from map_overlay and adjusting chunks afterwards by "creating" a new da.Array with identical name), however on closer inspection it fails on some edge cases.

My current approach is to call da.overlap.overlap and da.map_blocks directly. I will report back once I got a working proof of concept.

sehoffmann · 2021-03-20T15:29:30Z

While it is possible to circumvent the additional rechunk(), the final chunks will of course be as big as the current solution. Yet, if you are interested in improving the current solution by removing the need for the rechunk, then here is a rough concept code for a single axis only:

org =  np.broadcast_to(np.arange(8)[:,None],(8,3))
X = da.from_array(org, chunks=(3, 3))

>>> org
array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4],
       [5, 5, 5],
       [6, 6, 6],
       [7, 7, 7]])

window_size = 3
X_overlap = da.overlap.overlap(X, depth={0: (0, window_size - 1)}, boundary='none')

>>> X_overlap.chunks
((5, 5, 2), (3,))

# this has to be calculated algorithmically in a full impl:
# subtract (window_size-1) from each of X_overlap's chunks along window axis, and set negative ones to zero afterwards
new_chunks = ((3,3,0), (3,), (window_size,))

mapped = da.map_blocks(
    sliding_window_view, 
    X_overlap,
    chunks=new_chunks,
    new_axis=2,
    meta=X_overlap,

    window_shape=window_size,
    axis=0
)
>>> mapped
dask.array<sliding_window_view, shape=(6, 3, 3), dtype=int64, chunksize=(3, 3, 3), chunktype=numpy.ndarray>

>>> mapped.chunks
((3, 3, 0), (3,), (3,))

# mapped has correct shape, but zero chunks will get computed anyways and will produce errors when computed
# in the simple / usual case where all zero chunks are at the end, we can just opt to not include them

fixed = da.Array(mapped.dask, mapped.name, ((3,3), (3,), (3,)), meta=mapped)
>>> fixed
dask.array<sliding_window_view, shape=(6, 3, 3), dtype=int64, chunksize=(3, 3, 3), chunktype=numpy.ndarray>

computed = fixed.compute()
>>> computed.shape
(6, 3, 3)
>>> computed
array([[[0, 1, 2],
        [0, 1, 2],
        [0, 1, 2]],

       [[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]],

       [[2, 3, 4],
        [2, 3, 4],
        [2, 3, 4]],

       [[3, 4, 5],
        [3, 4, 5],
        [3, 4, 5]],

       [[4, 5, 6],
        [4, 5, 6],
        [4, 5, 6]],

       [[5, 6, 7],
        [5, 6, 7],
        [5, 6, 7]]])

# for a robust solution also working on irregular chunks,  we have to "reindex" the chunks by creating a new task graph
mapped_irregular_chunks = ((3, 0, 3, 0), (3,), (3,))
fixed2_chunks = ((3,3), (3,), (3,))

new_tasks = {
('fixed2', 0,0,0): (mapped.name, 0,0,0),
('fixed2', 1,0,0): (mapped.name, 2,0,0),
}

fixed2 = da.Array(new_tasks + dict(mapped.dask), 'fixed2', fixed2_chunks, meta=mapped)

dcherian commented Feb 16, 2021

View reviewed changes

dask/array/overlap.py Outdated Show resolved Hide resolved

dcherian commented Feb 16, 2021

View reviewed changes

dask/array/overlap.py Outdated Show resolved Hide resolved

dcherian and others added 3 commits February 16, 2021 13:14

Add sliding_window_view

2b39903

Fixes dask#4659

Update dask/array/overlap.py

18e532f

Add to init and api

6ab8fd9

dcherian force-pushed the sliding-window branch from aaca41c to 6ab8fd9 Compare February 16, 2021 20:14

jsignell added the array label Feb 22, 2021

dcherian mentioned this pull request Feb 22, 2021

High memory usage when loading a custom-strided/non-contigous numpy array #7208

Open

add breaking test

1586ae0

dcherian commented Feb 22, 2021

View reviewed changes

dask/array/overlap.py Outdated Show resolved Hide resolved

Fix bug in chunks assignment

de74086

FirefoxMetzger reviewed Feb 23, 2021

View reviewed changes

dask/array/overlap.py Outdated Show resolved Hide resolved

dask/array/overlap.py Outdated Show resolved Hide resolved

dcherian and others added 3 commits February 23, 2021 06:24

Update dask/array/overlap.py

b1e7736

Co-authored-by: Sebastian Wallkötter <sebastian@wallkoetter.net>

Move to stride_tricks

526aafd

bugfix

f567a36

jsignell reviewed Feb 24, 2021

View reviewed changes

dask/array/tests/test_overlap.py Outdated Show resolved Hide resolved

bmerry reviewed Feb 25, 2021

View reviewed changes

bmerry mentioned this pull request Feb 26, 2021

Check that computed chunks have right size and dtype #7277

Merged

2 tasks

dcherian marked this pull request as draft February 26, 2021 15:10

dcherian added 2 commits February 27, 2021 09:26

review feedback

d0e8128

undo refactor of map_overlap

56d0612

dcherian force-pushed the sliding-window branch from 2ee5ceb to 35c35be Compare March 1, 2021 14:58

dcherian added 3 commits March 1, 2021 08:02

Update docstring

6513df1

fix

edf488f

minimize diff

a44ebab

dcherian marked this pull request as ready for review March 1, 2021 15:06

dcherian added 2 commits March 1, 2021 08:17

Fix dispatching

d15b232

Use derived_from with numpy_compat.

08a2ae7

dcherian mentioned this pull request Mar 1, 2021

Use numpy & dask sliding_window_view for rolling pydata/xarray#4977

Merged

4 tasks

pentschev mentioned this pull request Mar 4, 2021

Support sliding window computations #4659

Closed

dcherian added 2 commits March 8, 2021 08:01

Remove blocks test.

ee610d7

Base automatically changed from master to main March 8, 2021 20:19

Raise when window_shape == 0

f1e6b9c

jsignell merged commit 80afa97 into dask:main Mar 18, 2021

dcherian deleted the sliding-window branch March 18, 2021 16:15

GenevieveBuckley mentioned this pull request Mar 31, 2021

Feature Request: Efficient rolling with strides pydata/xarray#3608

Open

Uh oh!

Conversation

dcherian commented Feb 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsignell commented Feb 22, 2021

Uh oh!

Uh oh!

FirefoxMetzger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dcherian commented Feb 23, 2021

Uh oh!

jsignell commented Feb 24, 2021

Uh oh!

Uh oh!

bmerry commented Feb 25, 2021

Uh oh!

bmerry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bmerry Feb 25, 2021

Choose a reason for hiding this comment

Uh oh!

dcherian Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmerry commented Feb 25, 2021

Uh oh!

dcherian commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmerry commented Feb 26, 2021

Uh oh!

dcherian commented Mar 1, 2021

Uh oh!

bmerry commented Mar 9, 2021

Uh oh!

dcherian commented Mar 9, 2021

Uh oh!

dtpc commented Mar 18, 2021

Uh oh!

jsignell commented Mar 18, 2021

Uh oh!

dcherian commented Mar 18, 2021

Uh oh!

jsignell commented Mar 18, 2021

Uh oh!

dcherian commented Mar 18, 2021

Uh oh!

sehoffmann commented Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented Mar 20, 2021

Uh oh!

sehoffmann commented Mar 20, 2021

Uh oh!

sehoffmann commented Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

dcherian commented Feb 16, 2021 •

edited

Loading

dcherian commented Feb 26, 2021 •

edited

Loading

sehoffmann commented Mar 19, 2021 •

edited

Loading

sehoffmann commented Mar 20, 2021 •

edited

Loading