Skip to content

Add sliding_window_view#7234

Merged
jsignell merged 22 commits intodask:mainfrom
dcherian:sliding-window
Mar 18, 2021
Merged

Add sliding_window_view#7234
jsignell merged 22 commits intodask:mainfrom
dcherian:sliding-window

Conversation

@dcherian
Copy link
Collaborator

@dcherian dcherian commented Feb 16, 2021

@jsignell
Copy link
Member

Pinging @dask/array folks since I suspect someone in there will be able to review.

Copy link
Contributor

@FirefoxMetzger FirefoxMetzger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding two small points that I noticed while testing the PR. Not sure if they are relevant or not, so feel free to ignore them if not applicable :)

dcherian and others added 3 commits February 23, 2021 06:24
Co-authored-by: Sebastian Wallkötter <sebastian@wallkoetter.net>
@dcherian
Copy link
Collaborator Author

Is there a helper function to validate that a valid array graph is being constructed?

For e.g., with this branch, block sizes change after compute. This is resulting in some hard to debug errors.

before compute: (20, 3) | after compute: (17, 3)
before compute: (8, 3) | after compute: (11, 3)

Earlier, I had an issue where the shape was changing before/after compute.

It'd be nice if there was a function to check these things.

@jsignell
Copy link
Member

Is there a helper function to validate that a valid array graph is being constructed?

assert_eq should check the shape before and after compute.

@bmerry
Copy link
Contributor

bmerry commented Feb 25, 2021

assert_eq should check the shape before and after compute.

I don't think that will check that the individual chunks have the sizes that they're supposed to though (and which I also have my doubts about after reading the code), just that the pasted-together array does. I'm an infrequent dask developer so I don't know if there are any tools to do that, but if you wanted to write your own, you could probably do something with map_blocks and block_info to assert that the expected shape from block_info makes the actual shape for each chunk.

Copy link
Contributor

@bmerry bmerry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments. I'm not a core dev, so it's up to them what needs to be addressed.

.. autosummary::
overlap.overlap
overlap.map_overlap
lib.stride_tricks.sliding_window_view
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for dask core devs: even if it has to be implemented in this namespace for dispatching reasons, would it make sense for the public API to be overlap.sliding_window_view?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this idea so i moved it to overlap.py

@bmerry
Copy link
Contributor

bmerry commented Feb 25, 2021

I've pushed an experiment to https://github.com/bmerry/sliding-window which greatly simplifies the implementation, and I think fixes a few bugs too (in particular, it should handle repeated axes). It passes the tests you wrote, but I can't say for sure whether it gets the newchunks correct. You're welcome to cherry-pick it.

- Put overlap on +ve side, so that last chunk is the one that is trimmed.
- Ensure that last chunk is big enough to leave chunks of positive size
  after windowing.
- Remove trim function and just trim the last chunk.
- Remove wrapper that tries to slice the chunks.
bmerry added a commit to bmerry/dask that referenced this pull request Feb 26, 2021
This was inspired by a question in dask#7234. At present `assert_eq` checks
that the computed array has the right size and dtype, but it's still
possible that the individual chunks have the wrong size (as long as they
add up to the right total size) and maybe the wrong dtype (as long as
they coerce to produce the right dtype - although I haven't actually
checked that this would slip through).

Now `persist` is used to access the individual computed chunks and
checked for shape and dtype.

This is a draft because there are a couple of failures. One is in
`test_zarr_return_stored[False]` (which I'll need to investigate
further, but possibly `persist` just doesn't make sense for this case)
and the other is in `test_blockwise_concatenate` which looks like it
might just be a test bug.
@dcherian
Copy link
Collaborator Author

dcherian commented Feb 26, 2021

Thanks @bmerry.

I have a simpler implementation too now by using symmetric overlaps and boundary padding.

I'll take a closer look at yours soon, it looks cleaner!

@dcherian dcherian marked this pull request as draft February 26, 2021 15:10
@bmerry
Copy link
Contributor

bmerry commented Feb 26, 2021

I don't think that will check that the individual chunks have the sizes that they're supposed to though (and which I also have my doubts about after reading the code), just that the pasted-together array does. I'm an infrequent dask developer so I don't know if there are any tools to do that, but if you wanted to write your own, you could probably do something with map_blocks and block_info to assert that the expected shape from block_info makes the actual shape for each chunk.

#7277 should help with this.

@dcherian
Copy link
Collaborator Author

dcherian commented Mar 1, 2021

Very nice implementation @bmerry !

this version passes all of xarray's tests too so it should be ready to go.

@dcherian dcherian marked this pull request as ready for review March 1, 2021 15:06
jsignell pushed a commit that referenced this pull request Mar 4, 2021
* Check that computed chunks have right size and dtype

This was inspired by a question in #7234. At present `assert_eq` checks
that the computed array has the right size and dtype, but it's still
possible that the individual chunks have the wrong size (as long as they
add up to the right total size) and maybe the wrong dtype (as long as
they coerce to produce the right dtype - although I haven't actually
checked that this would slip through).

Now `persist` is used to access the individual computed chunks and
checked for shape and dtype.

* Fix test_blockwise_concatenate

It had a `blockwise` function that returned a shape inconsistent with
that inferred from the input block sizes.

* Fix dtype handling for *_like_safe on old numpy

It was being used with non-arrays, so explicitly force things to be
arrays.
dcherian added 2 commits March 8, 2021 08:01
* upstream/master: (43 commits)
  bump version to 2021.03.0
  Bump minimum version of distributed (dask#7328)
  Fix `percentiles_summary` with `dask_cudf` (dask#7325)
  Temporarily revert recent Array.__setitem__ updates (dask#7326)
  Blockwise.clone (dask#7312)
  NEP-35 duck array update (dask#7321)
  Don't allow setting `.name` for array (dask#7222)
  Use nearest interpolation for creating percentiles of integer input (dask#7305)
  Test `exp` with CuPy arrays (dask#7322)
  Check that computed chunks have right size and dtype (dask#7277)
  pytest.mark.flaky (dask#7319)
  Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308)
  Support for Python 3.9 (dask#7289)
  Add broadcast-based merge implementation (dask#7143)
  Add split_every to graph_manipulation (dask#7282)
  Typo in optimize docs (dask#7306)
  dask.graph_manipulation support for xarray.Dataset (dask#7276)
  Add plot width and height support for Bokeh 2.3.0 (dask#7297)
  Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997)
  Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260)
  ...
Base automatically changed from master to main March 8, 2021 20:19
@bmerry
Copy link
Contributor

bmerry commented Mar 9, 2021

It's looking good. The only thing I think still left is a test for a window size of 0, which is legal in numpy and has some slightly odd behaviour that I suspect might not work as is. It's also pretty useless (you get an array with a length-zero dimension) so if it's difficult to make work I don't think it would be unreasonable to just raise an exception stating that it's not supported by Dask.

@dcherian
Copy link
Collaborator Author

dcherian commented Mar 9, 2021

The only thing I think still left is a test for a window size of 0, which is legal in numpy and has some slightly odd behaviour that I suspect might not work as is

I think we can leave that for someone who really needs it :) . I am raising ValueError now. I could change it to NotImplementedError if some one felt strongly about it.

@dtpc
Copy link

dtpc commented Mar 18, 2021

FWIW I have been working on something similar, but only came across this PR today as I was cleaning up my code. I have uploaded it here https://github.com/dtpc/dask-windows

This PR is obviously must cleaner. I am only posting my version as it is based on skimage.util.view_as_windows and so supports a step argument as well, in case anybody is looking for that. :)

@jsignell
Copy link
Member

It seems like this is very close to being ready. I am not quite sure if the failures are known ones or not (ping @jrbourbeau who will know), but it's probably worth merging main in to see if this passes.

* upstream/main:
  Change default branch from master to main (dask#7198)
  Add Xarray to CI software environment (dask#7338)
  Update repartition argument name in error text (dask#7336)
  Run upstream tests based on commit message (dask#7329)
  Use pytest.register_assert_rewrite on util modules (dask#7278)
  add example on using specific chunk sizes in from_array() (dask#7330)
  Move numpy skip into test (dask#7247)
@dcherian
Copy link
Collaborator Author

yay. tests pass!

@jsignell jsignell merged commit 80afa97 into dask:main Mar 18, 2021
@jsignell
Copy link
Member

Thanks for persevering! :)

@dcherian dcherian deleted the sliding-window branch March 18, 2021 16:15
@dcherian
Copy link
Collaborator Author

Thanks @jsignell and @bmerry

@sehoffmann
Copy link

sehoffmann commented Mar 19, 2021

Just testing out this new function. Are you aware that it can easily result in humongous chunks?
I got three years worth of wind-velocity measurements as lat/long images:

dataset:

\n \n \n \n \n \n \n
Array Chunk
Bytes 574.62 MB 16.78 MB
Shape (8768, 64, 128) (256, 64, 128)
Count 36 Tasks 36 Chunks
Type float64 numpy.ndarray

Transforming it to a one-year sliding window results in three 500gb chunks.

sliding_window_view(dataset, 2920, axis=0):

\n \n \n \n \n \n \n \n \n \n
Array Chunk
Bytes 1.12 TB 560.32 GB
Shape (5849, 64, 128, 2920) (2928, 64, 128, 2920)
Count 52 Tasks 3 Chunks
Type float64 numpy.ndarray

safe_chunks are ((2920, 2928, 2920), (64,), (128,))

@dcherian
Copy link
Collaborator Author

This is misleading unfortunately. it does rechunk to your window size, but after that it uses numpy's sliding_window_view to create a view that looks like a giant array 2928x64x128x2920 but is really using memory for 2928x64x128 = 200MB

@sehoffmann
Copy link

@dcherian True, the last dimension is fake. Yet, if I try to compute anything with these "mega" chunks, serialization fails (probably because dask realizes early it can't hold that much memory). This effectively prevents me from using the sliding window function.

Anyways, I believe it should be possible to circumvent the rechunk() entirely (which also incurs additional overhead). I thought I had a working solution yesterday (by reusing the tasks from map_overlay and adjusting chunks afterwards by "creating" a new da.Array with identical name), however on closer inspection it fails on some edge cases.

My current approach is to call da.overlap.overlap and da.map_blocks directly. I will report back once I got a working proof of concept.

@sehoffmann
Copy link

sehoffmann commented Mar 20, 2021

While it is possible to circumvent the additional rechunk(), the final chunks will of course be as big as the current solution. Yet, if you are interested in improving the current solution by removing the need for the rechunk, then here is a rough concept code for a single axis only:

org =  np.broadcast_to(np.arange(8)[:,None],(8,3))
X = da.from_array(org, chunks=(3, 3))

>>> org
array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4],
       [5, 5, 5],
       [6, 6, 6],
       [7, 7, 7]])

window_size = 3
X_overlap = da.overlap.overlap(X, depth={0: (0, window_size - 1)}, boundary='none')

>>> X_overlap.chunks
((5, 5, 2), (3,))

# this has to be calculated algorithmically in a full impl:
# subtract (window_size-1) from each of X_overlap's chunks along window axis, and set negative ones to zero afterwards
new_chunks = ((3,3,0), (3,), (window_size,))

mapped = da.map_blocks(
    sliding_window_view, 
    X_overlap,
    chunks=new_chunks,
    new_axis=2,
    meta=X_overlap,

    window_shape=window_size,
    axis=0
)
>>> mapped
dask.array<sliding_window_view, shape=(6, 3, 3), dtype=int64, chunksize=(3, 3, 3), chunktype=numpy.ndarray>

>>> mapped.chunks
((3, 3, 0), (3,), (3,))

# mapped has correct shape, but zero chunks will get computed anyways and will produce errors when computed
# in the simple / usual case where all zero chunks are at the end, we can just opt to not include them

fixed = da.Array(mapped.dask, mapped.name, ((3,3), (3,), (3,)), meta=mapped)
>>> fixed
dask.array<sliding_window_view, shape=(6, 3, 3), dtype=int64, chunksize=(3, 3, 3), chunktype=numpy.ndarray>

computed = fixed.compute()
>>> computed.shape
(6, 3, 3)
>>> computed
array([[[0, 1, 2],
        [0, 1, 2],
        [0, 1, 2]],

       [[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]],

       [[2, 3, 4],
        [2, 3, 4],
        [2, 3, 4]],

       [[3, 4, 5],
        [3, 4, 5],
        [3, 4, 5]],

       [[4, 5, 6],
        [4, 5, 6],
        [4, 5, 6]],

       [[5, 6, 7],
        [5, 6, 7],
        [5, 6, 7]]])

# for a robust solution also working on irregular chunks,  we have to "reindex" the chunks by creating a new task graph
mapped_irregular_chunks = ((3, 0, 3, 0), (3,), (3,))
fixed2_chunks = ((3,3), (3,), (3,))

new_tasks = {
('fixed2', 0,0,0): (mapped.name, 0,0,0),
('fixed2', 1,0,0): (mapped.name, 2,0,0),
}

fixed2 = da.Array(new_tasks + dict(mapped.dask), 'fixed2', fixed2_chunks, meta=mapped) 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support sliding window computations

8 participants