Add an rewrite rule for handling (getitem, (elemwise, x), i)#410
Add an rewrite rule for handling (getitem, (elemwise, x), i)#410shoyer wants to merge 1 commit intodask:masterfrom
Conversation
|
It's cool to see other people pick up the rewrite rule framework. This came up before when we first added it, but can we verify that adding these rules doesn't introduce substantial overhead in the common case? I think that the way @jcrist implemented this it should be quite robust to matching against many rules but it'd be good to verify. |
|
Rewriting is O(n) in the length of the first pattern to be matched. If no match, it's a little harder to figure out due to backtracking (non-deterministic algorithm). I have another version that is deterministic, which would bound failing to find a match to O(n) longest pattern that could be matched. Either way, there is negligible overhead for many-to-one matching. In this case, there actually is no overhead, as the variation is in the functions instead of the args. |
|
I'll note that it would very nice to extend the matching system to automatically handle even unrecognized NumPy ufuncs (or ufunc like objects like those provided by Numba's vectorize). These have a |
|
@jcrist are you working on |
|
I probably won't get to the pattern matcher improvements until later this week. We can always push a new PR cleaning up the rules you add here, or you can wait until this process is simplified. |
These should be rewritten from `(getitem, (elemwise, x), i)` to `(elemwise, (getitem, x, i))`. In many cases, this translates into a significant reduction in IO. This is particularly relevant for computed coordinates, which are ubiquitous in the fields of meteorology and occeanography. These case typically be calculated in a straightforward way from ufuncs, and we would like to support these using dask in xray (see pydata/xarray#462).
|
Another limitation of the current pattern matcher is that it has no way to recognize It also occurs to me that my current logic is flawed, because It presumes that I want to apply getitem to all ufunc or binop arguments. In fact, these arguments only need to be broadcast compatible and could even include scalars (which cannot be indexed). So it seems we will need to figure out some tricks so that we can move the getitem inside elemwise in these cases. |
|
I've resolved what was causing the python 3 errors in travis. The pains of developing in multiple git repositories at once. Things should be smoother after a rebase/merge with master |
|
I'd like revisit this, to follow up on my refactor of elemwise in #622. My plan is to broadcast all arrays before calling @jcrist To make this work well, it would be very helpful to have support for rules that match based on types. In particular, I would like to be able to match all tasks with a ufunc or |
|
Actually, it looks like using isinstance checks for ufuncs alone won't suffice. I'm also need to verify that the ufunc signature is |
|
@shoyer can I close this as stale? |
|
Fair enough, I'll make another issue just so we don't forget about it |
|
Lovely. Getting close to zero-lingering-pull-requests. |
These should be rewritten from
(getitem, (elemwise, x), i)to(elemwise, (getitem, x, i)). In many cases, this translates into a significant reduction in IO.This is particularly relevant for computed coordinates, which are ubiquitous in the fields of meteorology and occeanography. These case typically be calculated in a straightforward way from ufuncs, and we would like to support these using dask in xray (see pydata/xarray#462).
I'll note that this PR is not entirely complete yet -- there are a few other dask array operations that will require wrapping. Also, we would like to optimize the tasks resulting from stacking or concatenation followed by indexing for the same reasons (e.g., so we can apply a coordinate transformation after loading a multi-file dataset).