Skip to content

Add an rewrite rule for handling (getitem, (elemwise, x), i)#410

Closed
shoyer wants to merge 1 commit intodask:masterfrom
shoyer:elemwise-rewrite
Closed

Add an rewrite rule for handling (getitem, (elemwise, x), i)#410
shoyer wants to merge 1 commit intodask:masterfrom
shoyer:elemwise-rewrite

Conversation

@shoyer
Copy link
Member

@shoyer shoyer commented Jul 11, 2015

These should be rewritten from (getitem, (elemwise, x), i) to (elemwise, (getitem, x, i)). In many cases, this translates into a significant reduction in IO.

This is particularly relevant for computed coordinates, which are ubiquitous in the fields of meteorology and occeanography. These case typically be calculated in a straightforward way from ufuncs, and we would like to support these using dask in xray (see pydata/xarray#462).

I'll note that this PR is not entirely complete yet -- there are a few other dask array operations that will require wrapping. Also, we would like to optimize the tasks resulting from stacking or concatenation followed by indexing for the same reasons (e.g., so we can apply a coordinate transformation after loading a multi-file dataset).

@mrocklin
Copy link
Member

It's cool to see other people pick up the rewrite rule framework. This came up before when we first added it, but can we verify that adding these rules doesn't introduce substantial overhead in the common case?

I think that the way @jcrist implemented this it should be quite robust to matching against many rules but it'd be good to verify.

@jcrist
Copy link
Member

jcrist commented Jul 11, 2015

Rewriting is O(n) in the length of the first pattern to be matched. If no match, it's a little harder to figure out due to backtracking (non-deterministic algorithm). I have another version that is deterministic, which would bound failing to find a match to O(n) longest pattern that could be matched. Either way, there is negligible overhead for many-to-one matching. In this case, there actually is no overhead, as the variation is in the functions instead of the args.

@shoyer
Copy link
Member Author

shoyer commented Jul 11, 2015

I'll note that it would very nice to extend the matching system to automatically handle even unrecognized NumPy ufuncs (or ufunc like objects like those provided by Numba's vectorize). These have a nin attributre that specifies how many input arguments they take.

@shoyer
Copy link
Member Author

shoyer commented Jul 12, 2015

@jcrist are you working on isinstance checks so I can register all ufuncs? or should I finish this up?

@jcrist
Copy link
Member

jcrist commented Jul 12, 2015

I probably won't get to the pattern matcher improvements until later this week. We can always push a new PR cleaning up the rules you add here, or you can wait until this process is simplified.

These should be rewritten from `(getitem, (elemwise, x), i)` to
`(elemwise, (getitem, x, i))`. In many cases, this translates into a
significant reduction in IO.

This is particularly relevant for computed coordinates, which are ubiquitous
in the fields of meteorology and occeanography. These case typically be
calculated in a straightforward way from ufuncs, and we would like to support
these using dask in xray (see pydata/xarray#462).
@shoyer shoyer force-pushed the elemwise-rewrite branch from 983d4d4 to fc1dcb0 Compare July 13, 2015 00:19
@shoyer
Copy link
Member Author

shoyer commented Jul 13, 2015

Another limitation of the current pattern matcher is that it has no way to recognize partial arguments. This means that it can't match in the presence of optional keyword arguments (usually written with partial, at least in the current version of dask).

It also occurs to me that my current logic is flawed, because It presumes that I want to apply getitem to all ufunc or binop arguments. In fact, these arguments only need to be broadcast compatible and could even include scalars (which cannot be indexed). So it seems we will need to figure out some tricks so that we can move the getitem inside elemwise in these cases.

@mrocklin
Copy link
Member

I've resolved what was causing the python 3 errors in travis. The pains of developing in multiple git repositories at once.

Things should be smoother after a rebase/merge with master

@shoyer
Copy link
Member Author

shoyer commented Aug 31, 2015

I'd like revisit this, to follow up on my refactor of elemwise in #622.

My plan is to broadcast all arrays before calling atop (e.g., using broadcast_to). Then, this pattern matching rule can move the indexing inside the element-wise operations. I don't plan to turn broadcast scalar arguments, because this would cause a discrepancy with NumPy's dtype promotion rules, which special case scalars (see #622 for more details).

@jcrist To make this work well, it would be very helpful to have support for rules that match based on types. In particular, I would like to be able to match all tasks with a ufunc or partial_by_order object as the first element. Otherwise, the only way to properly handle the array + scalar case would be to update matching rules at runtime, which seems non-ideal.

@shoyer
Copy link
Member Author

shoyer commented Aug 31, 2015

Actually, it looks like using isinstance checks for ufuncs alone won't suffice. I'm also need to verify that the ufunc signature is None. Otherwise, it could be a gufunc, for which this rewrite rule would not apply.

@mrocklin
Copy link
Member

@shoyer can I close this as stale?

@shoyer
Copy link
Member Author

shoyer commented Sep 22, 2015

Fair enough, I'll make another issue just so we don't forget about it

@shoyer shoyer closed this Sep 22, 2015
@mrocklin
Copy link
Member

Lovely. Getting close to zero-lingering-pull-requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants