Skip to content

MAINT: implement median#3819

Closed
stsievert wants to merge 5 commits intodask:mainfrom
stsievert:median
Closed

MAINT: implement median#3819
stsievert wants to merge 5 commits intodask:mainfrom
stsievert:median

Conversation

@stsievert
Copy link
Copy Markdown
Member

@stsievert stsievert commented Jul 26, 2018

This PR implements da.median and dask.array.Array.median. It is a simple wrapper around da.percentile.

This would close #46

  • Tests added and pass
  • Passes flake8 dask

TODO

@jakirkham
Copy link
Copy Markdown
Member

The documentation of percentile says it is approximate, which makes sense. How important is it to have the exact median? Should we add a similar note in da.median's docs?

@stsievert
Copy link
Copy Markdown
Member Author

I've modified the docstring, and copied it to both functions.

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Jul 30, 2018

I don't think it's a good idea to implement even an approximate version of median() without any accuracy guarantees. See here for discussion about dask's percentile: #1225

A version based on dask.array.apply_gufunc would be fine, but it won't work in cases where the aggregated dimensions exists in multiple chunks.

@jakirkham
Copy link
Copy Markdown
Member

Should we drop percentile then? It seems weird for us to provide one and not the other.

Could use da.apply_along_axis for this sort of thing. Though that only works for a single axis. Definitely wouldn't work for the whole array.

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Aug 2, 2018 via email

@jakirkham
Copy link
Copy Markdown
Member

How does that work with median?

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Aug 2, 2018

Something like:

import dask.array as da
import numpy as np

x = da.random.random((100, 100), chunks=(10, 100))
res = da.apply_gufunc(lambda x: np.median(x, axis=-1), '(x)->()', x, output_dtypes=x.dtype)

Applying this along axes other than the last would require a bit of dimension shuffling but would not be too difficult. We could/should probably add this into the apply_gufunc() interface (#3843).

@jakirkham
Copy link
Copy Markdown
Member

Though there would need to be an explicit rechunking step for the axis (or axes) in question, right?

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Aug 2, 2018 via email

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Aug 2, 2018

Yes, automatic rechunking would be something to consider, perhaps based on whatever we use in general for apply_gufunc() (the same issues apply).

@jakirkham
Copy link
Copy Markdown
Member

To return to the question of percentile and median, it sounds like we are ok with one of the following:

  1. Drop percentile and exclude median.
  2. Fix percentile to work correctly and allow median.

Does that sound like an accurate summary? Anything missing above?

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Aug 2, 2018 via email

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Aug 2, 2018

So perhaps it would make more sense to fix percentile first to work correctly (e.g., by using apply_gufunc()). This would be a good idea even if the new algorithm doesn't work in every case (which it won't, since it requires that the aggregated dimension is unchunked).

For sufficiently randomly ordered inputs, the current percentile algorithm would work fine. But for some unknown fraction of cases, it returns silently incorrect results with no error bound.

I would be OK imposing some pain on existing users to eliminate silent bugs, e.g., by introducing a new keyword argument assume_randomly_ordered_input=False and requiring it to be explicitly flipped to True to use the existing algorithm.

@jakirkham
Copy link
Copy Markdown
Member

Agreed fixing it sounds like the right move. Seems there have already been several issues logged with the current implementation. ( #731 ) ( #1212 ) ( #1225 ) ( #3099 ) ( #3115 ) A few of these have been closed, but more out of understanding of the known limitations than a real fix.

That said, of course this is a hard problem and solving it exactly on arbitrarily large N-D data is probably not worth the effort (if it is even possible). It would be good if we could collect some user feedback about when people apply percentile. Do they actually want percentiles on the entire data or are they looking for some piece of it? Is it primarily 1-D data (e.g. time series) or is it higher dimensionality? What compromises are users willing to make ( as there certainly will be a few ;)?

Agree an exact algorithm that can only work on a portion of the data might be pretty useful.

Agree that mandatory shuffling could be pretty useful. Though I don't know the existing algorithm well. So don't know if there are more specific requirements about the shuffle.

Also @jcrist has crick, which we may consider adding as a dependency or (with his ok) moving over.

@stsievert
Copy link
Copy Markdown
Member Author

I think this PR is ready for merge now given @shoyer's at pydata/xarray#2999 (comment)

@mrocklin
Copy link
Copy Markdown
Member

There are also lots of cases where exact median is doable, specifically if we're computing the median only along some axis. My guess is that this is more likely to be the common case for dask array users, (particularly among image processing use cases) but I'm not sure. Given this, I'm not sure how best to handle the approximate situation.

@mrocklin
Copy link
Copy Markdown
Member

See also #5575

@jakirkham
Copy link
Copy Markdown
Member

I wonder also if doing something like histogram could be a good way of approximating the median in the cases where an exact value is hard to come by. For small integral types, this could even compute the median exactly on large data.

@jsignell
Copy link
Copy Markdown
Member

Another issue came up recently dealing with medians (#4362). I think people would derive benefit from even an approximate median.

@jakirkham
Copy link
Copy Markdown
Member

@shoyer, does this implementation seem better?

@jangorecki
Copy link
Copy Markdown

What seems optimal approach is to provide median function having extra argument that decides if approximation should be computed or exact median. Ideally defaulting to exact median to avoid surprises.

@jrbourbeau
Copy link
Copy Markdown
Member

Closing in favor of #9483 -- thanks all for engaging here

@jrbourbeau jrbourbeau closed this Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

da.percentile and np.percentile mismatch median/nanmedian

7 participants