MAINT: implement median by stsievert · Pull Request #3819 · dask/dask

stsievert · 2018-07-26T17:26:36Z

This PR implements da.median and dask.array.Array.median. It is a simple wrapper around da.percentile.

This would close #46

Tests added and pass
Passes flake8 dask

TODO

implement dask.dataframe._Frame.median
(possibly) close da.percentile and np.percentile mismatch #3099

jakirkham · 2018-07-27T14:38:18Z

The documentation of percentile says it is approximate, which makes sense. How important is it to have the exact median? Should we add a similar note in da.median's docs?

stsievert · 2018-07-27T17:19:01Z

I've modified the docstring, and copied it to both functions.

shoyer · 2018-07-30T00:14:45Z

I don't think it's a good idea to implement even an approximate version of median() without any accuracy guarantees. See here for discussion about dask's percentile: #1225

A version based on dask.array.apply_gufunc would be fine, but it won't work in cases where the aggregated dimensions exists in multiple chunks.

jakirkham · 2018-08-02T14:45:04Z

Should we drop percentile then? It seems weird for us to provide one and not the other.

Could use da.apply_along_axis for this sort of thing. Though that only works for a single axis. Definitely wouldn't work for the whole array.

shoyer · 2018-08-02T15:03:06Z

Yes, I'd support dropping percentile. It would be better to use apply_gufunc than apply_along_axis because it does vectorised calculations over each chunk.

…

On Thu, Aug 2, 2018 at 7:46 AM jakirkham ***@***.***> wrote: Should we drop percentile then? It seems weird for us to provide one and not the other. Could use da.apply_along_axis for this sort of thing. Though that only works for a single axis. Definitely wouldn't work for the whole array. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1m3yrXkdvaTn-2b-WciSilJjhWPOks5uMxDZgaJpZM4ViPfK> .

jakirkham · 2018-08-02T17:45:19Z

How does that work with median?

shoyer · 2018-08-02T18:16:36Z

Something like:

import dask.array as da
import numpy as np

x = da.random.random((100, 100), chunks=(10, 100))
res = da.apply_gufunc(lambda x: np.median(x, axis=-1), '(x)->()', x, output_dtypes=x.dtype)

Applying this along axes other than the last would require a bit of dimension shuffling but would not be too difficult. We could/should probably add this into the apply_gufunc() interface (#3843).

jakirkham · 2018-08-02T19:49:49Z

Though there would need to be an explicit rechunking step for the axis (or axes) in question, right?

mrocklin · 2018-08-02T19:51:24Z

It might be interesting to investigate auto-rechunking for apply_gufunc

…

On Thu, Aug 2, 2018 at 3:49 PM, jakirkham ***@***.***> wrote: Though there would need to be an explicit rechunking step for the axis (or axes) in question, right? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszJwWXk81hIpbm3i9NszOB5rhkeqbks5uM1fegaJpZM4ViPfK> .

shoyer · 2018-08-02T19:56:16Z

Yes, automatic rechunking would be something to consider, perhaps based on whatever we use in general for apply_gufunc() (the same issues apply).

jakirkham · 2018-08-02T20:15:55Z

To return to the question of percentile and median, it sounds like we are ok with one of the following:

Drop percentile and exclude median.
Fix percentile to work correctly and allow median.

Does that sound like an accurate summary? Anything missing above?

mrocklin · 2018-08-02T20:18:00Z

I'm not entirely comfortable with dropping percentile. It sees active use, despite the lack of theory around it.

…

On Thu, Aug 2, 2018 at 4:15 PM, jakirkham ***@***.***> wrote: To return to the question of percentile and median, it sounds like we are ok with one of the following: 1. Drop percentile and exclude median. 2. Fix percentile to work correctly and allow median. Does that sound like an accurate summary? Anything missing above? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszHwPHpSkaBbG3Lb8_zdPTjdvaZzIks5uM138gaJpZM4ViPfK> .

shoyer · 2018-08-02T20:27:47Z

So perhaps it would make more sense to fix percentile first to work correctly (e.g., by using apply_gufunc()). This would be a good idea even if the new algorithm doesn't work in every case (which it won't, since it requires that the aggregated dimension is unchunked).

For sufficiently randomly ordered inputs, the current percentile algorithm would work fine. But for some unknown fraction of cases, it returns silently incorrect results with no error bound.

I would be OK imposing some pain on existing users to eliminate silent bugs, e.g., by introducing a new keyword argument assume_randomly_ordered_input=False and requiring it to be explicitly flipped to True to use the existing algorithm.

jakirkham · 2018-08-02T20:47:49Z

Agreed fixing it sounds like the right move. Seems there have already been several issues logged with the current implementation. ( #731 ) ( #1212 ) ( #1225 ) ( #3099 ) ( #3115 ) A few of these have been closed, but more out of understanding of the known limitations than a real fix.

That said, of course this is a hard problem and solving it exactly on arbitrarily large N-D data is probably not worth the effort (if it is even possible). It would be good if we could collect some user feedback about when people apply percentile. Do they actually want percentiles on the entire data or are they looking for some piece of it? Is it primarily 1-D data (e.g. time series) or is it higher dimensionality? What compromises are users willing to make ( as there certainly will be a few ;)?

Agree an exact algorithm that can only work on a portion of the data might be pretty useful.

Agree that mandatory shuffling could be pretty useful. Though I don't know the existing algorithm well. So don't know if there are more specific requirements about the shuffle.

Also @jcrist has crick, which we may consider adding as a dependency or (with his ok) moving over.

stsievert · 2019-11-10T02:24:31Z

I think this PR is ready for merge now given @shoyer's at pydata/xarray#2999 (comment)

mrocklin · 2019-11-10T15:30:52Z

There are also lots of cases where exact median is doable, specifically if we're computing the median only along some axis. My guess is that this is more likely to be the common case for dask array users, (particularly among image processing use cases) but I'm not sure. Given this, I'm not sure how best to handle the approximate situation.

mrocklin · 2019-11-10T15:56:56Z

See also #5575

jakirkham · 2019-11-10T21:52:50Z

I wonder also if doing something like histogram could be a good way of approximating the median in the cases where an exact value is hard to come by. For small integral types, this could even compute the median exactly on large data.

jsignell · 2020-05-27T14:16:25Z

Another issue came up recently dealing with medians (#4362). I think people would derive benefit from even an approximate median.

jakirkham · 2020-07-20T03:33:33Z

@shoyer, does this implementation seem better?

jangorecki · 2020-12-28T17:09:13Z

What seems optimal approach is to provide median function having extra argument that decides if approximation should be computed or exact median. Ideally defaulting to exact median to avoid surprises.

jrbourbeau · 2022-09-13T21:45:42Z

Closing in favor of #9483 -- thanks all for engaging here

MAINT: implement median

320cb46

stsievert force-pushed the median branch from 60dd691 to 320cb46 Compare July 26, 2018 17:30

DOC: modify docstrings

91a142d

shoyer mentioned this pull request Aug 2, 2018

Support axis/axes/keepdims in apply_gufunc() #3843

Closed

shoyer mentioned this pull request Sep 24, 2019

dask.dataframe quantile fails spectacularly in some edge cases #731

Open

stsievert mentioned this pull request Nov 10, 2019

median on dask arrays pydata/xarray#2999

Closed

stsievert added 3 commits November 9, 2019 20:01

Merge branch 'master' into median

94b9f25

Update implementation of median

7cb6bdb

Implement for Pandas Series

6c8f974

Base automatically changed from master to main March 8, 2021 20:19

This was referenced Aug 19, 2021

Possible docstring error for DataFrame.quantile #8065

Closed

Doc error on the comparison page - Dask does support a quantile method (albeit approximate) modin-project/modin#3354

Closed

jrbourbeau mentioned this pull request Sep 12, 2022

Add DataFrame and Series median method #9483

Merged

3 tasks

jrbourbeau closed this Sep 13, 2022

Uh oh!

Conversation

stsievert commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Jul 27, 2018

Uh oh!

stsievert commented Jul 27, 2018

Uh oh!

shoyer commented Jul 30, 2018

Uh oh!

jakirkham commented Aug 2, 2018

Uh oh!

shoyer commented Aug 2, 2018 via email

Uh oh!

jakirkham commented Aug 2, 2018

Uh oh!

shoyer commented Aug 2, 2018

Uh oh!

jakirkham commented Aug 2, 2018

Uh oh!

mrocklin commented Aug 2, 2018 via email

Uh oh!

shoyer commented Aug 2, 2018

Uh oh!

jakirkham commented Aug 2, 2018

Uh oh!

mrocklin commented Aug 2, 2018 via email

Uh oh!

shoyer commented Aug 2, 2018

Uh oh!

jakirkham commented Aug 2, 2018

Uh oh!

stsievert commented Nov 10, 2019

Uh oh!

mrocklin commented Nov 10, 2019

Uh oh!

mrocklin commented Nov 10, 2019

Uh oh!

jakirkham commented Nov 10, 2019

Uh oh!

jsignell commented May 27, 2020

Uh oh!

jakirkham commented Jul 20, 2020

Uh oh!

jangorecki commented Dec 28, 2020

Uh oh!

jrbourbeau commented Sep 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

stsievert commented Jul 26, 2018 •

edited

Loading