Skip to content

dask.dataframe quantile fails spectacularly in some edge cases #731

@shoyer

Description

@shoyer
s = pd.Series([-1, 0, 0, 0, 1, 1])
print(s.median())  # 0.0
print(dd.from_pandas(s, 2).quantile(0.5).compute())  # 1.0

This is also true for arbitrarily large repetitions of this data, e.g.,

s = pd.Series([-1] * 1000 + [0, 0, 0] * 1000 + [1, 1] * 1000)
# also holds for all different chunk sizes that I tested other than 20
dd.from_pandas(s, 20).quantile(0.5).compute()  # 1.0

cc @ogrisel

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions