Skip to content

Errors in quantile calculation #3115

@grzes314

Description

@grzes314

I generated simple data: numbers 1-100, split into 10 shards, so that the first shard contains numbers 1-10, the second 11-20, and so on. I read the data with
df = dd.read_csv(os.path.join('/home/grzegorz/shards', 'shard_*.csv'))
and then I try to summarize it:
df.describe().compute()
The result is:

                a
count  100.000000
mean    50.500000
std     29.011492
min      1.000000
25%     33.250000
50%     65.500000
75%     97.750000
max    100.000000

Clearly the quantiles are far from the right value. I understand that calculation of quantiles in a distributed data set is very troublesome and obtaining the exact value might be even impossible, but it seems like a relatively big error for such a simple data set. It puts in question whether the results are reliable for instance for data with trend.

Please, let know if quantile computation is going to be improved.

Python 3.5.2
dask 0.16.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions