-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
I generated simple data: numbers 1-100, split into 10 shards, so that the first shard contains numbers 1-10, the second 11-20, and so on. I read the data with
df = dd.read_csv(os.path.join('/home/grzegorz/shards', 'shard_*.csv'))
and then I try to summarize it:
df.describe().compute()
The result is:
a
count 100.000000
mean 50.500000
std 29.011492
min 1.000000
25% 33.250000
50% 65.500000
75% 97.750000
max 100.000000
Clearly the quantiles are far from the right value. I understand that calculation of quantiles in a distributed data set is very troublesome and obtaining the exact value might be even impossible, but it seems like a relatively big error for such a simple data set. It puts in question whether the results are reliable for instance for data with trend.
Please, let know if quantile computation is going to be improved.
Python 3.5.2
dask 0.16.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels