-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
Description
Hi all,
I'd like to report a weird slow read_parquet behavior between dask versions.
On all examples I'll be using:
- CentOS 7.7
- Python 3.6
- Fastparquet 0.4.0 (Installed with pip)
Dask versions were also installed with pip.
pip3 install dask=={version} dask[dataframe]=={version}
The I'm using is:
filters = [('x', '>=', time_from), ('x', '<', time_to)]
ddf = dd.read_parquet('path', columns=['x', 'y'], filters=filters, engine='fastparquet')
s = datetime.datetime.now().timestamp()
df = ddf.compute()
e = datetime.datetime.now().timestamp()
print('took', e-s, 'secs')Test 1 <--- slow
dask==2.20.0
fastparquet==0.4.0
the output is:
took 62.25287580490112 secs
Test 2 <--- slow
dask==2.18.1
fastparquet==0.4.0
output:
took 62.15111708641052 secs
Test 3 <--- fast
dask==1.2.2
fastparquet==0.4.0
output:
took 1.4967741966247559 secs
The data schema is as shown below:
df.head()
| x | y | |
|---|---|---|
| 0 | 2020-06-24 13:00:05 | 1003 |
| 1 | 2020-06-24 13:00:10 | 1083 |
| 2 | 2020-06-24 13:00:15 | 1247 |
| 3 | 2020-06-24 13:00:20 | 1173 |
| 4 | 2020-06-24 13:00:25 | 1260 |
df.dtypes
x datetime64[ns]
y float64
dtype: object
What happened:
read_parquet function is slow on recent dask versions > 1min
What you expected to happen:
faster read_parquet function.
dask=1.2.2 took 1sec
Regards,
C.
Reactions are currently unavailable