Skip to content

Read_Parquet too slow between versions 1.* and 2.* #6376

@CrashLaker

Description

@CrashLaker

Hi all,

I'd like to report a weird slow read_parquet behavior between dask versions.
On all examples I'll be using:

  • CentOS 7.7
  • Python 3.6
  • Fastparquet 0.4.0 (Installed with pip)

Dask versions were also installed with pip.
pip3 install dask=={version} dask[dataframe]=={version}

The I'm using is:

filters = [('x', '>=', time_from), ('x', '<', time_to)]
ddf = dd.read_parquet('path', columns=['x', 'y'], filters=filters, engine='fastparquet')
s = datetime.datetime.now().timestamp()
df = ddf.compute()
e = datetime.datetime.now().timestamp()
print('took', e-s, 'secs')

Test 1 <--- slow
dask==2.20.0
fastparquet==0.4.0
the output is:
took 62.25287580490112 secs

Test 2 <--- slow
dask==2.18.1
fastparquet==0.4.0
output:
took 62.15111708641052 secs

Test 3 <--- fast
dask==1.2.2
fastparquet==0.4.0
output:
took 1.4967741966247559 secs

The data schema is as shown below:
df.head()

x y
0 2020-06-24 13:00:05 1003
1 2020-06-24 13:00:10 1083
2 2020-06-24 13:00:15 1247
3 2020-06-24 13:00:20 1173
4 2020-06-24 13:00:25 1260

df.dtypes

x    datetime64[ns]
y           float64
dtype: object

What happened:
read_parquet function is slow on recent dask versions > 1min

What you expected to happen:
faster read_parquet function.
dask=1.2.2 took 1sec

Regards,
C.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ioneeds infoNeeds further information from the user

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions