Skip to content

dask.dataframe.read_csv('./filepath/*.csv') returning tuple #7777

@evanharwin

Description

@evanharwin

What happened:
Loading a dataframe seemingly returned a tuple, rather than a dask.dataframe, as an exception was thrown:
AttributeError: 'tuple' object has no attribute 'sample'

What you expected to happen:
I expected for the code below to return a pandas.DataFrame with the correlations that I'm looking for!

Minimal Complete Verifiable Example:

import dask.dataframe as daskdf
from dask.distributed import Client

client = Client(memory_limit='4GB', processes=False)

raw_df = daskdf.read_csv(os.path.join(input_file_path, '*.csv'))
df = raw_df.sample(frac=0.01).drop(['gaugeid', 'time', 'input', 'labels'], 1)
correlations = df.corr().compute()

Anything else we need to know?:
The example runs fine on my local machine (Windows 10, Dask 2021.1.1, Python 3.8.5), it is just failing when run in containerised compute provided by Azure.

The full traceback is here:

Traceback (most recent call last):
  File "correlation_analysis.py", line 43, in <module>
    correlations = df.corr().compute()
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/base.py", line 285, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 2673, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 1982, in gather
    return self.sync(
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 853, in sync
    return sync(
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/utils.py", line 354, in sync
    raise exc.with_traceback(tb)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/utils.py", line 337, in f
    result[0] = yield future
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 1847, in _gather
    raise exception.with_traceback(traceback)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/dataframe/methods.py", line 352, in sample
    return df.sample(random_state=rs, frac=frac, replace=replace) if len(df) > 0 else df
AttributeError: 'tuple' object has no attribute 'sample'

Environment:

  • Dask version: 2021.6.0
  • Python version: 3.8.1
  • Operating System: Linux
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions