Skip to content

[BUG] Dask_cudf alters the metadata processing on CPU dask dataframe apply if metadata is passed #7946

@beckernick

Description

@beckernick

When dask_cudf is imported and a user calls apply on a pandas backed dask dataframe, dask-cudf alters the metadata creation step to use cudf if metadata is supplied. This can cause confusing downstream errors as the user will unexpectedly be operating on the GPU. If metadata is not explicitly supplied, Dask will continue to use pandas as expected. This does not happen if dask_cudf is not imported.

import pandas as pd
import dask.dataframe as dd
import dask_cudfdf = pd.DataFrame({'a': [3,4], 'b': [1, 2]})
ddf = dd.from_pandas(df, npartitions=1)
emb = ddf['a'].apply(pd.Series, meta={'c0': 'int64', 'c1': 'int64'})
print(type(emb._meta))
print(type(emb))
<class 'cudf.core.dataframe.DataFrame'>
<class 'dask_cudf.core.DataFrame'>
import pandas as pd
import dask.dataframe as dd
import dask_cudfdf = pd.DataFrame({'a': [3,4], 'b': [1, 2]})
ddf = dd.from_pandas(df, npartitions=1)
emb = ddf['a'].apply(pd.Series)
print(type(emb._meta))
print(type(emb))
<class 'pandas.core.frame.DataFrame'>
<class 'dask.dataframe.core.DataFrame'>
/raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210413/lib/python3.8/site-packages/dask/dataframe/core.py:3519: UserWarning: 
You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta={0: 'int64'})

  warnings.warn(meta_warning(meta))
conda list | grep "rapids\|dask\|pandas\|arrow\|numpy\|scipy"
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210413:
arrow-cpp                 1.0.1           py38h9018dff_36_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
cudf                      0.20.0a210413   cuda_11.2_py38_gd6479a20d8_137    rapidsai-nightly
cuml                      0.20.0a210413   cuda11.2_py38_g5f61a3519_74    rapidsai-nightly
dask                      2021.4.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.4.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 0.20.0a210413            py38_9    rapidsai-nightly
dask-cudf                 0.20.0a210413   py38_gd6479a20d8_137    rapidsai-nightly
libcudf                   0.20.0a210413   cuda11.2_gd6479a20d8_137    rapidsai-nightly
libcuml                   0.20.0a210413   cuda11.2_g5f61a3519_74    rapidsai-nightly
libcumlprims              0.20.0a210408   cuda11.2_g7f19636_2    rapidsai-nightly
librmm                    0.20.0a210413   cuda11.2_g80bfeb2_13    rapidsai-nightly
numpy                     1.20.2           py38h9894fe3_0    conda-forge
pandas                    1.2.4            py38h1abd341_0    conda-forge
pyarrow                   1.0.1           py38hb53058b_36_cuda    conda-forge
rmm                       0.20.0a210413   cuda_11.2_py38_g80bfeb2_13    rapidsai-nightly
scipy                     1.6.2            py38h7b17777_0    conda-forge
ucx                       1.9.0+gcd9efd3       cuda11.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.20.0a210413   py38_gcd9efd3_2    rapidsai-nightly

cc @viclafargue

Metadata

Metadata

Assignees

Labels

PythonAffects Python cuDF API.bugSomething isn't workingdaskDask issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions