Skip to content

dask.dataframe.core.Series.corr fails when other Series has the same name #4906

@Chilipp

Description

@Chilipp

Hey awesome dask developers!

For some reason, the correlation between Series does not work

Consider the following simple example:

import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(np.arange(1e6).reshape((-1, 1)), columns=['a'])
ddf = dd.from_pandas(df, chunksize=int(1e3))
ddf.a.corr(ddf.a).compute(scheduler='single-threaded')

It produces the following

ValueError: could not broadcast input array from shape (2,2) into shape (2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 503, in get_sync
    return get_async(apply_sync, 1, dsk, keys, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 449, in get_async
    fire_task()
  File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 445, in fire_task
    callback=queue.put)
  File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 492, in apply_sync
    res = func(*args, **kwds)
  File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 235, in execute_task
    result = pack_exception(e, dumps)
  File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/opt/conda/lib/python3.7/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/opt/conda/lib/python3.7/site-packages/dask/optimization.py", line 942, in __call__
    dict(zip(self.inkeys, args)))
  File "/opt/conda/lib/python3.7/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/opt/conda/lib/python3.7/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/opt/conda/lib/python3.7/site-packages/dask/dataframe/core.py", line 4152, in cov_corr_chunk
    m[idx] = np.nansum(mu_discrepancy, axis=0)
ValueError: could not broadcast input array from shape (2,2) into shape (2)

It however does work, when you rename the column a to b, i.e. you replace the last line with

ddf.a.corr(ddf.a.rename('b')).compute(scheduler='single-threaded')

You can reproduce this bug with docker using the following

Dockerfile
FROM continuumio/miniconda3

RUN conda install pandas dask -y

Here are also some informations about the conda environment that this error is based on

conda info -a

     active environment : None
       user config file : /root/.condarc
 populated config files : 
          conda version : 4.6.14
    conda-build version : not installed
         python version : 3.7.3.final.0
       base environment : /opt/conda  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/free/linux-64
                          https://repo.anaconda.com/pkgs/free/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /opt/conda/pkgs
                          /root/.conda/pkgs
       envs directories : /opt/conda/envs
                          /root/.conda/envs
               platform : linux-64
             user-agent : conda/4.6.14 requests/2.21.0 CPython/3.7.3 Linux/4.9.125-linuxkit debian/9 glibc/2.24
                UID:GID : 0:0
             netrc file : None
           offline mode : False

# conda environments:
#
base                  *  /opt/conda

sys.version: 3.7.3 (default, Mar 27 2019, 22:11:17) 
...
sys.prefix: /opt/conda
sys.executable: /opt/conda/bin/python
conda location: /opt/conda/lib/python3.7/site-packages/conda
conda-build: None
conda-env: /opt/conda/bin/conda-env
user site dirs: 

CIO_TEST: <not set>
CONDA_ROOT: /opt/conda
PATH: /opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
REQUESTS_CA_BUNDLE: <not set>
SSL_CERT_FILE: <not set>
conda list
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
asn1crypto                0.24.0                   py37_0  
blas                      1.0                         mkl  
bokeh                     1.2.0                    py37_0  
ca-certificates           2019.5.15                     0  
certifi                   2019.3.9                 py37_0  
cffi                      1.12.2           py37h2e261b9_1  
chardet                   3.0.4                    py37_1  
click                     7.0                      py37_0  
cloudpickle               1.1.1                      py_0  
conda                     4.6.14                   py37_0  
cryptography              2.6.1            py37h1ba5d50_0  
cytoolz                   0.9.0.1          py37h14c3975_1  
dask                      1.2.2                      py_0  
dask-core                 1.2.2                      py_0  
distributed               1.28.1                   py37_0  
freetype                  2.9.1                h8a8886c_1  
heapdict                  1.0.0                    py37_2  
idna                      2.8                      py37_0  
intel-openmp              2019.4                      243  
jinja2                    2.10.1                   py37_0  
jpeg                      9b                   h024ee3a_2  
libedit                   3.1.20181209         hc058e9b_0  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 8.2.0                hdf63c60_1  
libgfortran-ng            7.3.0                hdf63c60_0  
libpng                    1.6.37               hbc83047_0  
libstdcxx-ng              8.2.0                hdf63c60_1  
libtiff                   4.0.10               h2733197_2  
locket                    0.2.0                    py37_1  
markupsafe                1.1.1            py37h7b6447c_0  
mkl                       2019.4                      243  
mkl_fft                   1.0.12           py37ha843d7b_0  
mkl_random                1.0.2            py37hd81dba3_0  
msgpack-python            0.6.1            py37hfd86e86_1  
ncurses                   6.1                  he6710b0_1  
numpy                     1.16.4           py37h7e9f1db_0  
numpy-base                1.16.4           py37hde5b4d6_0  
olefile                   0.46                     py37_0  
openssl                   1.1.1c               h7b6447c_1  
packaging                 19.0                     py37_0  
pandas                    0.24.2           py37he6710b0_0  
partd                     0.3.10                   py37_1  
pillow                    6.0.0            py37h34e0f95_0  
pip                       19.0.3                   py37_0  
psutil                    5.6.2            py37h7b6447c_0  
pycosat                   0.6.3            py37h14c3975_0  
pycparser                 2.19                     py37_0  
pyopenssl                 19.0.0                   py37_0  
pyparsing                 2.4.0                      py_0  
pysocks                   1.6.8                    py37_0  
python                    3.7.3                h0371630_0  
python-dateutil           2.8.0                    py37_0  
pytz                      2019.1                     py_0  
pyyaml                    5.1              py37h7b6447c_0  
readline                  7.0                  h7b6447c_5  
requests                  2.21.0                   py37_0  
ruamel_yaml               0.15.46          py37h14c3975_0  
setuptools                41.0.0                   py37_0  
six                       1.12.0                   py37_0  
sortedcontainers          2.1.0                    py37_0  
sqlite                    3.27.2               h7b6447c_0  
tblib                     1.4.0                      py_0  
tk                        8.6.8                hbc83047_0  
toolz                     0.9.0                    py37_0  
tornado                   6.0.2            py37h7b6447c_0  
urllib3                   1.24.1                   py37_0  
wheel                     0.33.1                   py37_0  
xz                        5.2.4                h14c3975_4  
yaml                      0.1.7                had09818_2  
zict                      0.1.4                    py37_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.3.7                h0b5b093_0  

Thanks for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions