-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
Description
Hey awesome dask developers!
For some reason, the correlation between Series does not work
Consider the following simple example:
import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(np.arange(1e6).reshape((-1, 1)), columns=['a'])
ddf = dd.from_pandas(df, chunksize=int(1e3))
ddf.a.corr(ddf.a).compute(scheduler='single-threaded')It produces the following
ValueError: could not broadcast input array from shape (2,2) into shape (2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.7/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/dask/base.py", line 398, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 503, in get_sync
return get_async(apply_sync, 1, dsk, keys, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 449, in get_async
fire_task()
File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 445, in fire_task
callback=queue.put)
File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 492, in apply_sync
res = func(*args, **kwds)
File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 235, in execute_task
result = pack_exception(e, dumps)
File "/opt/conda/lib/python3.7/site-packages/dask/local.py", line 230, in execute_task
result = _execute_task(task, data)
File "/opt/conda/lib/python3.7/site-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/opt/conda/lib/python3.7/site-packages/dask/optimization.py", line 942, in __call__
dict(zip(self.inkeys, args)))
File "/opt/conda/lib/python3.7/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/opt/conda/lib/python3.7/site-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/opt/conda/lib/python3.7/site-packages/dask/dataframe/core.py", line 4152, in cov_corr_chunk
m[idx] = np.nansum(mu_discrepancy, axis=0)
ValueError: could not broadcast input array from shape (2,2) into shape (2)
It however does work, when you rename the column a to b, i.e. you replace the last line with
ddf.a.corr(ddf.a.rename('b')).compute(scheduler='single-threaded')You can reproduce this bug with docker using the following
Dockerfile
FROM continuumio/miniconda3
RUN conda install pandas dask -y
Here are also some informations about the conda environment that this error is based on
conda info -a
active environment : None
user config file : /root/.condarc
populated config files :
conda version : 4.6.14
conda-build version : not installed
python version : 3.7.3.final.0
base environment : /opt/conda (writable)
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/free/linux-64
https://repo.anaconda.com/pkgs/free/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /opt/conda/pkgs
/root/.conda/pkgs
envs directories : /opt/conda/envs
/root/.conda/envs
platform : linux-64
user-agent : conda/4.6.14 requests/2.21.0 CPython/3.7.3 Linux/4.9.125-linuxkit debian/9 glibc/2.24
UID:GID : 0:0
netrc file : None
offline mode : False
# conda environments:
#
base * /opt/conda
sys.version: 3.7.3 (default, Mar 27 2019, 22:11:17)
...
sys.prefix: /opt/conda
sys.executable: /opt/conda/bin/python
conda location: /opt/conda/lib/python3.7/site-packages/conda
conda-build: None
conda-env: /opt/conda/bin/conda-env
user site dirs:
CIO_TEST: <not set>
CONDA_ROOT: /opt/conda
PATH: /opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
REQUESTS_CA_BUNDLE: <not set>
SSL_CERT_FILE: <not set>
conda list
# packages in environment at /opt/conda:
#
# Name Version Build Channel
asn1crypto 0.24.0 py37_0
blas 1.0 mkl
bokeh 1.2.0 py37_0
ca-certificates 2019.5.15 0
certifi 2019.3.9 py37_0
cffi 1.12.2 py37h2e261b9_1
chardet 3.0.4 py37_1
click 7.0 py37_0
cloudpickle 1.1.1 py_0
conda 4.6.14 py37_0
cryptography 2.6.1 py37h1ba5d50_0
cytoolz 0.9.0.1 py37h14c3975_1
dask 1.2.2 py_0
dask-core 1.2.2 py_0
distributed 1.28.1 py37_0
freetype 2.9.1 h8a8886c_1
heapdict 1.0.0 py37_2
idna 2.8 py37_0
intel-openmp 2019.4 243
jinja2 2.10.1 py37_0
jpeg 9b h024ee3a_2
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 8.2.0 hdf63c60_1
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 8.2.0 hdf63c60_1
libtiff 4.0.10 h2733197_2
locket 0.2.0 py37_1
markupsafe 1.1.1 py37h7b6447c_0
mkl 2019.4 243
mkl_fft 1.0.12 py37ha843d7b_0
mkl_random 1.0.2 py37hd81dba3_0
msgpack-python 0.6.1 py37hfd86e86_1
ncurses 6.1 he6710b0_1
numpy 1.16.4 py37h7e9f1db_0
numpy-base 1.16.4 py37hde5b4d6_0
olefile 0.46 py37_0
openssl 1.1.1c h7b6447c_1
packaging 19.0 py37_0
pandas 0.24.2 py37he6710b0_0
partd 0.3.10 py37_1
pillow 6.0.0 py37h34e0f95_0
pip 19.0.3 py37_0
psutil 5.6.2 py37h7b6447c_0
pycosat 0.6.3 py37h14c3975_0
pycparser 2.19 py37_0
pyopenssl 19.0.0 py37_0
pyparsing 2.4.0 py_0
pysocks 1.6.8 py37_0
python 3.7.3 h0371630_0
python-dateutil 2.8.0 py37_0
pytz 2019.1 py_0
pyyaml 5.1 py37h7b6447c_0
readline 7.0 h7b6447c_5
requests 2.21.0 py37_0
ruamel_yaml 0.15.46 py37h14c3975_0
setuptools 41.0.0 py37_0
six 1.12.0 py37_0
sortedcontainers 2.1.0 py37_0
sqlite 3.27.2 h7b6447c_0
tblib 1.4.0 py_0
tk 8.6.8 hbc83047_0
toolz 0.9.0 py37_0
tornado 6.0.2 py37h7b6447c_0
urllib3 1.24.1 py37_0
wheel 0.33.1 py37_0
xz 5.2.4 h14c3975_4
yaml 0.1.7 had09818_2
zict 0.1.4 py37_0
zlib 1.2.11 h7b6447c_3
zstd 1.3.7 h0b5b093_0
Thanks for your help!