Skip to content

DataFrame correlations are broken? #1086

@mrocklin

Description

@mrocklin

These seem to break with unknown divisions (although perhaps this is appropriate):

>>> df.tip_amount.corr(df.payment_type == 2).compute()
/opt/anaconda/lib/python2.7/site-packages/dask/dataframe/core.pyc in corr(self, other, method, min_periods)
   1235             raise NotImplementedError("Only Pearson correlation has been "
   1236                                       "implemented")
-> 1237         df = concat([self, other], axis=1)
   1238         return cov_corr(df, min_periods, corr=True, scalar=True)
   1239 

/opt/anaconda/lib/python2.7/site-packages/dask/dataframe/multi.pyc in concat(dfs, axis, join, interleave_partitions)
    552     else:
    553         if axis == 1:
--> 554              raise ValueError('Unable to concatenate DataFrame with unknown '
    555                               'division specifying axis=1')
    556         else:

ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1

Also I seem to be getting wrong results:

>>> df[['tip_amount', 'payment_type']].corr().compute()
               tip_amount    payment_type
tip_amount       0.999169       -0.033852
payment_type    -0.033852        8.420302

>>> df.head(1000)[['tip_amount', 'payment_type']].corr()
               tip_amount    payment_type
tip_amount       1.000000       -0.559584
payment_type    -0.559584        1.000000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions