Support mixed operations between arrays and dataframes by mrocklin · Pull Request #3230 · dask/dask

mrocklin · 2018-02-28T22:27:03Z

This allows mixing of Dask array and dataframe objects in element-wise
computations if they are well aligned.

df.x - df.y.values

Tests added / passed
Passes flake8 dask
Fully documented, including docs/source/changelog.rst for all changes
and one of the docs/source/*-api.rst files for new API

This allows mixing of Dask array and dataframe objects in element-wise computations *if* they are well aligned. df.x - df.y.values

mrocklin · 2018-02-28T22:28:40Z

Here is a possible approach to #3227

Currently this is a proof of concept. There is plenty of extra testing, hardening, and informative error reporting that should happen. I wanted to check with @jcrist and @TomAugspurger about the approach before going further.

mrocklin · 2018-03-01T19:23:17Z

I've removed the WIP label. This is more ready for review now.

TomAugspurger · 2018-03-01T19:26:19Z

I'll take a closer look later, but overall this seems not too invasive. I think it's an acceptable level of complexity to take on for maintaining compatibility with pandas.

eriknw · 2018-03-01T20:49:43Z

I'm excited to see this. I have a use case that involves converting between dataframes and arrays that I know will be aligned due to construction. I'm traveling this week, but I should be able to test and share my use case this weekend.

eriknw · 2018-03-05T16:14:37Z

Ah, my use case for mixing arrays and dataframes is different than what is supported here. I can open a new issue if you prefer.

I want to be able to extract arrays from dataframes, perform operations on these arrays, then construct a new dataframe using the same index and some columns from the original dataframe. The array operations don't change the shape of the zeroth axis, so we know the arrays and dataframes will always align.

It's possible I'm going about this the wrong way. Here's an example:

import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(
    np.arange(30).reshape((5, 6)),
    columns=['partitioned_key', 'other_key', 'L1', 'L2', 'R1', 'R2'],
).set_index('partitioned_key', drop=False)
ddf = dd.from_pandas(df, npartitions=2)

# Convert dask.dataframe to dask.arrays
A = ddf[['L1', 'L2']].values
B = ddf[['R1', 'R2']].values

# Perform (possibly complex) operations on dask.arrays
# The result will always align with ddf.index
result = A + B

# Create new dask.dataframe combining result with keys from ddf
ddf2 = dd.from_dask_array(result)
ddf2['partitioned_key'] = ddf['partitioned_key']  # FAILS
ddf2['other_key'] = ddf['other_key']  # FAILS

# We still need a way to set the index of ddf2 to be the same as ddf.
# Once they have the same index, we probably want to use dd.concat:
# >>> dd.concat([ddf[['partitioned_key', 'other_key']], ddf2], axis='columns')

# Alternative approach (assign one column at a time)
ddf3 = ddf[['partitioned_key', 'other_key']]
ddf3['R1'] = result[:, 0]  # FAILS
ddf3['R2'] = dd.from_dask_array(result[:, 1])  # FAILS

mrocklin · 2018-03-05T20:07:38Z

This seems possible if we were to relax restrictions about dask.dataframe's checks on mismatched known/unknown divisions. This works fine for example:

In [7]: ddf2.divisions = ddf.divisions

In [8]: ddf2['partitioned_key'] = ddf['partitioned_key']  # FAILS

I suspect that small changes here could resolve this:

~/workspace/dask/dask/dataframe/multi.py in align_partitions(*dfs)
    101         raise ValueError("dfs contains no DataFrame and Series")
    102     if not all(df.known_divisions for df in dfs1):
--> 103         raise ValueError("Not all divisions are known, can't align "
    104                          "partitions. Please use `set_index` "
    105                          "to set the index.")

This is probably decently easy to do if you're interested @eriknw

mrocklin · 2018-03-05T20:07:47Z

I plan to merge this tomorrow if there are no further comments

eriknw · 2018-03-05T20:14:31Z

Great. I'll pursue my use case sometime this year, but it's currently low priority.

Support mixed operations between arrays and dataframes

7527a72

This allows mixing of Dask array and dataframe objects in element-wise computations *if* they are well aligned. df.x - df.y.values

mrocklin changed the title ~~Support mixed operations between arrays and dataframes~~ [WIP] Support mixed operations between arrays and dataframes Feb 28, 2018

Support array-dataframe operations with mixed dimensions

6ea84e9

mrocklin changed the title ~~[WIP] Support mixed operations between arrays and dataframes~~ Support mixed operations between arrays and dataframes Mar 1, 2018

changelog, flake8

086a975

mrocklin merged commit 4ad9622 into dask:master Mar 7, 2018

mrocklin deleted the array-dataframe-mixed branch March 7, 2018 14:24

mrocklin mentioned this pull request Mar 9, 2018

Error using df.index in and condition #3227

Closed

jendrikjoe mentioned this pull request Sep 9, 2019

String comparison for index fails. #5378

Closed

jacobtomlinson mentioned this pull request Oct 3, 2025

Fix/choose trivial case #12090

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support mixed operations between arrays and dataframes#3230

Support mixed operations between arrays and dataframes#3230
mrocklin merged 3 commits intodask:masterfrom
mrocklin:array-dataframe-mixed

mrocklin commented Feb 28, 2018 •

edited

Loading

Uh oh!

mrocklin commented Feb 28, 2018

Uh oh!

mrocklin commented Mar 1, 2018

Uh oh!

TomAugspurger commented Mar 1, 2018

Uh oh!

eriknw commented Mar 1, 2018

Uh oh!

eriknw commented Mar 5, 2018

Uh oh!

mrocklin commented Mar 5, 2018

Uh oh!

mrocklin commented Mar 5, 2018

Uh oh!

eriknw commented Mar 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mrocklin commented Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Feb 28, 2018

Uh oh!

mrocklin commented Mar 1, 2018

Uh oh!

TomAugspurger commented Mar 1, 2018

Uh oh!

eriknw commented Mar 1, 2018

Uh oh!

eriknw commented Mar 5, 2018

Uh oh!

mrocklin commented Mar 5, 2018

Uh oh!

mrocklin commented Mar 5, 2018

Uh oh!

eriknw commented Mar 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrocklin commented Feb 28, 2018 •

edited

Loading