Skip to content

Support mixed operations between arrays and dataframes#3230

Merged
mrocklin merged 3 commits intodask:masterfrom
mrocklin:array-dataframe-mixed
Mar 7, 2018
Merged

Support mixed operations between arrays and dataframes#3230
mrocklin merged 3 commits intodask:masterfrom
mrocklin:array-dataframe-mixed

Conversation

@mrocklin
Copy link
Copy Markdown
Member

@mrocklin mrocklin commented Feb 28, 2018

This allows mixing of Dask array and dataframe objects in element-wise
computations if they are well aligned.

df.x - df.y.values
  • Tests added / passed
  • Passes flake8 dask
  • Fully documented, including docs/source/changelog.rst for all changes
    and one of the docs/source/*-api.rst files for new API

This allows mixing of Dask array and dataframe objects in element-wise
computations *if* they are well aligned.

    df.x - df.y.values
@mrocklin mrocklin changed the title Support mixed operations between arrays and dataframes [WIP] Support mixed operations between arrays and dataframes Feb 28, 2018
@mrocklin
Copy link
Copy Markdown
Member Author

Here is a possible approach to #3227

Currently this is a proof of concept. There is plenty of extra testing, hardening, and informative error reporting that should happen. I wanted to check with @jcrist and @TomAugspurger about the approach before going further.

@mrocklin mrocklin changed the title [WIP] Support mixed operations between arrays and dataframes Support mixed operations between arrays and dataframes Mar 1, 2018
@mrocklin
Copy link
Copy Markdown
Member Author

mrocklin commented Mar 1, 2018

I've removed the WIP label. This is more ready for review now.

@TomAugspurger
Copy link
Copy Markdown
Member

I'll take a closer look later, but overall this seems not too invasive. I think it's an acceptable level of complexity to take on for maintaining compatibility with pandas.

@eriknw
Copy link
Copy Markdown
Member

eriknw commented Mar 1, 2018

I'm excited to see this. I have a use case that involves converting between dataframes and arrays that I know will be aligned due to construction. I'm traveling this week, but I should be able to test and share my use case this weekend.

@eriknw
Copy link
Copy Markdown
Member

eriknw commented Mar 5, 2018

Ah, my use case for mixing arrays and dataframes is different than what is supported here. I can open a new issue if you prefer.

I want to be able to extract arrays from dataframes, perform operations on these arrays, then construct a new dataframe using the same index and some columns from the original dataframe. The array operations don't change the shape of the zeroth axis, so we know the arrays and dataframes will always align.

It's possible I'm going about this the wrong way. Here's an example:

import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(
    np.arange(30).reshape((5, 6)),
    columns=['partitioned_key', 'other_key', 'L1', 'L2', 'R1', 'R2'],
).set_index('partitioned_key', drop=False)
ddf = dd.from_pandas(df, npartitions=2)

# Convert dask.dataframe to dask.arrays
A = ddf[['L1', 'L2']].values
B = ddf[['R1', 'R2']].values

# Perform (possibly complex) operations on dask.arrays
# The result will always align with ddf.index
result = A + B

# Create new dask.dataframe combining result with keys from ddf
ddf2 = dd.from_dask_array(result)
ddf2['partitioned_key'] = ddf['partitioned_key']  # FAILS
ddf2['other_key'] = ddf['other_key']  # FAILS

# We still need a way to set the index of ddf2 to be the same as ddf.
# Once they have the same index, we probably want to use dd.concat:
# >>> dd.concat([ddf[['partitioned_key', 'other_key']], ddf2], axis='columns')

# Alternative approach (assign one column at a time)
ddf3 = ddf[['partitioned_key', 'other_key']]
ddf3['R1'] = result[:, 0]  # FAILS
ddf3['R2'] = dd.from_dask_array(result[:, 1])  # FAILS

@mrocklin
Copy link
Copy Markdown
Member Author

mrocklin commented Mar 5, 2018

This seems possible if we were to relax restrictions about dask.dataframe's checks on mismatched known/unknown divisions. This works fine for example:

In [7]: ddf2.divisions = ddf.divisions

In [8]: ddf2['partitioned_key'] = ddf['partitioned_key']  # FAILS

I suspect that small changes here could resolve this:

~/workspace/dask/dask/dataframe/multi.py in align_partitions(*dfs)
    101         raise ValueError("dfs contains no DataFrame and Series")
    102     if not all(df.known_divisions for df in dfs1):
--> 103         raise ValueError("Not all divisions are known, can't align "
    104                          "partitions. Please use `set_index` "
    105                          "to set the index.")

This is probably decently easy to do if you're interested @eriknw

@mrocklin
Copy link
Copy Markdown
Member Author

mrocklin commented Mar 5, 2018

I plan to merge this tomorrow if there are no further comments

@eriknw
Copy link
Copy Markdown
Member

eriknw commented Mar 5, 2018

Great. I'll pursue my use case sometime this year, but it's currently low priority.

@mrocklin mrocklin merged commit 4ad9622 into dask:master Mar 7, 2018
@mrocklin mrocklin deleted the array-dataframe-mixed branch March 7, 2018 14:24
@jacobtomlinson jacobtomlinson mentioned this pull request Oct 3, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants