Support mixed operations between arrays and dataframes#3230
Support mixed operations between arrays and dataframes#3230mrocklin merged 3 commits intodask:masterfrom
Conversation
This allows mixing of Dask array and dataframe objects in element-wise
computations *if* they are well aligned.
df.x - df.y.values
|
Here is a possible approach to #3227 Currently this is a proof of concept. There is plenty of extra testing, hardening, and informative error reporting that should happen. I wanted to check with @jcrist and @TomAugspurger about the approach before going further. |
|
I've removed the WIP label. This is more ready for review now. |
|
I'll take a closer look later, but overall this seems not too invasive. I think it's an acceptable level of complexity to take on for maintaining compatibility with pandas. |
|
I'm excited to see this. I have a use case that involves converting between dataframes and arrays that I know will be aligned due to construction. I'm traveling this week, but I should be able to test and share my use case this weekend. |
|
Ah, my use case for mixing arrays and dataframes is different than what is supported here. I can open a new issue if you prefer. I want to be able to extract arrays from dataframes, perform operations on these arrays, then construct a new dataframe using the same index and some columns from the original dataframe. The array operations don't change the shape of the zeroth axis, so we know the arrays and dataframes will always align. It's possible I'm going about this the wrong way. Here's an example: import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(
np.arange(30).reshape((5, 6)),
columns=['partitioned_key', 'other_key', 'L1', 'L2', 'R1', 'R2'],
).set_index('partitioned_key', drop=False)
ddf = dd.from_pandas(df, npartitions=2)
# Convert dask.dataframe to dask.arrays
A = ddf[['L1', 'L2']].values
B = ddf[['R1', 'R2']].values
# Perform (possibly complex) operations on dask.arrays
# The result will always align with ddf.index
result = A + B
# Create new dask.dataframe combining result with keys from ddf
ddf2 = dd.from_dask_array(result)
ddf2['partitioned_key'] = ddf['partitioned_key'] # FAILS
ddf2['other_key'] = ddf['other_key'] # FAILS
# We still need a way to set the index of ddf2 to be the same as ddf.
# Once they have the same index, we probably want to use dd.concat:
# >>> dd.concat([ddf[['partitioned_key', 'other_key']], ddf2], axis='columns')
# Alternative approach (assign one column at a time)
ddf3 = ddf[['partitioned_key', 'other_key']]
ddf3['R1'] = result[:, 0] # FAILS
ddf3['R2'] = dd.from_dask_array(result[:, 1]) # FAILS |
|
This seems possible if we were to relax restrictions about dask.dataframe's checks on mismatched known/unknown divisions. This works fine for example: In [7]: ddf2.divisions = ddf.divisions
In [8]: ddf2['partitioned_key'] = ddf['partitioned_key'] # FAILSI suspect that small changes here could resolve this: ~/workspace/dask/dask/dataframe/multi.py in align_partitions(*dfs)
101 raise ValueError("dfs contains no DataFrame and Series")
102 if not all(df.known_divisions for df in dfs1):
--> 103 raise ValueError("Not all divisions are known, can't align "
104 "partitions. Please use `set_index` "
105 "to set the index.")This is probably decently easy to do if you're interested @eriknw |
|
I plan to merge this tomorrow if there are no further comments |
|
Great. I'll pursue my use case sometime this year, but it's currently low priority. |
This allows mixing of Dask array and dataframe objects in element-wise
computations if they are well aligned.
flake8 daskdocs/source/changelog.rstfor all changesand one of the
docs/source/*-api.rstfiles for new API