-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
What happened:
A computation from a delayed call returning two outputs is done twice when one output is array and one output is a dataframe. Interestingly, if both outputs are arrays, or both outputs are dataframes, the duplication does not occur.
If the two outputs (one array, one dataframe) are fed to another delayed call, a computation on the output of that 2nd delayed call does not cause in the extra computation.
In the code below, compute is printed twice under 'Individually'.
What you expected to happen:
compute should be printed once under 'Individually'.
Minimal Complete Verifiable Example:
import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
def calc():
print('compute')
return np.zeros(1), pd.DataFrame({'z': [1]})
res = dask.delayed(calc, nout=2)()
res0 = da.from_delayed(res[0], shape=(1,), dtype=np.float64)
res1 = dd.from_delayed(res[1], meta=[('z', np.int64)])
print('Individually')
_ = dask.compute(res0, res1, scheduler='synchronous')
def comb(a, b):
return a, b
res_comb = dask.delayed(comb)(res0, res1)
print()
print('Combined')
_ = dask.compute(res_comb, scheduler='synchronous')Individually
compute
compute
Combined
computeAnything else we need to know?:
I do not think this is a meta/dtype/shape inference issue, as duplication isn't happening with the "Combined" computation.
Environment:
- Dask version: 2021.04.0+16.ge83379d5
- Python version: 3.8.8
- Operating System: MacOS 11.2.3
- Install method (conda, pip, source): conda