Skip to content

Unexpected chunking behavior when using xr.align with join='outer' #4112

@jbusecke

Description

@jbusecke

I just came across some unexpected behavior, when using xr.align with the option join='outer' on two Dataarrays which contain dask.arrays and have different dimension lengths.

MCVE Code Sample

import numpy as np
import xarray as xr

short_time = xr.cftime_range('2000', periods=12)
long_time = xr.cftime_range('2000', periods=120)

data_short = np.random.rand(len(short_time))
data_long = np.random.rand(len(long_time))
a = xr.DataArray(data_short, dims=['time'], coords={'time':short_time}).chunk({'time':3})
b = xr.DataArray(data_long, dims=['time'], coords={'time':long_time}).chunk({'time':3})

a,b = xr.align(a,b, join = 'outer')

Expected Output

As expected a is filled with missing values:

a.plot()
b.plot()

image

But the filled values do not replicate the chunking along the time dimension in b. Instead the padded values are in one single chunk, which can be substantially larger than the others.

a.data

image

b.data

image

(Quick shoutout for the amazing html representation. This made diagnosing this problem super easy! 🥳 )

Problem Description

I think for many problems it would be more appropriate if the padded portion of the array would have a chunking scheme like the longer array.

A practical example (which brought me to this issue) is given in the CMIP6 data archive, where some models give output for several members, with some of them running longer than others, leading to problems when these are combined (see intake-esm/#225).
Basically for that particular model, there are 5 members with a runtime of 100 years and one member with a runtime of 300 years. I think using xr.align leads immediately to a chunk that is 200 years long and blows up the memory on all systems I have tried this on.

Is there a way to work around this, or is this behavior intended and I am missing something?

cc'ing @dcherian @andersy005

Versions

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.2 | packaged by conda-forge | (default, Apr 24 2020, 08:20:52) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1127.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4

xarray: 0.15.1
pandas: 1.0.3
numpy: 1.18.4
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.1.2
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.1.3
cfgrib: None
iris: None
bottleneck: None
dask: 2.15.0
distributed: 2.15.2
matplotlib: 3.2.1
cartopy: 0.18.0
seaborn: None
numbagg: None
setuptools: 46.1.3.post20200325
pip: 20.1
conda: None
pytest: 5.4.2
IPython: 7.14.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions