Skip to content

Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 #4428

@JSKenyon

Description

@JSKenyon

What happened:
A project of mine suddenly broke with:

ValueError: Object has inconsistent chunks along dimension row. This can be fixed by calling unify_chunks().

where previously it had worked.

What you expected to happen:
There should have been no change.

Minimal Complete Verifiable Example:
This is very difficult to reproduce. I have tried, but it clearly isn't triggered for relatively simple xarray.Datasets. In my code, the Datasets in question are the result of multiple concatenations, selection and chunking operations. What I shall do instead is attempt to demonstrate the change, in the hopes that someone more knowledgeable has some intuition for what has gone wrong.

dask==2.25.0

I have a dataset, foo, with a number of different variables, most indexed by row. I will focus on one variable to demonstrate the change in behaviour, specifically FLAG. This is what flag looks like prior to a foo.sortby("row") call. Note that there is only a single chunk (this is intentional).

<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<rechunk-merge, shape=(40710, 1024, 4), dtype=bool, chunksize=(40710, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
  * row      (row) int64 462991 462993 462994 462996 ... 505074 505075 505076
Dimensions without coordinates: chan, corr

After the foo.sortby("row") call:

<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<getitem, shape=(40710, 1024, 4), dtype=bool, chunksize=(40710, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
  * row      (row) int64 462991 462993 462994 462996 ... 505076 505077 505078
Dimensions without coordinates: chan, corr

Note that the chunksize is unchanged.

dask==2.26.0

Repeating exactly the same experiment, prior to the call:

<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<rechunk-merge, shape=(40710, 1024, 4), dtype=bool, chunksize=(40710, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
  * row      (row) int64 462991 462993 462994 462996 ... 505074 505075 505076
Dimensions without coordinates: chan, corr

After the foo.sortby("row") call:

<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<getitem, shape=(40710, 1024, 4), dtype=bool, chunksize=(20355, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
  * row      (row) int64 462991 462993 462994 462996 ... 505076 505077 505078
Dimensions without coordinates: chan, corr

Note the change in the chunksize.

Anything else we need to know?:
I have seen similar behaviour when using xarray.Dataset.sel.

Environment:

dask==2.25.0

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-7648-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.15.1
pandas: 1.1.2
numpy: 1.19.2
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.25.0
distributed: 2.26.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 50.3.0
pip: 20.2.3
conda: None
pytest: 6.0.2
IPython: None
sphinx: None

dask==2.26.0

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-7648-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.15.1
pandas: 1.1.2
numpy: 1.19.2
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.26.0
distributed: 2.26.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 50.3.0
pip: 20.2.3
conda: None
pytest: 6.0.2
IPython: None
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions