Skip to content

Unconverted NetCDF4Array in Dask compute after collapse #856

@sadielbartholomew

Description

@sadielbartholomew

After (at least) doing a grouped collapse whereby the time axis boundaries lie within the group (see MRE below) and attempting to access the underlying data array, an error is raised due to, ultimately, expecting an object of form of a numpy-like array during the Dask compute operation but encountering a cf.NetCDF4Array object, notably: AttributeError: 'NetCDF4Array' object has no attribute 'astype'. Did you mean: 'dtype'.

Note I have done some pdb debugging on this (just getting the MRE was a quite tricky), which I will also summarise below in a follow-on comment, which implies that a concatenate operation may be involved as a prereq for the bug to emerge. I will also investigate the Dask task graph, but am yet to do so due to attending an all-day meeting today and other release-related concerns yesterday.

Traceback and field context

See the end for the error report, but it is useful to note the form of the field and affected coordinate hence the dump and print in the MRE below. It produces the following output:

----------------------------------------------------
Field: long_name=Sea surface temperature (ncvar%sst)
----------------------------------------------------
Conventions = 'CF-1.6'
_FillValue = np.int16(-32767)
history = '2023-07-26 18:11:43 GMT by grib_to_netcdf-2.25.1: /opt/ecmwf/mars-
           client/bin/grib_to_netcdf.bin -S param -o /cache/data8/adaptor.mars.
           internal-1690395052.5048163-16972-6-318f8028-8b02-478b-b30e-
           fde96e4f8d57.nc /cache/tmp/318f8028-8b02-478b-b30e-fde96e4f8d57-
           adaptor.mars.internal-1690392897.08971-16972-10-tmp.grib'
long_name = 'Sea surface temperature'
missing_value = np.int16(-32767)
units = 'K'

Data(long_name=time(996), long_name=latitude(721), long_name=longitude(1440)) = [[[271.4597342469336, ..., --]]] K

Domain Axis: long_name=latitude(721)
Domain Axis: long_name=longitude(1440)
Domain Axis: long_name=time(996)

Dimension coordinate: long_name=time
    calendar = 'gregorian'
    long_name = 'time'
    units = 'hours since 1900-01-01 00:00:00.0'
    Data(long_name=time(996)) = [1940-01-01 00:00:00, ..., 2022-12-01 00:00:00] gregorian

Dimension coordinate: long_name=latitude
    long_name = 'latitude'
    units = 'degrees_north'
    Data(long_name=latitude(721)) = [90.0, ..., -90.0] degrees_north

Dimension coordinate: long_name=longitude
    long_name = 'longitude'
    units = 'degrees_east'
    Data(long_name=longitude(1440)) = [0.0, ..., 359.75] degrees_east

Time cordinate is long_name=time(44) hours since 1900-01-01 00:00:00.0 gregorian
Traceback (most recent call last):
  File "/home/slb93/git-repos/cf-python/docs/source/recipes/mre-astype-bug.py", line 13, in <module>
    t.data.array  # or t.array directly
    ^^^^^^^^^^^^
  File "/home/slb93/miniconda3/envs/old-sphinx-cf-doc-build-only/lib/python3.11/site-packages/cfdm/data/data.py", line 2629, in array
    a = self.compute().copy()
        ^^^^^^^^^^^^^^
  File "/home/slb93/miniconda3/envs/old-sphinx-cf-doc-build-only/lib/python3.11/site-packages/cfdm/data/data.py", line 3917, in compute
    a = dx.compute()
        ^^^^^^^^^^^^
  File "/home/slb93/miniconda3/envs/old-sphinx-cf-doc-build-only/lib/python3.11/site-packages/dask/base.py", line 370, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/slb93/miniconda3/envs/old-sphinx-cf-doc-build-only/lib/python3.11/site-packages/dask/base.py", line 656, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/slb93/miniconda3/envs/old-sphinx-cf-doc-build-only/lib/python3.11/site-packages/dask/array/chunk.py", line 279, in astype
    return x.astype(astype_dtype, **kwargs)
           ^^^^^^^^
AttributeError: 'NetCDF4Array' object has no attribute 'astype'. Did you mean: 'dtype'?

MRE

Using an environment from cf.environment(paths=False) as printed below - in short, using the latest main branch for cfdm and for `cf-python:

import cf
sst = cf.read("~/recipes/ERA5_monthly_averaged_SST.nc")[0]  # sea surface temp
sst.dump()
am_max = sst.collapse("area: maximum")
am_max = am_max.subspace(T=cf.ge(cf.dt("1980-01-01")))

# Note a cf.mam() grouped collapse works, buy cf.djf() errors!
am_max_collapse = am_max.collapse("T: mean", group=cf.djf())
t = am_max_collapse.dimension_coordinate("long_name=time")
print("Time cordinate is", t)
t.data.array  # or t.array directly, the AttributeError is raised here

Note if the group to collapse on is changed to another season e.g. cf.mam or cf.son, the error does not appear! The sst field has time data starting in Jan and ending in Dec, therefore it is likely that the fact the time boundaries lie within the djf season is related to the problem emerging.

Environment, as tested on

Platform: Linux-6.6.65-1-MANJARO-x86_64-with-glibc2.41
HDF5 library: 1.14.2
netcdf library: 4.9.4-development
udunits2 library: /home/slb93/miniconda3/envs/old-sphinx-cf-doc-build-only/lib/libudunits2.so.0
esmpy/ESMF: 8.7.0
Python: 3.11.8
dask: 2025.3.0
netCDF4: 1.7.2
h5netcdf: 1.6.1
h5py: 3.13.0
s3fs: 2025.3.0
psutil: 7.0.0
packaging: 24.2
numpy: 2.2.4
scipy: 1.15.2
matplotlib: 3.8.4
cftime: 1.6.4.post1
cfunits: 3.3.7
cfplot: 3.3.0
cfdm: 1.12.0.0
cf: 3.17.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions