Skip to content

Filters should be a list of dictionaries #65

@abarciauskas-bgse

Description

@abarciauskas-bgse

I believe filters should be an optional list of dictionaries, at least in the case of netcdf4, which is read, in kerchunk, by the h5py library. Futher the zarr spec indicates filters should be a list of json objects

Without this datatype change, I get pydantic type errors which I first reported in https://github.com/TomNicholas/VirtualiZarr/issues/60.

Reproducible example

In this example, I created an artificial dataset with filters as well as used the air dataset from the Usage docs since I knew that worked. It is interesting how the netcdf4 library appears to read filters from both files and the h5py library only reads filters from the artificially generated dataset. I have not yet tracked down why this is.

from netCDF4 import Dataset
import numpy as np
from virtualizarr import open_virtual_dataset
import xarray as xr
import h5py
from netCDF4 import Dataset

# Create some artificial data
data = np.random.rand(100, 100)  # 100x100 array of random numbers

# Create a new NetCDF file
nc_filename = 'artificial_with_filter.nc'
nc_file = Dataset(nc_filename, 'w', format='NETCDF4')

# Define the dimensions of the data
nc_file.createDimension('x', data.shape[0])
nc_file.createDimension('y', data.shape[1])

# Create a variable with zlib compression
data_var = nc_file.createVariable('data', np.float32, ('x', 'y'), zlib=True)

# Assign the data to the variable
data_var[:] = data

# Close the file
nc_file.close()

print(f"NetCDF file '{nc_filename}' created successfully with zlib compression.")

# create an example netCDF4 file from xarray dataset
ds = xr.tutorial.open_dataset('air_temperature')
ds.to_netcdf('air.nc')

files = [('air.nc'), ('artificial_with_filter.nc')]
var_keys = ['air', 'data']
for file in files:
    h5file = h5py.File(file, 'r')
    nc_file = Dataset(file, 'r')
    for group_name in h5file.keys():
        if group_name in var_keys:
            group = h5file[group_name]

            h5filters = group._filters
            print(f"Filters found with hdf5 for {group_name}: {h5filters}")

            var = nc_file.variables[group_name]
            ncfilters = var.filters()
            print(f"Filters found for netcdf for '{group_name}': {ncfilters}")            

    open_virtual_dataset(file)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions