Skip to content

Add support for streaming Zarr stores for hosted datasets #4096

@jacobbieker

Description

@jacobbieker

Is your feature request related to a problem? Please describe.
Lots of geospatial data is stored in the Zarr format. This format works well for n-dimensional data and coordinates, and can have good compression. Unfortunately, HF datasets doesn't support streaming in data in Zarr format as far as I can tell. Zarr stores are designed to be easily streamed in from cloud storage, especially with xarray and fsspec. Since geospatial data tends to be very large, and on the order of TBs of data or 10's of TBs of data for a single dataset, it can be difficult to store the dataset locally for users. Just adding Zarr stores with HF git doesn't work well (see #3823) as Zarr splits the data into lots of small chunks for fast loading, and that doesn't work well with git. I've somewhat gotten around that issue by tarring each Zarr store and uploading them as a single file, which seems to be working (see https://huggingface.co/datasets/openclimatefix/gfs-reforecast for example data files, although the script isn't written yet). This does mean that streaming doesn't quite work though. On the other hand, in https://huggingface.co/datasets/openclimatefix/eumetsat_uk_hrv we stream in a Zarr store from a public GCP bucket quite easily.

Describe the solution you'd like
A way to upload Zarr stores for hosted datasets so that we can stream it with xarray and fsspec.

Describe alternatives you've considered
Tarring each Zarr store individually and just extracting them in the dataset script -> Downside this is a lot of data that probably doesn't fit locally for a lot of potential users.
Pre-prepare examples in a format like Parquet -> Would use a lot more storage, and a lot less flexibility, in the eumetsat_uk_hrv, we use the one Zarr store for multiple different configurations.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions