-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Is your feature request related to a problem? Please describe.
Lots of geospatial data is stored in the Zarr format. This format works well for n-dimensional data and coordinates, and can have good compression. Unfortunately, HF datasets doesn't support streaming in data in Zarr format as far as I can tell. Zarr stores are designed to be easily streamed in from cloud storage, especially with xarray and fsspec. Since geospatial data tends to be very large, and on the order of TBs of data or 10's of TBs of data for a single dataset, it can be difficult to store the dataset locally for users. Just adding Zarr stores with HF git doesn't work well (see #3823) as Zarr splits the data into lots of small chunks for fast loading, and that doesn't work well with git. I've somewhat gotten around that issue by tarring each Zarr store and uploading them as a single file, which seems to be working (see https://huggingface.co/datasets/openclimatefix/gfs-reforecast for example data files, although the script isn't written yet). This does mean that streaming doesn't quite work though. On the other hand, in https://huggingface.co/datasets/openclimatefix/eumetsat_uk_hrv we stream in a Zarr store from a public GCP bucket quite easily.
Describe the solution you'd like
A way to upload Zarr stores for hosted datasets so that we can stream it with xarray and fsspec.
Describe alternatives you've considered
Tarring each Zarr store individually and just extracting them in the dataset script -> Downside this is a lot of data that probably doesn't fit locally for a lot of potential users.
Pre-prepare examples in a format like Parquet -> Would use a lot more storage, and a lot less flexibility, in the eumetsat_uk_hrv, we use the one Zarr store for multiple different configurations.