Skip to content

Don't write empty chunks to icechunk#745

Merged
TomNicholas merged 4 commits intozarr-developers:mainfrom
TomNicholas:dont_write_empty_chunks_to_icechunk
Jul 30, 2025
Merged

Don't write empty chunks to icechunk#745
TomNicholas merged 4 commits intozarr-developers:mainfrom
TomNicholas:dont_write_empty_chunks_to_icechunk

Conversation

@TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Jul 29, 2025

This should fix the problem exposed in #740

  • Closes Trying to append to icechunk fails #740
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

@TomNicholas TomNicholas added the Icechunk 🧊 Relates to Icechunk library / spec label Jul 29, 2025
@codecov
Copy link

codecov bot commented Jul 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.55%. Comparing base (8230c36) to head (1de1be0).
⚠️ Report is 32 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #745   +/-   ##
=======================================
  Coverage   87.55%   87.55%           
=======================================
  Files          35       35           
  Lines        1848     1848           
=======================================
  Hits         1618     1618           
  Misses        230      230           
Files with missing lines Coverage Δ
virtualizarr/writers/icechunk.py 90.75% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jacobbieker added a commit to jacobbieker/planetary-datasets that referenced this pull request Jul 29, 2025
@TomNicholas TomNicholas merged commit a08deb2 into zarr-developers:main Jul 30, 2025
15 checks passed
@maxrjones
Copy link
Member

FYI I'm starting to question this choice. It would be nice if Zarr/Icechunk had a better way to denote chunks comprised only of fill values than just using the absence of a chunk reference (i.e., we should make this explicit rather than implicit). I've asked if there is a formal definition of the expectations related "missing" chunks via the core Zarr spec or extensions in #Zarr > specification for the meaning of missing chunks.

@TomNicholas
Copy link
Member Author

I agree that Zarr conflating uninitialized chunks with chunks containing only fill values is really annoying. But there is nothing we can do here unless the Zarr format gains a way to differentiate?

@maxrjones
Copy link
Member

I agree that Zarr conflating uninitialized chunks with chunks containing only fill values is really annoying. But there is nothing we can do here unless the Zarr format gains a way to differentiate?

Does icechunk have a better way to differentiate? I tried reading through the IC spec but could not tell how that's handled. It seems like pure fill value chunks could fit as a specific type of chunk ref though, even before there's a pure Zarr solution.

@d-v-b
Copy link

d-v-b commented Aug 8, 2025

I agree that Zarr conflating uninitialized chunks with chunks containing only fill values is really annoying. But there is nothing we can do here unless the Zarr format gains a way to differentiate?

How would this work exactly? It's pretty important that creating a Zarr array only requires writing a single metadata document. This entails that the entire array (based on just 1 metadata document) must be consistently readable, even though no data has been written. This means that there must be some specification for what's inside the array, and that can only be contained in the metadata document. That value must be consistent with the data type for the array. This leads pretty directly to the fill_value metadata field, and the behavior we have today. I'm not sure how else you could achieve this, given the constraints I mentioned

@TomNicholas TomNicholas deleted the dont_write_empty_chunks_to_icechunk branch August 8, 2025 12:42
@rabernat
Copy link
Collaborator

rabernat commented Aug 8, 2025

In Zarr, the point of fill_value is to indicate that data has not been written. This is the same as HDF5.

If this is a concern, it would be wise to choose a fill_value that can't be confused with actual data.

@dcherian
Copy link

dcherian commented Aug 8, 2025

uninitialized chunks with chunks containing only fill values

FWIW in netCDF this is the historical meaning of _FillValue: https://docs.unidata.ucar.edu/nug/2.0-draft/nug_conventions.html#FillValue.

How would this work exactly? It's pretty important that creating a Zarr array only requires writing a single metadata document. This entails that the entire array (based on just 1 metadata document) must be consistently readable, even though no data has been written.

I agree. You could return uninitialized memory so np.empty instead of np.full. Historically "uninitialized" could mean "just allocated" (like NoFill mode in netCDF or np.empty). This behaviour is useful at write-time when you know that you will write every byte of uninitialized memory, so overwriting with a fill value isn't necessary. But I dont see how this translates to Zarr since we assemble chunks in memory, then flush it out.

@dcherian
Copy link

dcherian commented Aug 8, 2025

In Zarr, the point of fill_value is to indicate that data has not been written.

AFAICT some substantial fraction of the Zarr community uses it to indicate "valid but background value", hence the historical choice of 0 as default fill value (!!!!)

@d-v-b
Copy link

d-v-b commented Aug 8, 2025

In Zarr, the point of fill_value is to indicate that data has not been written.

AFAICT some substantial fraction of the Zarr community uses it to indicate "valid but background value", hence the historical choice of 0 as default fill value (!!!!)

yes, this is very common in life sciences, where images are often intensities, and 0 is on the lower end

@d-v-b
Copy link

d-v-b commented Aug 8, 2025

If this is a concern, it would be wise to choose a fill_value that can't be confused with actual data.

agreed, and we could make this even easier on the data type side by offering nullable numeric data types (e.g., int16 | Null, or {0, ....65,535, Null}). The user would still be responsible for not writing nulls of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Icechunk 🧊 Relates to Icechunk library / spec

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trying to append to icechunk fails

5 participants