Don't write empty chunks to icechunk by TomNicholas · Pull Request #745 · zarr-developers/VirtualiZarr

TomNicholas · 2025-07-29T10:07:05Z

This should fix the problem exposed in #740

Closes Trying to append to icechunk fails #740
Tests added
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst
~~New functions/methods are listed in api.rst~~
~~New functionality has documentation~~

for more information, see https://pre-commit.ci

codecov · 2025-07-29T10:31:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.55%. Comparing base (8230c36) to head (1de1be0).
⚠️ Report is 32 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #745   +/-   ##
=======================================
  Coverage   87.55%   87.55%           
=======================================
  Files          35       35           
  Lines        1848     1848           
=======================================
  Hits         1618     1618           
  Misses        230      230

Files with missing lines	Coverage Δ
virtualizarr/writers/icechunk.py	`90.75% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

See zarr-developers/VirtualiZarr#745

maxrjones · 2025-08-07T22:32:48Z

FYI I'm starting to question this choice. It would be nice if Zarr/Icechunk had a better way to denote chunks comprised only of fill values than just using the absence of a chunk reference (i.e., we should make this explicit rather than implicit). I've asked if there is a formal definition of the expectations related "missing" chunks via the core Zarr spec or extensions in #Zarr > specification for the meaning of missing chunks.

TomNicholas · 2025-08-08T04:59:08Z

I agree that Zarr conflating uninitialized chunks with chunks containing only fill values is really annoying. But there is nothing we can do here unless the Zarr format gains a way to differentiate?

maxrjones · 2025-08-08T12:24:30Z

I agree that Zarr conflating uninitialized chunks with chunks containing only fill values is really annoying. But there is nothing we can do here unless the Zarr format gains a way to differentiate?

Does icechunk have a better way to differentiate? I tried reading through the IC spec but could not tell how that's handled. It seems like pure fill value chunks could fit as a specific type of chunk ref though, even before there's a pure Zarr solution.

d-v-b · 2025-08-08T12:35:51Z

I agree that Zarr conflating uninitialized chunks with chunks containing only fill values is really annoying. But there is nothing we can do here unless the Zarr format gains a way to differentiate?

How would this work exactly? It's pretty important that creating a Zarr array only requires writing a single metadata document. This entails that the entire array (based on just 1 metadata document) must be consistently readable, even though no data has been written. This means that there must be some specification for what's inside the array, and that can only be contained in the metadata document. That value must be consistent with the data type for the array. This leads pretty directly to the fill_value metadata field, and the behavior we have today. I'm not sure how else you could achieve this, given the constraints I mentioned

rabernat · 2025-08-08T13:48:51Z

In Zarr, the point of fill_value is to indicate that data has not been written. This is the same as HDF5.

If this is a concern, it would be wise to choose a fill_value that can't be confused with actual data.

dcherian · 2025-08-08T13:50:06Z

uninitialized chunks with chunks containing only fill values

FWIW in netCDF this is the historical meaning of _FillValue: https://docs.unidata.ucar.edu/nug/2.0-draft/nug_conventions.html#FillValue.

How would this work exactly? It's pretty important that creating a Zarr array only requires writing a single metadata document. This entails that the entire array (based on just 1 metadata document) must be consistently readable, even though no data has been written.

I agree. You could return uninitialized memory so np.empty instead of np.full. Historically "uninitialized" could mean "just allocated" (like NoFill mode in netCDF or np.empty). This behaviour is useful at write-time when you know that you will write every byte of uninitialized memory, so overwriting with a fill value isn't necessary. But I dont see how this translates to Zarr since we assemble chunks in memory, then flush it out.

dcherian · 2025-08-08T13:50:56Z

In Zarr, the point of fill_value is to indicate that data has not been written.

AFAICT some substantial fraction of the Zarr community uses it to indicate "valid but background value", hence the historical choice of 0 as default fill value (!!!!)

d-v-b · 2025-08-08T13:51:42Z

In Zarr, the point of fill_value is to indicate that data has not been written.

AFAICT some substantial fraction of the Zarr community uses it to indicate "valid but background value", hence the historical choice of 0 as default fill value (!!!!)

yes, this is very common in life sciences, where images are often intensities, and 0 is on the lower end

d-v-b · 2025-08-08T13:56:04Z

If this is a concern, it would be wise to choose a fill_value that can't be confused with actual data.

agreed, and we could make this even easier on the data type side by offering nullable numeric data types (e.g., int16 | Null, or {0, ....65,535, Null}). The user would still be responsible for not writing nulls of course.

TomNicholas added 2 commits July 29, 2025 11:04

add regression test

24ee9c6

skip writing empty chunks

c095249

TomNicholas added the Icechunk 🧊 Relates to Icechunk library / spec label Jul 29, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e164da

for more information, see https://pre-commit.ci

pre-commit-ci bot temporarily deployed to test-release July 29, 2025 10:07 Inactive

release note

1de1be0

TomNicholas temporarily deployed to test-release July 29, 2025 10:30 — with GitHub Actions Inactive

TomNicholas mentioned this pull request Jul 29, 2025

Trying to append to icechunk fails #740

Closed

TomNicholas requested a review from maxrjones July 29, 2025 10:31

jacobbieker added a commit to jacobbieker/planetary-datasets that referenced this pull request Jul 29, 2025

Change to do appending with new Virtualizarr version

7e342b7

See zarr-developers/VirtualiZarr#745

maxrjones approved these changes Jul 30, 2025

View reviewed changes

TomNicholas merged commit a08deb2 into zarr-developers:main Jul 30, 2025
15 checks passed

TomNicholas deleted the dont_write_empty_chunks_to_icechunk branch August 8, 2025 12:42

This was referenced Aug 8, 2025

Add explanation about fill value meaning across software library and standards developmentseed/datacube-guide#1

Open

Decoding of virtual chunks from Icechunk fails when creating an array with an ArrayBytesCodec serializer. #763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't write empty chunks to icechunk#745

Don't write empty chunks to icechunk#745
TomNicholas merged 4 commits intozarr-developers:mainfrom
TomNicholas:dont_write_empty_chunks_to_icechunk

TomNicholas commented Jul 29, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

maxrjones commented Aug 7, 2025

Uh oh!

TomNicholas commented Aug 8, 2025

Uh oh!

maxrjones commented Aug 8, 2025

Uh oh!

d-v-b commented Aug 8, 2025

Uh oh!

rabernat commented Aug 8, 2025

Uh oh!

dcherian commented Aug 8, 2025

Uh oh!

dcherian commented Aug 8, 2025

Uh oh!

d-v-b commented Aug 8, 2025

Uh oh!

d-v-b commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

TomNicholas commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

maxrjones commented Aug 7, 2025

Uh oh!

TomNicholas commented Aug 8, 2025

Uh oh!

maxrjones commented Aug 8, 2025

Uh oh!

d-v-b commented Aug 8, 2025

Uh oh!

rabernat commented Aug 8, 2025

Uh oh!

dcherian commented Aug 8, 2025

Uh oh!

dcherian commented Aug 8, 2025

Uh oh!

d-v-b commented Aug 8, 2025

Uh oh!

d-v-b commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TomNicholas commented Jul 29, 2025 •

edited

Loading

codecov bot commented Jul 29, 2025 •

edited

Loading