Move ObjectStoreRegistry and Reader functionality to obspec_utils by maxrjones · Pull Request #844 · zarr-developers/VirtualiZarr

maxrjones · 2025-12-19T20:32:51Z

I think that ObstoreReader and ObjectStoreRegistry are useful outside of virtualizarr (e.g., for earthaccess) and should therefore be a separate library. I started obspec-utils accordingly, in line with some past discussions.

codecov · 2025-12-19T20:34:33Z

Codecov Report

❌ Patch coverage is 96.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.99%. Comparing base (2bbd1f9) to head (482958c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
virtualizarr/parsers/dmrpp.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #844      +/-   ##
==========================================
- Coverage   89.33%   88.99%   -0.34%     
==========================================
  Files          34       34              
  Lines        2015     1945      -70     
==========================================
- Hits         1800     1731      -69     
+ Misses        215      214       -1

Files with missing lines	Coverage Δ
virtualizarr/accessor.py	`95.60% <ø> (ø)`
virtualizarr/manifests/store.py	`89.20% <100.00%> (ø)`
virtualizarr/parsers/fits.py	`100.00% <100.00%> (ø)`
virtualizarr/parsers/hdf/hdf.py	`95.52% <100.00%> (+0.10%)`	⬆️
virtualizarr/parsers/kerchunk/json.py	`100.00% <100.00%> (ø)`
virtualizarr/parsers/kerchunk/parquet.py	`90.69% <100.00%> (ø)`
virtualizarr/parsers/netcdf3.py	`100.00% <100.00%> (ø)`
virtualizarr/parsers/typing.py	`100.00% <100.00%> (ø)`
virtualizarr/parsers/zarr.py	`98.75% <100.00%> (ø)`
virtualizarr/registry.py	`100.00% <100.00%> (ø)`
... and 3 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TomNicholas · 2026-01-09T21:06:04Z

This is technically a breaking change to the API, because the type of he ObjectStoreRegistry has changed. (Note that it literally changes the Parser protocol definition too). This justifies a minor version bump at least...

TomNicholas

Seems good to me

maxrjones · 2026-01-09T22:43:04Z

This is technically a breaking change to the API, because the type of he ObjectStoreRegistry has changed. (Note that it literally changes the Parser protocol definition too). This justifies a minor version bump at least...

I'll think on this a bit more. I forgot that a reason for obspec-utils was not only to have the code available as a lighter dependency but also to change the typing from obstore-based to obspec-based, which would be good to test before obspec-utils is a dependency of virtualizarr

TomNicholas

Shall we just merge this? It all looks great to me.

TomNicholas · 2026-01-24T04:54:31Z

This is technically a breaking change to the API, because the type of he ObjectStoreRegistry has changed.

I think that while this is technically true, a downstream package would have to have been doing something fairly weird with imports to be affected by this change.

maxrjones · 2026-01-24T05:20:19Z

Shall we just merge this? It all looks great to me.

Definitely not yet 😞 I'm not sure if the latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting in obspec-util's EagerReadableStore or if something else is wrong, but I'm now seeing a performance hit rather than gain for virtualizing "s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc":

jovyan@jupyter-maxrjones:~/test-virtualizarr-scripts$ uv run --script test-main.py 
    Updated https://github.com/zarr-developers/VirtualiZarr (2bbd1f9be1d7562e79280e4ee780af6895bf4385)
      Built virtualizarr @ git+https://github.com/zarr-developers/VirtualiZarr@2bbd1f9be1d7562e79280e4ee780af6895bf4385
Installed 22 packages in 1.82s
Warming up...

Running 5 iterations...
  Run 1: 1.549s
  Run 2: 1.552s
  Run 3: 1.523s
  Run 4: 1.467s
  Run 5: 1.432s

==================================================
VirtualiZarr main branch benchmark results:
==================================================
  Average: 1.504s
  Min:     1.432s
  Max:     1.552s
==================================================
jovyan@jupyter-maxrjones:~/test-virtualizarr-scripts$ uv run --script test-branch.py 
    Updated https://github.com/maxrjones/VirtualiZarr.git (a53c7911219ff3d553e1c54c0907781120eea08d)
      Built virtualizarr @ git+https://github.com/maxrjones/VirtualiZarr.git@a53c7911219ff3d553e1c54c0907781120eea08d
Installed 24 packages in 1.61s
Warming up...

Running 5 iterations...
  Run 1: 2.961s
  Run 2: 2.953s
  Run 3: 2.967s
  Run 4: 2.966s
  Run 5: 2.974s

==================================================
VirtualiZarr obspec-utils branch benchmark results:
==================================================
  Average: 2.964s
  Min:     2.953s
  Max:     2.974s
==================================================

maxrjones · 2026-01-24T05:26:19Z

Shall we just merge this? It all looks great to me.

Definitely not yet 😞 I'm not sure if the latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting in obspec-util's EagerReadableStore or if something else is wrong, but I'm now seeing a performance hit rather than gain for virtualizing "s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc":

we may just need to make the specific reader configurable, but I'll give it some more thought after rest

TomNicholas · 2026-01-24T06:09:31Z

latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting

Adaptive strategy? I really think that copying Icechunk is probably all we should do here. We already know that that works really well at fetching the whole blob, it's deterministic, predictable, configurable, and already a massive improvement on what's in VZ main 🙂

TomNicholas · 2026-01-24T16:41:27Z

latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting

How can it be? Isn't developmentseed/obspec-utils#26 basically just setting defaults? Wasn't the previous iteration effectively just the same thing but with a default request_size of 16MB instead of 12MB?

maxrjones · 2026-01-24T17:08:42Z

Adaptive strategy?

It's a bit adaptive because the chunk size can increase past the default of 12 MB if the default chunk size would increase the number of requests beyond the max concurrent requests limits (18). So if the file size is >216MB the request size will be > 12 MB.

latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting

How can it be? Isn't virtual-zarr/obspec-utils#26 basically just setting defaults? Wasn't the previous iteration effectively just the same thing but with a default request_size of 16MB instead of 12MB?

I'm currently thinking that the eager strategy is best for very-not-cloud-optimized files like GOES, but worse for only slightly not-cloud-optimized files where there are few variables such that there's not so much dispersed metadata through the file.

My preferred path forward would be to split this PR up into two PRs to isolate the changes that I'm less certain about ( changing the default reader in the HDF Parser) from the changes that originally motivated this PR (moving ObjectStoreRegistry into obspec_utils). This also isolates the pseudo-breaking changes (ObjectStoreRegistry as external) from internal implementation defaults (which reader we use in the HDF parser).

TomNicholas · 2026-01-24T17:20:39Z

It's a bit adaptive because the chunk size can increase past the default of 12 MB if the default chunk size would increase the number of requests beyond the max concurrent requests limits (18). So if the file size is >216MB the request size will be > 12 MB.

I see. I'm of the opinion that even the less-efficient of these is still much much better than what's in main, and so it's fine to release this now and improve it later.

I'm currently thinking that the eager strategy is best for very-not-cloud-optimized files like GOES, but worse for only slightly not-cloud-optimized files where there are few variables such that there's not so much dispersed metadata through the file.

Makes sense, though I think we would have to check carefully to be sure.

My preferred path forward would be to split this PR up into two PRs to isolate the changes

Sounds like a good idea.

In that case how about we:

Split this PR up into two (one that makes VZ depend on obspec_utils, one that changes the default reader)
Expose the reader to make it configurable - easiest way is presumably just to make the HDFParser accept an optional reader kwarg.
Have it default to the adaptive reader (or the non-adaptive one if you prefer).
Release all this.

That way users benefit from a big improvement immediately, we don't have to work off unmerged PRs, but they/we are still free to optimize further later, opt back in to old behaviour if necessary, or to optimize reader strategies for specific uses/scenarios/parsers later.

TomNicholas · 2026-01-24T17:26:55Z

Another reason to expose the choice of parser is that the speedup of the EagerParser essentially trades off speed against memory use. A user might want to keep memory use low at the cost of slower speed, and main is effectively one extreme of that spectrum, and EagerReader is the other extreme.

maxrjones · 2026-01-24T17:40:13Z

I agree with all of your points. I won't immediately be able to update this PR, so please feel welcome to take it over if you want.

maxrjones · 2026-01-24T18:43:18Z

FYI I just remembered that e657ee7 (this PR) might be pretty much exactly what we want from this PR

maxrjones · 2026-01-24T18:58:15Z

Still 100% on board with making the reader configurable, but wanted to point out that #855 uses the BufferedReadableStore (similar to the old default) but is still faster than the EagerReadableStore by using caching at a level that is available for loadable variables as well. This is in-line with what we discussed during yesterday's community meeting.

docs/api/parsers/hdf5.md

Depend on obspec_utils for ObstoreReader, ObjectStoreRegistry

bc0e69c

maxrjones temporarily deployed to test-release December 19, 2025 20:33 — with GitHub Actions Inactive

Update docs

82b82d8

maxrjones temporarily deployed to test-release December 20, 2025 23:03 — with GitHub Actions Inactive

Merge branch 'main' into obspec-utils

c2fcb6c

maxrjones temporarily deployed to test-release January 9, 2026 19:08 — with GitHub Actions Inactive

TomNicholas approved these changes Jan 9, 2026

View reviewed changes

maxrjones mentioned this pull request Jan 12, 2026

ODD PI 26.1 Objective 3: 🤖 Support virtualization of additional data products NASA-IMPACT/veda-odd#246

Closed

3 tasks

maxrjones added 3 commits January 21, 2026 21:27

Sync with upstream changes

bc7369f

Use ParallelStoreReader in HDFParser

502a1ea

Merge branch 'main' into obspec-utils

c3d60ee

maxrjones had a problem deploying to test-release January 22, 2026 02:39 — with GitHub Actions Failure

Fix more references

d660a57

maxrjones had a problem deploying to test-release January 22, 2026 02:42 — with GitHub Actions Failure

More fixes

edc5f20

maxrjones had a problem deploying to test-release January 22, 2026 02:48 — with GitHub Actions Failure

TomNicholas added the test-upstream Run the upstream tests on this PR label Jan 22, 2026

maxrjones added 2 commits January 23, 2026 13:38

Use ReadableFile protocol in HDFParser

d845984

Use EagerStoreReader with 16 MB requests

955c516

maxrjones had a problem deploying to test-release January 23, 2026 22:20 — with GitHub Actions Failure

Fix typing

e657ee7

maxrjones had a problem deploying to test-release January 24, 2026 01:22 — with GitHub Actions Failure

maxrjones changed the title ~~RFC: Depend on obspec_utils for ObstoreReader, ObjectStoreRegistry~~ Move ObjectStoreRegistry and Reader functionality to obspec_utils Jan 24, 2026

maxrjones added 3 commits January 23, 2026 23:08

Use new EagerStoreReader defaults

e703381

Bump minimum dep

bfcbb06

Release notes

ad5a6d1

maxrjones temporarily deployed to test-release January 24, 2026 04:20 — with GitHub Actions Inactive

Bump obstore minimum dep

a53c791

maxrjones temporarily deployed to test-release January 24, 2026 04:35 — with GitHub Actions Inactive

TomNicholas approved these changes Jan 24, 2026

View reviewed changes

maxrjones marked this pull request as draft January 24, 2026 05:20

maxrjones mentioned this pull request Jan 24, 2026

Add example of virtualizing GOES using caching and request splitting #855

Merged

7 tasks

maxrjones added 2 commits January 24, 2026 14:22

Merge branch 'obspec-utils-readablefile-protocol' into obspec-utils

c2ffdaf

Update docs

b76dab6

maxrjones temporarily deployed to test-release January 24, 2026 19:49 — with GitHub Actions Inactive

maxrjones marked this pull request as ready for review January 24, 2026 19:51

TomNicholas reviewed Jan 24, 2026

View reviewed changes

docs/api/parsers/hdf5.md Outdated Show resolved Hide resolved

Remove ReaderFactory from public API docs

482958c

TomNicholas temporarily deployed to test-release January 24, 2026 22:35 — with GitHub Actions Inactive

TomNicholas merged commit 5c51877 into zarr-developers:main Jan 24, 2026
15 checks passed

Conversation

maxrjones commented Dec 19, 2025

Uh oh!

codecov bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TomNicholas commented Jan 9, 2026

Uh oh!

TomNicholas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxrjones commented Jan 9, 2026

Uh oh!

TomNicholas left a comment

Choose a reason for hiding this comment

Uh oh!

TomNicholas commented Jan 24, 2026

Uh oh!

maxrjones commented Jan 24, 2026

Uh oh!

maxrjones commented Jan 24, 2026

Uh oh!

TomNicholas commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxrjones commented Jan 24, 2026

Uh oh!

TomNicholas commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxrjones commented Jan 24, 2026

Uh oh!

maxrjones commented Jan 24, 2026

Uh oh!

maxrjones commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 19, 2025 •

edited

Loading

TomNicholas left a comment •

edited

Loading

TomNicholas commented Jan 24, 2026 •

edited

Loading

TomNicholas commented Jan 24, 2026 •

edited

Loading

TomNicholas commented Jan 24, 2026 •

edited

Loading

TomNicholas commented Jan 24, 2026 •

edited

Loading

maxrjones commented Jan 24, 2026 •

edited

Loading