Skip to content

Move ObjectStoreRegistry and Reader functionality to obspec_utils#844

Merged
TomNicholas merged 18 commits intozarr-developers:mainfrom
maxrjones:obspec-utils
Jan 24, 2026
Merged

Move ObjectStoreRegistry and Reader functionality to obspec_utils#844
TomNicholas merged 18 commits intozarr-developers:mainfrom
maxrjones:obspec-utils

Conversation

@maxrjones
Copy link
Member

I think that ObstoreReader and ObjectStoreRegistry are useful outside of virtualizarr (e.g., for earthaccess) and should therefore be a separate library. I started obspec-utils accordingly, in line with some past discussions.

@codecov
Copy link

codecov bot commented Dec 19, 2025

Codecov Report

❌ Patch coverage is 96.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.99%. Comparing base (2bbd1f9) to head (482958c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
virtualizarr/parsers/dmrpp.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #844      +/-   ##
==========================================
- Coverage   89.33%   88.99%   -0.34%     
==========================================
  Files          34       34              
  Lines        2015     1945      -70     
==========================================
- Hits         1800     1731      -69     
+ Misses        215      214       -1     
Files with missing lines Coverage Δ
virtualizarr/accessor.py 95.60% <ø> (ø)
virtualizarr/manifests/store.py 89.20% <100.00%> (ø)
virtualizarr/parsers/fits.py 100.00% <100.00%> (ø)
virtualizarr/parsers/hdf/hdf.py 95.52% <100.00%> (+0.10%) ⬆️
virtualizarr/parsers/kerchunk/json.py 100.00% <100.00%> (ø)
virtualizarr/parsers/kerchunk/parquet.py 90.69% <100.00%> (ø)
virtualizarr/parsers/netcdf3.py 100.00% <100.00%> (ø)
virtualizarr/parsers/typing.py 100.00% <100.00%> (ø)
virtualizarr/parsers/zarr.py 98.75% <100.00%> (ø)
virtualizarr/registry.py 100.00% <100.00%> (ø)
... and 3 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TomNicholas
Copy link
Member

This is technically a breaking change to the API, because the type of he ObjectStoreRegistry has changed. (Note that it literally changes the Parser protocol definition too). This justifies a minor version bump at least...

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me

@maxrjones
Copy link
Member Author

This is technically a breaking change to the API, because the type of he ObjectStoreRegistry has changed. (Note that it literally changes the Parser protocol definition too). This justifies a minor version bump at least...

I'll think on this a bit more. I forgot that a reason for obspec-utils was not only to have the code available as a lighter dependency but also to change the typing from obstore-based to obspec-based, which would be good to test before obspec-utils is a dependency of virtualizarr

@maxrjones maxrjones changed the title RFC: Depend on obspec_utils for ObstoreReader, ObjectStoreRegistry Move ObjectStoreRegistry and Reader functionality to obspec_utils Jan 24, 2026
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just merge this? It all looks great to me.

@TomNicholas
Copy link
Member

This is technically a breaking change to the API, because the type of he ObjectStoreRegistry has changed.

I think that while this is technically true, a downstream package would have to have been doing something fairly weird with imports to be affected by this change.

@maxrjones
Copy link
Member Author

Shall we just merge this? It all looks great to me.

Definitely not yet 😞 I'm not sure if the latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting in obspec-util's EagerReadableStore or if something else is wrong, but I'm now seeing a performance hit rather than gain for virtualizing "s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc":

jovyan@jupyter-maxrjones:~/test-virtualizarr-scripts$ uv run --script test-main.py 
    Updated https://github.com/zarr-developers/VirtualiZarr (2bbd1f9be1d7562e79280e4ee780af6895bf4385)
      Built virtualizarr @ git+https://github.com/zarr-developers/VirtualiZarr@2bbd1f9be1d7562e79280e4ee780af6895bf4385
Installed 22 packages in 1.82s
Warming up...

Running 5 iterations...
  Run 1: 1.549s
  Run 2: 1.552s
  Run 3: 1.523s
  Run 4: 1.467s
  Run 5: 1.432s

==================================================
VirtualiZarr main branch benchmark results:
==================================================
  Average: 1.504s
  Min:     1.432s
  Max:     1.552s
==================================================
jovyan@jupyter-maxrjones:~/test-virtualizarr-scripts$ uv run --script test-branch.py 
    Updated https://github.com/maxrjones/VirtualiZarr.git (a53c7911219ff3d553e1c54c0907781120eea08d)
      Built virtualizarr @ git+https://github.com/maxrjones/VirtualiZarr.git@a53c7911219ff3d553e1c54c0907781120eea08d
Installed 24 packages in 1.61s
Warming up...

Running 5 iterations...
  Run 1: 2.961s
  Run 2: 2.953s
  Run 3: 2.967s
  Run 4: 2.966s
  Run 5: 2.974s

==================================================
VirtualiZarr obspec-utils branch benchmark results:
==================================================
  Average: 2.964s
  Min:     2.953s
  Max:     2.974s
==================================================

@maxrjones maxrjones marked this pull request as draft January 24, 2026 05:20
@maxrjones
Copy link
Member Author

Shall we just merge this? It all looks great to me.

Definitely not yet 😞 I'm not sure if the latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting in obspec-util's EagerReadableStore or if something else is wrong, but I'm now seeing a performance hit rather than gain for virtualizing "s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc":

we may just need to make the specific reader configurable, but I'll give it some more thought after rest

@TomNicholas
Copy link
Member

TomNicholas commented Jan 24, 2026

latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting

Adaptive strategy? I really think that copying Icechunk is probably all we should do here. We already know that that works really well at fetching the whole blob, it's deterministic, predictable, configurable, and already a massive improvement on what's in VZ main 🙂

@TomNicholas
Copy link
Member

TomNicholas commented Jan 24, 2026

latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting

How can it be? Isn't developmentseed/obspec-utils#26 basically just setting defaults? Wasn't the previous iteration effectively just the same thing but with a default request_size of 16MB instead of 12MB?

@maxrjones
Copy link
Member Author

Adaptive strategy?

It's a bit adaptive because the chunk size can increase past the default of 12 MB if the default chunk size would increase the number of requests beyond the max concurrent requests limits (18). So if the file size is >216MB the request size will be > 12 MB.

latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting

How can it be? Isn't virtual-zarr/obspec-utils#26 basically just setting defaults? Wasn't the previous iteration effectively just the same thing but with a default request_size of 16MB instead of 12MB?

I'm currently thinking that the eager strategy is best for very-not-cloud-optimized files like GOES, but worse for only slightly not-cloud-optimized files where there are few variables such that there's not so much dispersed metadata through the file.

My preferred path forward would be to split this PR up into two PRs to isolate the changes that I'm less certain about ( changing the default reader in the HDF Parser) from the changes that originally motivated this PR (moving ObjectStoreRegistry into obspec_utils). This also isolates the pseudo-breaking changes (ObjectStoreRegistry as external) from internal implementation defaults (which reader we use in the HDF parser).

@TomNicholas
Copy link
Member

TomNicholas commented Jan 24, 2026

It's a bit adaptive because the chunk size can increase past the default of 12 MB if the default chunk size would increase the number of requests beyond the max concurrent requests limits (18). So if the file size is >216MB the request size will be > 12 MB.

I see. I'm of the opinion that even the less-efficient of these is still much much better than what's in main, and so it's fine to release this now and improve it later.

I'm currently thinking that the eager strategy is best for very-not-cloud-optimized files like GOES, but worse for only slightly not-cloud-optimized files where there are few variables such that there's not so much dispersed metadata through the file.

Makes sense, though I think we would have to check carefully to be sure.

My preferred path forward would be to split this PR up into two PRs to isolate the changes

Sounds like a good idea.

In that case how about we:

  • Split this PR up into two (one that makes VZ depend on obspec_utils, one that changes the default reader)
  • Expose the reader to make it configurable - easiest way is presumably just to make the HDFParser accept an optional reader kwarg.
  • Have it default to the adaptive reader (or the non-adaptive one if you prefer).
  • Release all this.

That way users benefit from a big improvement immediately, we don't have to work off unmerged PRs, but they/we are still free to optimize further later, opt back in to old behaviour if necessary, or to optimize reader strategies for specific uses/scenarios/parsers later.

@TomNicholas
Copy link
Member

TomNicholas commented Jan 24, 2026

Another reason to expose the choice of parser is that the speedup of the EagerParser essentially trades off speed against memory use. A user might want to keep memory use low at the cost of slower speed, and main is effectively one extreme of that spectrum, and EagerReader is the other extreme.

@maxrjones
Copy link
Member Author

I agree with all of your points. I won't immediately be able to update this PR, so please feel welcome to take it over if you want.

@maxrjones
Copy link
Member Author

FYI I just remembered that e657ee7 (this PR) might be pretty much exactly what we want from this PR

@maxrjones
Copy link
Member Author

maxrjones commented Jan 24, 2026

Still 100% on board with making the reader configurable, but wanted to point out that #855 uses the BufferedReadableStore (similar to the old default) but is still faster than the EagerReadableStore by using caching at a level that is available for loadable variables as well. This is in-line with what we discussed during yesterday's community meeting.

@maxrjones maxrjones marked this pull request as ready for review January 24, 2026 19:51
@TomNicholas TomNicholas merged commit 5c51877 into zarr-developers:main Jan 24, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-upstream Run the upstream tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants