Move ObjectStoreRegistry and Reader functionality to obspec_utils#844
Move ObjectStoreRegistry and Reader functionality to obspec_utils#844TomNicholas merged 18 commits intozarr-developers:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #844 +/- ##
==========================================
- Coverage 89.33% 88.99% -0.34%
==========================================
Files 34 34
Lines 2015 1945 -70
==========================================
- Hits 1800 1731 -69
+ Misses 215 214 -1
🚀 New features to boost your workflow:
|
|
This is technically a breaking change to the API, because the type of he ObjectStoreRegistry has changed. (Note that it literally changes the |
I'll think on this a bit more. I forgot that a reason for obspec-utils was not only to have the code available as a lighter dependency but also to change the typing from obstore-based to obspec-based, which would be good to test before obspec-utils is a dependency of virtualizarr |
TomNicholas
left a comment
There was a problem hiding this comment.
Shall we just merge this? It all looks great to me.
I think that while this is technically true, a downstream package would have to have been doing something fairly weird with imports to be affected by this change. |
Definitely not yet 😞 I'm not sure if the latest adaptive strategy was actually a bad idea relative to the simple 16 MB request splitting in obspec-util's EagerReadableStore or if something else is wrong, but I'm now seeing a performance hit rather than gain for virtualizing |
we may just need to make the specific reader configurable, but I'll give it some more thought after rest |
Adaptive strategy? I really think that copying Icechunk is probably all we should do here. We already know that that works really well at fetching the whole blob, it's deterministic, predictable, configurable, and already a massive improvement on what's in VZ |
How can it be? Isn't developmentseed/obspec-utils#26 basically just setting defaults? Wasn't the previous iteration effectively just the same thing but with a default |
It's a bit adaptive because the chunk size can increase past the default of 12 MB if the default chunk size would increase the number of requests beyond the max concurrent requests limits (18). So if the file size is >216MB the request size will be > 12 MB.
I'm currently thinking that the eager strategy is best for very-not-cloud-optimized files like GOES, but worse for only slightly not-cloud-optimized files where there are few variables such that there's not so much dispersed metadata through the file. My preferred path forward would be to split this PR up into two PRs to isolate the changes that I'm less certain about ( changing the default reader in the HDF Parser) from the changes that originally motivated this PR (moving ObjectStoreRegistry into obspec_utils). This also isolates the pseudo-breaking changes (ObjectStoreRegistry as external) from internal implementation defaults (which reader we use in the HDF parser). |
I see. I'm of the opinion that even the less-efficient of these is still much much better than what's in
Makes sense, though I think we would have to check carefully to be sure.
Sounds like a good idea. In that case how about we:
That way users benefit from a big improvement immediately, we don't have to work off unmerged PRs, but they/we are still free to optimize further later, opt back in to old behaviour if necessary, or to optimize reader strategies for specific uses/scenarios/parsers later. |
|
Another reason to expose the choice of parser is that the speedup of the |
|
I agree with all of your points. I won't immediately be able to update this PR, so please feel welcome to take it over if you want. |
|
FYI I just remembered that |
|
Still 100% on board with making the reader configurable, but wanted to point out that #855 uses the BufferedReadableStore (similar to the old default) but is still faster than the EagerReadableStore by using caching at a level that is available for loadable variables as well. This is in-line with what we discussed during yesterday's community meeting. |
I think that
ObstoreReaderandObjectStoreRegistryare useful outside of virtualizarr (e.g., for earthaccess) and should therefore be a separate library. I started obspec-utils accordingly, in line with some past discussions.