Content-address distributions in the cache by distribution hash#16816
Content-address distributions in the cache by distribution hash#16816charliermarsh wants to merge 11 commits into
Conversation
d42bbf0 to
0bf8b5c
Compare
|
Still a few things I want to improve here. |
0bf8b5c to
3bf79e2
Compare
cfa9f83 to
0b8c764
Compare
084d601 to
d111e5c
Compare
| raise BackendUnavailable(data.get('traceback', '')) | ||
| pip._vendor.pyproject_hooks._impl.BackendUnavailable: Traceback (most recent call last): | ||
| File "/Users/example/.cache/uv/archive-v0/3783IbOdglemN3ieOULx2/lib/python3.13/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend | ||
| File "/Users/example/.cache/uv/archive-v0/97de8790030bbd5c2d96b7ec782fc2f7820ef8dba6db909ccf95449f2d062d4b/lib/python3.13/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend |
There was a problem hiding this comment.
Another risk here is that this is significantly longer which hurts path length.
There was a problem hiding this comment.
We could base64.urlsafe_b64encode it which would be ~43 characters (less than the 64 here, but more than the 21 we used before).
There was a problem hiding this comment.
A few ideas...
- base64 encoding seems reasonable
- we might want to store it as
{:2}/{2:}? git and npm do this to shard directories. I guess we don't have that problem today but if we're changing it maybe we should consider it? It looks like you did in 3bf79e2 ? - We could do a truncated hash with a package id for collisions?
{:8}/{package-id}(I guess thepackage-idcould come first?). We'd could persist the full hash to a file for a safety check too.
There was a problem hiding this comment.
Yes. I did it as {:2}/{2:4}/{4:} in an earlier commit then rolled it back because it makes various things more complicated (e.g., for cache prune we have to figure out if we can prune the directories recursively). I can re-add it if it seems compelling.
There was a problem hiding this comment.
We could do a truncated hash with a package id for collisions?
I'd prefer not to couple the content-addressed storage to a concept like "package names" if possible. It's meant to be more general (e.g., we also use it for cached environments).
There was a problem hiding this comment.
({:2}/{2:4}/{4:} is what PyPI uses; it looks like pip does {:2}/{2:4}/{4:6}/{6:}?)
There was a problem hiding this comment.
then rolled it back because it makes various things more complicated
Fair enough. I think people do it to avoid directory size limits (i.e., the number of items allowed in a single directory). I think we'd have had this problem already though if it was a concern for us? It seems fairly trivial to check both locations in the future if we determine we need it.
I'd prefer not to couple the content-addressed storage to a concept like "package names" if possible.
I think the idea that there's a "disambiguating" component for collisions if we truncate the hash doesn't need to be tied to "package names" specifically. The most generic way to do it would be to have /0, /1, ... directories with /{id}/HASH files and iterate over them? I sort of don't like that though :)
It's broadly unclear to me how much engineering we should do to avoid a long path length.
There was a problem hiding this comment.
It may not really matter. I can't remember the specifics but what ends up happening here is: we create a temp dir, unzip it, then we move the temp dir into this location and hardlink from this location. So I don't think we end up referencing paths within these archives?
|
How does this relate to #888? |
|
iirc The hash checking ideas of RECORD never materialized, pip doesn't check the RECORD and neither does uv, and there's plans to remove it (https://discuss.python.org/t/discouraging-deprecating-pep-427-style-signatures/94968). The consensus has shifted to using hashes and signature for the entire archive that are presented outside of the archive, on the index page, instead of being shipped with the archive. |
|
What's the likelihood of this being merged anytime soon @charliermarsh? Currently running into the following issue, which this would solve:
If the wheel happens to be, say, |
8a87bf7 to
6d22a38
Compare
|
NB: If we do this, we might want to also remove our |
5204144 to
5418c08
Compare
5418c08 to
873f92a
Compare
|
I likely need to re-benchmark this prior to landing. |
873f92a to
fa879f9
Compare
uv test inventory changesThis PR changes the tests when compared with the latest
|
Co-authored-by: Codex <noreply@openai.com>
19b1dc8 to
ed760bb
Compare
|
There doesn't seem to be much of an effect vs. main for local wheels (which is the case I'm worried about, since we now have to hash those in addition to unzipping them): That said... I have no idea if this will generalize to other machines, since this is a pretty hefty MacBook Pro. |
|
What's the story for compatibility with the existing archive storage? I can try to do some benchmarking on smaller machines too. |
It "just works". If you have existing archives, we continue to use them. If you access a new archive, we generate a content-addressed ID, and those IDs never collide with our existing IDs. |
zanieb
left a comment
There was a problem hiding this comment.
LGTM — I'll report back with some more benchmarks tomorrow.
|
Thanks. I'm also having Codex explore whether there's a way to use blake3 hazmat to do something more clever to avoid reading the file twice. In general I'm nervous about the change as-is due to the performance implications for large local files... |
|
The wheels you tested on a Namespace runner with restricted CPU counts
And some synthetic wheel shapes
And some synthetic "big file" cases
So it looks like the worst-case is the 1 GiB / 16-core case which was about +27 ms / 10.5% slower. I'm a bit confused it got comparatively slower on large files as the CPU count increased, so I'm going to look into that. |
|
Just to double-confirm, these are local wheels already available on disk right? |
|
Yep! |
Summary
The core here is to always compute at least a BLAKE3 hash for all wheels, then store unzipped wheels in a content-address location in the archive directory. This will help with disk space (since we'll avoid storing multiple copies of the same wheel contents) and cache reuse, since we can now reuse unzipped distributions from
uv pip installinuv synccommands (which always require hashes already).We use BLAKE3 since it's especially fast for files that are already present on disk via it's Rayon and memory-mapping features (\cc @oconnor663).
Closes #1061.
Closes #13995.
Closes #16786.