Content-address distributions in the cache by distribution hash by charliermarsh · Pull Request #16816 · astral-sh/uv

charliermarsh · 2025-11-22T03:05:15Z

Summary

The core here is to always compute at least a BLAKE3 hash for all wheels, then store unzipped wheels in a content-address location in the archive directory. This will help with disk space (since we'll avoid storing multiple copies of the same wheel contents) and cache reuse, since we can now reuse unzipped distributions from uv pip install in uv sync commands (which always require hashes already).

We use BLAKE3 since it's especially fast for files that are already present on disk via it's Rayon and memory-mapping features (\cc @oconnor663).

Closes #1061.

Closes #13995.

Closes #16786.

charliermarsh · 2025-11-22T03:55:41Z

Still a few things I want to improve here.

charliermarsh · 2025-11-22T15:09:36Z

    raise BackendUnavailable(data.get('traceback', ''))
 pip._vendor.pyproject_hooks._impl.BackendUnavailable: Traceback (most recent call last):
-  File "/Users/example/.cache/uv/archive-v0/3783IbOdglemN3ieOULx2/lib/python3.13/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend
+  File "/Users/example/.cache/uv/archive-v0/97de8790030bbd5c2d96b7ec782fc2f7820ef8dba6db909ccf95449f2d062d4b/lib/python3.13/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend


Another risk here is that this is significantly longer which hurts path length.

We could base64.urlsafe_b64encode it which would be ~43 characters (less than the 64 here, but more than the 21 we used before).

A few ideas...

base64 encoding seems reasonable

we might want to store it as {:2}/{2:}? git and npm do this to shard directories. I guess we don't have that problem today but if we're changing it maybe we should consider it? It looks like you did in 3bf79e2 ?

We could do a truncated hash with a package id for collisions? {:8}/{package-id} (I guess the package-id could come first?). We'd could persist the full hash to a file for a safety check too.

Yes. I did it as {:2}/{2:4}/{4:} in an earlier commit then rolled it back because it makes various things more complicated (e.g., for cache prune we have to figure out if we can prune the directories recursively). I can re-add it if it seems compelling.

We could do a truncated hash with a package id for collisions?

I'd prefer not to couple the content-addressed storage to a concept like "package names" if possible. It's meant to be more general (e.g., we also use it for cached environments).

({:2}/{2:4}/{4:} is what PyPI uses; it looks like pip does {:2}/{2:4}/{4:6}/{6:}?)

then rolled it back because it makes various things more complicated

Fair enough. I think people do it to avoid directory size limits (i.e., the number of items allowed in a single directory). I think we'd have had this problem already though if it was a concern for us? It seems fairly trivial to check both locations in the future if we determine we need it.

I'd prefer not to couple the content-addressed storage to a concept like "package names" if possible.

I think the idea that there's a "disambiguating" component for collisions if we truncate the hash doesn't need to be tied to "package names" specifically. The most generic way to do it would be to have /0, /1, ... directories with /{id}/HASH files and iterate over them? I sort of don't like that though :)

It's broadly unclear to me how much engineering we should do to avoid a long path length.

It may not really matter. I can't remember the specifics but what ends up happening here is: we create a temp dir, unzip it, then we move the temp dir into this location and hardlink from this location. So I don't think we end up referencing paths within these archives?

zanieb · 2025-12-02T10:02:42Z

How does this relate to #888?

konstin · 2025-12-02T12:37:10Z

iirc The hash checking ideas of RECORD never materialized, pip doesn't check the RECORD and neither does uv, and there's plans to remove it (https://discuss.python.org/t/discouraging-deprecating-pep-427-style-signatures/94968). The consensus has shifted to using hashes and signature for the entire archive that are presented outside of the archive, on the index page, instead of being shipped with the archive.

sandyharvie · 2026-03-21T00:59:53Z

What's the likelihood of this being merged anytime soon @charliermarsh? Currently running into the following issue, which this would solve:

Node comes up, global cache is configured
Bunch of workers are scheduled all at once on the node, each of which proceed to set up their own virtual environment using this global cache
If N of these workers need a particular wheel, we often end up with N copies of that unzipped wheel in the cache

If the wheel happens to be, say, torch, this can cause the cache to balloon in size by ~2N GiB. When there are many such wheels and the node is sufficiently large (i.e., N is large), we see our cache instantly grow to upwards of 1 TiB!

woodruffw · 2026-05-18T22:53:18Z

NB: If we do this, we might want to also remove our seahash dep and use BLAKE3 everywhere 🙂

charliermarsh · 2026-05-30T14:26:13Z

I likely need to re-benchmark this prior to landing.

astral-sh-bot · 2026-05-30T15:24:43Z

uv test inventory changes

This PR changes the tests when compared with the latest main baseline.

Added tests: 1
Removed tests: 0
Changed suites: 1

uv-distribution-types: +1 / -0

Added:

uv-distribution-types::hash::tests::algorithms_include_only_required_hashes

Removed: none

Co-authored-by: Codex <noreply@openai.com>

charliermarsh · 2026-06-02T01:17:32Z

There doesn't seem to be much of an effect vs. main for local wheels (which is the case I'm worried about, since we now have to hash those in addition to unzipping them):

Wheel       Size    main mean    current mean    Result
━━━━━━━  ━━━━━━━━━  ━━━━━━━━━━━  ━━━━━━━━━━━━━━  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
tomli     12 KiB      44.6 ms         42.8 ms    current 1.04x faster
───────  ─────────  ───────────  ──────────────  ────────────────────────────────────
black    1.3 MiB      50.5 ms         50.3 ms    same
───────  ─────────  ───────────  ──────────────  ────────────────────────────────────
numpy     13 MiB     120.8 ms        122.7 ms    current 1.02x slower, within noise
───────  ─────────  ───────────  ──────────────  ────────────────────────────────────
torch     84 MiB    1225.8 ms       1223.1 ms    same

That said... I have no idea if this will generalize to other machines, since this is a pretty hefty MacBook Pro.

zanieb · 2026-06-02T01:47:37Z

What's the story for compatibility with the existing archive storage?

I can try to do some benchmarking on smaller machines too.

charliermarsh · 2026-06-02T01:51:48Z

What's the story for compatibility with the existing archive storage?

It "just works". If you have existing archives, we continue to use them. If you access a new archive, we generate a content-addressed ID, and those IDs never collide with our existing IDs.

zanieb

LGTM — I'll report back with some more benchmarks tomorrow.

charliermarsh · 2026-06-02T01:54:27Z

Thanks. I'm also having Codex explore whether there's a way to use blake3 hazmat to do something more clever to avoid reading the file twice. In general I'm nervous about the change as-is due to the performance implications for large local files...

zanieb · 2026-06-02T02:47:38Z

The wheels you tested on a Namespace runner with restricted CPU counts

CPU	tomli	black	numpy	torch	Geomean
2	0.987x	0.970x	1.010x	1.017x	0.996x
4	1.031x	0.916x	0.973x	0.935x	0.963x
8	0.989x	1.024x	0.998x	0.986x	0.999x
16	1.000x	0.982x	0.998x	1.022x	1.000x

And some synthetic wheel shapes

Shape	Ratio
single stored 64 MiB	1.023x
single stored 256 MiB	1.026x
10k tiny files	1.007x
1k medium files	1.025x
single deflated zero 512 MiB	0.993x

And some synthetic "big file" cases

CPU	512 MiB	1 GiB
2	0.996x same	1.037x slower
4	1.034x slower	1.060x slower
8	1.039x slower	1.063x slower
16	1.083x slower	1.105x slower

So it looks like the worst-case is the 1 GiB / 16-core case which was about +27 ms / 10.5% slower.

I'm a bit confused it got comparatively slower on large files as the CPU count increased, so I'm going to look into that.

charliermarsh · 2026-06-02T12:14:50Z

Just to double-confirm, these are local wheels already available on disk right?

zanieb · 2026-06-02T13:31:24Z

Yep!

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 03:07 — with GitHub Actions Inactive

charliermarsh force-pushed the charlie/content branch from d42bbf0 to 0bf8b5c Compare November 22, 2025 03:16

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 03:19 — with GitHub Actions Inactive

charliermarsh force-pushed the charlie/content branch from 0bf8b5c to 3bf79e2 Compare November 22, 2025 04:15

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 04:17 — with GitHub Actions Inactive

charliermarsh commented Nov 22, 2025

View reviewed changes

Comment thread crates/uv-distribution/src/distribution_database.rs

charliermarsh had a problem deploying to uv-test-registries November 22, 2025 14:39 — with GitHub Actions Error

charliermarsh force-pushed the charlie/content branch from cfa9f83 to 0b8c764 Compare November 22, 2025 14:39

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 14:41 — with GitHub Actions Inactive

charliermarsh requested review from konstin and zanieb November 22, 2025 14:45

charliermarsh marked this pull request as ready for review November 22, 2025 14:46

charliermarsh force-pushed the charlie/content branch 2 times, most recently from 084d601 to d111e5c Compare November 22, 2025 15:09

charliermarsh commented Nov 22, 2025

View reviewed changes

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 15:12 — with GitHub Actions Inactive

konstin added enhancement New feature or improvement to existing functionality performance labels Dec 1, 2025

charliermarsh force-pushed the charlie/content branch 8 times, most recently from 8a87bf7 to 6d22a38 Compare March 26, 2026 02:32

charliermarsh force-pushed the charlie/content branch from 5204144 to 5418c08 Compare May 30, 2026 14:01

charliermarsh marked this pull request as draft May 30, 2026 14:06

charliermarsh force-pushed the charlie/content branch from 5418c08 to 873f92a Compare May 30, 2026 14:25

charliermarsh force-pushed the charlie/content branch from 873f92a to fa879f9 Compare May 30, 2026 14:34

charliermarsh marked this pull request as ready for review May 30, 2026 14:59

charliermarsh and others added 11 commits June 1, 2026 19:11

Content-address distributions in the archive

9a7d11d

Use a slash-delimited path

a415a5b

Remove sync zip

6582806

Fix tests

d5a7a7f

Remove prefix

f77aa29

Always use Blake-3

45f89a5

Use mmap

4c28b93

Truncate to 30 characters

c0ff932

Fix post-rebase verification issues

1ebcbff

Co-authored-by: Codex <noreply@openai.com>

Simplify post-rebase fixes

45b6bfe

Use faster path

ed760bb

charliermarsh force-pushed the charlie/content branch from 19b1dc8 to ed760bb Compare June 2, 2026 00:18

zanieb approved these changes Jun 2, 2026

View reviewed changes

charliermarsh changed the title ~~Content-address distributions in the archive~~ Content-address distributions in the cache by distribution hash Jun 3, 2026

charliermarsh mentioned this pull request Jun 3, 2026

Content-address distributions in the cache by dirhash #19665

Closed

Conversation

charliermarsh commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

charliermarsh commented Nov 22, 2025

Uh oh!

Uh oh!

charliermarsh Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

charliermarsh Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

zanieb Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charliermarsh Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

charliermarsh Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

charliermarsh Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zanieb Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

charliermarsh Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

zanieb commented Dec 2, 2025

Uh oh!

konstin commented Dec 2, 2025

Uh oh!

sandyharvie commented Mar 21, 2026

Uh oh!

woodruffw commented May 18, 2026

Uh oh!

charliermarsh commented May 30, 2026

Uh oh!

astral-sh-bot Bot commented May 30, 2026

uv test inventory changes

Uh oh!

charliermarsh commented Jun 2, 2026

Uh oh!

zanieb commented Jun 2, 2026

Uh oh!

charliermarsh commented Jun 2, 2026

Uh oh!

zanieb left a comment

Choose a reason for hiding this comment

Uh oh!

charliermarsh commented Jun 2, 2026

Uh oh!

zanieb commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charliermarsh commented Jun 2, 2026

Uh oh!

zanieb commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

charliermarsh commented Nov 22, 2025 •

edited

Loading

zanieb Nov 22, 2025 •

edited

Loading

charliermarsh Nov 22, 2025 •

edited

Loading

zanieb commented Jun 2, 2026 •

edited

Loading