Speedup discovery of large workspaces#18311
Conversation
| /// Collect the workspace member projects from the `members` and `excludes` entries. | ||
| async fn collect_members( | ||
| /// Collect the workspace member projects and build the workspace object. | ||
| async fn build( |
There was a problem hiding this comment.
I think I might still prefer "collect"? or just "new"? I think "build" is overloaded
555d801 to
a17a664
Compare
b9b1c55 to
bdab9cb
Compare
bdab9cb to
ed8c4f6
Compare
EliteTK
left a comment
There was a problem hiding this comment.
Looks good overall, not sure about the update_project stuff. Will need to give it a second look later before I'm confident I've understood it enough to mark it as approved.
| let slf = Arc::try_unwrap(slf).unwrap_or_else(|slf| { | ||
| if cfg!(debug_assertions) { | ||
| panic!( | ||
| "Cannot modify workspace still in use with {} references", | ||
| Arc::strong_count(&slf) | ||
| ); | ||
| } else { | ||
| (*slf).clone() | ||
| } | ||
| }); |
There was a problem hiding this comment.
It feels somewhat like it should be the caller's responsibility to pull the workspace out of the Arc.
There was a problem hiding this comment.
Also, it's a bit worrying that we're just ignoring the potential for a split-brained situation here when debug_assertions are off.
I think if we can shift responsibility to callers this might end up unnecessary anyway?
There was a problem hiding this comment.
I improved the comment a bit.
The idea is that the caller shouldn't be aware of the Arc at all, it should be an implementation detail. OTOH, you shouldn't change the workspace if there's other parts of uv holding on to the cache, editing should always be an exclusive operation (this is the case, but it musn't regress). Then finally, we want uv to never panic, so it's a debug assertion check only. I'm not particularly attached to it, this is just about how this nesting came to be.
There was a problem hiding this comment.
I mean the caller cannot call if the Arc has more than one reference. This seems like an abstraction leak in that case. Unless I missed something and it's impossible to call this from such a context. But this makes me wonder if there should be a "we're building this Workspace" kind of type which doesn't use Arc, and then a second "we've built this Workspace" kind of type which holds onto an Arc...?
I'll need to read a bit further into the surrounding context before I can give a more concrete sounding explanation of what I mean.
There was a problem hiding this comment.
Our current approach is that a workspace-modifying command reads the workspace the normal way, then applies a list of changes, then syncs. These are not performance critical in the cases I looked at, so I'm willing to do whatever is easiest to ensure correctness for.
There was a problem hiding this comment.
I think fundamentally what this code looks like it wants is some kind of RwLock rather than an Arc.
To make that work with #18311 (comment) then you would make the cache store:
enum Slot {
Loading(watch::Receiver<Option<Arc<RwLock<Workspace>>>>),
Ready(Arc<RwLock<Workspace>>),
}
And I think it should more or less handle all the cases?
There was a problem hiding this comment.
TBH I think this fine, it looks worse than it is, the cases that calls call this (uv add/uv remove/uv version) are flat and sequential, so I'm fine with the verbose debug assert and Arc` operation here.
cefa8f6 to
010f996
Compare
Gankra
left a comment
There was a problem hiding this comment.
The general approach is good but I'm worried about the maintainability of the implementation.
| /// The cache makes assumptions about [`DiscoveryOptions`]: | ||
| /// * For a given discovery path, `stop_discovery_at` either always or never sits between a project | ||
| /// and a workspace root. This means that we don't need to key discovery on `stop_discovery_at`. | ||
| /// * TODO(konsti): Support caching for [`MemberDiscovery`] modes that aren't `All`. |
There was a problem hiding this comment.
Do those actually come up much?
| let slf = Arc::try_unwrap(self).unwrap_or_else(|slf| { | ||
| if cfg!(debug_assertions) { | ||
| panic!( | ||
| "Cannot modify workspace still in use with {} references", | ||
| Arc::strong_count(&slf) | ||
| ); | ||
| } else { |
There was a problem hiding this comment.
A very interesting assertion to include only in debug, what's the concern of multiple outstanding references?
There was a problem hiding this comment.
You'd have some other type around that references the workspace you're just editing, which risks using the incorrect cached state later. This is trivially true atm, the Arc::try_unwrap only makes it explicit.
| let project_root = uv_fs::normalize_path(&project_root); | ||
| // Trim trailing slashes. | ||
| let project_root = project_root.components().collect::<PathBuf>(); |
There was a problem hiding this comment.
damn, uv_fs::normalize_path really failing to live up to its name
There was a problem hiding this comment.
More seriously this is copy-pasted in a lot of places and it would be really bad for anyone to not do exactly this so maybe this should be factored into a function?
| let workspace = Arc::new(workspace); | ||
| cache.insert(workspace.clone()); |
There was a problem hiding this comment.
Hmm I'm not a huge fan of "check cache" and "insert into cache" being in separate functions, this seems highly likely to cause errors. Ideally cache code is always shaped like
cache.get(x).or_insert(|| make_new(x))
There was a problem hiding this comment.
Per other comments about races it is hopefully fine that two threads could race on inserting into the cache when they're only doing get_or_insert. The only weirdness about it is they will not end up agreeing on the Arc, but both Arcs should hold the same contents unless someone is concurrently modifying the pyproject.toml or something, in which case uv can't really defend against that.
There was a problem hiding this comment.
Not really sure what a better implementation here is, given that workspace discovery is a slow operation that runs in parallel.
There was a problem hiding this comment.
But when it's running in parallel, I assume it's not useful if two parallel processes are discovering the same workspace simultaneously right?
So I think an option would be to have data structure such that:
getis an asynchronous operation. If someone else is currently discovering the workspace for that entry, it blocks until completion.getreturns some "entry" type which keeps that specific keyed entry locked until it is dropped. Such that,or_insertcan be an operation which carries that lock until the entry is actually written.
The alternative is that two parallel processes try to discover the same workspace ~simultaneously. Which at best is just wasted cycles, and at worst could slow down both operations.
There was a problem hiding this comment.
That's the question I had in the very beginning of the implementation too, and I'm still not entirely sure about:
The problem is that if you discover /foo/a and /foo/b, you can't know whether they'll belong to the same workspace or not until you read the current pyproject.toml and traverse up and (potentially) read the workspace root pyproject.toml. Potentially, all the paths you're trying to look at are part of the workspace defined by /pyproject.toml, or they could all be individual projects.
This leads to those options:
- Make workspace discovery sequential, so each discovery operation is sure that it uses the cache if one exists
- Discover the current project and the workspace root, then lock on discovering the rest of the workspace
- If we skip the current project and workspace root discovery if it's in the cache, we have a (smaller) race still
- If we always perform the current project and workspace root discovery for each new path, it gets noticeably slower. (I think it would undo the speed of this PR?)
To be more clear about the expensive work we have to do, the work to discover a workspace from a project is:
- Discover the current project
pyproject.toml - Traverse upwards and read the workspace root
pyproject.toml - Discover all the other members by reading their
pyproject.toml
Previously, we only cached (3), this PR adds caching for (1) and (2).
There was a problem hiding this comment.
I guess I have a few questions:
- Is there a situation where for a specific workspace, uv would initiate a discovery for more than one member before reading the workspace
pyproject.toml? - If all we need to cache is the
pyproject.tomlcontents, why isn't the cache explicitly designed around that? I don't see a problem with caching the workspaces too, but it seems like we'd get most of the way here by just cachingpyproject.tomlkeyed by path and having an invalidation mechanism? - If two discovery "threads" try to discover
/foo/aand/foo/bin parallel. Isn't the current design such that both of them will still discover the root of the workspace in parallel, then discover the other member in parallel, and only then write the cache entry at the end?
There was a problem hiding this comment.
Is there a situation where for a specific workspace, uv would initiate a discovery for more than one member before reading the workspace pyproject.toml?
Yes, e.g. bar depends on foo-a and foo-b, which are not in its own workspace.
If all we need to cache is the pyproject.toml contents, why isn't the cache explicitly designed around that? I don't see a problem with caching the workspaces too, but it seems like we'd get most of the way here by just caching pyproject.toml keyed by path and having an invalidation mechanism?
Yeah we can try doing that, this could handle a large fraction of what we're currently doing.
If two discovery "threads" try to discover
/foo/aand/foo/bin parallel. Isn't the current design such that both of them will still discover the root of the workspace in parallel, then discover the other member in parallel, and only then write the cache entry at the end?
Yes, changing that would be this option:
Discover the current project and the workspace root, then lock on discovering the rest of the workspace
If we skip the current project and workspace root discovery if it's in the cache, we have a (smaller) race still
There was a problem hiding this comment.
If you think the pyproject.toml cache will work, it'll probably be a simpler approach than trying to implement the "Discover the current project and the workspace root, then lock on discovering the rest of the workspace" approach.
If we skip the current project and workspace root discovery if it's in the cache, we have a (smaller) race still
You mean that foo-a's and foo-b's pyproject.tomls would be both read first by the initial parallel discovery tasks, and then one of the tasks would grab the workspace lock which would mean that the second task would wait, meaning the first task would still have to read the other pyproject.toml twice?
This seems like it is a good example of the kind of problem which would be implicitly avoided if we maintained a pyproject.toml cache. It could mean that foo-a's discovery task, which found the workspace root first, has to wait for foo-b's discovery task to parse the pyproject.toml. But this shouldn't slow anything down because foo-b's discovery task should normally be done parsing sooner than foo-a's discovery task would have if it had also simultaneously tried to parse the file at that point.
|
|
||
| // Update the `pypackage.toml` in-memory. | ||
| target = target.update(&content)?; | ||
| target = target.update(&content, &WorkspaceCache::default())?; |
There was a problem hiding this comment.
Do we really need to make two workspace caches in this function?
There was a problem hiding this comment.
It seems easier to not cache here than to try getting all subtlety correct here (unlike uv run where this performance matters).
| .workspaces | ||
| .lock() | ||
| .expect("there was a panic in another thread"); | ||
| cache.insert(workspace.install_path.clone(), workspace); |
There was a problem hiding this comment.
Notable here that this makes clear kinda redundant since you can just insert over a workspace and we don't care if it was already defined. And removing the workspace before inserting it again just gives another thread permission to go "oh shoot that's not in the cache, I gotta insert that workspace myself" only for this thread to come back around and overwrite it anyway.
Depending on the workloads you expect here it might be correct/necessary to hold the lock between clear and insert so no one gets confused.
There was a problem hiding this comment.
Which clear/insert pair do you mean?
9798dfa to
8b6d5ea
Compare
EliteTK
left a comment
There was a problem hiding this comment.
I think this code should work as is but I've left/expanded on some of the suggestions.
This time I feel pretty confident I understand all of it.
| let workspace = Arc::new(workspace); | ||
| cache.insert(workspace.clone()); |
There was a problem hiding this comment.
But when it's running in parallel, I assume it's not useful if two parallel processes are discovering the same workspace simultaneously right?
So I think an option would be to have data structure such that:
getis an asynchronous operation. If someone else is currently discovering the workspace for that entry, it blocks until completion.getreturns some "entry" type which keeps that specific keyed entry locked until it is dropped. Such that,or_insertcan be an operation which carries that lock until the entry is actually written.
The alternative is that two parallel processes try to discover the same workspace ~simultaneously. Which at best is just wasted cycles, and at worst could slow down both operations.
| let slf = Arc::try_unwrap(slf).unwrap_or_else(|slf| { | ||
| if cfg!(debug_assertions) { | ||
| panic!( | ||
| "Cannot modify workspace still in use with {} references", | ||
| Arc::strong_count(&slf) | ||
| ); | ||
| } else { | ||
| (*slf).clone() | ||
| } | ||
| }); |
There was a problem hiding this comment.
I think fundamentally what this code looks like it wants is some kind of RwLock rather than an Arc.
To make that work with #18311 (comment) then you would make the cache store:
enum Slot {
Loading(watch::Receiver<Option<Arc<RwLock<Workspace>>>>),
Ready(Arc<RwLock<Workspace>>),
}
And I think it should more or less handle all the cases?
7dd1cc5 to
e126ea3
Compare
933e0c5 to
69f8e0b
Compare
uv test inventory changesThis PR changes the tests when compared with the latest
|
50418f3 to
037debf
Compare
9ae5855 to
e5ede0a
Compare
Released on 2026-06-10. ### Enhancements - Add `--emit-index-url` and `--emit-find-links` to `uv export` ([#18370](#18370)) - Add `--find-links` support for `uv pip list` ([#16103](#16103)) - Group executable install errors during `uv python install` ([#19691](#19691)) - Use ICF in macOS release builds to reduce binary sizes ([#19615](#19615)) ### Preview features - Add initial hidden `uv upgrade` command ([#19678](#19678)) - Reject Git revisions in `uv upgrade` ([#19742](#19742)) ### Configuration - Recognize `UV_NO_INSTALL_PROJECT`, `UV_NO_INSTALL_WORKSPACE`, `UV_NO_INSTALL_LOCAL` ([#19323](#19323)) ### Performance - Speed up discovery of large workspaces ([#18311](#18311)) ### Bug fixes - Allow unknown preview flags with a warning again ([#19669](#19669)) - Apply dependency exclusions to direct requirements ([#19699](#19699)) - Avoid following external symlinks during cache clean ([#19682](#19682)) - Avoid following symlinks during cache prune ([#19543](#19543)) - Fix Git cache keys for worktrees and packed refs ([#19706](#19706)) - Make resolver error handling iterative to avoid stack overflows ([#19695](#19695)) - Pass `VIRTUAL_ENV` through `cygpath` inside `fish` on Windows ([#19703](#19703)) - Rebuild explicit local directory tool installs ([#19591](#19591)) - Validate egg top-level entries as identifiers ([#19679](#19679)) ### Documentation - Document `--find-links` caching behavior ([#19585](#19585)) - Add a small section for malware checks ([#19680](#19680))
Adding the `stop_discovery_at.is_none()` at https://github.com/astral-sh/uv/blob/53d3f5443441ff5ef17b66a0d68cf7091dd5a091/crates/uv-workspace/src/workspace.rs#L1734-L1754 in #18311 inadvertently caused a regression when `stop_discovery_at` was missing during workspace discovery. This PR fixes that.
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [ghcr.io/astral-sh/uv](https://github.com/astral-sh/uv) | stage | patch | `0.11.19` → `0.11.20` | --- ### Release Notes <details> <summary>astral-sh/uv (ghcr.io/astral-sh/uv)</summary> ### [`v0.11.20`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#01120) Released on 2026-06-10. ##### Enhancements - Add `--emit-index-url` and `--emit-find-links` to `uv export` ([#​18370](astral-sh/uv#18370)) - Add `--find-links` support for `uv pip list` ([#​16103](astral-sh/uv#16103)) - Group executable install errors during `uv python install` ([#​19691](astral-sh/uv#19691)) - Use ICF in macOS release builds to reduce binary sizes ([#​19615](astral-sh/uv#19615)) ##### Preview features - Add initial hidden `uv upgrade` command ([#​19678](astral-sh/uv#19678)) - Reject Git revisions in `uv upgrade` ([#​19742](astral-sh/uv#19742)) ##### Configuration - Recognize `UV_NO_INSTALL_PROJECT`, `UV_NO_INSTALL_WORKSPACE`, `UV_NO_INSTALL_LOCAL` ([#​19323](astral-sh/uv#19323)) ##### Performance - Speed up discovery of large workspaces ([#​18311](astral-sh/uv#18311)) ##### Bug fixes - Allow unknown preview flags with a warning again ([#​19669](astral-sh/uv#19669)) - Apply dependency exclusions to direct requirements ([#​19699](astral-sh/uv#19699)) - Avoid following external symlinks during cache clean ([#​19682](astral-sh/uv#19682)) - Avoid following symlinks during cache prune ([#​19543](astral-sh/uv#19543)) - Fix Git cache keys for worktrees and packed refs ([#​19706](astral-sh/uv#19706)) - Make resolver error handling iterative to avoid stack overflows ([#​19695](astral-sh/uv#19695)) - Pass `VIRTUAL_ENV` through `cygpath` inside `fish` on Windows ([#​19703](astral-sh/uv#19703)) - Rebuild explicit local directory tool installs ([#​19591](astral-sh/uv#19591)) - Validate egg top-level entries as identifiers ([#​19679](astral-sh/uv#19679)) ##### Documentation - Document `--find-links` caching behavior ([#​19585](astral-sh/uv#19585)) - Add a small section for malware checks ([#​19680](astral-sh/uv#19680)) </details> --- ### Configuration 📅 **Schedule**: (UTC) - Branch creation - At any time (no schedule defined) - Automerge - At any time (no schedule defined) 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4yMjAuMCIsInVwZGF0ZWRJblZlciI6IjQzLjIyMC4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiXX0=--> Co-authored-by: Renovate Bot <renovate@bhamm-lab.com> Reviewed-on: https://codeberg.org/blake-hamm/bhamm-lab/pulls/186
This change avoids reading workspace root and member
pyproject.tomlmultiple times for the same information. It remove all duplicate reading except for cache keys.With the whole stack merged, it's ~1.8x faster for
uv run python -Vin airflow (ignoring the one uncacheable wheel with--offline):hyperfine --warmup 5 --runs 50on airflow:When analyzing the spans, reading the root
airflow/pyproject.tomlgoes from 130 times to 2 times (regular read and cache keys), while for each workspace member, it goes from 4 to 2 reads (regular read and cache keys), removing 2 full reads each.Before
After