Skip to content

Conversation

@zanieb
Copy link
Member

@zanieb zanieb commented Jun 3, 2025

Investigating #13744

I tried reproducing here by running the test in a loop, but could not. I presume it's an interaction with other tests.

This drops the snapshot, but I think it's worth it to try to examine the flake?

@zanieb zanieb force-pushed the zb/debug-sync_dry_run branch from b32d18b to 4529f68 Compare June 3, 2025 14:21
@zanieb zanieb temporarily deployed to uv-test-publish June 3, 2025 14:27 — with GitHub Actions Inactive
@zanieb zanieb force-pushed the zb/debug-sync_dry_run branch from 4529f68 to f11d11d Compare June 3, 2025 14:41
@zanieb zanieb temporarily deployed to uv-test-publish June 3, 2025 14:44 — with GitHub Actions Inactive
@zanieb zanieb force-pushed the zb/debug-sync_dry_run branch from 165bf0a to 94571ee Compare June 3, 2025 14:51
@zanieb zanieb had a problem deploying to uv-test-publish June 3, 2025 14:53 — with GitHub Actions Failure
@zanieb zanieb temporarily deployed to uv-test-publish June 3, 2025 16:08 — with GitHub Actions Inactive
@zanieb zanieb temporarily deployed to uv-test-publish June 9, 2025 14:20 — with GitHub Actions Inactive
@zanieb zanieb changed the title Debug: Run sync_dry_run on infinite loop Debug sync_dry_run flake by panicking with verbose output on failure Jun 11, 2025
@zanieb zanieb force-pushed the zb/debug-sync_dry_run branch from f00f018 to 27aa4a3 Compare June 11, 2025 13:54
@zanieb zanieb force-pushed the zb/debug-sync_dry_run branch from 27aa4a3 to 74864ed Compare June 11, 2025 13:56
@zanieb zanieb marked this pull request as ready for review June 11, 2025 13:56
@zanieb zanieb added the testing Internal testing of behavior label Jun 11, 2025
@zanieb zanieb merged commit 806cc5c into main Jun 12, 2025
86 checks passed
@zanieb zanieb deleted the zb/debug-sync_dry_run branch June 12, 2025 13:05
zanieb added a commit that referenced this pull request Jun 26, 2025
In addition to our flake catch, keep a snapshot.

Extends #13817
zanieb added a commit that referenced this pull request Jun 26, 2025
zanieb added a commit that referenced this pull request Jun 30, 2025
This fixes an obscure cache collision in Python interpreter queries,
which we believe to be the root cause of CI flakes we've been seeing
where a project environment is invalidated and recreated.

This work follows from the logs in [this CI
run](https://github.com/astral-sh/uv/actions/runs/15934322410/job/44950599993?pr=14326)
which captured one of the flakes with tracing enabled. There, we can see
that the project environment is invalidated because the Python
interpreter in the environment has a different version than expected:

```
DEBUG Checking for Python environment at `.venv`
TRACE Cached interpreter info for Python 3.12.9, skipping probing: .venv/bin/python3
DEBUG The interpreter in the project environment has different version (3.12.9) than it was created with (3.9.21)
```

(this message is updated to reflect #14329)

The flow is roughly:

- We create an environment with 3.12.9
- We query the environment, and cache the interpreter version for
`.venv/bin/python`
- We create an environment for 3.9.12, replacing the existing one
- We query the environment, and read the cached information

The Python cache entries are keyed by the absolute path to the
interpreter, and rely on the modification time (ctime, nsec resolution)
of the canonicalized path to determine if the cache entry should be
invalidated. The key is a hex representation of a u64 sea hasher output
— which is very unlikely to collide.

After an audit of the Python query caching logic, we determined that the
most likely cause of a collision in cache entries is that the
modification times of underlying interpreters are identical. This seems
pretty feasible, especially if the file system does not support
nanosecond precision — though it appears that the GitHub runners do
support it.

The fix here is to include the canonicalized path in the cache key,
which ensures we're looking at the modification time of the _same_
underlying interpreter.

This will "invalidate" all existing interpreter cache entries but that's
not a big deal.

This should also have the effect of reducing cache churn for
interpreters in virtual environments. Now, when you change Python
versions, we won't invalidate the previous cache entry so if you change
_back_ to the old version we can re-use our cached information.

It's a bit speculative, since we don't have a deterministic reproduction
in CI, but this is the strongest candidate given the logs and should
increase correctness regardless.

Closes #14160
Closes #13744
Closes #13745

Once it's confirmed the flakes are resolved, we should revert

- #14275
- #13817
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Internal testing of behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants