Skip to content

Handle the hard link limit gracefully instead of failing#17699

Merged
EliteTK merged 8 commits intoastral-sh:mainfrom
amol-:cache-reset-links
Mar 5, 2026
Merged

Handle the hard link limit gracefully instead of failing#17699
EliteTK merged 8 commits intoastral-sh:mainfrom
amol-:cache-reset-links

Conversation

@amol-
Copy link
Contributor

@amol- amol- commented Jan 26, 2026

Summary

Handle the case where too many hardlinks were created and thus installing packages fails.

There are cases where the file system can have a hardlinks limit and when it's hit uv fails,
for example AWS EFS that has a limit of 177 hard links ( https://docs.aws.amazon.com/efs/latest/ug/troubleshooting-efs-fileop-errors.html#hardlinkerror )

This PR address this by resetting the hardlinks when the limit is reached (it does this by replacing the file in the cache, so new hardlinks can be made)

There can be race conditions over the limit, but those are ok, as the links are reset atomically so in case of race conditions the worst case will be that the limit is reset twice, but nothing will break.

Test Plan

Add a install_hardlink_after_emlink function, it successfully reproduced the issue on my system.
The issue is that it's expensive to run as it has to generate a lot of hardlinks

@konstin
Copy link
Member

konstin commented Jan 26, 2026

Before making a change, we need a clearer description of the problem. Can you share instructions how to reproduce the problem, as well as the error message you received and the logs? Why do we need an extra option for this, can we handle the better internally?

@amol-
Copy link
Contributor Author

amol- commented Jan 26, 2026

Before making a change, we need a clearer description of the problem
Can you share instructions how to reproduce the problem,

As explained in the description you can reproduce this by using EFS and installing a package in more than 177 python environments. That's probably the easiest way.

as well as the error message you received and the logs?

You would get an error like

2026/01/06 09:01:25.755055774       XXXX/archive-v0/XCaM_31VGMkcLZe9KeK-L/pkg_resources/__init__.py
2026/01/06 09:01:25.755067836       to
2026/01/06 09:01:25.755068532       YYYY/builds-v0/.tmp5VSfaw/lib/python3.14/site-packages/pkg_resources/__init__.py:
2026/01/06 09:01:25.755077725       Too many links (os error 31)

Why do we need an extra option for this, can we handle the better internally?

I think we can detect the os error and handle it automatically, but I wasn't very fond of introducing an install bottleneck without the user being aware of it. Adding an explicit option seemed to ensure that the user was aware the bottleneck was there. But I'm happy to change to automatic handling of the error if you feel it's an approach more aligned to uv user experience

@EliteTK
Copy link
Contributor

EliteTK commented Jan 26, 2026

Rather than having the user figure out the hard link limit and specify it themselves, we should just handle EMLINK and ERROR_TOO_MANY_LINKS gracefully.

Moreover, I believe we don't lock the cache for exclusive access in these cases, so there's a chance that two instances of uv end up copying the file at the same time. But the way this is written, this won't cause any corruption, just a bit of unnecessary work.

@amol-
Copy link
Contributor Author

amol- commented Jan 26, 2026

Rather than having the user figure out the hard link limit and specify it themselves, we should just handle EMLINK and ERROR_TOO_MANY_LINKS gracefully.

Moreover, I believe we don't lock the cache for exclusive access in these cases, so there's a chance that two instances of uv end up copying the file at the same time. But the way this is written, this won't cause any corruption, just a bit of unnecessary work.

Sounds good! I'll move the implementation in that direction and I'll create a script to reproduce the issue to verify it, as it's not as easy as before to reproduce the condition at that point.

@amol- amol- changed the title Add the --link-limit and UV_LINK_LIMIT options and handle limit Handle TooManyLinks by resetting the hard links when it's hit Jan 26, 2026
@amol- amol- changed the title Handle TooManyLinks by resetting the hard links when it's hit Handle TooManyLinks by resetting the hard links when hit Jan 26, 2026
@amol-
Copy link
Contributor Author

amol- commented Jan 27, 2026

The test is flagged as passed in the CI

PASS [   1.137s] (1749/3484) uv::it pip_sync::install_hardlink_after_emlink

But I don't know if we have a way to see if it was skipped or not because the logging isn't shown.

PS: On my system I know it's not skipped and works as expected, I can see that via the logging

@amol- amol- marked this pull request as ready for review January 27, 2026 10:24
@EliteTK EliteTK self-requested a review February 4, 2026 10:59
@amol-
Copy link
Contributor Author

amol- commented Feb 6, 2026

@EliteTK anything I can do to help with this PR? I have to manually cleanup links one on of our systems once in a while when the deploy fails, so I'd love to see this one go

@EliteTK
Copy link
Contributor

EliteTK commented Feb 6, 2026

@amol- Sorry, I just hadn't gotten around to re-reviewing it yet. But I should be able to do that next week.

@mconflitti-pbc
Copy link

@EliteTK thanks for reviewing this! Any updates on when you think you will have some bandwidth to finish taking a look?

Copy link
Contributor

@EliteTK EliteTK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other places where we make a hardlink where we should instead be calling this?

@amol-
Copy link
Contributor Author

amol- commented Feb 21, 2026

@EliteTK I still plan to move this forward. I'm just stuck with limited internet connectivity until next week. Then I'll address your suggestions

@amol- amol- marked this pull request as draft February 26, 2026 16:36
@amol-
Copy link
Contributor Author

amol- commented Feb 26, 2026

moving this back to a draft until I finished addressing review comments

@amol- amol- force-pushed the cache-reset-links branch from d78580e to 8312db4 Compare February 27, 2026 09:57
@amol- amol- force-pushed the cache-reset-links branch from fb47ea2 to 9aa6873 Compare February 27, 2026 14:34
- name: "Create minix filesystem (low hardlink limit)"
run: |
truncate -s 16M /tmp/minix.img
mkfs.minix /tmp/minix.img
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for minix because it has a limit of 250 hard links, thus makes the test faster to run

@amol- amol- marked this pull request as ready for review February 27, 2026 14:47
@amol-
Copy link
Contributor Author

amol- commented Feb 27, 2026

@EliteTK I rebased against main, addressed your comments and made the test run faster and more predictably via a dedicated file system. It should now be ready for re-review

@EliteTK
Copy link
Contributor

EliteTK commented Feb 27, 2026

This is looking pretty great actually! Thanks.

I'll try to review this properly soonish (probably Monday at this rate) but just a quick thing i noticed. We should still set UV_INTERNAL__TEST_MAXLINKS_FS for testing on windows, it can just be the default NTFS. But this is definitely the right way to approach testing this!

@EliteTK EliteTK self-requested a review February 27, 2026 16:05
@amol-
Copy link
Contributor Author

amol- commented Feb 27, 2026

We should still set UV_INTERNAL__TEST_MAXLINKS_FS for testing on windows, it can just be the default NTFS. But this is definitely the right way to approach testing this!

Added the setup for windows too, the CI seems to confirm it worked -> https://github.com/astral-sh/uv/actions/runs/22497686685/job/65176453709?pr=17699#step:11:1597 + https://github.com/astral-sh/uv/actions/runs/22497686685/job/65176453709?pr=17699#step:11:31

Copy link
Contributor

@EliteTK EliteTK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, there's a couple of small things. I can probably just nip them myself tomorrow, but feel free to address them yourself.

"Resetting cache entry due to too many hardlinks: {}",
src.display()
);
let parent = src.parent().ok_or_else(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably also handle src.parent() returning an empty path (indicating it was passed a single component relative path).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this one, I'd expect that the installed file is always in a package, like /home/user/.cache/uv/.../package-1.0/file.py so not clear when src.parent() could not exist.

But I guess it won't cause any harm handling an mpty parent,
c72de08 should handle both the case of invalid parents (like a parent for /) and the case of missing parents. Both will lead to "current directory"

amol- and others added 2 commits March 4, 2026 14:51
Co-authored-by: Tomasz Kramkowski <tomasz@kramkow.ski>
@amol-
Copy link
Contributor Author

amol- commented Mar 4, 2026

@EliteTK it smells like the CI failures are transient errors, do you have a chance to rerun the CI? I don't have the option available

@amol-
Copy link
Contributor Author

amol- commented Mar 5, 2026

my last commit yesterday restarted the tests, but they failed again with 401 on requests to setup build infrastructure, not sure if there is a way I can intervene

Error: failed to solve: dhi.io/alpine-base:3.23: failed to resolve source metadata for dhi.io/alpine-base:3.23: failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized
Error: failed with: Error: failed to solve: dhi.io/alpine-base:3.23: failed to resolve source metadata for dhi.io/alpine-base:3.23: failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized

@EliteTK
Copy link
Contributor

EliteTK commented Mar 5, 2026

Those are apparently because as an external contributor your CI runs don't get the creds required for those...

So you can just ignore them, I guess once this hits main if there is an issue then it can be reverted.

Other than that, PR looks good. I was wrong about needing to copy the permissions, thought the code wrote into the tempfile rather than overwriting with copy. We discussed internally what path we would want for that windows test.

I didn't want to waste your time on these last couple of things so I pushed a commit. Will merge this once the rest of CI passes.

@EliteTK EliteTK changed the title Handle TooManyLinks by resetting the hard links when hit Handle the hard link limit gracefully instead of failing Mar 5, 2026
@EliteTK EliteTK added the bug Something isn't working label Mar 5, 2026
@EliteTK EliteTK merged commit 6036575 into astral-sh:main Mar 5, 2026
232 of 244 checks passed
@amol-
Copy link
Contributor Author

amol- commented Mar 6, 2026

@EliteTK thanks for all the support in getting this one through. I really appreciated your help!

@EliteTK
Copy link
Contributor

EliteTK commented Mar 6, 2026

@amol- thanks for the contribution :)

tmeijn pushed a commit to tmeijn/dotfiles that referenced this pull request Mar 11, 2026
This MR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [uv](https://github.com/astral-sh/uv) | patch | `0.10.7` → `0.10.9` |

MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot).

**Proposed changes to behavior should be submitted there as MRs.**

---

### Release Notes

<details>
<summary>astral-sh/uv (uv)</summary>

### [`v0.10.9`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0109)

[Compare Source](astral-sh/uv@0.10.8...0.10.9)

Released on 2026-03-06.

##### Enhancements

- Add `fbgemm-gpu`, `fbgemm-gpu-genai`, `torchrec`, and `torchtune` to the PyTorch list ([#&#8203;18338](astral-sh/uv#18338))
- Add torchcodec to PyTorch List ([#&#8203;18336](astral-sh/uv#18336))
- Log the duration we took before erroring ([#&#8203;18231](astral-sh/uv#18231))
- Warn when using `uv_build` settings without `uv_build` ([#&#8203;15750](astral-sh/uv#15750))
- Add fallback to `/usr/lib/os-release` on Linux system lookup failure ([#&#8203;18349](astral-sh/uv#18349))
- Use `cargo auditable` to include SBOM in uv builds ([#&#8203;18276](astral-sh/uv#18276))

##### Configuration

- Add an environment variable for `UV_VENV_RELOCATABLE` ([#&#8203;18331](astral-sh/uv#18331))

##### Performance

- Avoid toml `Document` overhead ([#&#8203;18306](astral-sh/uv#18306))
- Use a single global workspace cache ([#&#8203;18307](astral-sh/uv#18307))

##### Bug fixes

- Continue on trampoline job assignment failures ([#&#8203;18291](astral-sh/uv#18291))
- Handle the hard link limit gracefully instead of failing ([#&#8203;17699](astral-sh/uv#17699))
- Respect build constraints for workspace members ([#&#8203;18350](astral-sh/uv#18350))
- Revalidate editables and other dependencies in scripts ([#&#8203;18328](astral-sh/uv#18328))
- Support Python 3.13+ on Android ([#&#8203;18301](astral-sh/uv#18301))
- Support `cp3-none-any` ([#&#8203;17064](astral-sh/uv#17064))
- Skip tool environments with broken links to Python on Windows ([#&#8203;17176](astral-sh/uv#17176))

##### Documentation

- Add documentation for common marker values ([#&#8203;18327](astral-sh/uv#18327))
- Improve documentation on virtual dependencies ([#&#8203;18346](astral-sh/uv#18346))

### [`v0.10.8`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0108)

[Compare Source](astral-sh/uv@0.10.7...0.10.8)

Released on 2026-03-03.

##### Python

- Add CPython 3.10.20
- Add CPython 3.11.15
- Add CPython 3.12.13

##### Enhancements

- Add Docker images based on Docker Hardened Images ([#&#8203;18247](astral-sh/uv#18247))
- Add resolver hint when `--exclude-newer` filters out all versions of a package ([#&#8203;18217](astral-sh/uv#18217))
- Configure a real retry minimum delay of 1s ([#&#8203;18201](astral-sh/uv#18201))
- Expand `uv_build` direct build compatibility ([#&#8203;17902](astral-sh/uv#17902))
- Fetch CPython from an Astral mirror by default ([#&#8203;18207](astral-sh/uv#18207))
- Download uv releases from an Astral mirror in installers by default ([#&#8203;18191](astral-sh/uv#18191))
- Add SBOM attestations to Docker images ([#&#8203;18252](astral-sh/uv#18252))
- Improve hint for installing meson-python when missing as build backend ([#&#8203;15826](astral-sh/uv#15826))

##### Configuration

- Add `UV_INIT_BARE` environment variable for `uv init` ([#&#8203;18210](astral-sh/uv#18210))

##### Bug fixes

- Prevent `uv tool upgrade` from installing excluded dependencies ([#&#8203;18022](astral-sh/uv#18022))
- Promote authentication policy when saving tool receipts ([#&#8203;18246](astral-sh/uv#18246))
- Respect exclusions in scripts ([#&#8203;18269](astral-sh/uv#18269))
- Retain default-branch Git SHAs in `pylock.toml` files ([#&#8203;18227](astral-sh/uv#18227))
- Skip installed Python check for URL dependencies ([#&#8203;18211](astral-sh/uv#18211))
- Respect constraints during `--upgrade` ([#&#8203;18226](astral-sh/uv#18226))
- Fix `uv tree` orphaned roots and premature deduplication ([#&#8203;17212](astral-sh/uv#17212))

##### Documentation

- Mention cooldown and tweak inline script metadata in dependency bots documentation ([#&#8203;18230](astral-sh/uv#18230))
- Move cache prune in GitLab to `after_script` ([#&#8203;18206](astral-sh/uv#18206))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this MR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box

---

This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My40OS4wIiwidXBkYXRlZEluVmVyIjoiNDMuNTcuMCIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiUmVub3ZhdGUgQm90IiwiYXV0b21hdGlvbjpib3QtYXV0aG9yZWQiLCJkZXBlbmRlbmN5LXR5cGU6OnBhdGNoIl19-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants