fix(uv): never gut the venv in place when rebuilding on Windows#38397
fix(uv): never gut the venv in place when rebuilding on Windows#38397lEWFkRAD wants to merge 3 commits into
Conversation
|
this prevent the brick , but does it solve the locked problem? |
`hermes update` can brick a Windows install. cmd_update guards against
concurrent hermes.exe processes and exits, but `hermes update --force` skips
that guard (and the guard's own message tells the user to pass --force).
Forced past it, rebuild_venv runs while the venv is still in use:
shutil.rmtree(venv_dir, ignore_errors=True) # deletes site-packages +
# certifi, fails on the
# locked python.exe ->
# half-gutted venv
uv venv ... (no --clear) # aborts "directory already
# exists" -> never recreated
The venv is left with no pyvenv.cfg; every later HTTPS call dies with
FileNotFoundError (missing cacert.pem) and there is no recovery.
Move the old venv aside atomically with os.replace instead of deleting it in
place, pass --clear, and restore the moved-aside venv if the rebuild fails.
Verified on Windows 11: os.replace of a venv whose interpreter is live
succeeds (Windows tracks a running .exe by handle), so the rebuild actually
completes while the running gateway keeps using the moved-aside copy until it
restarts -- the update succeeds instead of bricking. If the venv genuinely
cannot be moved, the rebuild aborts cleanly and leaves it fully intact.
Root-cause fix for NousResearch#37881. The --clear-only approaches in NousResearch#37895 / NousResearch#38051
don't cover the real lock scenario: when the locked interpreter is inside the
venv being rebuilt, neither rmtree nor `uv venv --clear` can delete it.
- hermes_cli/managed_uv.py: atomic move-aside + --clear + restore-on-failure
- tests/hermes_cli/test_managed_uv.py: in-use venv left intact with no rebuild
attempted; failed rebuild restored; success path asserts --clear
- website/docs/getting-started/updating.md: document --force venv-rebuild safety
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
010cf85 to
6eb5d45
Compare
Yes — it solves the locked problem, not just the brick. Verified on Windows 11: If the venv genuinely can''t be moved (e.g. an AV/indexer holds the directory), Quick repro I ran (temp dirs, interpreter kept alive in the background): This is also why |
Fold in the call-site guards from NousResearch#38511 (@f3rs3n) so they compose with the move-aside fix from the previous commit: - rebuild_venv() no longer reports success if `uv venv` exits 0 without producing an interpreter; it restores the moved-aside venv and returns False. - both `hermes update` venv-rebuild call sites now abort (RuntimeError) instead of silently continuing into dependency install when rebuild_venv() returns False. Tests: rebuild_venv returns False when the rebuilt interpreter is missing; cmd_update raises and skips pip install when the fresh managed-uv rebuild fails. Co-Authored-By: f3rs3n <32328813+f3rs3n@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Folded the two call-site guards from @f3rs3n''s #38511 into this PR (credited as co-author on the commit):
So this PR now covers the Windows-safe move-aside/restore and the call-site abort + interpreter-exists check, with tests for each. Thanks @f3rs3n — happy to restructure however maintainers prefer. |
|
I tried reproduce with the mocked hermes update venv rebuilding step, renaming works but found the need to add the renamed folder to .gitignore down the flow of the hermes update. |
…kips it rebuild_venv moves the live venv aside to <venv>.old (it is not deleted in place) before recreating it. `hermes update` runs `git stash push --include-untracked` before pulling, so an untracked venv.old gets swept into the autostash — a whole moved-aside virtualenv — on every update. The existing `/venv/` ignore entry is anchored to the repo root and does not cover `venv.old`. Add `/venv.old/` to .gitignore (mirroring `/venv/`) plus a regression guard test asserting the backup name rebuild_venv uses is ignored. Addresses review feedback on NousResearch#38397 (the moved-aside folder needs to be gitignored down the flow of the hermes update). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
You're right. rebuild_venv moves the live venv aside to venv.old, and the existing /venv/ rule is anchored so it doesn't cover it. Since hermes update runs git stash push --include-untracked before pulling, that untracked venv.old was getting swept into the autostash (a whole moved-aside venv) every run. Pushed ac9a29e: added /venv.old/ to .gitignore (mirrors /venv/) plus a regression-guard test. Verified with a real venv.old/ dir that git status no longer sees it, so the autostash skips it. |
|
Thanks for the thorough writeup and tests — your root-cause analysis on the locked This is already fixed on Credit noted — your diagnosis matched the failure exactly. Thanks for contributing! |
What does this PR do?
On Windows, a normal
hermes updatecan brick its own install by gutting the venv it is rebuilding.hermes_cli/managed_uv.py:rebuild_venv()deletes the old venv in place and then recreates it:When any
hermes.exe(gateway/desktop) is running — including the updater's owninterpreter — it holds
venv\Scripts\python.exeopen. Then (1) deleteseverything it can (
site-packages, includingcertifi/cacert.pem) butsilently fails on the locked
python.exe, leaving a half-gutted venv with nopyvenv.cfg. (2) then aborts with "A directory already exists", so the venv isnever recreated. The result is a venv that still loads but whose every outbound
HTTPS call dies with
FileNotFoundError(missingcacert.pem), and the CLIreports
No virtual environment found/ModuleNotFoundError: hermes_cli. Noautomatic recovery — the venv must be rebuilt by hand. (This is issue #37881; we
hit it in production on Windows 11.)
This fixes the root cause: never destroy a venv that is in use. Move the old
venv aside atomically with
os.replace, which fails fast and cleanly when thevenv is locked — leaving the working venv fully intact — instead of deleting it
piecemeal. Also pass
--clear, and restore the moved-aside venv if the rebuilditself fails.
Related Issue
Fixes #37881
Type of Change
Changes Made
hermes_cli/managed_uv.py—rebuild_venv(): move the existing venv aside withos.replace(abort without deletion if it's in use), pass--cleartouv venv, and restore the old venv if the rebuild fails. Updated docstring.tests/hermes_cli/test_managed_uv.py— regression tests (below).How to Test
venv\Scripts\python.exeis locked), run
hermes update. Onmainthis bricks the venv(
pyvenv.cfggone, later HTTPS calls raiseFileNotFoundError).rebuild_venvaborts cleanly with "venv is in use; stop thegateway/desktop and retry" and the existing venv is left fully intact.
pytest tests/hermes_cli/test_managed_uv.py -k RebuildVenv— see newtests proving an in-use venv is never deleted and a failed rebuild is restored.
Checklist
Code
fix(uv): ...)pytest tests/ -qand all tests pass — ranTestRebuildVenv(all pass on Windows/Py3.12); did not run the full suite. Note: 6 pre-existing tests intest_managed_uv.py(TestResolveUv/TestEnsureUv/TestUpdateManagedUv/TestInstallUvInternals) fail on Windows independent of this change (POSIXchmod/os.access(X_OK)assumptions)Documentation & Housekeeping
rebuild_venv)cli-config.yaml.example— N/ACONTRIBUTING.md/AGENTS.md— N/Aos.replaceis correct on POSIX too; behavior there is unchanged (move-aside then recreate)Logs
Production failure signature that this prevents: