Skip to content

fix(uv): never gut the venv in place when rebuilding on Windows#38397

Closed
lEWFkRAD wants to merge 3 commits into
NousResearch:mainfrom
lEWFkRAD:fix/rebuild-venv-windows-lock-bricks-install
Closed

fix(uv): never gut the venv in place when rebuilding on Windows#38397
lEWFkRAD wants to merge 3 commits into
NousResearch:mainfrom
lEWFkRAD:fix/rebuild-venv-windows-lock-bricks-install

Conversation

@lEWFkRAD

@lEWFkRAD lEWFkRAD commented Jun 3, 2026

Copy link
Copy Markdown

What does this PR do?

On Windows, a normal hermes update can brick its own install by gutting the venv it is rebuilding.

hermes_cli/managed_uv.py:rebuild_venv() deletes the old venv in place and then recreates it:

if venv_dir.exists():
    shutil.rmtree(venv_dir, ignore_errors=True)      # (1)
result = subprocess.run([uv_bin, "venv", str(venv_dir), "--python", python_version])  # (2)

When any hermes.exe (gateway/desktop) is running — including the updater's own
interpreter — it holds venv\Scripts\python.exe open. Then (1) deletes
everything it can (site-packages, including certifi/cacert.pem) but
silently fails on the locked python.exe, leaving a half-gutted venv with no
pyvenv.cfg. (2) then aborts with "A directory already exists", so the venv is
never recreated. The result is a venv that still loads but whose every outbound
HTTPS call dies with FileNotFoundError (missing cacert.pem), and the CLI
reports No virtual environment found / ModuleNotFoundError: hermes_cli. No
automatic recovery — the venv must be rebuilt by hand. (This is issue #37881; we
hit it in production on Windows 11.)

This fixes the root cause: never destroy a venv that is in use. Move the old
venv aside atomically with os.replace, which fails fast and cleanly when the
venv is locked — leaving the working venv fully intact — instead of deleting it
piecemeal. Also pass --clear, and restore the moved-aside venv if the rebuild
itself fails.

Relationship to #37895 and #38051 (alternative approach). Both of those add
--clear to uv venv while keeping shutil.rmtree(ignore_errors=True). In the
real lock scenario --clear is insufficient: the running interpreter is the
venv's python.exe, so uv venv --clear cannot delete it either, and the
rmtree has already gutted site-packages. Moving the directory aside first
(os.replace) is what actually prevents the brick — it bails out before any
deletion when the venv is in use, and tells the user to stop the gateway. This
PR includes that fix; happy to fold it into whichever PR the maintainers prefer.

Related Issue

Fixes #37881

Type of Change

  • ðŸ�› Bug fix (non-breaking change that fixes an issue)

Changes Made

  • hermes_cli/managed_uv.py — rebuild_venv(): move the existing venv aside with
    os.replace (abort without deletion if it's in use), pass --clear to
    uv venv, and restore the old venv if the rebuild fails. Updated docstring.
  • tests/hermes_cli/test_managed_uv.py — regression tests (below).

How to Test

  1. On Windows, with a Hermes gateway/desktop running (so venv\Scripts\python.exe
    is locked), run hermes update. On main this bricks the venv
    (pyvenv.cfg gone, later HTTPS calls raise FileNotFoundError).
  2. With this patch, rebuild_venv aborts cleanly with "venv is in use; stop the
    gateway/desktop and retry"
    and the existing venv is left fully intact.
  3. Unit: pytest tests/hermes_cli/test_managed_uv.py -k RebuildVenv — see new
    tests proving an in-use venv is never deleted and a failed rebuild is restored.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(uv): ...)
  • I searched for existing PRs — found fix(cli): pass --clear to uv venv in rebuild_venv to avoid Windows brick #37895 and fix(config): add --clear flag to uv venv rebuild to prevent Windows bricking #37881 #38051; this is intentionally a more complete alternative (see note above), not a blind duplicate
  • My PR contains only changes related to this fix
  • I've run pytest tests/ -q and all tests pass — ran TestRebuildVenv (all pass on Windows/Py3.12); did not run the full suite. Note: 6 pre-existing tests in test_managed_uv.py (TestResolveUv/TestEnsureUv/TestUpdateManagedUv/TestInstallUvInternals) fail on Windows independent of this change (POSIX chmod/os.access(X_OK) assumptions)
  • I've added tests for my changes
  • I've tested on my platform: Windows 11 (Python 3.12)

Documentation & Housekeeping

  • I've updated relevant documentation (docstring on rebuild_venv)
  • cli-config.yaml.example — N/A
  • CONTRIBUTING.md / AGENTS.md — N/A
  • I've considered cross-platform impact — os.replace is correct on POSIX too; behavior there is unchanged (move-aside then recreate)
  • tool descriptions/schemas — N/A

Logs

Production failure signature that this prevents:

WARNING hermes_cli.managed_uv: venv rebuild failed: A directory already exists at: ...\venv
ERROR  agent.conversation_loop: API call failed after 3 retries. [Errno 2] No such file or directory
       | provider=custom base_url=http://127.0.0.1:8001/v1 model=qwen3.6-27b-gptq

@alt-glitch alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard area/config Config system, migrations, profiles python:uv Pull requests that update python:uv code P1 High — major feature broken, no workaround labels Jun 3, 2026
@rdnot

rdnot commented Jun 3, 2026

Copy link
Copy Markdown

this prevent the brick , but does it solve the locked problem?

`hermes update` can brick a Windows install. cmd_update guards against
concurrent hermes.exe processes and exits, but `hermes update --force` skips
that guard (and the guard's own message tells the user to pass --force).
Forced past it, rebuild_venv runs while the venv is still in use:

    shutil.rmtree(venv_dir, ignore_errors=True)   # deletes site-packages +
                                                  # certifi, fails on the
                                                  # locked python.exe ->
                                                  # half-gutted venv
    uv venv ...            (no --clear)           # aborts "directory already
                                                  # exists" -> never recreated

The venv is left with no pyvenv.cfg; every later HTTPS call dies with
FileNotFoundError (missing cacert.pem) and there is no recovery.

Move the old venv aside atomically with os.replace instead of deleting it in
place, pass --clear, and restore the moved-aside venv if the rebuild fails.
Verified on Windows 11: os.replace of a venv whose interpreter is live
succeeds (Windows tracks a running .exe by handle), so the rebuild actually
completes while the running gateway keeps using the moved-aside copy until it
restarts -- the update succeeds instead of bricking. If the venv genuinely
cannot be moved, the rebuild aborts cleanly and leaves it fully intact.

Root-cause fix for NousResearch#37881. The --clear-only approaches in NousResearch#37895 / NousResearch#38051
don't cover the real lock scenario: when the locked interpreter is inside the
venv being rebuilt, neither rmtree nor `uv venv --clear` can delete it.

- hermes_cli/managed_uv.py: atomic move-aside + --clear + restore-on-failure
- tests/hermes_cli/test_managed_uv.py: in-use venv left intact with no rebuild
  attempted; failed rebuild restored; success path asserts --clear
- website/docs/getting-started/updating.md: document --force venv-rebuild safety

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@lEWFkRAD lEWFkRAD force-pushed the fix/rebuild-venv-windows-lock-bricks-install branch from 010cf85 to 6eb5d45 Compare June 3, 2026 20:57
@lEWFkRAD

lEWFkRAD commented Jun 3, 2026

Copy link
Copy Markdown
Author

this prevent the brick , but does it solve the locked problem?

Yes — it solves the locked problem, not just the brick.

Verified on Windows 11: os.replace moving the venv aside succeeds even while the venv''s own python.exe is the running interpreter, because Windows tracks a running executable by handle (not path), so renaming its parent directory is allowed. uv venv --clear then builds a fresh venv in its place, and the still-running gateway/desktop keep using the moved-aside copy until they restart — so hermes update actually completes instead of bricking.

If the venv genuinely can''t be moved (e.g. an AV/indexer holds the directory), os.replace raises and we abort cleanly with the existing venv fully intact — never the half-deleted state.

Quick repro I ran (temp dirs, interpreter kept alive in the background):

NEW (os.replace):  move-aside SUCCEEDED while interpreter live; uv venv --clear rc=0,
                   fresh pyvenv.cfg created, old process still alive  => update SUCCEEDS
OLD (main):        rmtree(ignore_errors=True) deleted site-packages (cacert gone) but
                   left the locked python.exe; uv venv (no --clear) rc=2  => GUTTED == BRICK

This is also why --clear alone isn''t enough for the locked case: when the locked interpreter is inside the venv being rebuilt, neither rmtree nor uv venv --clear can delete it — moving it aside first is what lets the rebuild proceed.

Fold in the call-site guards from NousResearch#38511 (@f3rs3n) so they compose with the
move-aside fix from the previous commit:

- rebuild_venv() no longer reports success if `uv venv` exits 0 without
  producing an interpreter; it restores the moved-aside venv and returns False.
- both `hermes update` venv-rebuild call sites now abort (RuntimeError) instead
  of silently continuing into dependency install when rebuild_venv() returns
  False.

Tests: rebuild_venv returns False when the rebuilt interpreter is missing;
cmd_update raises and skips pip install when the fresh managed-uv rebuild fails.

Co-Authored-By: f3rs3n <32328813+f3rs3n@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@lEWFkRAD

lEWFkRAD commented Jun 3, 2026

Copy link
Copy Markdown
Author

Folded the two call-site guards from @f3rs3n''s #38511 into this PR (credited as co-author on the commit):

  • rebuild_venv() now returns False (and restores the moved-aside venv) if uv venv exits 0 without actually producing an interpreter.
  • Both hermes update rebuild call sites now abort instead of continuing into dependency install when rebuild_venv() returns False.

So this PR now covers the Windows-safe move-aside/restore and the call-site abort + interpreter-exists check, with tests for each. Thanks @f3rs3n — happy to restructure however maintainers prefer.

@rdnot

rdnot commented Jun 4, 2026

Copy link
Copy Markdown

I tried reproduce with the mocked hermes update venv rebuilding step, renaming works but found the need to add the renamed folder to .gitignore down the flow of the hermes update.

…kips it

rebuild_venv moves the live venv aside to <venv>.old (it is not deleted in
place) before recreating it. `hermes update` runs
`git stash push --include-untracked` before pulling, so an untracked venv.old
gets swept into the autostash — a whole moved-aside virtualenv — on every
update. The existing `/venv/` ignore entry is anchored to the repo root and
does not cover `venv.old`.

Add `/venv.old/` to .gitignore (mirroring `/venv/`) plus a regression guard
test asserting the backup name rebuild_venv uses is ignored.

Addresses review feedback on NousResearch#38397 (the moved-aside folder needs to be
gitignored down the flow of the hermes update).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@lEWFkRAD

lEWFkRAD commented Jun 4, 2026

Copy link
Copy Markdown
Author

I tried reproduce with the mocked hermes update venv rebuilding step, renaming works but found the need to add the renamed folder to .gitignore down the flow of the hermes update.

You're right. rebuild_venv moves the live venv aside to venv.old, and the existing /venv/ rule is anchored so it doesn't cover it. Since hermes update runs git stash push --include-untracked before pulling, that untracked venv.old was getting swept into the autostash (a whole moved-aside venv) every run.

Pushed ac9a29e: added /venv.old/ to .gitignore (mirrors /venv/) plus a regression-guard test. Verified with a real venv.old/ dir that git status no longer sees it, so the autostash skips it.

@teknium1

teknium1 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the thorough writeup and tests — your root-cause analysis on the locked python.exe / gutted site-packages chain was spot on.

This is already fixed on main by #38887 ("fix(update): harden venv rebuild + verify core deps after install"), which lands the --clear retry in rebuild_venv() (gated on uv's "already exists" stderr signal so it doesn't mask real failures) plus a post-install _verify_core_dependencies check to catch partial installs. Closing as redundant.

Credit noted — your diagnosis matched the failure exactly. Thanks for contributing!

@teknium1 teknium1 closed this Jun 4, 2026
@lEWFkRAD lEWFkRAD deleted the fix/rebuild-venv-windows-lock-bricks-install branch June 4, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config Config system, migrations, profiles comp/cli CLI entry point, hermes_cli/, setup wizard P1 High — major feature broken, no workaround python:uv Pull requests that update python:uv code type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: hermes update bricks the install on Windows — venv rebuild leaves a venv with no pyvenv.cfg, then ModuleNotFoundError: hermes_cli

5 participants