Skip to content

fix: Windows stale-cwd causes exit 126 on all subsequent commands#13397

Open
Stellarrysss wants to merge 1 commit into
NousResearch:mainfrom
Stellarrysss:fix/windows-stale-cwd-exit-126
Open

fix: Windows stale-cwd causes exit 126 on all subsequent commands#13397
Stellarrysss wants to merge 1 commit into
NousResearch:mainfrom
Stellarrysss:fix/windows-stale-cwd-exit-126

Conversation

@Stellarrysss

Copy link
Copy Markdown

Summary

Fix the "every terminal command returns exit 126" wedge that users on Windows hit after a prior command cd's into a directory that later gets deleted (temp build dir, checkpoint, intermediate tempdir). The root cause is the shell wrapper's cd '<self.cwd>' || exit 126 guard combined with pwd -P > <cwd_file> running after the user command — so a failing cd short-circuits the write, and self.cwd stays permanently wedged on the bad path until Hermes restarts.

Root cause

BaseEnvironment._wrap_command builds a per-execute bash script of the shape:

source <snapshot> 2>/dev/null || true
cd '<self.cwd>' || exit 126        # ← sentinel fires here when cwd is gone
eval '<user command>'
__hermes_ec=$?
...
pwd -P > <cwd_file> 2>/dev/null || true   # ← never reached on stale-cwd branch
exit $__hermes_ec

When self.cwd points at a deleted directory, cd fails, the script exits 126 before pwd -P > <cwd_file>, so self._update_cwd() reads the old (stale) path back next turn — and every subsequent execute() repeats the 126 exit until the whole Python process restarts.

On Windows this scenario is especially common because:

  • os.getcwd() (used at construction time) returns C:\Users\... while pwd -P returns /c/Users/... — two representations that only the first command's bash successfully normalises; if that fails for any reason, drift can leave self.cwd in a form Git Bash can't resolve.
  • Many authoring workflows cd into ephemeral build/checkpoint temp dirs that are later cleaned.

Fix

Two complementary layers, both in tools/environments/base.py:

Layer 1 — Python-side self-heal in BaseEnvironment.execute().
A new module-level _is_cwd_stale(cwd) helper calls os.path.isdir (or, on Windows, translates Git Bash's /c/Users/... to C:/Users/... via _posix_drive_to_win before calling isdir). When the check reports stale, effective_cwd is reset to self._startup_cwd (captured at __init__) and a logger.warning is emitted. Conservative on Windows: POSIX-only paths like /tmp/... (whose MSYS mapping isn't reconstructed in Python) are not declared stale — the shell-side fallback handles them.

Layer 2 — Shell-side fallback in _wrap_command.
The single cd '<cwd>' || exit 126 is replaced with

cd '<cwd>' 2>/dev/null || { cd '<startup_cwd>' 2>/dev/null && echo '[hermes] ...' >&2; } || exit 126

If the Python check missed (race with external cleanup, or a path Python couldn't resolve), bash falls back to startup cwd and emits a stderr breadcrumb so the caller can see what happened. The final exit 126 still fires only when both cwds fail — preserving the original signal when the environment is genuinely broken.

self._startup_cwd is captured once in BaseEnvironment.__init__ alongside self.cwd.

Non-breaking on Linux/Mac

  • _is_cwd_stale on non-Windows falls straight through to os.path.isdir — same semantics as a plain fs check.
  • _posix_drive_to_win is a no-op for paths that don't match /<letter>/....
  • The shell wrapper's new form is a superset of the prior cd || exit 126: same sentinel on double failure, no new failure modes.
  • _startup_cwd is purely internal; no public API change.

Test results

All passing on Windows 11 / Python 3.12 / Git for Windows bash:

Test Result
_posix_drive_to_win: /c/Users/...C:/Users/..., /tmp/x unchanged (4 cases) 4/4 PASS
_is_cwd_stale: empty=stale, live=ok, nonexistent Win=stale, /c/<bogus>=stale, /tmp/<bogus>=not-declared 5/5 PASS
Integration: fresh env, echo hello → rc=0 PASS
Integration: cd <tempdir>, delete tempdir externally, echo recovered → rc=0, self.cwd reset to startup PASS
Integration: force self.cwd = /tmp/<not-here> (Python can't verify on Windows), echo shell_fallback_worked → rc=0 (shell fallback fires) PASS
Integration: second command after self-heal remains stable PASS

Total: 10/10 unit + 4/4 integration = 14 assertions, all pass. On a wedged environment without the patch, every command returns 126 including the trivial echo hello.

Related / side finding

While writing the regression test I noticed _wait_for_process uses select.select([pipe_fd], ...) which raises OSError on Windows (select there only supports sockets) — this is orthogonal to the exit-126 bug but does mean local terminal commands return output="" even when rc=0 on Windows. That's a separate issue I can file a follow-up for if useful; it didn't affect the 126 diagnosis because users see the return code, not the captured output.

The terminal tool's command wrapper prefixes every user command with
`cd '<self.cwd>' || exit 126`. When `self.cwd` points at a directory
that has been deleted (temp build artefact, checkpoint cleanup, or an
intermediate path from an earlier `cd ... && rm -rf <same>`), the cd
fails and the script exits 126 before running anything. Because the
`pwd -P > <cwd_file>` line runs *after* the eval, the cwd file is
never refreshed, so `self.cwd` stays wedged on the bad path and every
subsequent `execute()` produces the same 126 — until the process
restarts.

This manifests as the well-known "every terminal command returns
exit 126" bug on Windows, where Git Bash path translation and temp
directory cleanup together make stale-cwd scenarios especially common.

Fix is two complementary layers:

1. Python-side self-heal in `BaseEnvironment.execute()`. Before each
   command, `_is_cwd_stale(effective_cwd)` checks whether the path
   can be stat'd. If stale, `self.cwd` is reset to the newly-captured
   `self._startup_cwd` and a warning is logged. On Windows we handle
   both `C:\Users\...` and Git Bash `/c/Users/...` forms via a small
   `_posix_drive_to_win` helper; POSIX-only paths like `/tmp/...` on
   Windows (whose MSYS mapping we don't resolve in Python) fall
   through to the second layer.

2. Shell-side fallback in `_wrap_command`. The single
   `cd '<cwd>' || exit 126` becomes
   `cd '<cwd>' 2>/dev/null || { cd '<startup_cwd>' 2>/dev/null && echo '...' >&2; } || exit 126`.
   If the Python check missed the stale path (e.g. race with an
   external cleanup, or a POSIX-only path), bash falls back to the
   startup cwd and emits a stderr breadcrumb so the caller sees what
   happened. Only exit 126 if *both* cwds fail.

Non-breaking on Linux/Mac: `_is_cwd_stale` uses `os.path.isdir`
there; shell fallback is a superset of the prior behaviour.

Tests (Windows, Python 3.12):
- `_posix_drive_to_win`: /c/Users conv, /d/foo conv, /tmp/x unchanged (pass)
- `_is_cwd_stale`: empty=stale, live=ok, nonexistent Win=stale,
  `/c/<bogus>`=stale, `/tmp/<bogus>`=not-declared (pass)
- integration: fresh env `echo hello` → rc=0 (pass)
- integration: cd into tempdir, delete tempdir externally, run
  `echo recovered` → rc=0, self.cwd reset to startup (pass)
- integration: force `self.cwd = /tmp/<not-here>` (Python can't
  verify) → shell fallback kicks in, rc=0 (pass)
- integration: consecutive commands after self-heal remain stable (pass)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists tool/terminal Terminal execution and process management type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants