Skip to content

fix(terminal): recover from deleted cwd instead of crashing all sessions#19925

Closed
kshitijk4poor wants to merge 1 commit into
mainfrom
fix/stale-cwd-recovery
Closed

fix(terminal): recover from deleted cwd instead of crashing all sessions#19925
kshitijk4poor wants to merge 1 commit into
mainfrom
fix/stale-cwd-recovery

Conversation

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Problem

When multiple gateway sessions share a single LocalEnvironment (the default — all subagents map to "default" task_id), a subagent that cd's into a temp directory poisons self.cwd for ALL sessions if that directory is later deleted.

subprocess.Popen(cwd='/tmp/deleted_dir') raises FileNotFoundError at the Python level — before bash even starts — making terminal and file tools completely unusable across all concurrent sessions until gateway restart.

The retry logic (3 retries with exponential backoff) just retries the same broken Popen call, making it worse by delaying the error response.

Reproduction scenario (from a user's debug report):

  1. Gateway running with Telegram, spawning delegate_task subagents
  2. Subagent cd's into /tmp/dr3_4_a1_46hd1m5g (a temp sandbox)
  3. Temp dir deleted by script cleanup
  4. ALL sessions get: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/dr3_4_a1_46hd1m5g'
  5. Every terminal/file operation fails with 3 retries each — gateway is bricked

Fix

Two changes in tools/environments/local.py:

1. _run_bash() — validate cwd before Popen

Before passing cwd to subprocess.Popen, check os.path.isdir(self.cwd). If the directory no longer exists, reset self.cwd to user's home (or /) as a safe fallback. The shell-level cd -- <path> in _wrap_command handles the logical directory switch — Popen's cwd just needs to be a valid launch point for bash.

2. _update_cwd() — don't restore stale paths from cwd file

After a command runs, _update_cwd() reads the cwd tracking file. If the recorded path no longer exists on disk, reset to fallback instead of re-setting self.cwd to the stale path. This prevents the exit-126 loop where every command fails because the stale path keeps getting re-read from the file.

Recovery behavior

After the fix:

  • First command after deletion: returns exit 126 with clear No such file or directory message (model can understand and adapt)
  • All subsequent commands: succeed from home directory (self-healing)
  • No Python crash, no retry storms, no gateway bricking

Test plan

E2E verified:

env = LocalEnvironment(cwd=temp_dir, timeout=30)
env.execute('echo hello')  # works
shutil.rmtree(temp_dir)    # simulate cleanup
env.execute('echo test')   # rc=126, no crash, cwd resets to ~
env.execute('echo ok')     # rc=0, recovered

Existing test suite: 50 passed (test_base_environment + test_local_env_blocklist + test_terminal_tool), 973 passed in full tools/ suite (2 pre-existing failures unrelated to this change).

When multiple gateway sessions share a single LocalEnvironment (the
default task_id mapping), a subagent that cd's into a temp directory
poisons self.cwd for ALL sessions if that directory is later deleted.
subprocess.Popen(cwd=<deleted_path>) raises FileNotFoundError at the
Python level — before bash even starts — making terminal and file tools
completely unusable until gateway restart.

Fix: validate self.cwd exists before passing to Popen. If stale, reset
to user home (or /) as fallback. The shell-level 'cd -- <path>' in
_wrap_command handles the logical directory switch, so Popen's cwd just
needs to be a valid launch point for bash.

Also harden _update_cwd to not re-set self.cwd from the cwd tracking
file when the recorded path no longer exists, preventing the exit-126
loop where the stale path gets re-read on every command.

Recovery behavior: first command after deletion returns exit 126 with a
clear 'No such file or directory' message (model can understand and
adapt), then all subsequent commands succeed from home directory.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/tools Tool registry, model_tools, toolsets tool/terminal Terminal execution and process management backend/local Local shell execution labels May 4, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #17569 — same root cause: LocalEnvironment._run_bash raises FileNotFoundError when self.cwd is deleted, bricking terminal. This PR is a superset (adds _update_cwd guard). Also overlaps with #17707.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #17569

@kshitijk4poor

Copy link
Copy Markdown
Collaborator Author

Closing as duplicate of #17569 which is a more thorough implementation of the same fix:

#17707 is also a simpler variant of the same fix (no tests, no ancestor walk, no _update_cwd guard).

Recommend salvaging #17569 — it's the superset. #17707 can be closed once #17569 lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend/local Local shell execution comp/tools Tool registry, model_tools, toolsets P1 High — major feature broken, no workaround tool/terminal Terminal execution and process management type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants