Skip to content

fix(terminal): rewrite A && B & to A && { B & } to prevent subshell leak#12207

Closed
handsdiff wants to merge 1 commit into
NousResearch:mainfrom
handsdiff:fix/compound-background-subshell-leak
Closed

fix(terminal): rewrite A && B & to A && { B & } to prevent subshell leak#12207
handsdiff wants to merge 1 commit into
NousResearch:mainfrom
handsdiff:fix/compound-background-subshell-leak

Conversation

@handsdiff

Copy link
Copy Markdown
Contributor

Problem

bash parses A && B & with && tighter than &, so it forks a subshell for the compound and backgrounds the subshell. Inside the subshell, B runs foreground, so the subshell waits for B to finish. When B is a process that never exits on its own (python3 -m http.server, yes > /dev/null, any daemon), the subshell is stuck in wait4 forever and leaks as an orphan reparented to init.

Observed in production

Across a small fleet over ~3 days:

  • 8 unique stuck-terminal events (Agent idle for 1803s … tool=terminal) on 4 of 5 VMs
  • 7 currently-leaked bash + server pairs across the fleet, oldest from 2026-04-14
  • One agent appeared hung to the user for 30 minutes because the subshell's open stdout pipe kept the terminal tool's drain thread blocked (the subshell never closed FD 1)

Kernel stack of a live leak on each affected VM:

[<0>] do_wait+0x61/0xe0
[<0>] kernel_wait4+0xae/0x150
[<0>] __do_sys_wait4+0x9e/0xb0

Consistent fingerprint across leaks: PID ≠ PGID (bash is not session leader because the leaked one is a subshell, not the Popen child), PPID = 1 (outer bash exited, subshell reparented), same process group as the original setsid'd outer bash.

The common agent pattern that tripped it: cd X && python3 -m http.server 8000 &>/dev/null & sleep 1 && curl … — an agent's "start a local server then verify it" one-liner.

Relation to 933fbd8

Distinct from the set +m fix in 933fbd8, which addressed an interactive shell's job-control waiting at exit. set +m doesn't help here because bash -c is non-interactive and job control is already off — the problem is the subshell's own internal wait for its foreground B, not the outer shell's job tracking.

Fix

Walk the command shell-aware (respecting quotes, parens, brace groups, &>/>& redirects), find A && B & / A || B & at depth 0, and rewrite the tail to A && { B & }:

  • Brace groups don't fork a subshell — they run in the current shell
  • B & inside the group is a simple background (no subshell wait)
  • The outer & is absorbed into the group, so the compound no longer needs an explicit subshell
  • && error-propagation is preserved exactly: if A fails, && short-circuits and B never runs

Called once from BaseEnvironment.execute right after _prepare_command, so it runs for every backend (local, ssh, docker, modal, …) with no per-backend plumbing.

Examples

Before After
cd /tmp && python3 -m http.server 8000 &>/dev/null & cd /tmp && { python3 -m http.server 8000 &>/dev/null & }
A && B && C & A && B && { C & }
A || B & A || { B & }
sleep 5 & unchanged (simple bg doesn't have the bug)
A; B & unchanged
cmd 2>&1 & unchanged (redirect, not compound bg)
echo 'A && B &' unchanged (quoted)
(A && B) & unchanged (paren subshell — same bug class, but left as a separate follow-up)

Edge cases handled

  • Quoted strings (single/double) preserved via shared _read_shell_token
  • Parenthesised subshells (...) skipped (separate bug class)
  • Brace groups { ... } tracked for depth, making the rewrite idempotent
  • &>/dev/null not mistaken for background &
  • 2>&1, >&2, <&3 fd-redirects recognised by looking back for < / >
  • ;, |, newlines reset chain state so the rewrite doesn't cross statements
  • Comment lines (# …) left alone
  • Backslash-escaped \&\& treated as literals

Tests

34 new tests in tests/tools/test_terminal_compound_background.py:

  • TestRewrites — positive cases (10): A && B &, A \|\| B &, chained, mixed, multiline, realistic vela-shape
  • TestPreserved — negative cases (7): simple bg, no compound, empty, whitespace-only
  • TestRedirectsNotConfused — 5 redirect scenarios
  • TestQuotingAndParens — 6 quoting / paren / comment / cmd-substitution / backslash cases
  • TestIdempotence — 2 idempotency checks
  • TestEdgeCases — 4 malformed / leading-whitespace / tab-separated

All 34 pass. Adjacent terminal-tool tests (114 more across test_terminal_tool.py, test_terminal_exit_semantics.py, test_terminal_requirements.py, etc.) still pass unchanged.

End-to-end validation

Provisioned a fresh test VM. Ran the exact vela-incident command through the real LocalEnvironment.execute path:

  • Before: bash stuck in wait4 for 1800 s, server alive as its child, gateway inactivity timeout eventually fired
  • After: execute() returns in 1.31 s with rc=0, curl sees 200, no leaked bash, only the intentional backgrounded server (orphaned to init, which is the agent's expressed intent)

Scope left for follow-up

  • (A && B) & and other paren-subshell backgrounding has the same bug class but isn't the common agent pattern; adding paren coverage would require a different rewrite shape (the subshell is explicit) and is safer as a separate PR.
  • The /stop-doesn't-reach-worker and SIGTERM-orphans-subprocess issues from fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace #11907 are the upstream side of the same family of incidents — this PR complements that one by removing the most common trigger pattern entirely.

🤖 Generated with Claude Code

@handsdiff handsdiff force-pushed the fix/compound-background-subshell-leak branch from e9bcacc to b05e7f6 Compare April 18, 2026 17:28
…ll leak

bash parses `A && B &` with `&&` tighter than `&`, so it forks a subshell
for the compound and backgrounds the subshell. Inside the subshell, B
runs foreground, so the subshell waits for B. When B is a process that
doesn't naturally exit (`python3 -m http.server`, `yes > /dev/null`, a
long-running daemon), the subshell is stuck in `wait4` forever and leaks
as an orphan reparented to init.

Observed in production: agents running `cd X && python3 -m http.server
8000 &>/dev/null & sleep 1 && curl ...` as a "start a local server, then
verify it" one-liner. Outer bash exits cleanly; the subshell never does.
Across ~3 days of use, 8 unique stuck-terminal events and 7 leaked
bash+server pairs accumulated on the fleet, with some sessions appearing
hung from the user's perspective because the subshell's open stdout pipe
kept the terminal tool's drain thread blocked.

This is distinct from the `set +m` fix in 933fbd8 (which addressed
interactive-shell job-control waiting at exit). `set +m` doesn't help
here because `bash -c` is non-interactive and job control is already
off; the problem is the subshell's own internal wait for its foreground
B, not the outer shell's job-tracking.

The fix: walk the command shell-aware (respecting quotes, parens, brace
groups, `&>`/`>&` redirects), find `A && B &` / `A || B &` at depth 0
and rewrite the tail to `A && { B & }`. Brace groups don't fork a
subshell — they run in the current shell. `B &` inside the group is a
simple background (no subshell wait). The outer `&` is absorbed into
the group, so the compound no longer needs an explicit subshell.

`&&` error-propagation is preserved exactly: if A fails, `&&`
short-circuits and B never runs.

- Skips quoted strings, comment lines, and `(…)` subshells
- Handles `&>/dev/null`, `2>&1`, `>&2` without mistaking them for `&`
- Resets chain state at `;`, `|`, and newlines
- Tracks brace depth so already-rewritten output is idempotent
- Walks using the existing `_read_shell_token` tokenizer, matching the
  pattern of `_rewrite_real_sudo_invocations`

Called once from `BaseEnvironment.execute` right after
`_prepare_command`, so it runs for every backend (local, ssh, docker,
modal, etc.) with no per-backend plumbing.

34 new tests covering rewrite cases, preservation cases, redirect
edge-cases, quoting/parens/backticks, idempotency, and empty/edge
inputs. End-to-end verified on a test VM: the exact vela-incident
command now returns in ~1.3s with no leaked bash, only the intentional
backgrounded server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@teknium1

Copy link
Copy Markdown
Contributor

Merged via #12724 — your commit was cherry-picked onto current main with authorship preserved in git log. Thanks @handsdiff!

This stacks well with today's other terminal fixes:

Full PR: #12724

handsdiff added a commit to handsdiff/hermes-agent that referenced this pull request Apr 24, 2026
…g), drop NousResearch#7297, add NousResearch#12207

- Add rows for NousResearch#12234 (match-based model routing, supersedes NousResearch#7297 and draft NousResearch#12227)
  and NousResearch#12207 (compound-background-subshell-leak).
- Move NousResearch#7297 into a new 'Closed / superseded' section; note the branches
  (feat/per-platform-model and feat/per-source-model) are already deleted from origin.
- Update rebase workflow: swap the feat/per-platform-model line for feat/model-routing,
  add fix/compound-background-subshell-leak.
- Update the fork-main rebuild section: document that octopus strategy fails on
  adjacent-region additions and switch the documented command to a sequential-merge
  loop. Note the recurring conflict site (_classify_source_kind vs _is_owner_source)
  and the union-resolve strategy for it.
- Add PR-specific note for NousResearch#12234 covering the helper, classifier, legacy shim, and
  rebase-conflict guidance.
handsdiff added a commit to handsdiff/hermes-agent that referenced this pull request Apr 24, 2026
…12724; slate-dg-proxy-hook superseded

- NousResearch#12207 (compound-background-subshell-leak) cherry-picked upstream as abfc184
  via PR NousResearch#12724 with authorship preserved. Moved to "Merged" section;
  removed from rebase workflow and sequential-merge loop.
- feat/slate-dg-proxy-hook (no PR) added under "Closed / superseded":
  abandoned prototype that built the Discord zero-secrets transport inside
  hermes-agent as an opt-in SLATE_DG_PROXY hook. Provisioner-side .pth
  install (hermes-provisioner/dg_patch.py) superseded it — zero hermes-agent
  patches required. Also mis-cut off fork main so unmergeable as upstream PR.
- Dropped stale fix/compressor-tool-args-valid-json rebase line (branch
  already deleted; the "removed" note below was orphaned).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants