fix(terminal): rewrite `A && B &` to `A && { B & }` to prevent subshell leak by handsdiff · Pull Request #12207 · NousResearch/hermes-agent

handsdiff · 2026-04-18T16:05:59Z

Problem

bash parses A && B & with && tighter than &, so it forks a subshell for the compound and backgrounds the subshell. Inside the subshell, B runs foreground, so the subshell waits for B to finish. When B is a process that never exits on its own (python3 -m http.server, yes > /dev/null, any daemon), the subshell is stuck in wait4 forever and leaks as an orphan reparented to init.

Observed in production

Across a small fleet over ~3 days:

8 unique stuck-terminal events (Agent idle for 1803s … tool=terminal) on 4 of 5 VMs
7 currently-leaked bash + server pairs across the fleet, oldest from 2026-04-14
One agent appeared hung to the user for 30 minutes because the subshell's open stdout pipe kept the terminal tool's drain thread blocked (the subshell never closed FD 1)

Kernel stack of a live leak on each affected VM:

[<0>] do_wait+0x61/0xe0
[<0>] kernel_wait4+0xae/0x150
[<0>] __do_sys_wait4+0x9e/0xb0

Consistent fingerprint across leaks: PID ≠ PGID (bash is not session leader because the leaked one is a subshell, not the Popen child), PPID = 1 (outer bash exited, subshell reparented), same process group as the original setsid'd outer bash.

The common agent pattern that tripped it: cd X && python3 -m http.server 8000 &>/dev/null & sleep 1 && curl … — an agent's "start a local server then verify it" one-liner.

Relation to `933fbd8`

Distinct from the set +m fix in 933fbd8, which addressed an interactive shell's job-control waiting at exit. set +m doesn't help here because bash -c is non-interactive and job control is already off — the problem is the subshell's own internal wait for its foreground B, not the outer shell's job tracking.

Fix

Walk the command shell-aware (respecting quotes, parens, brace groups, &>/>& redirects), find A && B & / A || B & at depth 0, and rewrite the tail to A && { B & }:

Brace groups don't fork a subshell — they run in the current shell
B & inside the group is a simple background (no subshell wait)
The outer & is absorbed into the group, so the compound no longer needs an explicit subshell
&& error-propagation is preserved exactly: if A fails, && short-circuits and B never runs

Called once from BaseEnvironment.execute right after _prepare_command, so it runs for every backend (local, ssh, docker, modal, …) with no per-backend plumbing.

Examples

Before	After
`cd /tmp && python3 -m http.server 8000 &>/dev/null &`	`cd /tmp && { python3 -m http.server 8000 &>/dev/null & }`
`A && B && C &`	`A && B && { C & }`
`A \|\| B &`	`A \|\| { B & }`
`sleep 5 &`	unchanged (simple bg doesn't have the bug)
`A; B &`	unchanged
`cmd 2>&1 &`	unchanged (redirect, not compound bg)
`echo 'A && B &'`	unchanged (quoted)
`(A && B) &`	unchanged (paren subshell — same bug class, but left as a separate follow-up)

Edge cases handled

Quoted strings (single/double) preserved via shared _read_shell_token
Parenthesised subshells (...) skipped (separate bug class)
Brace groups { ... } tracked for depth, making the rewrite idempotent
&>/dev/null not mistaken for background &
2>&1, >&2, <&3 fd-redirects recognised by looking back for < / >
;, |, newlines reset chain state so the rewrite doesn't cross statements
Comment lines (# …) left alone
Backslash-escaped \&\& treated as literals

Tests

34 new tests in tests/tools/test_terminal_compound_background.py:

TestRewrites — positive cases (10): A && B &, A \|\| B &, chained, mixed, multiline, realistic vela-shape
TestPreserved — negative cases (7): simple bg, no compound, empty, whitespace-only
TestRedirectsNotConfused — 5 redirect scenarios
TestQuotingAndParens — 6 quoting / paren / comment / cmd-substitution / backslash cases
TestIdempotence — 2 idempotency checks
TestEdgeCases — 4 malformed / leading-whitespace / tab-separated

All 34 pass. Adjacent terminal-tool tests (114 more across test_terminal_tool.py, test_terminal_exit_semantics.py, test_terminal_requirements.py, etc.) still pass unchanged.

End-to-end validation

Provisioned a fresh test VM. Ran the exact vela-incident command through the real LocalEnvironment.execute path:

Before: bash stuck in wait4 for 1800 s, server alive as its child, gateway inactivity timeout eventually fired
After: execute() returns in 1.31 s with rc=0, curl sees 200, no leaked bash, only the intentional backgrounded server (orphaned to init, which is the agent's expressed intent)

Scope left for follow-up

(A && B) & and other paren-subshell backgrounding has the same bug class but isn't the common agent pattern; adding paren coverage would require a different rewrite shape (the subshell is explicit) and is safer as a separate PR.
The /stop-doesn't-reach-worker and SIGTERM-orphans-subprocess issues from fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace #11907 are the upstream side of the same family of incidents — this PR complements that one by removing the most common trigger pattern entirely.

🤖 Generated with Claude Code

…ll leak bash parses `A && B &` with `&&` tighter than `&`, so it forks a subshell for the compound and backgrounds the subshell. Inside the subshell, B runs foreground, so the subshell waits for B. When B is a process that doesn't naturally exit (`python3 -m http.server`, `yes > /dev/null`, a long-running daemon), the subshell is stuck in `wait4` forever and leaks as an orphan reparented to init. Observed in production: agents running `cd X && python3 -m http.server 8000 &>/dev/null & sleep 1 && curl ...` as a "start a local server, then verify it" one-liner. Outer bash exits cleanly; the subshell never does. Across ~3 days of use, 8 unique stuck-terminal events and 7 leaked bash+server pairs accumulated on the fleet, with some sessions appearing hung from the user's perspective because the subshell's open stdout pipe kept the terminal tool's drain thread blocked. This is distinct from the `set +m` fix in 933fbd8 (which addressed interactive-shell job-control waiting at exit). `set +m` doesn't help here because `bash -c` is non-interactive and job control is already off; the problem is the subshell's own internal wait for its foreground B, not the outer shell's job-tracking. The fix: walk the command shell-aware (respecting quotes, parens, brace groups, `&>`/`>&` redirects), find `A && B &` / `A || B &` at depth 0 and rewrite the tail to `A && { B & }`. Brace groups don't fork a subshell — they run in the current shell. `B &` inside the group is a simple background (no subshell wait). The outer `&` is absorbed into the group, so the compound no longer needs an explicit subshell. `&&` error-propagation is preserved exactly: if A fails, `&&` short-circuits and B never runs. - Skips quoted strings, comment lines, and `(…)` subshells - Handles `&>/dev/null`, `2>&1`, `>&2` without mistaking them for `&` - Resets chain state at `;`, `|`, and newlines - Tracks brace depth so already-rewritten output is idempotent - Walks using the existing `_read_shell_token` tokenizer, matching the pattern of `_rewrite_real_sudo_invocations` Called once from `BaseEnvironment.execute` right after `_prepare_command`, so it runs for every backend (local, ssh, docker, modal, etc.) with no per-backend plumbing. 34 new tests covering rewrite cases, preservation cases, redirect edge-cases, quoting/parens/backticks, idempotency, and empty/edge inputs. End-to-end verified on a test VM: the exact vela-incident command now returns in ~1.3s with no leaked bash, only the intentional backgrounded server. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

teknium1 · 2026-04-19T23:53:20Z

Merged via #12724 — your commit was cherry-picked onto current main with authorship preserved in git log. Thanks @handsdiff!

This stacks well with today's other terminal fixes:

[Bug]: Terminal tool hangs indefinitely when using setsid + disown pattern to launch background services #8340 (landed earlier today) — switched _drain() to select-based non-blocking reads so the drain thread exits ~300ms after bash, no longer hanging on a grandchild-held pipe.
terminal: guard foreground server/watch commands (salvage #10810) #12723 (salvaged from terminal: guard foreground server/watch commands and shell background wrappers #10810 by @etherman-os) — foreground guardrail rejects nohup/disown/setsid/& and long-lived patterns (uvicorn, vite, npm run dev, python -m http.server, …) with guidance to use background=true.
fix(terminal): rewrite A && B & to A && { B & } (salvage #12207) #12724 (this one) — rewrites A && B & → A && { B & } at the shell layer so if the rewrite path is taken anyway, no bash subshell leaks.

Full PR: #12724

…g), drop NousResearch#7297, add NousResearch#12207 - Add rows for NousResearch#12234 (match-based model routing, supersedes NousResearch#7297 and draft NousResearch#12227) and NousResearch#12207 (compound-background-subshell-leak). - Move NousResearch#7297 into a new 'Closed / superseded' section; note the branches (feat/per-platform-model and feat/per-source-model) are already deleted from origin. - Update rebase workflow: swap the feat/per-platform-model line for feat/model-routing, add fix/compound-background-subshell-leak. - Update the fork-main rebuild section: document that octopus strategy fails on adjacent-region additions and switch the documented command to a sequential-merge loop. Note the recurring conflict site (_classify_source_kind vs _is_owner_source) and the union-resolve strategy for it. - Add PR-specific note for NousResearch#12234 covering the helper, classifier, legacy shim, and rebase-conflict guidance.

…12724; slate-dg-proxy-hook superseded - NousResearch#12207 (compound-background-subshell-leak) cherry-picked upstream as abfc184 via PR NousResearch#12724 with authorship preserved. Moved to "Merged" section; removed from rebase workflow and sequential-merge loop. - feat/slate-dg-proxy-hook (no PR) added under "Closed / superseded": abandoned prototype that built the Discord zero-secrets transport inside hermes-agent as an opt-in SLATE_DG_PROXY hook. Provisioner-side .pth install (hermes-provisioner/dg_patch.py) superseded it — zero hermes-agent patches required. Also mis-cut off fork main so unmergeable as upstream PR. - Dropped stale fix/compressor-tool-args-valid-json rebase line (branch already deleted; the "removed" note below was orphaned). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

handsdiff force-pushed the fix/compound-background-subshell-leak branch from e9bcacc to b05e7f6 Compare April 18, 2026 17:28

handsdiff force-pushed the fix/compound-background-subshell-leak branch from b05e7f6 to e80028d Compare April 19, 2026 22:11

This was referenced Apr 19, 2026

fix(terminal): reap backgrounded children in LocalEnvironment #12032

Closed

fix(terminal): rewrite A && B & to A && { B & } (salvage #12207) #12724

Merged

teknium1 closed this in #12724 Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(terminal): rewrite `A && B &` to `A && { B & }` to prevent subshell leak#12207

fix(terminal): rewrite `A && B &` to `A && { B & }` to prevent subshell leak#12207
handsdiff wants to merge 1 commit into
NousResearch:mainfrom
handsdiff:fix/compound-background-subshell-leak

handsdiff commented Apr 18, 2026

Uh oh!

teknium1 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

handsdiff commented Apr 18, 2026

Problem

Observed in production

Relation to 933fbd8

Fix

Examples

Edge cases handled

Tests

End-to-end validation

Scope left for follow-up

Uh oh!

teknium1 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Relation to `933fbd8`