fix(terminal): rewrite A && B & to A && { B & } to prevent subshell leak#12207
Closed
handsdiff wants to merge 1 commit into
Closed
fix(terminal): rewrite A && B & to A && { B & } to prevent subshell leak#12207handsdiff wants to merge 1 commit into
A && B & to A && { B & } to prevent subshell leak#12207handsdiff wants to merge 1 commit into
Conversation
e9bcacc to
b05e7f6
Compare
…ll leak bash parses `A && B &` with `&&` tighter than `&`, so it forks a subshell for the compound and backgrounds the subshell. Inside the subshell, B runs foreground, so the subshell waits for B. When B is a process that doesn't naturally exit (`python3 -m http.server`, `yes > /dev/null`, a long-running daemon), the subshell is stuck in `wait4` forever and leaks as an orphan reparented to init. Observed in production: agents running `cd X && python3 -m http.server 8000 &>/dev/null & sleep 1 && curl ...` as a "start a local server, then verify it" one-liner. Outer bash exits cleanly; the subshell never does. Across ~3 days of use, 8 unique stuck-terminal events and 7 leaked bash+server pairs accumulated on the fleet, with some sessions appearing hung from the user's perspective because the subshell's open stdout pipe kept the terminal tool's drain thread blocked. This is distinct from the `set +m` fix in 933fbd8 (which addressed interactive-shell job-control waiting at exit). `set +m` doesn't help here because `bash -c` is non-interactive and job control is already off; the problem is the subshell's own internal wait for its foreground B, not the outer shell's job-tracking. The fix: walk the command shell-aware (respecting quotes, parens, brace groups, `&>`/`>&` redirects), find `A && B &` / `A || B &` at depth 0 and rewrite the tail to `A && { B & }`. Brace groups don't fork a subshell — they run in the current shell. `B &` inside the group is a simple background (no subshell wait). The outer `&` is absorbed into the group, so the compound no longer needs an explicit subshell. `&&` error-propagation is preserved exactly: if A fails, `&&` short-circuits and B never runs. - Skips quoted strings, comment lines, and `(…)` subshells - Handles `&>/dev/null`, `2>&1`, `>&2` without mistaking them for `&` - Resets chain state at `;`, `|`, and newlines - Tracks brace depth so already-rewritten output is idempotent - Walks using the existing `_read_shell_token` tokenizer, matching the pattern of `_rewrite_real_sudo_invocations` Called once from `BaseEnvironment.execute` right after `_prepare_command`, so it runs for every backend (local, ssh, docker, modal, etc.) with no per-backend plumbing. 34 new tests covering rewrite cases, preservation cases, redirect edge-cases, quoting/parens/backticks, idempotency, and empty/edge inputs. End-to-end verified on a test VM: the exact vela-incident command now returns in ~1.3s with no leaked bash, only the intentional backgrounded server. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b05e7f6 to
e80028d
Compare
This was referenced Apr 19, 2026
Contributor
|
Merged via #12724 — your commit was cherry-picked onto current main with authorship preserved in git log. Thanks @handsdiff! This stacks well with today's other terminal fixes:
Full PR: #12724 |
handsdiff
added a commit
to handsdiff/hermes-agent
that referenced
this pull request
Apr 24, 2026
…g), drop NousResearch#7297, add NousResearch#12207 - Add rows for NousResearch#12234 (match-based model routing, supersedes NousResearch#7297 and draft NousResearch#12227) and NousResearch#12207 (compound-background-subshell-leak). - Move NousResearch#7297 into a new 'Closed / superseded' section; note the branches (feat/per-platform-model and feat/per-source-model) are already deleted from origin. - Update rebase workflow: swap the feat/per-platform-model line for feat/model-routing, add fix/compound-background-subshell-leak. - Update the fork-main rebuild section: document that octopus strategy fails on adjacent-region additions and switch the documented command to a sequential-merge loop. Note the recurring conflict site (_classify_source_kind vs _is_owner_source) and the union-resolve strategy for it. - Add PR-specific note for NousResearch#12234 covering the helper, classifier, legacy shim, and rebase-conflict guidance.
handsdiff
added a commit
to handsdiff/hermes-agent
that referenced
this pull request
Apr 24, 2026
…12724; slate-dg-proxy-hook superseded - NousResearch#12207 (compound-background-subshell-leak) cherry-picked upstream as abfc184 via PR NousResearch#12724 with authorship preserved. Moved to "Merged" section; removed from rebase workflow and sequential-merge loop. - feat/slate-dg-proxy-hook (no PR) added under "Closed / superseded": abandoned prototype that built the Discord zero-secrets transport inside hermes-agent as an opt-in SLATE_DG_PROXY hook. Provisioner-side .pth install (hermes-provisioner/dg_patch.py) superseded it — zero hermes-agent patches required. Also mis-cut off fork main so unmergeable as upstream PR. - Dropped stale fix/compressor-tool-args-valid-json rebase line (branch already deleted; the "removed" note below was orphaned). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
bash parses
A && B &with&&tighter than&, so it forks a subshell for the compound and backgrounds the subshell. Inside the subshell,Bruns foreground, so the subshell waits forBto finish. WhenBis a process that never exits on its own (python3 -m http.server,yes > /dev/null, any daemon), the subshell is stuck inwait4forever and leaks as an orphan reparented to init.Observed in production
Across a small fleet over ~3 days:
Agent idle for 1803s … tool=terminal) on 4 of 5 VMsbash+ server pairs across the fleet, oldest from 2026-04-14Kernel stack of a live leak on each affected VM:
Consistent fingerprint across leaks:
PID ≠ PGID(bash is not session leader because the leaked one is a subshell, not the Popen child),PPID = 1(outer bash exited, subshell reparented), same process group as the original setsid'd outer bash.The common agent pattern that tripped it:
cd X && python3 -m http.server 8000 &>/dev/null & sleep 1 && curl …— an agent's "start a local server then verify it" one-liner.Relation to 933fbd8
Distinct from the
set +mfix in 933fbd8, which addressed an interactive shell's job-control waiting at exit.set +mdoesn't help here becausebash -cis non-interactive and job control is already off — the problem is the subshell's own internal wait for its foregroundB, not the outer shell's job tracking.Fix
Walk the command shell-aware (respecting quotes, parens, brace groups,
&>/>&redirects), findA && B &/A || B &at depth 0, and rewrite the tail toA && { B & }:B &inside the group is a simple background (no subshell wait)&is absorbed into the group, so the compound no longer needs an explicit subshell&&error-propagation is preserved exactly: ifAfails,&&short-circuits andBnever runsCalled once from
BaseEnvironment.executeright after_prepare_command, so it runs for every backend (local, ssh, docker, modal, …) with no per-backend plumbing.Examples
cd /tmp && python3 -m http.server 8000 &>/dev/null &cd /tmp && { python3 -m http.server 8000 &>/dev/null & }A && B && C &A && B && { C & }A || B &A || { B & }sleep 5 &A; B &cmd 2>&1 &echo 'A && B &'(A && B) &Edge cases handled
_read_shell_token(...)skipped (separate bug class){ ... }tracked for depth, making the rewrite idempotent&>/dev/nullnot mistaken for background&2>&1,>&2,<&3fd-redirects recognised by looking back for</>;,|, newlines reset chain state so the rewrite doesn't cross statements# …) left alone\&\&treated as literalsTests
34 new tests in
tests/tools/test_terminal_compound_background.py:TestRewrites— positive cases (10):A && B &,A \|\| B &, chained, mixed, multiline, realistic vela-shapeTestPreserved— negative cases (7): simple bg, no compound, empty, whitespace-onlyTestRedirectsNotConfused— 5 redirect scenariosTestQuotingAndParens— 6 quoting / paren / comment / cmd-substitution / backslash casesTestIdempotence— 2 idempotency checksTestEdgeCases— 4 malformed / leading-whitespace / tab-separatedAll 34 pass. Adjacent terminal-tool tests (114 more across
test_terminal_tool.py,test_terminal_exit_semantics.py,test_terminal_requirements.py, etc.) still pass unchanged.End-to-end validation
Provisioned a fresh test VM. Ran the exact vela-incident command through the real
LocalEnvironment.executepath:wait4for 1800 s, server alive as its child, gateway inactivity timeout eventually firedexecute()returns in 1.31 s withrc=0, curl sees200, no leaked bash, only the intentional backgrounded server (orphaned to init, which is the agent's expressed intent)Scope left for follow-up
(A && B) &and other paren-subshell backgrounding has the same bug class but isn't the common agent pattern; adding paren coverage would require a different rewrite shape (the subshell is explicit) and is safer as a separate PR./stop-doesn't-reach-worker and SIGTERM-orphans-subprocess issues from fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace #11907 are the upstream side of the same family of incidents — this PR complements that one by removing the most common trigger pattern entirely.🤖 Generated with Claude Code