Skip to content

fix(task): terminate parallel siblings on failure via process groups#9655

Merged
jdx merged 2 commits intomainfrom
fix/parallel-sibling-termination
May 6, 2026
Merged

fix(task): terminate parallel siblings on failure via process groups#9655
jdx merged 2 commits intomainfrom
fix/parallel-sibling-termination

Conversation

@jdx
Copy link
Copy Markdown
Owner

@jdx jdx commented May 6, 2026

Summary

Fixes #8459 — when a parallel mise run --jobs N task fails, sibling tasks now terminate promptly instead of running to natural completion.

The previous behaviour: SIGTERM only hit the direct child PID. For sh -c "...sleep 30..." the shell exits but sleep is reparented to init and runs to completion, holding inherited stdout/stderr pipes open and stalling mise's pipe-reader threads. Total runtime was bounded by the slowest sibling, not the failing one.

This re-introduces per-task process groups (the approach reverted in 26ac45323) and gates them so the original Playwright regression (#8345) does not return.

Approach

  • should_use_pgroup() decides whether to setpgid + killpg. It returns false when:
    1. MISE_TASK_PGID_MANAGED=1 is in env (set by an outer mise on every spawned child) — so a nested mise inherits the outer's pgid instead of creating a new one
    2. We're the session leader (getsid(0) == getpid()) — what Node's detached: true does, including Playwright's webServer
  • kill_all uses killpg (with kill fallback) when pgroups are active.
  • PIPE_DRAIN_TIMEOUT = 10s — after a child's ExitStatus arrives, the rx loop switches to recv_timeout. If a grandchild that escaped the kill is still holding pipes open, abandon the readers instead of hanging forever.
  • The signal forwarder now forwards SIGINT too when pgroups are on, since the child is no longer in the terminal's foreground pgid.

Scheduler bugs that meant kill_all was never reaching

While debugging, two bugs surfaced that meant the existing kill_all call site never fired in practice:

  1. Scheduler::join_all was checking result.is_ok() — but JoinSet::join_next returns Ok(Err(...)) for failing tasks (only Err for panics), so failures bypassed the kill entirely.
  2. run_loop's is_stopping check only ran on the next loop iteration, but while siblings were running it was idle in tokio::select! and never iterated. kill_all is now triggered directly from the spawn closure on failure.

Tradeoffs

  • Direct invocation (mise run --jobs 2 fail ::: long): perfect — sleep dies via killpg (~1s).
  • Wrapped invocation ([tasks.repro] run = "mise run --jobs 2 ..."): inner mise can't killpg without also killing itself, so its grandchild sleep gets reparented and runs out the clock as an orphan. The drain timeout still ensures mise exits in ~11s instead of hanging.
  • SIGKILL from an orchestrator can't be caught — the session-leader gate (rather than any handler) is what saves us in the bare-Playwright case.

Test plan

🤖 Generated with Claude Code


Note

Medium Risk
Touches Unix process-group management and signal forwarding in the task runner, which can affect shutdown behavior across platforms and nested invocations. Adds safeguards and timeouts, but regressions could still surface in edge cases (TTY/interactive commands, orchestrator-managed sessions).

Overview
Fixes parallel task fail-fast behavior by reintroducing per-task process groups so SIGTERM reaches task grandchildren (e.g. sh -c ... sleep) and sibling tasks stop promptly when one fails.

Adds should_use_pgroup() gating plus MISE_TASK_PGID_MANAGED propagation to avoid breaking orchestrator setsid/pgroup kills and nested mise runs, updates kill_all to prefer killpg, and forwards signals (including SIGINT) appropriately when pgroups are enabled. Introduces a 10s pipe-drain timeout after child exit to prevent hangs when inherited stdout/stderr pipes remain open.

Fixes/adjusts failure termination triggering in both Run::spawn_sched_job and Scheduler::join_all, and adds two e2e regressions tests covering orchestrator pgroup SIGKILL propagation and parallel sibling termination timing.

Reviewed by Cursor Bugbot for commit 0ff45b8. Bugbot is set up for automated code reviews on this repo. Configure here.

When a parallel task fails, mise was sending SIGTERM only to the direct
child PID. For tasks like `sh -c "...sleep 30..."`, sh exits but its
foreground child (sleep) gets reparented to init and runs to completion,
holding the inherited stdout/stderr pipes open and stalling mise's
pipe-reader threads — total runtime ballooned to however long the
longest sibling ran naturally (#8459).

Re-introduce per-task process groups (the approach reverted in 26ac453)
and gate them so the original Playwright regression (#8345) doesn't return:

- `should_use_pgroup()` skips setpgid/killpg when nested under another
  mise (env var `MISE_TASK_PGID_MANAGED=1` propagated through the env)
  or when we're the session leader (`getsid(0) == getpid()` — what Node's
  `detached: true` does, including Playwright's `webServer`).
- `kill_all` uses `killpg` with `kill` fallback when pgroups are active.
- New `PIPE_DRAIN_TIMEOUT = 10s`: after a child's ExitStatus arrives,
  switch the rx loop to recv_timeout. If a grandchild that escaped the
  kill is still holding pipes open, abandon the readers instead of
  hanging forever.
- Signal forwarder now forwards SIGINT too when pgroups are on (since
  the child is no longer in the terminal's foreground pgid).

Also fixes two scheduler bugs that meant the existing kill_all call site
never fired:

- `Scheduler::join_all` was checking `result.is_ok()`, which is true
  for `Ok(Err(...))` — so failing tasks (the common case, vs panics)
  bypassed kill_all entirely.
- `run_loop`'s `is_stopping` check only ran on the next iteration, but
  while siblings were running it was idle in `tokio::select!` and never
  iterated. Trigger kill_all directly from the spawn closure on failure.

E2E coverage:
- `tasks/test_task_parallel_fail_fast` for #8459 (direct invocation
  exits in ~1s; wrapped invocation falls back to drain timeout).
- `tasks/test_task_orchestrator_kill` for #8345 (`setsid mise run` +
  `kill -KILL -PGID` simulating Playwright's tree-kill).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

This PR fixes parallel fail-fast behavior by ensuring that when one parallel sibling task fails, its running siblings (and their grandchildren) are terminated promptly via Unix process groups rather than running to natural completion and holding pipes open.

  • Introduces should_use_pgroup() (cached via Lazy<bool>) which gates per-task setpgid(0,0) + killpg behind two safety checks: the MISE_TASK_PGID_MANAGED env var (set by an outer mise to prevent nested mise from escaping the outer's pgid) and a session-leader guard (getsid(0) == getpid()) to avoid breaking Playwright/orchestrator setsid + kill -KILL -pgid teardown patterns.
  • Adds PIPE_DRAIN_TIMEOUT = 10s so orphaned grandchildren holding pipes open can't stall the parent's stdout/stderr readers indefinitely; also fixes two pre-existing scheduler bugs where join_all was checking result.is_ok() (missing the Ok(Err(_)) case) and kill_all was never reached from run_loop's idle select! state.
  • Adds two e2e regression tests covering the orchestrator setsid+kill -PGID flow (Mise task hangs when a child processes is spawned with mise (since v2026.2.18) #8345) and the direct/wrapped parallel sibling termination timing (mise run: Parallel tasks are not terminated on failure of sibling #8459).

Confidence Score: 5/5

Safe to merge. The process-group management logic is carefully gated, well-commented, and backed by two targeted e2e regression tests.

The two scheduler bugs (wrong is_ok() check and unreachable kill_all from idle select!) were genuine defects that are now correctly fixed. The pgroup gating via should_use_pgroup() — cached on first access, consistent across execute() and kill_all() — is sound. The PIPE_DRAIN_TIMEOUT fallback handles the nested-mise edge case cleanly. Previous review concerns (missing EXIT trap in the test, separate should_use_pgroup evaluations) have both been addressed in this revision.

No files require special attention.

Reviews (2): Last reviewed commit: "fix(task): cache should_use_pgroup, trap..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements process group management for tasks to ensure that descendant processes are correctly terminated when a task fails or is killed, preventing hangs caused by orphaned processes holding pipes open. It introduces a 10-second pipe drain timeout and logic to handle nested mise instances and TTY-based tasks. Feedback was provided regarding a compilation error in src/cmd.rs due to the use of a non-existent method and an unnecessary unsafe block, with a suggestion to use nix::unistd::isatty instead.

I am having trouble creating individual review comments. Click here to see my feedback.

src/cmd.rs (513-530)

high

The unsafe block is unnecessary here as pre_exec and the functions called within it (setpgid, isatty) are not marked as unsafe in recent versions of nix. Additionally, std::os::fd::BorrowedFd::borrow_raw is not a public method in the standard library and will cause a compilation error. It is better to use nix::unistd::isatty directly with the raw file descriptor (0 for stdin).

            self.cmd.pre_exec(|| {
                // Skip setpgid when stdin is a TTY: interactive tools
                // (e.g. Tilt) need to stay in the terminal's foreground
                // pgid or they hang on SIGTTIN when reading stdin.
                // Use nix::unistd::isatty(0) — pre_exec runs post-fork where
                // OnceLock/malloc are not async-signal-safe.
                if let Ok(false) = nix::unistd::isatty(0) {
                    let _ = nix::unistd::setpgid(
                        nix::unistd::Pid::from_raw(0),
                        nix::unistd::Pid::from_raw(0),
                    );
                }
                Ok(())
            });

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Hyperfine Performance

mise x -- echo

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.1 x -- echo 24.3 ± 1.6 20.3 32.4 1.00
mise x -- echo 24.9 ± 1.7 21.2 33.0 1.02 ± 0.10

mise env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.1 env 24.3 ± 1.9 20.7 32.7 1.02 ± 0.09
mise env 23.8 ± 1.1 21.4 31.3 1.00

mise hook-env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.1 hook-env 24.0 ± 1.2 20.7 30.6 1.00
mise hook-env 25.7 ± 1.6 21.4 31.3 1.07 ± 0.09

mise ls

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.5.1 ls 20.8 ± 1.6 16.5 25.7 1.00
mise ls 21.3 ± 1.7 17.2 27.5 1.03 ± 0.11

xtasks/test/perf

Command mise-2026.5.1 mise Variance
install (cached) 148ms 149ms +0%
ls (cached) 75ms 74ms +1%
bin-paths (cached) 81ms 83ms -2%
task-ls (cached) 631ms 624ms +1%

Address PR feedback:

- Cache should_use_pgroup() in a LazyLock<bool>. execute() decides at
  spawn time whether to setpgid; kill_all() decides at signal time
  whether to killpg. The two must agree, or a child placed in its own
  pgid won't be reached by killpg and grandchildren leak. Computing the
  value once eliminates any chance of disagreement if the env later
  mutates.

- Add `trap 'rm -f "$marker"' EXIT` to test_task_parallel_fail_fast so
  the marker tempfile is removed on abnormal exit, not just at the
  explicit `rm -f` call. Without it, a failed run leaves a stale marker
  that would false-positive subsequent runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jdx jdx merged commit 4ce7e71 into main May 6, 2026
34 of 35 checks passed
@jdx jdx deleted the fix/parallel-sibling-termination branch May 6, 2026 15:27
mise-en-dev added a commit that referenced this pull request May 7, 2026
### 🚀 Features

- **(aqua)** support registry libc variants by @jdx in
[#9652](#9652)
- **(bin-paths)** add executable names output by @risu729 in
[#9617](#9617)

### 🐛 Bug Fixes

- **(aqua)** preserve configured file extensions by @risu729 in
[#9611](#9611)
- **(aqua)** support registry file links by @risu729 in
[#9610](#9610)
- **(backend)** reject bare package backend names by @risu729 in
[#9608](#9608)
- **(backend)** apply inline tool option overrides by @risu729 in
[#9306](#9306)
- **(backend)** skip versions host for local tool opts by @risu729 in
[#9568](#9568)
- **(github)** chmod explicit archive bin by @risu729 in
[#9609](#9609)
- **(install)** skip remote-versions refresh in prefer-offline mode by
@jdx in [#9627](#9627)
- **(lock)** scope targets to active project root by @risu729 in
[#9319](#9319)
- **(lockfile)** respect existing platforms during auto-lock by @jdx in
[#9621](#9621)
- **(pipx)** filter yanked pypi releases by @risu729 in
[#9607](#9607)
- **(pipx)** declare python as a backend dependency by @jdx in
[#9678](#9678)
- **(schema)** update refs to $defs in mise-registry-tool.json by
@risu729 in [#9671](#9671)
- **(task)** terminate parallel siblings on failure via process groups
by @jdx in [#9655](#9655)
- **(task)** stable MISE_PROJECT_ROOT for monorepo tasks, add
MISE_MONOREPO_ROOT by @jdx in
[#9657](#9657)
- **(trust)** run enter hooks after trusting config by @risu729 in
[#9634](#9634)
- **(ui)** stop clearing screen for prompts by @jdx in
[#9619](#9619)
- use /bin/cp on macos by @pdehlke in
[#9656](#9656)

### 🚜 Refactor

- **(aqua)** store aqua var defaults as strings by @risu729 in
[#9645](#9645)
- **(config)** support structured TOML values in registry backend
options by @risu729 in [#9584](#9584)
- **(deps)** remove serde_derive dependency by @risu729 in
[#9670](#9670)
- **(deps)** remove anyhow dependency by @risu729 in
[#9661](#9661)
- **(deps)** use std::sync::LazyLock instead of once_cell::Lazy by
@risu729 in [#9668](#9668)
- **(schema)** generate task schema from mise schema by @risu729 in
[#9581](#9581)
- **(schema)** reuse task props with unevaluatedProperties by @risu729
in [#9582](#9582)
- **(schema)** reuse registry common types by @risu729 in
[#9648](#9648)
- **(schema)** reuse plugin script config by @risu729 in
[#9647](#9647)
- **(schema)** use $defs in schema files by @risu729 in
[#9646](#9646)

### 📚 Documentation

- **(node)** add tips for enabling node idiomatic by @fu050409 in
[#9675](#9675)

### 🧪 Testing

- **(cli)** remove nondeterministic tool depends assertion by @risu729
in [#9633](#9633)
- **(e2e)** pin uv to 0.11.8 around astral-sh/uv#19278 by @jdx in
[#9618](#9618)
- **(e2e)** wait for docker env cleanup by @risu729 in
[#9631](#9631)
- **(zig)** use official zig instead of mach mirror by @jdx in
[#9659](#9659)

### 📦️ Dependency Updates

- fall through to hash check when providers have no outputs by @jdx in
[#9622](#9622)
- bump Cargo.lock by @jdx in
[#9625](#9625)

### 📦 Registry

- remove registry depends by @risu729 in
[#9571](#9571)
- add code-review-graph (pipx:code-review-graph) by @chautruonglong in
[#9673](#9673)

### Chore

- **(ci)** split large registry test-tool changes by @risu729 in
[#9628](#9628)
- **(ci)** make perf script robust to runner noise by @jdx in
[#9635](#9635)
- **(ci)** skip hyperfine comments without permission by @risu729 in
[#9629](#9629)

### New Contributors

- @chautruonglong made their first contribution in
[#9673](#9673)
- @pdehlke made their first contribution in
[#9656](#9656)

## 📦 Aqua Registry Updates

### New Packages (5)

-
[`anthropics/anthropic-cli`](https://github.com/anthropics/anthropic-cli)
- [`crates.io/wasmi_cli`](https://github.com/wasmi-labs/wasmi)
- [`openclaw/gogcli`](https://github.com/openclaw/gogcli)
- `racket-lang.org/racket-minimal`
- [`runs-on/cli`](https://github.com/runs-on/cli)

### Updated Packages (13)

- [`UpCloudLtd/upcloud-cli`](https://github.com/UpCloudLtd/upcloud-cli)
- [`aristocratos/btop`](https://github.com/aristocratos/btop)
- [`dprint/dprint`](https://github.com/dprint/dprint)
- [`j178/prek`](https://github.com/j178/prek)
- [`jdx/hk`](https://github.com/jdx/hk)
- [`jdx/mise`](https://github.com/jdx/mise)
- [`jdx/usage`](https://github.com/jdx/usage)
- [`jreleaser/jreleaser`](https://github.com/jreleaser/jreleaser)
-
[`jreleaser/jreleaser/standalone`](https://github.com/jreleaser/jreleaser)
- [`pnpm/pnpm`](https://github.com/pnpm/pnpm)
- [`suzuki-shunsuke/cmdx`](https://github.com/suzuki-shunsuke/cmdx)
- [`suzuki-shunsuke/ghir`](https://github.com/suzuki-shunsuke/ghir)
- [`twpayne/chezmoi`](https://github.com/twpayne/chezmoi)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant