fix(task): terminate parallel siblings on failure via process groups#9655
fix(task): terminate parallel siblings on failure via process groups#9655
Conversation
When a parallel task fails, mise was sending SIGTERM only to the direct child PID. For tasks like `sh -c "...sleep 30..."`, sh exits but its foreground child (sleep) gets reparented to init and runs to completion, holding the inherited stdout/stderr pipes open and stalling mise's pipe-reader threads — total runtime ballooned to however long the longest sibling ran naturally (#8459). Re-introduce per-task process groups (the approach reverted in 26ac453) and gate them so the original Playwright regression (#8345) doesn't return: - `should_use_pgroup()` skips setpgid/killpg when nested under another mise (env var `MISE_TASK_PGID_MANAGED=1` propagated through the env) or when we're the session leader (`getsid(0) == getpid()` — what Node's `detached: true` does, including Playwright's `webServer`). - `kill_all` uses `killpg` with `kill` fallback when pgroups are active. - New `PIPE_DRAIN_TIMEOUT = 10s`: after a child's ExitStatus arrives, switch the rx loop to recv_timeout. If a grandchild that escaped the kill is still holding pipes open, abandon the readers instead of hanging forever. - Signal forwarder now forwards SIGINT too when pgroups are on (since the child is no longer in the terminal's foreground pgid). Also fixes two scheduler bugs that meant the existing kill_all call site never fired: - `Scheduler::join_all` was checking `result.is_ok()`, which is true for `Ok(Err(...))` — so failing tasks (the common case, vs panics) bypassed kill_all entirely. - `run_loop`'s `is_stopping` check only ran on the next iteration, but while siblings were running it was idle in `tokio::select!` and never iterated. Trigger kill_all directly from the spawn closure on failure. E2E coverage: - `tasks/test_task_parallel_fail_fast` for #8459 (direct invocation exits in ~1s; wrapped invocation falls back to drain timeout). - `tasks/test_task_orchestrator_kill` for #8345 (`setsid mise run` + `kill -KILL -PGID` simulating Playwright's tree-kill). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR fixes parallel fail-fast behavior by ensuring that when one parallel sibling task fails, its running siblings (and their grandchildren) are terminated promptly via Unix process groups rather than running to natural completion and holding pipes open.
Confidence Score: 5/5Safe to merge. The process-group management logic is carefully gated, well-commented, and backed by two targeted e2e regression tests. The two scheduler bugs (wrong is_ok() check and unreachable kill_all from idle select!) were genuine defects that are now correctly fixed. The pgroup gating via should_use_pgroup() — cached on first access, consistent across execute() and kill_all() — is sound. The PIPE_DRAIN_TIMEOUT fallback handles the nested-mise edge case cleanly. Previous review concerns (missing EXIT trap in the test, separate should_use_pgroup evaluations) have both been addressed in this revision. No files require special attention. Reviews (2): Last reviewed commit: "fix(task): cache should_use_pgroup, trap..." | Re-trigger Greptile |
There was a problem hiding this comment.
Code Review
This pull request implements process group management for tasks to ensure that descendant processes are correctly terminated when a task fails or is killed, preventing hangs caused by orphaned processes holding pipes open. It introduces a 10-second pipe drain timeout and logic to handle nested mise instances and TTY-based tasks. Feedback was provided regarding a compilation error in src/cmd.rs due to the use of a non-existent method and an unnecessary unsafe block, with a suggestion to use nix::unistd::isatty instead.
I am having trouble creating individual review comments. Click here to see my feedback.
src/cmd.rs (513-530)
The unsafe block is unnecessary here as pre_exec and the functions called within it (setpgid, isatty) are not marked as unsafe in recent versions of nix. Additionally, std::os::fd::BorrowedFd::borrow_raw is not a public method in the standard library and will cause a compilation error. It is better to use nix::unistd::isatty directly with the raw file descriptor (0 for stdin).
self.cmd.pre_exec(|| {
// Skip setpgid when stdin is a TTY: interactive tools
// (e.g. Tilt) need to stay in the terminal's foreground
// pgid or they hang on SIGTTIN when reading stdin.
// Use nix::unistd::isatty(0) — pre_exec runs post-fork where
// OnceLock/malloc are not async-signal-safe.
if let Ok(false) = nix::unistd::isatty(0) {
let _ = nix::unistd::setpgid(
nix::unistd::Pid::from_raw(0),
nix::unistd::Pid::from_raw(0),
);
}
Ok(())
});
Hyperfine Performance
|
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.1 x -- echo |
24.3 ± 1.6 | 20.3 | 32.4 | 1.00 |
mise x -- echo |
24.9 ± 1.7 | 21.2 | 33.0 | 1.02 ± 0.10 |
mise env
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.1 env |
24.3 ± 1.9 | 20.7 | 32.7 | 1.02 ± 0.09 |
mise env |
23.8 ± 1.1 | 21.4 | 31.3 | 1.00 |
mise hook-env
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.1 hook-env |
24.0 ± 1.2 | 20.7 | 30.6 | 1.00 |
mise hook-env |
25.7 ± 1.6 | 21.4 | 31.3 | 1.07 ± 0.09 |
mise ls
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.5.1 ls |
20.8 ± 1.6 | 16.5 | 25.7 | 1.00 |
mise ls |
21.3 ± 1.7 | 17.2 | 27.5 | 1.03 ± 0.11 |
xtasks/test/perf
| Command | mise-2026.5.1 | mise | Variance |
|---|---|---|---|
| install (cached) | 148ms | 149ms | +0% |
| ls (cached) | 75ms | 74ms | +1% |
| bin-paths (cached) | 81ms | 83ms | -2% |
| task-ls (cached) | 631ms | 624ms | +1% |
Address PR feedback: - Cache should_use_pgroup() in a LazyLock<bool>. execute() decides at spawn time whether to setpgid; kill_all() decides at signal time whether to killpg. The two must agree, or a child placed in its own pgid won't be reached by killpg and grandchildren leak. Computing the value once eliminates any chance of disagreement if the env later mutates. - Add `trap 'rm -f "$marker"' EXIT` to test_task_parallel_fail_fast so the marker tempfile is removed on abnormal exit, not just at the explicit `rm -f` call. Without it, a failed run leaves a stale marker that would false-positive subsequent runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### 🚀 Features - **(aqua)** support registry libc variants by @jdx in [#9652](#9652) - **(bin-paths)** add executable names output by @risu729 in [#9617](#9617) ### 🐛 Bug Fixes - **(aqua)** preserve configured file extensions by @risu729 in [#9611](#9611) - **(aqua)** support registry file links by @risu729 in [#9610](#9610) - **(backend)** reject bare package backend names by @risu729 in [#9608](#9608) - **(backend)** apply inline tool option overrides by @risu729 in [#9306](#9306) - **(backend)** skip versions host for local tool opts by @risu729 in [#9568](#9568) - **(github)** chmod explicit archive bin by @risu729 in [#9609](#9609) - **(install)** skip remote-versions refresh in prefer-offline mode by @jdx in [#9627](#9627) - **(lock)** scope targets to active project root by @risu729 in [#9319](#9319) - **(lockfile)** respect existing platforms during auto-lock by @jdx in [#9621](#9621) - **(pipx)** filter yanked pypi releases by @risu729 in [#9607](#9607) - **(pipx)** declare python as a backend dependency by @jdx in [#9678](#9678) - **(schema)** update refs to $defs in mise-registry-tool.json by @risu729 in [#9671](#9671) - **(task)** terminate parallel siblings on failure via process groups by @jdx in [#9655](#9655) - **(task)** stable MISE_PROJECT_ROOT for monorepo tasks, add MISE_MONOREPO_ROOT by @jdx in [#9657](#9657) - **(trust)** run enter hooks after trusting config by @risu729 in [#9634](#9634) - **(ui)** stop clearing screen for prompts by @jdx in [#9619](#9619) - use /bin/cp on macos by @pdehlke in [#9656](#9656) ### 🚜 Refactor - **(aqua)** store aqua var defaults as strings by @risu729 in [#9645](#9645) - **(config)** support structured TOML values in registry backend options by @risu729 in [#9584](#9584) - **(deps)** remove serde_derive dependency by @risu729 in [#9670](#9670) - **(deps)** remove anyhow dependency by @risu729 in [#9661](#9661) - **(deps)** use std::sync::LazyLock instead of once_cell::Lazy by @risu729 in [#9668](#9668) - **(schema)** generate task schema from mise schema by @risu729 in [#9581](#9581) - **(schema)** reuse task props with unevaluatedProperties by @risu729 in [#9582](#9582) - **(schema)** reuse registry common types by @risu729 in [#9648](#9648) - **(schema)** reuse plugin script config by @risu729 in [#9647](#9647) - **(schema)** use $defs in schema files by @risu729 in [#9646](#9646) ### 📚 Documentation - **(node)** add tips for enabling node idiomatic by @fu050409 in [#9675](#9675) ### 🧪 Testing - **(cli)** remove nondeterministic tool depends assertion by @risu729 in [#9633](#9633) - **(e2e)** pin uv to 0.11.8 around astral-sh/uv#19278 by @jdx in [#9618](#9618) - **(e2e)** wait for docker env cleanup by @risu729 in [#9631](#9631) - **(zig)** use official zig instead of mach mirror by @jdx in [#9659](#9659) ### 📦️ Dependency Updates - fall through to hash check when providers have no outputs by @jdx in [#9622](#9622) - bump Cargo.lock by @jdx in [#9625](#9625) ### 📦 Registry - remove registry depends by @risu729 in [#9571](#9571) - add code-review-graph (pipx:code-review-graph) by @chautruonglong in [#9673](#9673) ### Chore - **(ci)** split large registry test-tool changes by @risu729 in [#9628](#9628) - **(ci)** make perf script robust to runner noise by @jdx in [#9635](#9635) - **(ci)** skip hyperfine comments without permission by @risu729 in [#9629](#9629) ### New Contributors - @chautruonglong made their first contribution in [#9673](#9673) - @pdehlke made their first contribution in [#9656](#9656) ## 📦 Aqua Registry Updates ### New Packages (5) - [`anthropics/anthropic-cli`](https://github.com/anthropics/anthropic-cli) - [`crates.io/wasmi_cli`](https://github.com/wasmi-labs/wasmi) - [`openclaw/gogcli`](https://github.com/openclaw/gogcli) - `racket-lang.org/racket-minimal` - [`runs-on/cli`](https://github.com/runs-on/cli) ### Updated Packages (13) - [`UpCloudLtd/upcloud-cli`](https://github.com/UpCloudLtd/upcloud-cli) - [`aristocratos/btop`](https://github.com/aristocratos/btop) - [`dprint/dprint`](https://github.com/dprint/dprint) - [`j178/prek`](https://github.com/j178/prek) - [`jdx/hk`](https://github.com/jdx/hk) - [`jdx/mise`](https://github.com/jdx/mise) - [`jdx/usage`](https://github.com/jdx/usage) - [`jreleaser/jreleaser`](https://github.com/jreleaser/jreleaser) - [`jreleaser/jreleaser/standalone`](https://github.com/jreleaser/jreleaser) - [`pnpm/pnpm`](https://github.com/pnpm/pnpm) - [`suzuki-shunsuke/cmdx`](https://github.com/suzuki-shunsuke/cmdx) - [`suzuki-shunsuke/ghir`](https://github.com/suzuki-shunsuke/ghir) - [`twpayne/chezmoi`](https://github.com/twpayne/chezmoi)
Summary
Fixes #8459 — when a parallel
mise run --jobs Ntask fails, sibling tasks now terminate promptly instead of running to natural completion.The previous behaviour: SIGTERM only hit the direct child PID. For
sh -c "...sleep 30..."the shell exits butsleepis reparented to init and runs to completion, holding inherited stdout/stderr pipes open and stalling mise's pipe-reader threads. Total runtime was bounded by the slowest sibling, not the failing one.This re-introduces per-task process groups (the approach reverted in
26ac45323) and gates them so the original Playwright regression (#8345) does not return.Approach
should_use_pgroup()decides whether tosetpgid+killpg. It returns false when:MISE_TASK_PGID_MANAGED=1is in env (set by an outer mise on every spawned child) — so a nested mise inherits the outer's pgid instead of creating a new onegetsid(0) == getpid()) — what Node'sdetached: truedoes, including Playwright'swebServerkill_alluseskillpg(withkillfallback) when pgroups are active.PIPE_DRAIN_TIMEOUT = 10s— after a child'sExitStatusarrives, the rx loop switches torecv_timeout. If a grandchild that escaped the kill is still holding pipes open, abandon the readers instead of hanging forever.Scheduler bugs that meant
kill_allwas never reachingWhile debugging, two bugs surfaced that meant the existing
kill_allcall site never fired in practice:Scheduler::join_allwas checkingresult.is_ok()— butJoinSet::join_nextreturnsOk(Err(...))for failing tasks (onlyErrfor panics), so failures bypassed the kill entirely.run_loop'sis_stoppingcheck only ran on the next loop iteration, but while siblings were running it was idle intokio::select!and never iterated.kill_allis now triggered directly from the spawn closure on failure.Tradeoffs
mise run --jobs 2 fail ::: long): perfect —sleepdies viakillpg(~1s).[tasks.repro] run = "mise run --jobs 2 ..."): inner mise can'tkillpgwithout also killing itself, so its grandchildsleepgets reparented and runs out the clock as an orphan. The drain timeout still ensures mise exits in ~11s instead of hanging.SIGKILLfrom an orchestrator can't be caught — the session-leader gate (rather than any handler) is what saves us in the bare-Playwright case.Test plan
cargo buildcleancargo clippy --all-targetscleanmise run test:e2e tasks/test_task_parallel_fail_fast— new test for mise run: Parallel tasks are not terminated on failure of sibling #8459mise run test:e2e tasks/test_task_orchestrator_kill— new test for Mise task hangs when a child processes is spawned with mise (since v2026.2.18) #8345 usingsetsid mise run+kill -KILL -PGIDmise run test:e2e tasks/test_task_failure_hang tasks/test_task_failure_hang_depends tasks/test_task_sequence_failure tasks/test_task_timeout tasks/test_task_depends_post_on_failure tasks/test_task_depends_post_skipped_on_dep_failure tasks/test_task_parallel_execution tasks/test_task_parallel_replacing tasks/test_task_raw_output tasks/test_task_raw_args— related tests pass🤖 Generated with Claude Code
Note
Medium Risk
Touches Unix process-group management and signal forwarding in the task runner, which can affect shutdown behavior across platforms and nested invocations. Adds safeguards and timeouts, but regressions could still surface in edge cases (TTY/interactive commands, orchestrator-managed sessions).
Overview
Fixes parallel task fail-fast behavior by reintroducing per-task process groups so
SIGTERMreaches task grandchildren (e.g.sh -c ... sleep) and sibling tasks stop promptly when one fails.Adds
should_use_pgroup()gating plusMISE_TASK_PGID_MANAGEDpropagation to avoid breaking orchestratorsetsid/pgroup kills and nestedmiseruns, updateskill_allto preferkillpg, and forwards signals (includingSIGINT) appropriately when pgroups are enabled. Introduces a 10s pipe-drain timeout after child exit to prevent hangs when inherited stdout/stderr pipes remain open.Fixes/adjusts failure termination triggering in both
Run::spawn_sched_jobandScheduler::join_all, and adds two e2e regressions tests covering orchestrator pgroupSIGKILLpropagation and parallel sibling termination timing.Reviewed by Cursor Bugbot for commit 0ff45b8. Bugbot is set up for automated code reviews on this repo. Configure here.