Skip to content

feat(supervisor): use process groups for atomic daemon shutdown#239

Merged
jdx merged 2 commits intomainfrom
feat/process-group-isolation
Feb 16, 2026
Merged

feat(supervisor): use process groups for atomic daemon shutdown#239
jdx merged 2 commits intomainfrom
feat/process-group-isolation

Conversation

@jdx
Copy link
Owner

@jdx jdx commented Feb 16, 2026

Summary

  • Spawn each daemon in its own session via setsid() so the daemon PID becomes the process group leader (PID == PGID)
  • Replace the sequential /proc-walking child kill logic with a single killpg() call that atomically signals the entire process group
  • Eliminates race conditions where a child could fork between the proc scan and the kill, and is faster for multi-child daemons

Test plan

  • cargo build — clean, no warnings
  • cargo nextest run — all 154 tests pass (including test_stop_kills_child_processes)
  • cargo clippy and cargo fmt --check — clean
  • Manual: start a daemon that spawns children (e.g. npm run dev), stop it, verify no orphans
  • Manual: start a daemon that ignores SIGTERM, verify SIGKILL escalation kills the whole group

🤖 Generated with Claude Code


Note

Medium Risk
Changes core daemon lifecycle behavior and low-level signal/session handling, which could affect shutdown reliability or accidentally signal the wrong processes if PGID assumptions break.

Overview
Daemon shutdown is now process-group based. On Unix, daemon spawn now calls setsid() so each daemon becomes its own process-group leader, enabling atomic signaling of the whole tree.

Stopping a daemon switches from /proc-walking and sequentially killing children + parent to a single killpg()-based graceful shutdown path (SIGTERM with ~3s polling, then SIGKILL fallback), with a non-Unix fallback to the existing single-process kill.

Minor test cleanups: simplify kill_port’s lsof handling and drop an unused chrono import.

Written by Cursor Bugbot for commit 5c0dfb6. This will update automatically on new commits. Configure here.

Spawn each daemon in its own session via setsid() so that stopping a
daemon can kill the entire process tree with a single killpg() call
instead of walking /proc to find children sequentially. This eliminates
race conditions where a child could fork between the scan and the kill,
and is faster for multi-child daemons since the whole group receives the
signal atomically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 16, 2026 22:59
@gemini-code-assist
Copy link

Summary of Changes

Hello @jdx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the daemon termination process to be more reliable and efficient. By leveraging Unix process groups, the system can now ensure that when a daemon is stopped, all its associated child processes are also terminated atomically. This change addresses potential race conditions and improves the overall stability of daemon management.

Highlights

  • Atomic Daemon Shutdown: Implemented a robust mechanism for atomically shutting down daemons and all their descendant processes using Unix process groups.
  • Process Group Creation: Daemons are now spawned in their own session via setsid(), making the daemon's PID the process group leader (PGID).
  • Efficient Termination: Replaced the previous sequential /proc-walking child kill logic with a single killpg() call, which signals the entire process group, eliminating race conditions and improving speed.
Changelog
  • src/procs.rs
    • Added kill_process_group_async and kill_process_group functions for graceful and atomic termination of process groups, including SIGTERM and SIGKILL escalation.
    • Annotated the all_children function with #[allow(dead_code)] and updated its documentation to reflect that it is no longer used in the kill path.
  • src/supervisor/lifecycle.rs
    • Modified the daemon spawning logic to use libc::setsid() on Unix systems, ensuring each daemon runs in its own session and process group.
    • Updated the daemon stopping mechanism to call the new kill_process_group_async function, replacing the previous iterative child process killing.
  • tests/common/mod.rs
    • Refactored a conditional check for command output success using the && operator for improved readability.
  • tests/test_e2e_logs.rs
    • Removed an unused chrono import statement.
Activity
  • No specific activity (comments, reviews, progress) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a more robust and atomic way to shut down daemons by using process groups. The change from iterating through child processes to using killpg is a significant improvement, eliminating potential race conditions. The implementation correctly uses setsid when spawning processes and provides a graceful shutdown sequence with SIGTERM followed by SIGKILL. The code is well-structured, and the changes are also reflected in the tests. I have one suggestion to refactor some duplicated code for better maintainability, but overall this is a solid improvement.

src/procs.rs Outdated
Comment on lines +108 to +132
// Fast check: 10ms intervals for first 100ms
for i in 0..10 {
std::thread::sleep(std::time::Duration::from_millis(10));
self.refresh_pids(&[pid]);
if self.is_terminated_or_zombie(sysinfo::Pid::from_u32(pid)) {
debug!(
"process group {pgid} terminated after SIGTERM ({} ms)",
(i + 1) * 10
);
return true;
}
}

// Slower check: 50ms intervals for up to ~3 more seconds
for i in 0..58 {
std::thread::sleep(std::time::Duration::from_millis(50));
self.refresh_pids(&[pid]);
if self.is_terminated_or_zombie(sysinfo::Pid::from_u32(pid)) {
debug!(
"process group {pgid} terminated after SIGTERM ({} ms)",
100 + (i + 1) * 50
);
return true;
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The waiting logic with polling after sending SIGTERM is implemented in two separate loops, which leads to some code duplication. This can be refactored into a single loop using an iterator chain to make the code more concise and easier to maintain.

        // Wait for graceful shutdown, with a fast initial check.
        let fast_checks = std::iter::repeat(std::time::Duration::from_millis(10)).take(10);
        let slow_checks = std::iter::repeat(std::time::Duration::from_millis(50)).take(58);
        let mut elapsed_ms = 0;

        for sleep_duration in fast_checks.chain(slow_checks) {
            std::thread::sleep(sleep_duration);
            self.refresh_pids(&[pid]);
            elapsed_ms += sleep_duration.as_millis() as u64;
            if self.is_terminated_or_zombie(sysinfo::Pid::from_u32(pid)) {
                debug!(
                    "process group {pgid} terminated after SIGTERM ({} ms)",
                    elapsed_ms
                );
                return true;
            }
        }

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is ON, but a Cloud Agent failed to start.

When the group leader is a zombie, child processes may still be running.
The early return on zombie detection would skip killpg(), leaving those
children as orphans. Remove the zombie check before killpg() and let
ESRCH from killpg() handle the "no such process group" case instead.

Also refactors the two polling loops into a single iterator chain per
reviewer feedback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the supervisor’s daemon shutdown strategy to use Unix process groups, enabling an atomic “kill the whole daemon tree” operation and avoiding races inherent in /proc-walking.

Changes:

  • Spawn daemons in their own session/process group via setsid() (Unix-only) to make the daemon PID the process group ID.
  • Replace per-child termination logic with a single process-group kill path (killpg) with SIGTERM→wait→SIGKILL escalation.
  • Minor test cleanup/refactor (remove unused import; simplify/adjust port cleanup loop).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/supervisor/lifecycle.rs Adds setsid() at spawn time and switches stop logic to process-group kill.
src/procs.rs Introduces process-group kill implementation and keeps old /proc traversal helper for diagnostics.
tests/common/mod.rs Refactors kill_port to avoid sleeping per PID and uses a let-chain style condition.
tests/test_e2e_logs.rs Removes an unused chrono import.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +89 to +92
debug!("killing process group {pgid}");

// Send SIGTERM to the entire process group.
// killpg sends to all processes in the group atomically.
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill_process_group() uses the leader PID’s status (is_terminated_or_zombie(pid)) as the stopping condition. That can return early/false while other processes in the process group are still running (e.g., the group leader exits quickly but children keep running/ignore SIGTERM), which can leave orphans even though the API reports success. Consider checking process-group existence instead (e.g., poll killpg(pgid, 0) until it returns ESRCH, and don’t treat leader exit/zombie as sufficient).

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +106
if err.raw_os_error() == Some(libc::ESRCH) {
debug!("process group {pgid} no longer exists");
return false;
}
warn!("failed to send SIGTERM to process group {pgid}: {err}");
}

// Wait for graceful shutdown: fast initial check then slower polling.
let fast_checks = std::iter::repeat_n(std::time::Duration::from_millis(10), 10);
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When libc::killpg() fails with an error other than ESRCH, the function only logs a warning and continues, ultimately returning true. That can cause stop() to mark a daemon as Stopped even though no signals were delivered (e.g., EPERM). Consider propagating this as an error/false return, or at least validating the process group is actually gone (again via killpg(pgid, 0) -> ESRCH) before returning success.

Copilot uses AI. Check for mistakes.
Comment on lines +630 to +632
// Kill the entire process group atomically (daemon PID == PGID
// because we called setsid() at spawn time)
if let Err(e) = PROCS.kill_process_group_async(pid).await {
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop() only attempts the new process-group kill inside the if PROCS.is_running(pid) branch. With process groups, the group may still contain running children even if the original leader PID has already exited, so this guard can skip the killpg path and leave orphan processes. Consider attempting the process-group kill based on the stored PGID regardless of leader liveness (and let killpg(..., 0)/ESRCH determine whether anything remains).

Copilot uses AI. Check for mistakes.
@jdx jdx merged commit 1e8f64d into main Feb 16, 2026
5 checks passed
@jdx jdx deleted the feat/process-group-isolation branch February 16, 2026 23:09
@jdx jdx mentioned this pull request Feb 16, 2026
jdx added a commit that referenced this pull request Feb 16, 2026
## 🤖 New release

* `pitchfork-cli`: 1.4.3 -> 1.5.0

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

## [1.5.0](v1.4.3...v1.5.0) -
2026-02-16

### Added

- *(supervisor)* use process groups for atomic daemon shutdown
([#239](#239))
- SIGKILL after SIGTERM
([#238](#238))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Release metadata only (version/changelog updates) with no functional
code changes in this PR.
> 
> **Overview**
> Prepares the `v1.5.0` release by bumping the `pitchfork-cli` version
from `1.4.3` to `1.5.0` in `Cargo.toml`/`Cargo.lock`.
> 
> Updates `CHANGELOG.md` with a new `1.5.0` entry noting supervisor
shutdown improvements (process groups for atomic shutdown) and enforcing
`SIGKILL` after `SIGTERM`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
a3def3b. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants