feat(supervisor): use process groups for atomic daemon shutdown#239
feat(supervisor): use process groups for atomic daemon shutdown#239
Conversation
Spawn each daemon in its own session via setsid() so that stopping a daemon can kill the entire process tree with a single killpg() call instead of walking /proc to find children sequentially. This eliminates race conditions where a child could fork between the scan and the kill, and is faster for multi-child daemons since the whole group receives the signal atomically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @jdx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the daemon termination process to be more reliable and efficient. By leveraging Unix process groups, the system can now ensure that when a daemon is stopped, all its associated child processes are also terminated atomically. This change addresses potential race conditions and improves the overall stability of daemon management. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a more robust and atomic way to shut down daemons by using process groups. The change from iterating through child processes to using killpg is a significant improvement, eliminating potential race conditions. The implementation correctly uses setsid when spawning processes and provides a graceful shutdown sequence with SIGTERM followed by SIGKILL. The code is well-structured, and the changes are also reflected in the tests. I have one suggestion to refactor some duplicated code for better maintainability, but overall this is a solid improvement.
src/procs.rs
Outdated
| // Fast check: 10ms intervals for first 100ms | ||
| for i in 0..10 { | ||
| std::thread::sleep(std::time::Duration::from_millis(10)); | ||
| self.refresh_pids(&[pid]); | ||
| if self.is_terminated_or_zombie(sysinfo::Pid::from_u32(pid)) { | ||
| debug!( | ||
| "process group {pgid} terminated after SIGTERM ({} ms)", | ||
| (i + 1) * 10 | ||
| ); | ||
| return true; | ||
| } | ||
| } | ||
|
|
||
| // Slower check: 50ms intervals for up to ~3 more seconds | ||
| for i in 0..58 { | ||
| std::thread::sleep(std::time::Duration::from_millis(50)); | ||
| self.refresh_pids(&[pid]); | ||
| if self.is_terminated_or_zombie(sysinfo::Pid::from_u32(pid)) { | ||
| debug!( | ||
| "process group {pgid} terminated after SIGTERM ({} ms)", | ||
| 100 + (i + 1) * 50 | ||
| ); | ||
| return true; | ||
| } | ||
| } |
There was a problem hiding this comment.
The waiting logic with polling after sending SIGTERM is implemented in two separate loops, which leads to some code duplication. This can be refactored into a single loop using an iterator chain to make the code more concise and easier to maintain.
// Wait for graceful shutdown, with a fast initial check.
let fast_checks = std::iter::repeat(std::time::Duration::from_millis(10)).take(10);
let slow_checks = std::iter::repeat(std::time::Duration::from_millis(50)).take(58);
let mut elapsed_ms = 0;
for sleep_duration in fast_checks.chain(slow_checks) {
std::thread::sleep(sleep_duration);
self.refresh_pids(&[pid]);
elapsed_ms += sleep_duration.as_millis() as u64;
if self.is_terminated_or_zombie(sysinfo::Pid::from_u32(pid)) {
debug!(
"process group {pgid} terminated after SIGTERM ({} ms)",
elapsed_ms
);
return true;
}
}When the group leader is a zombie, child processes may still be running. The early return on zombie detection would skip killpg(), leaving those children as orphans. Remove the zombie check before killpg() and let ESRCH from killpg() handle the "no such process group" case instead. Also refactors the two polling loops into a single iterator chain per reviewer feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the supervisor’s daemon shutdown strategy to use Unix process groups, enabling an atomic “kill the whole daemon tree” operation and avoiding races inherent in /proc-walking.
Changes:
- Spawn daemons in their own session/process group via
setsid()(Unix-only) to make the daemon PID the process group ID. - Replace per-child termination logic with a single process-group kill path (
killpg) with SIGTERM→wait→SIGKILL escalation. - Minor test cleanup/refactor (remove unused import; simplify/adjust port cleanup loop).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/supervisor/lifecycle.rs |
Adds setsid() at spawn time and switches stop logic to process-group kill. |
src/procs.rs |
Introduces process-group kill implementation and keeps old /proc traversal helper for diagnostics. |
tests/common/mod.rs |
Refactors kill_port to avoid sleeping per PID and uses a let-chain style condition. |
tests/test_e2e_logs.rs |
Removes an unused chrono import. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| debug!("killing process group {pgid}"); | ||
|
|
||
| // Send SIGTERM to the entire process group. | ||
| // killpg sends to all processes in the group atomically. |
There was a problem hiding this comment.
kill_process_group() uses the leader PID’s status (is_terminated_or_zombie(pid)) as the stopping condition. That can return early/false while other processes in the process group are still running (e.g., the group leader exits quickly but children keep running/ignore SIGTERM), which can leave orphans even though the API reports success. Consider checking process-group existence instead (e.g., poll killpg(pgid, 0) until it returns ESRCH, and don’t treat leader exit/zombie as sufficient).
| if err.raw_os_error() == Some(libc::ESRCH) { | ||
| debug!("process group {pgid} no longer exists"); | ||
| return false; | ||
| } | ||
| warn!("failed to send SIGTERM to process group {pgid}: {err}"); | ||
| } | ||
|
|
||
| // Wait for graceful shutdown: fast initial check then slower polling. | ||
| let fast_checks = std::iter::repeat_n(std::time::Duration::from_millis(10), 10); |
There was a problem hiding this comment.
When libc::killpg() fails with an error other than ESRCH, the function only logs a warning and continues, ultimately returning true. That can cause stop() to mark a daemon as Stopped even though no signals were delivered (e.g., EPERM). Consider propagating this as an error/false return, or at least validating the process group is actually gone (again via killpg(pgid, 0) -> ESRCH) before returning success.
| // Kill the entire process group atomically (daemon PID == PGID | ||
| // because we called setsid() at spawn time) | ||
| if let Err(e) = PROCS.kill_process_group_async(pid).await { |
There was a problem hiding this comment.
stop() only attempts the new process-group kill inside the if PROCS.is_running(pid) branch. With process groups, the group may still contain running children even if the original leader PID has already exited, so this guard can skip the killpg path and leave orphan processes. Consider attempting the process-group kill based on the stored PGID regardless of leader liveness (and let killpg(..., 0)/ESRCH determine whether anything remains).
## 🤖 New release * `pitchfork-cli`: 1.4.3 -> 1.5.0 <details><summary><i><b>Changelog</b></i></summary><p> <blockquote> ## [1.5.0](v1.4.3...v1.5.0) - 2026-02-16 ### Added - *(supervisor)* use process groups for atomic daemon shutdown ([#239](#239)) - SIGKILL after SIGTERM ([#238](#238)) </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/release-plz/release-plz/). <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Release metadata only (version/changelog updates) with no functional code changes in this PR. > > **Overview** > Prepares the `v1.5.0` release by bumping the `pitchfork-cli` version from `1.4.3` to `1.5.0` in `Cargo.toml`/`Cargo.lock`. > > Updates `CHANGELOG.md` with a new `1.5.0` entry noting supervisor shutdown improvements (process groups for atomic shutdown) and enforcing `SIGKILL` after `SIGTERM`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit a3def3b. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Summary
setsid()so the daemon PID becomes the process group leader (PID == PGID)/proc-walking child kill logic with a singlekillpg()call that atomically signals the entire process groupTest plan
cargo build— clean, no warningscargo nextest run— all 154 tests pass (includingtest_stop_kills_child_processes)cargo clippyandcargo fmt --check— cleannpm run dev), stop it, verify no orphans🤖 Generated with Claude Code
Note
Medium Risk
Changes core daemon lifecycle behavior and low-level signal/session handling, which could affect shutdown reliability or accidentally signal the wrong processes if PGID assumptions break.
Overview
Daemon shutdown is now process-group based. On Unix, daemon spawn now calls
setsid()so each daemon becomes its own process-group leader, enabling atomic signaling of the whole tree.Stopping a daemon switches from
/proc-walking and sequentially killing children + parent to a singlekillpg()-based graceful shutdown path (SIGTERM with ~3s polling, then SIGKILL fallback), with a non-Unix fallback to the existing single-process kill.Minor test cleanups: simplify
kill_port’slsofhandling and drop an unusedchronoimport.Written by Cursor Bugbot for commit 5c0dfb6. This will update automatically on new commits. Configure here.