Skip to content

cgroups: reapp child processes before destroying cgroup#13135

Draft
Mic92 wants to merge 1 commit into
NixOS:masterfrom
Mic92:fix-cgroups
Draft

cgroups: reapp child processes before destroying cgroup#13135
Mic92 wants to merge 1 commit into
NixOS:masterfrom
Mic92:fix-cgroups

Conversation

@Mic92

@Mic92 Mic92 commented May 4, 2025

Copy link
Copy Markdown
Member

killing is enough to destroy a cgroup, we need to remove the wait state of zombie processes before we can destroy the cgroup.

Motivation

Context


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

killing is enough to destroy a cgroup, we need to remove the wait state
of zombie processes before we can destroy the cgroup.
@Mic92 Mic92 requested a review from edolstra as a code owner May 4, 2025 04:49
@Mic92 Mic92 added the backport 2.28-maintenance Automatically creates a PR against the branch label May 4, 2025
@Mic92

Mic92 commented May 5, 2025

Copy link
Copy Markdown
Member Author

Tested this on a busy CI machine a bit and didn't any breakage yet that we saw before in nix-community and NixOS infra.

if (kill(pid, SIGKILL) == -1 && errno != ESRCH)
throw SysError("killing member %d of cgroup '%s'", pid, cgroup);

while (waitpid(pid, nullptr, 0) == -1) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it matters, but I think this only works for processes that are direct children of the current process? For others it will return ECHILD according to the manpage.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems like it doesn't work correctly because there some other code that uses waitpid and crashes if there is no wait result.

@Mic92 Mic92 marked this pull request as draft May 6, 2025 10:03
@Mic92

Mic92 commented May 7, 2025

Copy link
Copy Markdown
Member Author

My current approach causes this crash:

May 06 11:03:17 web01 nix-daemon[862435]: 25# 0x0000000000480339 in nix-daemon
May 06 11:03:17 web01 nix-daemon[862435]: 26# 0x00007F5898DC647E in /nix/store/vbrdc5wgzn0w1zdp10xd2favkjn5fk7y-glibc-2.40-66/lib/libc.so.6
May 06 11:03:17 web01 nix-daemon[862435]: 27# __libc_start_main in /nix/store/vbrdc5wgzn0w1zdp10xd2favkjn5fk7y-glibc-2.40-66/lib/libc.so.6
May 06 11:03:17 web01 nix-daemon[862435]: 28# 0x0000000000484E65 in nix-daemon
May 06 11:03:18 web01 nix-daemon[1018942]: Nix crashed. This is a bug. Please report this at https://github.com/NixOS/nix/issues with the following information included:
May 06 11:03:18 web01 nix-daemon[1018942]: Exception: nix::SysError: error: cannot get exit status of PID 1035244: No child processes
May 06 11:03:18 web01 nix-daemon[1018942]: Stack trace:
May 06 11:03:18 web01 nix-daemon[1018942]: Nix crashed. This is a bug. Please report this at https://github.com/NixOS/nix/issues with the following information included:
May 06 11:03:18 web01 nix-daemon[1018942]: Exception: nix::SysError: error: cannot get exit status of PID 1035244: No child processes
May 06 11:03:18 web01 nix-daemon[1018942]: Stack trace:
May 06 11:03:18 web01 nix-daemon[1018942]:  0# 0x00000000004AAB09 in nix-daemon
May 06 11:03:18 web01 nix-daemon[1018942]:  1# 0x00007F58990901CA in /nix/store/7n3q3rgy5382di7ccrh3r6gk2xp51dh7-gcc-14.2.1.20250322-lib/lib/libstdc++.so.6
May 06 11:03:18 web01 nix-daemon[1018942]:  2# __cxa_call_terminate in /nix/store/7n3q3rgy5382di7ccrh3r6gk2xp51dh7-gcc-14.2.1.20250322-lib/lib/libstdc++.so.6
May 06 11:03:18 web01 nix-daemon[1018942]:  3# __gxx_personality_v0 in /nix/store/7n3q3rgy5382di7ccrh3r6gk2xp51dh7-gcc-14.2.1.20250322-lib/lib/libstdc++.so.6
May 06 11:03:18 web01 nix-daemon[1018942]:  4# 0x00007F5898FC5A19 in /nix/store/7n3q3rgy5382di7ccrh3r6gk2xp51dh7-gcc-14.2.1.20250322-lib/lib/libgcc_s.so.1
May 06 11:03:18 web01 nix-daemon[1018942]:  5# _Unwind_RaiseException in /nix/store/7n3q3rgy5382di7ccrh3r6gk2xp51dh7-gcc-14.2.1.20250322-lib/lib/libgcc_s.so.1
May 06 11:03:18 web01 nix-daemon[1018942]:  6# __cxa_throw in /nix/store/7n3q3rgy5382di7ccrh3r6gk2xp51dh7-gcc-14.2.1.20250322-lib/lib/libstdc++.so.6
May 06 11:03:18 web01 nix-daemon[1018942]:  7# 0x00007F5899C266CB in /nix/store/gx3g56w8ss5n8k9vpbr02gadh335ml1l-nix-util-2.29.0pre/lib/libnixutil.so

@edolstra

edolstra commented May 7, 2025

Copy link
Copy Markdown
Member

Not sure I understand the problem. Zombie processes should be reaped by init, so they should disappear from the cgroup pretty quickly.

@nixos-discourse

Copy link
Copy Markdown

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2025-05-07-nix-team-meeting-minutes-224/63973/1

@Mic92

Mic92 commented May 7, 2025

Copy link
Copy Markdown
Member Author

Not sure I understand the problem. Zombie processes should be reaped by init, so they should disappear from the cgroup pretty quickly.

Are they not parent of nix-daemon still? At least I am able to call waitpid() on some of them.

@NaN-git

NaN-git commented May 12, 2025

Copy link
Copy Markdown
Contributor

Woudn't be the right approach to wait for processes with parents in the cgroup first? Of course before doing this SIGKILL has to be sent. Then no other process can wait on these processes.

The question is whether it's guaranteed that other processes will wait for the killed processes with parents not in the cgroup immediately. Maybe waiting for these processed should be tried after some timeout.

@Mic92

Mic92 commented May 12, 2025

Copy link
Copy Markdown
Member Author

Woudn't be the right approach to wait for processes with parents in the cgroup first? Of course before doing this SIGKILL has to be sent. Then no other process can wait on these processes.

The question is whether it's guaranteed that other processes will wait for the killed processes with parents not in the cgroup immediately. Maybe waiting for these processed should be tried after some timeout.

It probably improve but it also looks like some of the parents are actually other nix-daemon processes, at least looking at the stacktrace. I will try to add more logging to understand better what is going on here.

@NaN-git

NaN-git commented May 12, 2025

Copy link
Copy Markdown
Contributor

It probably improve but it also looks like some of the parents are actually other nix-daemon processes, at least looking at the stacktrace. I will try to add more logging to understand better what is going on here.

Do you know whether

writeFile(killFile, "1");

is executed?
Otherwise your patch executes the killing and waiting for the processes in an order, which is obviously problematic.

@Mic92

Mic92 commented May 13, 2025

Copy link
Copy Markdown
Member Author

It probably improve but it also looks like some of the parents are actually other nix-daemon processes, at least looking at the stacktrace. I will try to add more logging to understand better what is going on here.

Do you know whether

writeFile(killFile, "1");

is executed? Otherwise your patch executes the killing and waiting for the processes in an order, which is obviously problematic.

Could you explain why it is problematic to kill processes out of order? It is not obvious to me when we are going to clear the whole process tree anyhow, why need to follow a certain order.

@NaN-git

NaN-git commented May 13, 2025

Copy link
Copy Markdown
Contributor

It probably improve but it also looks like some of the parents are actually other nix-daemon processes, at least looking at the stacktrace. I will try to add more logging to understand better what is going on here.

Do you know whether

writeFile(killFile, "1");

is executed? Otherwise your patch executes the killing and waiting for the processes in an order, which is obviously problematic.

Could you explain why it is problematic to kill processes out of order? It is not obvious to me when we are going to clear the whole process tree anyhow, why need to follow a certain order.

The order of killing and waiting matters. Let's assume a child is killed while its parent is still running, which can happen when writeFile(killFile, "1") isn't executed. Then the parent might try to wait for the child after this was done here already, causing a crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.28-maintenance Automatically creates a PR against the branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants