Skip to content

Conversation

@cdown
Copy link
Member

@cdown cdown commented Oct 29, 2018

See #7624 and #10473.

Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with cgroup_disable=cpu on the kernel command line.

This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.

Currently systemd provides some configuration knobs for these in the
form of [Default]CPUAccounting, [Default]MemoryAccounting, and their
ilk. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. CPUWeight), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.

This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.

Test Plan

Overall test doing grep -v memory with MemoryMax=5G in nomem.slice and DisableController=memory in nomemleaf.slice:

# systemctl status nomemleaf
* nomemleaf.service - Nomem Leaf Service
Loaded: loaded (/etc/systemd/system/nomemleaf.service; static; vendor preset: disabled)
Active: inactive (dead)

Nov 26 12:53:48 systemdtest systemd[1]: Starting Nomem Leaf Service...
Nov 26 12:53:48 systemdtest grep[516]: pids
Nov 26 12:53:48 systemdtest systemd[1]: nomemleaf.service: Succeeded.
Nov 26 12:53:48 systemdtest systemd[1]: Started Nomem Leaf Service.
Nov 26 12:53:48 systemdtest systemd[1]: nomemleaf.service: Consumed 1ms CPU time.

I've also tested this on -.slice, where it successfully removes the entry from cgroup.subtree_control.

DisableControllers also correctly shows up in systemctl show, showing DisableControllers=memory for nomem.slice, and nothing in nomem.service.

@yuwata yuwata added the pid1 label Oct 29, 2018
@cdown cdown force-pushed the disable_controller branch 5 times, most recently from 98d88c3 to ba089e1 Compare October 31, 2018 11:44
@cdown cdown changed the title cgroup: Add DisableController= directive to disable controller in subtree cgroup: Add DisableControllers= directive to disable controller in subtree Oct 31, 2018
@cdown cdown force-pushed the disable_controller branch 3 times, most recently from 5ce3495 to 34821a7 Compare October 31, 2018 16:11
@cdown
Copy link
Member Author

cdown commented Nov 1, 2018

CI failures appear unrelated:

This is still fine for review.

@evverx
Copy link
Contributor

evverx commented Nov 1, 2018

i386 failure in startup

That means that qemu turned on yesterday seems to be working :-) Anyway, it seems to be #10093, which is going to annoy everybody now that the test is really run.

Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice in general.

@poettering
Copy link
Member

would love to review this, can you rebase on current git, and address @keszybz' points?

@cdown
Copy link
Member Author

cdown commented Nov 21, 2018

Yeah, I've been working on this but found a small issue which I need to fix first. Probably will have an update ready for review tomorrow. :-)

@cdown cdown force-pushed the disable_controller branch 3 times, most recently from 28e8fa3 to 993bb3f Compare November 26, 2018 13:53
@cdown
Copy link
Member Author

cdown commented Nov 26, 2018

@poettering @keszybz CI passes (pending arm64, but nothing specific there). This is ready for review. :)

@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 26, 2018
@cdown cdown force-pushed the disable_controller branch 3 times, most recently from 707fe2c to a3d7631 Compare November 27, 2018 15:50
@cdown
Copy link
Member Author

cdown commented Nov 27, 2018

@poettering Ready for next round of reviews.

cdown added 2 commits December 3, 2018 14:37
We always end up doing these together, so just colocate them and require
manager state for unit_create_cgroup.
systemd currently doesn't really expend much effort in disabling
controllers. unit_realize_cgroup_now *may* be able to disable a
controller in the basic case when using cgroup v2, but generally won't
manage as downstream dependents may still use it.

This code doesn't add any logic to fix that, but it starts the process
of moving to have a breadth-first version of unit_realize_cgroup_now for
enabling, and a depth-first version of unit_realize_cgroup_now for
disabling.
This adds a depth-first version of unit_realize_cgroup_now which can
only do depth-first disabling of controllers, in preparation for the
DisableController= directive.
@cdown cdown force-pushed the disable_controller branch from a3d7631 to ca8137d Compare December 3, 2018 14:38
@cdown
Copy link
Member Author

cdown commented Dec 3, 2018

@poettering @keszybz This PR is pretty susceptible to merge conflicts and then having to go through the whole rebase/resolve/go through CI process again, any chance you could take a look soon? :-)

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just two nitpicks

…btree

Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.

This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.

Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.

This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
@cdown cdown force-pushed the disable_controller branch from ca8137d to c72703e Compare December 3, 2018 15:59
@poettering poettering added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Dec 3, 2018
@poettering
Copy link
Member

lgtm! thanks!

@poettering poettering merged commit a365325 into systemd:master Dec 3, 2018
@cdown cdown deleted the disable_controller branch December 4, 2018 13:31
@cdown cdown restored the disable_controller branch December 4, 2018 13:32
facebook-github-bot pushed a commit to facebookarchive/rpm-backports that referenced this pull request Dec 11, 2018
Summary:
Rebase two fixes onto the version merged upstream:
systemd/systemd#10507
systemd/systemd#10567
and backport a few more:
systemd/systemd#10411
systemd/systemd#10493
systemd/systemd#10757
systemd/systemd#10876

These are almost all cgroup2 related.

Reviewed By: cdown

Differential Revision: D13351498

fbshipit-source-id: 87c8428d48dbb0eb2ae7d34f7381fff88f83872f
in its subtree, the controller will be removed from the subtree. This can be used to avoid child units being
able to implicitly or explicitly enable a controller. Defaults to not disabling any controllers.</para>

<para>It may not be possible to successfully disable a controller if the unit or any child of the unit in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this child-over-parents policy chosen? I can see that disabling and delegating a controller on the same unit is non-sensical config, however, I fail to see why children can override with their Delegate= requests.

(Posting to the commit as it's closest to the origin, not sure if this sends notifications, so pasting a handle @cdown. Let me know if this dicussion should be redirected elsewhere.)

Copy link
Contributor

@Werkov Werkov Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I realize. We'd need to know the subtree of delegated children in order to do the "bottom-up" disabling. (Which we in theory know (with races though) but we've commited not to interfere with Delegate=.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cgroups good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed pid1

Development

Successfully merging this pull request may close these issues.

6 participants