Skip to content

Conversation

@erikstmartin
Copy link
Contributor

@erikstmartin erikstmartin commented Jun 10, 2016

- What I did
Added support for two new flags --cpu-rt-period and --cpu-rt-runtime combined with either the --privileged flag or --cap-add=sys_nice this allows processes within a container to prioritize themselves as real-time when CONFIG_RT_GROUP_SCHED is enabled in the kernel. See #22380. When this option is enabled each docker container's cgroup has to have cpu.rt_runtime_us set in order for the container to be able to prioritize its threads.

- How I did it

Support for changing the cgroup value already existed in runc, I updated all the necessary container configs to pass these values down to runc. I also implemented some sanity checks in the daemon.

- How to verify it
Ensure CONFIG_RT_GROUP_SCHED is enabled in the kernel (standard CentOS 7 install has this enabled). You can verify it's enabled by running
zcat /proc/config.gz | grep CONFIG_RT_GROUP_SCHED or by checking for the existence of /sys/fs/cgroup/cpu.rt_runtime_us

You will also need to ensure that the system and all parent cgroups have values set for the period and runtime as this limits what the children can be set to. If you're familiar with these values you can set them appropriately, sane defaults would be rt_period_us = 1000000 and rt_runtime_us = 950000, they only need to be set once on the system.

sysctl kernel.sched_rt_period_us
sysctl kernel.sched_rt_runtime_us

cat /sys/fs/cgroup/cpu.rt_period_us
cat /sys/fs/cgroup/cpu.rt_runtime_us

cat /sys/fs/cgroup/docker/cpu.rt_period_us
cat /sys/fs/cgroup/docker/cpu.rt_runtime_us

Update: You can now set the values for the parent cgroups with a new set of daemon flags. The system value will still need to be set using sysctl.
docker daemon --cpu-rt-runtime=950000

Now you have a system you can test on. To validate this works start a container with the new flags (we'll add a ulimit just to test that works as expected too)

docker run --it --cpu-rt-runtime=95000 --ulimit rtprio=99 --cap-add=sys_nice debian:jessie

If you receive an error like this, it's because your parent cgroups dont have rt_runtime_us set, or it's a lower value than you're trying to set for your container. Also ensure they are set for your container /sys/fs/cgroup/docker/<container id>/cpu.rt_runtime_us (after the container has been started). if you're testing this inside a container.

docker: Error response from daemon: oci runtime error: process_linux.go:286: setting cgroup config for ready process caused "failed to write 950000 to cpu.rt_runtime_us: write /sys/fs/cgroup/cpu,cpuacct/docker/5024d2730a83e86efd9042ea99972d582c66830f4a3aa21c747d1c25c824eb75/cpu.rt_runtime_us: invalid argument".

Once inside the container you can attempt to change the priority of your shell
chrt -r -p 99 1

Then validate by calling
chrt -p 1

root@c1d2f4b7d579:/# chrt -r -p 99 1
root@c1d2f4b7d579:/# chrt -p 1
pid 1's current scheduling policy: SCHED_RR
pid 1's current scheduling priority: 99

You can also test the following error conditions.

  • Passing either flag without having the sys_nice capability or being privileged results in error (no point setting your runtime if you dont have permission to change your priority)
  • Passing either flag on a host with the kernel option disabled results in error
  • Passing a runtime greater than the period results in an error

- Description for the changelog
Support for --cpu-rt-period and --cpu-rt-runtime flags, allowing containers to run real-time threads when CONFIG_RT_GROUP_SCHED is enabled in the kernel.

- A picture of a cute animal (not mandatory but encouraged)

Picture of Gopher

@justincormack
Copy link
Contributor

Thanks for working on this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I would check this, there are potentially other ways in which we could get capabilities, so I would just check the return codes when you actually try to change stuff and try to give a helpful message if it fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove just the capabilities check? or all the param sanity checking?

Also you wont see any errors until you make the syscall to change your priority, which would likely just be a permissions error and not overly helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capabilities check, it is a bit ugly and encodes a lot of presumed information about how capabilities are assigned which could change in future, and we don't do it anywhere else. EPERM in the syscall is pretty clear and you can provide a more helpful message there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got overly ambition trying to save people from themselves. I'll get this removed.

@erikstmartin
Copy link
Contributor Author

No problem, sorry for the crazy delay, been insanely caught up on other tasks.

Let me know if you have any questions/problems trying to test this. I admit it's a lot of steps to get setup to test if you don't already have this kernel param enabled, then the Docker inception thing, having to make sure the container you're testing inside of has rt_runtime. You almost need this feature to test this feature :)

@erikstmartin
Copy link
Contributor Author

I've removed the capabilities check and associated unit tests, PTAL.

@erikstmartin
Copy link
Contributor Author

CI Failures are from the manual changes within vendor/ I was advised to do that for now here so that everything could be reviewed in one place.

Once docker-archive-public/docker.engine-api#262 is merged, this PR can be updated to pull in upstream and that failure will disappear.

@GordonTheTurtle GordonTheTurtle added the dco/no Automatically set by a bot when one of the commits lacks proper signature label Jun 13, 2016
@GordonTheTurtle GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Jun 13, 2016
@thaJeztah
Copy link
Member

@erikstmartin yes, don't worry about the "failing-ci" label for now, it's just that we know it's failing (in this case due to the "vendor" CI). Should not be an issue while in design-review 👍

@erikstmartin
Copy link
Contributor Author

I've updated this PR to include a --cpu-rt-runtime flag to the daemon. The daemon needs to be able to set the cgroup value otherwise a sysadmin will need to set the value after the first container is started (after every reboot)

@erikstmartin erikstmartin force-pushed the realtime-threads branch 2 times, most recently from 3a39af7 to fe2f46e Compare July 20, 2016 19:58
@jalaziz
Copy link

jalaziz commented Jul 21, 2016

I haven't had a chance to test this, but I believe the daemon flag won't actually work correctly with newer systemd versions. Until opencontainers/runc#931 is fixed, the cpu cgroup path ends up being /sys/fs/cgroup/cpu/init.scope/system.slice. In order for everything to work correctly, rt_runtime and rt_period have to be recursively set down the path.

@erikstmartin
Copy link
Contributor Author

@jalaziz I'll double check, but i'm pretty sure I was developing this against 230.

@jalaziz
Copy link

jalaziz commented Jul 25, 2016

@erikstmartin For reference, we're running CoreOS with systemd v229. However, I believe the problem exists on other OSes with newer systemd versions if you use --exec-opt native.cgroupdriver=systemd (I'm sure you already knew this).

I'm not sure if the cgroupParent is correct for the CPU subsystem when you call initCgroupsPath. Even if it is, though, the testing I've done seems to indicate you'd have to recursively set the runtime and period settings (e.g. for init.scope, then for init.scope/system.slice).

@erikstmartin
Copy link
Contributor Author

@jalaziz You're completely correct. I dont think I tried it using the systemd cgroup driver on that machine.

@erikstmartin erikstmartin force-pushed the realtime-threads branch 3 times, most recently from 269ee59 to 1b749b4 Compare August 16, 2016 19:51
@erikstmartin
Copy link
Contributor Author

Latest push should handle climbing the cgroup tree and also account for init.scope in newer versions of systemd

@cpuguy83
Copy link
Member

🎉
Thanks @erikstmartin!

thaJeztah added a commit to thaJeztah/docker that referenced this pull request Jan 16, 2017
Pull request moby#27745 added support for the
client to talk to older versions of the daemon. Various flags were added to
docker 1.13 that are not compatible with older daemons.

This PR adds annotations to those flags, so that they are automatically hidden
if the daemon is older than docker 1.13 (API 1.25).

Not all new flags affect the API (some are client-side only). The following
PR's added new flags to docker 1.13 that affect the API;

- moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime`
- moby#27800 / moby#25317 added `--group` / `--group-add` / `--group-rm`
- moby#27702 added `--network` to `docker build`
- moby#25962 added `--attachable` to `docker network create`
- moby#27998 added `--compose-file` to `docker stack deploy`
- moby#22566 added `--stop-timeout` to `docker run` and `docker create`
- moby#26061 added `--init` to `docker run` and `docker create`
- moby#26941 added `--init-path` to `docker run` and `docker create`
- moby#27958 added `--cpus` on `docker run` / `docker create`
- moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create`
- moby#27596 added `--force` to `docker service update`
- moby#27857 added `--hostname` to `docker service create`
- moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update`
- moby#28076 added `--tty` on `docker service create` / `docker service update`
- moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update`
- moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update`

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@huster-hh
Copy link

@erikstmartin thank you for reading my question.
i meet the same error you said.i submit my question in #31411
you said 'If you receive an error like this, it's because your parent cgroups dont have rt_runtime_us set, or it's a lower value than you're trying to set for your container.'
however i can not change docker cgroup 's rt_runtime_us to more than 0.
can you tell me how you solve this error??
Thank you very much !!!Thank you !

@erikstmartin
Copy link
Contributor Author

@IT-huang This is going to be dependent on what the parent cgroup of docker is, is it user.slice, etc. You need to make sure that the parents also have rt_runtime allocated. This could also be at the system level of the host itself. Check sysctl kernel.sched_rt_runtime_us and sysctl kernel.sched_rt_period_us

erikstmartin added a commit to erikstmartin/docker that referenced this pull request Mar 14, 2017
cpu.rt_runtime

PR moby#23430 introduced a couple more
flags including `--cpu-rt-runtime` to the docker daemon. It appears
recent changes or merge issues may have broken this. It currently does
not take the cgroup mount point into account when determining the cgroup
files to write values to. This breaks docker setting its own
`cpu.rt_runtime` for the daemon. This also means containers aren't able
to set theirs.

Also, the cgroups.FindCgroupMountpointAndRoot returns back a mount point
that includes the cgroup of the currently running container when docker
is run inside a docker container. this breaks the `--cpu-rt-runtime`
flag when running docker in docker. A fix has been placed here, but
potentially could be pulled up into libcontainer if this is a better
place for it.

Signed-off-by: Erik St. Martin <alakriti@gmail.com>
andrewhsu pushed a commit to docker-archive/docker-ce that referenced this pull request May 19, 2017
Pull request moby/moby#27745 added support for the
client to talk to older versions of the daemon. Various flags were added to
docker 1.13 that are not compatible with older daemons.

This PR adds annotations to those flags, so that they are automatically hidden
if the daemon is older than docker 1.13 (API 1.25).

Not all new flags affect the API (some are client-side only). The following
PR's added new flags to docker 1.13 that affect the API;

- moby/moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime`
- moby/moby#27800 / moby/moby#25317 added `--group` / `--group-add` / `--group-rm`
- moby/moby#27702 added `--network` to `docker build`
- moby/moby#25962 added `--attachable` to `docker network create`
- moby/moby#27998 added `--compose-file` to `docker stack deploy`
- moby/moby#22566 added `--stop-timeout` to `docker run` and `docker create`
- moby/moby#26061 added `--init` to `docker run` and `docker create`
- moby/moby#26941 added `--init-path` to `docker run` and `docker create`
- moby/moby#27958 added `--cpus` on `docker run` / `docker create`
- moby/moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create`
- moby/moby#27596 added `--force` to `docker service update`
- moby/moby#27857 added `--hostname` to `docker service create`
- moby/moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update`
- moby/moby#28076 added `--tty` on `docker service create` / `docker service update`
- moby/moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update`
- moby/moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update`

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: 5d2722f83db9e301c6dcbe1c562c2051a52905db
Component: cli
albers added a commit to albers/docker-cli that referenced this pull request Jul 3, 2017
This adds bash completion for moby/moby#23430.

Signed-off-by: Harald Albers <github@albersweb.de>
andrewhsu pushed a commit to docker-archive/docker-ce that referenced this pull request Jul 14, 2017
This adds bash completion for moby/moby#23430.

Signed-off-by: Harald Albers <github@albersweb.de>
Upstream-commit: 74a5d1af86a3d0898dda79bb13282f1b039b015c
Component: cli
alshabib pushed a commit to alshabib/cli that referenced this pull request Aug 1, 2017
This adds bash completion for moby/moby#23430.

Signed-off-by: Harald Albers <github@albersweb.de>
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Jun 21, 2019
These options configure the parent cgroup, not the default for containers,
nor the daemon itself, so adding that information to the flag description
to make this slightly more clear.

relates to 56f77d5 (moby#23430) which implemented
these flags.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
burnMyDread pushed a commit to burnMyDread/moby that referenced this pull request Oct 21, 2019
These options configure the parent cgroup, not the default for containers,
nor the daemon itself, so adding that information to the flag description
to make this slightly more clear.

relates to 56f77d5 (moby#23430) which implemented
these flags.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: zach <Zachary.Joyner@linux.com>
}

cpuRealtimeRuntime := cgroupEnabled(mountPoint, "cpu.rt_runtime_us")
if !quiet && !cpuRealtimeRuntime {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make sense to check these two files separately -- this is a single kernel feature, so these two are either both present or not (and if one present and the other is not this is some kind of a gross kernel error).

I am trying to fix this in #41014

s.Linux.Resources.OOMScoreAdj = &c.HostConfig.OomScoreAdj
s.Linux.Sysctl = c.HostConfig.Sysctls

p := *s.Linux.CgroupsPath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need to do all of the below in case CPU RT parameters are not set, or the kernel does not support CPU RT sched.


daemon.initCgroupsPath(filepath.Dir(path))

_, root, err := cgroups.FindCgroupMountpointAndRoot("cpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call is very expensive and ideally should only be done once.

}

path = filepath.Join(root, path)
sysinfo := sysinfo.New(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very expensive way to check whether CPU RT scheduler is supported, and it's also performed way too late in the code, and also multiple times :(

@kolyshkin
Copy link
Contributor

Trying to address some of the issues in #41014 and #41016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.