Skip to content

Conversation

@zq-david-wang
Copy link

@zq-david-wang zq-david-wang commented Apr 24, 2019

When setting/updating child cpu cgroup cfs_quota_us and cfs_period_us, the order is significant because the child node should not have a higher quota/period ratio than its parent. This means if always setting period first would cause kernel error.
For example, say parent node has quota/period set to 10000/10000, and we want to change child to 1000/1000, if set period first, than after the period change and before the quota change, the ration is 10000/1000, and this setting make child node have larger quota ratio than its parent and rejected by kernel.

This patch would get current child value first, and make sure the first setting always make the quota/period ratio lower than current (valid) ratio.
EDIT:
A simpler strategy is to use a period->quota---->period sequence, if setting period failed, retry setting it after seting quota.

And there is an issue reported in kubernetes project
kubernetes/kubernetes#76704

EDIT:
To trigger the issue, we need

  1. the parent cgroup node has a non-infinite cgroup quota
  2. Before setting quota, the current child node's quota is non-infinite.
    And when using systemd cgroup driver with kubernetes/runc, both conditions are met...
    kubelet would create a parent node for pod, and set its quota according to resource limit, and when create child node via systemd cgroup driver, systemd would set peirod to 100ms and quota to a calculated value other than -1. Then further change to quota/period would be rejected by liinux kernel if child node have higher quota/period ratio after setting period and before setting quota.

}
if cgroup.Resources.CpuQuota != 0 {
// if quota is already set, don't set it again.
if (!qs) && cgroup.Resources.CpuQuota != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just do if !qs && ....

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed~

@zq-david-wang
Copy link
Author

Just realize that the code only address the situation for changing leaf node of a cgroup tree. When a parent node is changed, it should make it larger first.. And worse, if one want to change a node in the middle of the cgroup tree, it becoms a mission impossible...

Say we have three cgroup nodes a, b, c, a is b's parent, and b is c's parent
a,b and c have same quota/period values, say 10000/10000, and we want to change b to 100000/100000. Changing quota first would make b's ratio 100000/10000, and its parent node a would complain, and if chaning period first, b's ration 10000/100000 would be lower than its child.....

@zq-david-wang
Copy link
Author

My understanding is that runc is mostly used to manage leaf node of cgroup tree, right? If so, this fix should work "most of the time"...

@cyphar
Copy link
Member

cyphar commented Apr 25, 2019

Yes, runc only ever manages the leaf of tree. If the admin decides to misconfigure the cgroup tree to make runc crash there isn't much we can (or should) do about it.

@zq-david-wang
Copy link
Author

I have make changes to use a simpler strategy: set period first, if failed than retry if after setting quota.
This way, it should handle most of cases except the impossible one I mentioned before.

@zq-david-wang
Copy link
Author

c44aec9
This commit make the initial value of cpu.cfs_quota_us non-infinite via systemd,

}
// The order of setting cfs_quota_us and cfs_period_us is significant, since cgroup child node
// should not have a higher quota/period ratio than its parent.
// Use period->quota-->period sequence to make sure no matther parent or child node is changed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matther -> matter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed~!

@crosbymichael
Copy link
Member

One question, why not use re-order the logic on what is set first instead of this re-order bool?

@zq-david-wang
Copy link
Author

zq-david-wang commented May 8, 2019

@crosbymichael Thanks for the comments.
My first version of the code is to reorder the sequence to make the ratio smaller first by reading current node's period and comparing it with new value. If the new period is larger than current period, set period first, otherwise set quota first.
But this only work when current cgroup node is a leaf of cgroup tree. Eventhough this is how runc is used mostly nowadays, I feel it would be better to address both situation incase one day runc is used to change some non-leaf node of cgroup tree..... If the current node is not a leaf, to reorder, detecting the existance of child node is needed and would make code complicated in my opinion. So I choose a simpler one: retry.

I can put the first version of code changes in another PR if necessary.

@zq-david-wang
Copy link
Author

@crosbymichael @cyphar
Just found out another solution,

  1. change node's cpu quota to -1
  2. change period
  3. change quota
    This sequence can handle all situations no matter where the position of current node is in the cgroup tree, as long as the final quota/period ratio is valid regarding current node's parent and children.

Need your opinion on this.

Signed-off-by: zq-david-wang <00107082@163.com>
@AkihiroSuda
Copy link
Member

Please add tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants