[core] (cgroups 3/n) Creating CgroupManager to setup Ray's cgroup hierarchy and clean it up#56186
[core] (cgroups 3/n) Creating CgroupManager to setup Ray's cgroup hierarchy and clean it up#56186
Conversation
to perform cgroup operations. Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
instead of clone for older kernel headers < 5.7 (which is what we have in CI) Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
…irabbani/cgroups-1
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
…irabbani/cgroups-1
Signed-off-by: irabbani <irabbani@anyscale.com>
fix CI. Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
|
Test failures look unrelated (from serve) |
| hdrs = [ | ||
| "cgroup_manager_interface.h", | ||
| ], | ||
| target_compatible_with = [ |
| int64_t max_cpu_weight = supported_constraints_.at(kCPUWeightConstraint).Max(); | ||
| int64_t application_cgroup_cpu_weight = max_cpu_weight - system_reserved_cpu_weight; | ||
|
|
||
| RAY_LOG(INFO) << absl::StrFormat( |
There was a problem hiding this comment.
seems a little noisy for an info level, but only once per raylet startup so should be ok
There was a problem hiding this comment.
As an SRE, I found log lines like this one very useful when debugging issues. As a rule of thumb I think we should log the configuration each component starts up with (especially if it's only created once in the lifecycle of the application).
There was a problem hiding this comment.
Sounds good. This one is nice that all of the info is logged in one place. We have some other startup logs that are noisy because we log each bit in a separate log line from different components.
| #include "ray/common/cgroup2/scoped_cgroup_operation.h" | ||
| #include "ray/common/status_or.h" | ||
|
|
||
| namespace ray { |
There was a problem hiding this comment.
maybe should have a namespace cgroup ?
There was a problem hiding this comment.
I'm open to trying it. I haven't really wrapped my head around what best practices should be around namespaces. I've added an item to #54703. I'll play around with it at the end.
| 3. move all processes from the system cgroup into the base cgroup. | ||
| 4. delete the node, system, and application cgroups respectively. | ||
|
|
||
| Cleanup is best-effort. If any step fails, it will log a warning. |
There was a problem hiding this comment.
why's this? to avoid crashing before other cleanup can happen at a higher level?
There was a problem hiding this comment.
Yep. I figured we'd want to attempt the rest of the graceful shutdown process of the raylet even if cgroup cleanup didn't succeed fully.
| enabled_controllers_ == other.enabled_controllers_; | ||
| } | ||
| }; | ||
| class FakeCgroupDriver : public CgroupDriverInterface { |
|
minor comments; ping for merge |
Signed-off-by: irabbani <irabbani@anyscale.com>
|
@edoakes ready for merge. Thanks! |
…anager. (#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on #56186. For more details about the resource isolation project see #54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: sampan <sampan@anyscale.com>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: sampan <sampan@anyscale.com>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: zac <zac@anyscale.com>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: zac <zac@anyscale.com>
…rarchy and clean it up (#56186) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This is the first PR introduces the
CgroupManagerInterface. This will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs.This PR stacks on #55063.
For more details about the resource isolation project see #54703.
The cgroup hierarchy for Ray will be:
The current implementation will only support
I've signposted design decisions with comments in the code. Here's a summary:
There are placeholders in the code for future work: