-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Core] Resource Isolation Milestone 1 Implementation Tracker #54703
Copy link
Copy link
Open
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corek8s-projK8s and Ray OSSK8s and Ray OSS
Description
Implement support for isolating ray system processes from application processes using cgroupv2.
Work Items
- Create an API for passing cgroup configuration into ray ([core] Adding user facing API for resource isolation #51865).
- Implement CI support for running cgroup tests ([ci] Enable Cgroup support in CI for core #51454).
- Implement a sysfs driver for cgroup operations with tests. ([core] (cgroups 1/n) Adding a sys/fs filesystem driver to perform cgroup operations. #54898).
- Implement integration tests for the sysfs driver. ([core] (cgroups 2/n) adding integration tests for the cgroup sysfs driver. #55063).
- Implement a cgroup manager that uses the cgroup driver to check invariants, create subcgroups, move processes, enable controllers and resource limits.
- Implement cgroup cleanup in cgroup manager.
- Implement process migration for system processes and worker processes into the correct cgroup before or on startup.
- [core] (cgroups 7/n) cleaning up old cgroup integration code for raylet and core worker #56285
- [core] (cgroups 8/n) Wiring CgroupManager into the raylet. #56297
- [core] (cgroups 9/n) end-to-end integration of cgroups with ray start. #56352
- [core] (cgroups 10/n) Adding support in CgroupManager and CgroupDriver to move processes into system cgroup #56446
- [core] (cgroups 11/n) Raylet will move system processes into cgroup on startup #56522
- [core] (cgroups 12/n) Raylet will start worker processes in the application cgroup #56549
- Cleanup old cgroup code and associated TODOs.
- Add a ProcessIsolationFactory as described here. Clean up all public bazel targets.
- Add a
usercgroup for all non-ray processes. - Tune defaults for
--system-reserved-cpuand--system-reserved-memory. - Moving the driver and dashboard subprocesses into the system cgroup
- Bug fixes
- [] Add user-facing documentation for enabling resource isolation on VMs and containers. ([core] [docs] (cgroups 24/n) Adding public docs for the Resource Isolation #60183)
- Update Log messages to cross-link to user-facing documentation.
- Attempt to move all cgroup related functionality into its own namespace to see how it plays with developer ergonomics.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corek8s-projK8s and Ray OSSK8s and Ray OSS
Type
Projects
Status
In Progress
Status
Todo