[core] [docs] (cgroups 24/n) Adding public docs for the Resource Isolation #60183
[core] [docs] (cgroups 24/n) Adding public docs for the Resource Isolation #60183
Conversation
Signed-off-by: irabbani <irabbani@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request adds comprehensive documentation for the new Resource Isolation feature using cgroup v2. The documentation is well-structured and covers requirements, usage with containers and on bare metal, API references, and troubleshooting. I've found a couple of minor inconsistencies in the code examples which I've pointed out in the review comments. Once these are addressed, this will be a great addition to the Ray documentation.
Signed-off-by: irabbani <irabbani@anyscale.com>
| 1. System critical processes internal to Ray which are critical to node health | ||
| 2. User processes that are executing remote tasks and actors | ||
|
|
||
| Without resource isolation, user processes can starve system processes of CPU and memory leading to node failure. Node failure can cause instability in your workload and in extreme cases lead to job failure. |
There was a problem hiding this comment.
Not as extreme as we'd hope lol
Kunchd
left a comment
There was a problem hiding this comment.
Thanks for adding the usage docs for resource isolation! This should be very helpful for everyone.
I left a couple of little details, but the doc looks good overall.
Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
…irabbani/ri-docs-1
…irabbani/ri-docs-1
edoakes
left a comment
There was a problem hiding this comment.
Looks great! Just a few minor comments.
We should also consider discoverability. Many times, people will not be directly searching for "resource isolation" but rather trying to solve their stability/OOM problems. So we may want to drop an example of WorkerDiedError / OOM kill errors here for SEO... and probably audit other pages to see if this should be cross-linked from anywhere. For example: https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html
Signed-off-by: irabbani <israbbani@gmail.com>
I addressed some of this. I'll update all of our docs to x-link for SEO in a follow-up. |
… ray.init(...) (#60726) Follow up from #60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com>
|
@edoakes plz merge. Unrelated CI tests were failing so I had to update branch. |
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…ation (ray-project#60183) For more information about the Resource Isolation project see ray-project#54703. Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: irabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…ation (ray-project#60183) For more information about the Resource Isolation project see ray-project#54703. Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: irabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…ation (ray-project#60183) For more information about the Resource Isolation project see ray-project#54703. Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: irabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com>
… ray.init(...) (#60726) Follow up from #60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…ation (#60183) For more information about the Resource Isolation project see #54703. Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: irabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
… ray.init(...) (#60726) Follow up from #60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com>
…ation (#60183) For more information about the Resource Isolation project see #54703. Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: irabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ation (ray-project#60183) For more information about the Resource Isolation project see ray-project#54703. Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: irabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com>
…ation (ray-project#60183) For more information about the Resource Isolation project see ray-project#54703. Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: irabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ation (ray-project#60183) For more information about the Resource Isolation project see ray-project#54703. Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: irabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
… ray.init(...) (ray-project#60726) Follow up from ray-project#60183. When not running inside privileged containers, the user will have to specify a `cgroup_path`. It makes sense for this to be a part of the public API for `ray.init(...)`. Things I'm changing 1. Promoting `cgroup_path` to a public API parameter for `ray.init` 2. Updating tests to use that parameter. 3. Running all cgroup tests on CI for all C++ and python changes. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ation (ray-project#60183) For more information about the Resource Isolation project see ray-project#54703. Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes. --------- Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: irabbani <israbbani@gmail.com> Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
For more information about the Resource Isolation project see #54703.
Adding public documentation for how to enable and use Resource Isolation for process isolation between system and user processes.