Skip to content

[core] Adding user facing API for resource isolation#51865

Merged
edoakes merged 23 commits intomasterfrom
irabbani/resource-isolation
Apr 15, 2025
Merged

[core] Adding user facing API for resource isolation#51865
edoakes merged 23 commits intomasterfrom
irabbani/resource-isolation

Conversation

@israbbani
Copy link
Copy Markdown
Contributor

@israbbani israbbani commented Mar 31, 2025

[core] Adding user facing API for resource isolation for ray system processes with cgroupv2 in ray.init and ray cli. Adding parameters for memory and cpu controllers

See the Proposal section in for details about the API: https://docs.google.com/document/d/1QI9dh8i26vKXcSOrZxZUJguXJkN5iKysHNypZ_x_s94/edit?tab=t.0#heading=h.boj1wyximhs5

TODOs:

  • I still need to add support for K8S compliant suffixes (MB, MiB etc) which I'll do in a separate PR. Unfortunately, I couldn't find an off-the-shelf parser.
  • Come up with reasonable default values. The current ones are placeholders.

Rough Edges

  • I've tried to abstract away all resource isolation variables, invariant checking etc using the ResourceIsolationConfig class. There a few rough edges wrt to integrating with object_store_bytes. I'm hoping to refactor that too in a future PR. I'm open to suggestions for improvements.
  • The code is a little hard to test right not without peering into the guts of the implementation
    • especially true for ray.start and ray cli
    • also true for ray._private.utils methods for getting num cpus or memory
  • There's a lot of repetition of docstrings (especially b/w ray.init and ray start).

for ray system processes with cgroupv2 in ray.init and
ray cli. Adding parameters for memory and cpu controllers

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
@israbbani israbbani added the go add ONLY when ready to merge, run all tests label Mar 31, 2025
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
@israbbani israbbani removed the go add ONLY when ready to merge, run all tests label Apr 1, 2025
@israbbani israbbani changed the title [core] Adding user facing API for resource isolation [DRAFT} [core] Adding user facing API for resource isolation Apr 1, 2025
@israbbani israbbani changed the title [DRAFT} [core] Adding user facing API for resource isolation [DRAFT] [core] Adding user facing API for resource isolation Apr 1, 2025
Ibrahim Rabbani added 2 commits April 1, 2025 08:53
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
@israbbani israbbani marked this pull request as ready for review April 1, 2025 18:13
@israbbani israbbani requested review from dentiny and edoakes April 1, 2025 18:13
@israbbani israbbani added the go add ONLY when ready to merge, run all tests label Apr 1, 2025
@israbbani israbbani changed the title [DRAFT] [core] Adding user facing API for resource isolation [core] Adding user facing API for resource isolation Apr 1, 2025
Copy link
Copy Markdown
Contributor

@dentiny dentiny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the effort!



def test_enabled_default_config_values(monkeypatch):
object_store_memory = 10 * 10**9
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI (no change requested), I found we have humanfriendly package already checked in.

[~/ray] (hjiang/compilation-options) [gke_hjiang-proj-cgroup-experiment_us-west1_hjiang-cgroup-test]
ubuntu@hjiang-devbox-pg$ git grep "humanfriendly"
doc/source/ray-overview/pip_freeze_ray-ml-py39-cpu.txt:humanfriendly==10.0
python/requirements_compiled.txt:humanfriendly==10.0

In theory we should be able to do

object_store_memory = humanfriendly.parse_size("10GB")

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Ibrahim Rabbani added 4 commits April 2, 2025 23:00
- Removing references to ray_constants in docstrings
- Refactoring the ResourceIsolationConfig into individual methods
- _ prefix for all private members
- moving test_resource_isolation_config to python/ray/tests
- fixing test_ray_init to shutdown ray after each test case

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
node.resource_isolation_config.system_reserved_memory
== system_reserved_memory + object_store_memory
)
ray.shutdown()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need to shutdown after the test, it should be in a fixture. otherwise if the test fails the shutdown won't happen and there's state leakage

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm still learning pytest. The docs recommended "yield fixtures". I also used fixtures in test_cli.py. PTAL

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
self._resource_isolation_enabled = enable_resource_isolation
self.cgroup_path = cgroup_path
self.system_reserved_memory = system_reserved_memory
self.system_reserved_cpu_weight: float = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need documentation, since it's different from ctor argument.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. PTAL

Ibrahim Rabbani added 2 commits April 6, 2025 15:47
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
@israbbani israbbani requested review from dentiny and edoakes April 7, 2025 18:37
self.cgroup_path = cgroup_path
self.system_reserved_memory = system_reserved_memory
# cgroupv2 cpu.weight calculated from system_reserved_cpu
# assumes ray uses all available cores
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my own knowledge, could you please explain a little bit more to me, what's "assumes ray uses all available cores"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that ray will use the total number of cpu cores in the system to calculate the weight. See examples in our design.

If you have 32 cores and system_reserved_cpu = 1, then `int(1/32 * 10000) will be weight.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something i have question, according to doc:

Weights
-------

A parent's resource is distributed by adding up the weights of all
active children and giving each the fraction matching the ratio of its
weight against the sum. 
...
All weights are in the range [1, 10000] with the default at 100.  This
allows symmetric multiplicative biases in both directions at fine
enough granularity while staying in the intuitive range.

So my expected impl would be: system reserved : app alloc = system-reserved-core : (all core - system-reserved-core)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what you're saying. Could you give a concrete example or explain what you want me to change?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly (the above comment is my understanding); the logic will be used in the C++ side as well.
So we should document the calculation in the comment; or pass the app cpu weight?

irabbani added 4 commits April 8, 2025 03:26
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. monkeypatch makes me sad but it's a reasonable incremental change :)

Ping when ready to merge.

Comment on lines +16 to +18
"""Configuration for enabling resource isolation by reserving memory
and cpu for ray system processes through cgroupv2.
This class validates configuration for resource isolation by
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: google style guide docstring format we usually follow is:

"""Single line description, cannot span multiple.

More details blah blah.
""""

@israbbani
Copy link
Copy Markdown
Contributor Author

@edoakes this is ready to merge.

@edoakes edoakes merged commit 7c98ade into master Apr 15, 2025
5 checks passed
@edoakes edoakes deleted the irabbani/resource-isolation branch April 15, 2025 17:40
edoakes added a commit that referenced this pull request Sep 16, 2025
#56352)

This PR stacks on #56297.

For more details about the resource isolation project see
#54703.

This PR wires the resource isolation config (introduced here #51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
zma2 pushed a commit to zma2/ray that referenced this pull request Sep 23, 2025
ray-project#56352)

This PR stacks on ray-project#56297.

For more details about the resource isolation project see
ray-project#54703.

This PR wires the resource isolation config (introduced here ray-project#51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Zhiqiang Ma <zhiqiang.ma@intel.com>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
ray-project#56352)

This PR stacks on ray-project#56297.

For more details about the resource isolation project see
ray-project#54703.

This PR wires the resource isolation config (introduced here ray-project#51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: zac <zac@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Sep 24, 2025
#56352)

This PR stacks on #56297.

For more details about the resource isolation project see
#54703.

This PR wires the resource isolation config (introduced here #51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
ray-project#56352)

This PR stacks on ray-project#56297.

For more details about the resource isolation project see
ray-project#54703.

This PR wires the resource isolation config (introduced here ray-project#51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
#56352)

This PR stacks on #56297.

For more details about the resource isolation project see
#54703.

This PR wires the resource isolation config (introduced here #51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
ray-project#56352)

This PR stacks on ray-project#56297.

For more details about the resource isolation project see
ray-project#54703.

This PR wires the resource isolation config (introduced here ray-project#51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
ray-project#56352)

This PR stacks on ray-project#56297.

For more details about the resource isolation project see
ray-project#54703.

This PR wires the resource isolation config (introduced here ray-project#51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
ray-project#56352)

This PR stacks on ray-project#56297.

For more details about the resource isolation project see
ray-project#54703.

This PR wires the resource isolation config (introduced here ray-project#51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
ray-project#56352)

This PR stacks on ray-project#56297.

For more details about the resource isolation project see
ray-project#54703.

This PR wires the resource isolation config (introduced here ray-project#51865)
from ray cli (ray start) and the ray sdk (ray.init) into the raylet.
Notable changes include:
1. A separate python test target for running test_resource_isolation
related unit and integration tests. This unifies all cgroup related
tests under one buildkite target and removes the need for `--except-tags
cgroup` everywhere else.
2. Modification to the cgroup hierarchy. This was an oversight on my
part. The "no internal processes" constraint says that a non-root cgroup
can either have controllers enabled or have processes. Therefore, the
new hierarchy looks like:
```
  //      base_cgroup_path (e.g. /sys/fs/cgroup)
  //             |
  //     ray_node_<node_id>
  //       |           |
  //     system     application
  //       |           |
  //      leaf        leaf
  //
```
where the leaf nodes contain all processes and the system/application
cgroups apply cpu.weight and memory.min constraints.
3. CgroupManager now has a move ctor/assignment operator that allows
ownership and lifecycle to be managed by the NodeManager.
4. CgroupManager is now owned by NodeManager.
5. An end-to-end integration test in
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`.
6. Moved all previous integration tests from test_ray_init and test_cli
into test_resource_isolation_integration.py. These tests have TODOs to
finish them up once the rest of cgroup features are implemented.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants