Skip to content

[core] (cgroups 19/n) Allow fractions when getting the number of CPUs to calculate weights#57800

Merged
edoakes merged 3 commits intoirabbani/cgroups-18from
irabbani/cgroups-19
Oct 16, 2025
Merged

[core] (cgroups 19/n) Allow fractions when getting the number of CPUs to calculate weights#57800
edoakes merged 3 commits intoirabbani/cgroups-18from
irabbani/cgroups-19

Conversation

@israbbani
Copy link
Copy Markdown
Contributor

@israbbani israbbani commented Oct 16, 2025

This PR stacks on #57776.

For more details about the resource isolation project see #54703.

When Ray calculates the number of cpus available on the machine, it checks to see if it's running in a container. However, it truncates the number of cpus.

In this PR,

  • If the number of CPUs on the machine is <= DEFAULT_MIN_SYSTEM_RESERVED_CPU_CORES, then raise a ValueError. Previously, this was < DEFAULT_MIN_SYSTEM_RESERVED_CPU_CORES.
  • Return fractional CPUs from ray._private.utils.get_num_cpus if an optional parameter is set to True.

cpus available on the machine. This will prevent us from rounding down
when running in a container that has cpu.max set.

Signed-off-by: irabbani <israbbani@gmail.com>
@israbbani israbbani changed the base branch from master to irabbani/cgroups-18 October 16, 2025 17:37
@israbbani israbbani added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Oct 16, 2025
@israbbani israbbani marked this pull request as ready for review October 16, 2025 18:02
@israbbani israbbani requested a review from a team as a code owner October 16, 2025 18:02
@israbbani
Copy link
Copy Markdown
Contributor Author

Tested on Anyscale w/ a 2 core machine. Works with default parameters now.

lscpu | grep "CPU(s)"
CPU(s):                                  2

cat /sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/user/cpu.weight
4445

cat /sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/system/cpu.weight
5555

From the logs

(base) ray@ip-10-0-251-150:~/default$ grep "CgroupManager" /tmp/ray/session_latest/logs/raylet.out
{"asctime":"2025-10-16 12:49:53,974","levelname":"I","message":"Initializing CgroupManager at base cgroup at '/sys/fs/cgroup'. Ray's cgroup hierarchy will under the node cgroup at '/sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4' with [memory, cpu] controllers enabled. The system cgroup at '/sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/system' will have [memory] controllers enabled with [cpu.weight=5555, memory.min=5946149682] constraints. The user cgroup '/sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/user' will have no controllers enabled with [cpu.weight=4445] constraints. The user cgroup will contain the [/sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/user/workers, /sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/user/non-ray] cgroups.","component":"raylet","filename":"cgroup_manager.cc","lineno":212}

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment in followup


"""
available_system_cpus = utils.get_num_cpus()
available_system_cpus = utils.get_num_cpus(truncate=False)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should leave a comment for why we don't truncate

@edoakes edoakes merged commit 40a027e into irabbani/cgroups-18 Oct 16, 2025
5 checks passed
@edoakes edoakes deleted the irabbani/cgroups-19 branch October 16, 2025 20:07
edoakes added a commit that referenced this pull request Oct 17, 2025
)

For more details about the resource isolation project see
#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from #57800.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
)

For more details about the resource isolation project see
#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from #57800.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants