NUMA binding integration with elastic agent and torchrun by raghavhrishi · Pull Request #149334 · pytorch/pytorch

raghavhrishi · 2025-03-17T17:51:44Z

Implements #148689

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @kwen2501 @c-p-i-o

pytorch-bot · 2025-03-17T17:51:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149334

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 066a805 with merge base ee72338 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-03-17T17:51:49Z

The committers listed above are authorized under a signed CLA.

✅ login: pdesupinski / name: Paul de Supinski (066a805, 518340e, a0cc8fe, 046bbe9, aa59773)
✅ login: raghavhrishi / name: Raghav Hrishikeshan Mukundan (301e6ca)

raghavhrishi · 2025-03-17T17:57:18Z

cc: @ptrblck @eqy @kwen2501 @arpitsardhana

kwen2501 · 2025-03-17T21:13:27Z

Thanks @raghavhrishi ! Can you please sign the CLA?

torch/distributed/numa_binding.py

sanchitintel · 2025-03-19T17:08:16Z

torch/distributed/numa_binding.py

+        return numactlargs
+
+
+class CoreComplex(Numa):


This option seems to be specific to AMD x86_64 processors, which have the concept of a core complex whose cores share L3 cache.
On Intel x86_64 processors, the L3 cache is typically at the granularity of a socket.
L1 & L2 caches are private to each physical core.

Would it be okay to disable this option on Intel x86_64 machines (I'm guessing users would only use this option by mistake on Intel x86_64 machines), or explain the behavior with a warning if it'd be used on an Intel x86_64 machine? @jingxu10, can you please share your opinion?

Thanks!

A warning message can be added when the core-complex option is used and also in the help page (while describing the --numa_binding option) so that users are aware of it.

This has been updated in the recent commit.

Shall we consider the E P core case?
https://www.intel.com/content/www/us/en/gaming/resources/how-hybrid-design-works.html

Shall we consider the E P core case?

Thanks for your inputs, @jingxu10!
Looks like some variants of new data-center grade Xeon processors may also have E cores as well, so we should also probably consider them.

@leslie-fang-intel, please share your inputs. Thanks!

Thanks for sharing your thoughts – it's a good idea. It could potentially be a follow-up Pull Request once we’ve had the time to consider the design and how best to integrate it.
cc: @arpitsardhana

ashesh2512 · 2025-04-09T21:23:50Z

torch/distributed/numa_binding.py

+        resultCpuList = []
+        for i in range(resultCpuLen):
+            if (cpusSharedCacheVal >> i) & 1 == 1:
+                resultCpuList.append(i)


@raghavhrishi First, great job pushing on this feature!

Referencing your example in #148689, for the exclusive binding option,

If Rank 0 and Rank 1 are both affined to NUMA Node 0, the cores would be split as follows: Rank 0: numactl --physcpubind=0-3 --membind=0 Rank 1: numactl --physcpubind=4-7 --membind=0

This assumes that a contiguous indexing of CPUs would result in the most optimal binding. Could you please confirm is this is indeed an assumption in this PR? In my experience, there are many node architectures where linear indexing of CPUs is not the norm, see Frontier e.g., - https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#frontier-compute-nodes

If linear indexing of CPUs is indeed assumed, would it be possible to have a user option to specify --physcpubind or pass in the CPU/GPU topology?

@ashesh2512 Thanks for your comment!

Non-linear core indexing might have edge cases in scenarios where there are only two NUMA Nodes available for binding, and multiple ranks (e.g., 4) are affined to the same NUMA Node. In such cases, linear indexing might be necessary to address the issue effectively.

The exclusive binding strategy utilizes topology information to determine the NUMA Node associated with each rank. Once identified, it ensures that ranks affined to the same NUMA Node are assigned distinct sets of cores using physcpubind, preventing overlap. This approach ensures that ranks sharing affinity with a NUMA Node do not use the same cores. The strategy uses the system's underlying topology information and avoids cross-NUMA binding.

As a potential enhancement, we could consider adding an option for users to specify the cores they wish to use in a follow-up pull request after reviewing the design.

cc: @arpitsardhana

@raghavhrishi Thanks, I think that an option for users to specify the cores they wish to use would be ideal in a follow up PR. I could help with that.

For context, one of the architectures I work with, a single compute node (8 GPUs per node) has the following CPU/GPU affinity. Ideally, the user would be able to bind a process to one or multiple cores, and set the GPU index in PyTorch accordingly.

NUMA 0: hardware threads 000-007, 064-071 | GPU 4 hardware threads 008-015, 072-079 | GPU 5 NUMA 1: hardware threads 016-023, 080-087 | GPU 2 hardware threads 024-031, 088-095 | GPU 3 NUMA 2: hardware threads 032-039, 096-103 | GPU 6 hardware threads 040-047, 104-111 | GPU 7 NUMA 3: hardware threads 048-055, 112-119 | GPU 0 hardware threads 056-063, 120-127 | GPU 1

kwen2501 · 2025-04-28T22:44:03Z

requirements.txt

 psutil
 pyyaml
 requests
+pynvml


@atalman @malfet This seems to introduce a dependency. wdyt?

Gentle ping @malfet @atalman

Nope, we deliberately decided not to depend on pynvml, as one can very easily rewrite everything one need with ctypes

Moreover, it's a big no-go for something like ROCM or XPU

kwen2501 · 2025-04-28T22:47:16Z

torch/distributed/numa_binding.py

+    def get_gpu_count(self):
+        # Initialize NVML
+        pynvml.nvmlInit()
+        # Get the number of GPU devices
+        device_count = pynvml.nvmlDeviceGetCount()
+        # Shutdown NVML
+        pynvml.nvmlShutdown()


Is there a device-generic way?
There should be some methods in torch.accelerator package now.

what do you think of this?

kwen2501 · 2025-04-28T22:50:09Z

torch/distributed/numa_binding.py

+    # returns array indexed by GPU id and mapping to value NUMA node id
+    def get_numa_nodes(self):


nit: would appreciate an example of the return.

Suppose we have 4 GPUs, and they are connected to the following NUMA nodes:

GPU 0 → NUMA Node 0

GPU 1 → NUMA Node 0

GPU 2 → NUMA Node 1

GPU 3 → NUMA Node 1

Then the function would return:

[0, 0, 1, 1]

kwen2501 · 2025-04-28T22:51:50Z

torch/distributed/numa_binding.py

+        for busID in pciBusIDs:
+            pciFields = busID.split(":")
+            pciDir = f"{pciFields[0][-4:]}:{pciFields[1]}:{pciFields[2]}"
+            numaFile = NUMA_CMD.format(value=pciDir.lower())
+            try:
+                with open(numaFile) as numa_node_text:
+                    node = int(numa_node_text.read())
+                    numaNodes.append(node)
+            except FileNotFoundError:
+                print(f"The file {numaFile} does not exist.")


nit: can you comment on this block?
Also, is it worth for NVML to add an API to return the needed value?

For each GPU's PCI bus ID, constructs the sysfs path to its NUMA node file & reads the NUMA node associated with it. The function returns a list of NUMA nodes associated with each GPU.

kwen2501 · 2025-04-28T22:54:01Z

torch/distributed/numa_binding.py

+    # returns a bitmap for each core, its sibling cores
+    def get_thread_siblings(self, cpu):


What is the function for?

get_thread_siblings identifies which other CPUs (cores) are on the same NUMA node as the current CPU.

stas00 · 2025-04-29T19:39:44Z

Super!!! Thank you for implementing this, @raghavhrishi

This issue could be closed as well when merged: #115305

pdesupinski · 2025-06-11T21:07:25Z

Excited for this @raghavhrishi. Any update on the timeline?

raghavhrishi · 2025-06-13T03:51:33Z

@kwen2501: I've addressed the comments in the PR. Please let me know if there's anything else needed to proceed with the merge.
cc: @arpitsardhana

kwen2501 · 2025-06-16T21:26:51Z

Thanks for the improvements.

In general, I am wondering if there is a way to do it in a device-agnostic way. But I understand torch.accelerator APIs are not there yet (like, calculating the distance between a GPU and a CPU). cc @albanD. So perhaps it may be okay as is in the PR for now.

If we'd like to avoid a direct dependency on pynvml (as you did to requirements.txt), can we put a check in torchrun to see if pynvml is available? If available we use the code here; if not, let's fall back (doesn't hurt?)
cc @malfet @atalman

I will defer to @kiukchung and @d4l3k for final decision.

d4l3k · 2025-06-16T21:32:10Z

torch/distributed/run.py

+    numa_cmd = None
+    py_executable = os.getenv("PYTHON_EXEC", sys.executable)
+    if args.numa_binding:
+        numa_cmd = update_with_numa_binding_pytorch(args.numa_binding)


Can we implement this at the elastic agent level? Putting this logic here means only CLI users can get numa control and not via the programmatic API

d4l3k · 2025-06-23T21:04:01Z

@raghavhrishi do you have bandwidth to update this PR? There's still some of refactoring required to get this into a good state

The two main things are:

make nvml a soft dependency
refactor the integration in torchelastic to operate at the agent level (where we launch subprocesses/multiprocessing) rather than in the arg parsing/wrapper script

Primarily asking since we'd like to land this support and have someone who might be interested in pushing this over the line

We could also land this in pieces -- i.e. land the helper utilities and then follow up with a cleaner torchelastic integration

albanD · 2025-06-23T21:27:31Z

torch/distributed/run.py

        "Can be used to override custom logging behavior.",
    )
-
+    parser.add_argument(


cc @EikanWang didn't someone from your end already send a PR to do NUMA binding in torchrun? That vaguely rings a bell to me.

Hi @albanD , do you mean the script https://github.com/pytorch/pytorch/blob/main/torch/backends/xeon/run_cpu.py or #133835 or a separate recent PR?
Discussed with Nikita before, code changes in 133835 involves too many things, I'll split it into smaller PRs later.

Ho yes, https://github.com/pytorch/pytorch/blob/main/torch/backends/xeon/run_cpu.py is what I had in mind. @raghavhrishi any link between the two?

@albanD I think this file that you have mentioned is different from this MR's implementation.

albanD · 2025-06-23T21:28:45Z

setup.py

        "networkx",
        "jinja2",
        "fsspec",
+        "pynvml>=11.4.1",


Adding such a hard dependency is definitely not ok without much deeper considerations.

facebook-github-bot · 2025-07-23T03:24:00Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234.

agent.

facebook-github-bot · 2025-07-23T22:49:26Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234.

facebook-github-bot · 2025-07-23T22:59:04Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234.

facebook-github-bot · 2025-07-24T19:03:14Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234.

facebook-github-bot · 2025-07-24T22:35:03Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234.

facebook-github-bot · 2025-07-25T03:24:06Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234.

facebook-github-bot · 2025-07-25T21:08:45Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-07-25T21:14:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 17, 2025

pytorchbot added the open source label Mar 17, 2025

kwen2501 added the release notes: distributed (c10d) release notes category label Mar 17, 2025

kwen2501 requested review from LucasLLC, albanD, d4l3k and kwen2501 March 17, 2025 22:22

sanchitintel requested a review from jingxu10 March 18, 2025 00:57

sanchitintel reviewed Mar 19, 2025

View reviewed changes

torch/distributed/numa_binding.py Outdated Show resolved Hide resolved

sanchitintel reviewed Mar 19, 2025

View reviewed changes

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 20, 2025

ashesh2512 reviewed Apr 9, 2025

View reviewed changes

raghavhrishi requested a review from jeffdaily as a code owner April 26, 2025 04:50

kwen2501 requested a review from kiukchung April 28, 2025 22:38

kwen2501 reviewed Apr 28, 2025

View reviewed changes

d4l3k requested a review from ngimel April 29, 2025 17:48

kwen2501 mentioned this pull request Apr 29, 2025

[training] Adding NUMA support for pytorch #150597

Closed

raghavhrishi requested a review from kwen2501 June 14, 2025 05:28

d4l3k reviewed Jun 16, 2025

View reviewed changes

albanD previously requested changes Jun 23, 2025

View reviewed changes

albanD reviewed Jun 23, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 23, 2025

raghavhrishi and others added 5 commits July 23, 2025 11:30

original numa_binding.py implementation

301e6ca

Refactor file to be pythonic, fix bugs, and integrate with elastic

aa59773

agent.

Add option in torchrun

a0cc8fe

Add default and fallback to facilitate mass rollout

518340e

Add a few more log statements

046bbe9

pdesupinski force-pushed the raghavhrishi/numa-binding-torchrun branch from 41c3b11 to b0a07a4 Compare July 23, 2025 18:31

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jul 23, 2025

pdesupinski force-pushed the raghavhrishi/numa-binding-torchrun branch from b0a07a4 to 7bc29a4 Compare July 23, 2025 22:42

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 23, 2025

pdesupinski force-pushed the raghavhrishi/numa-binding-torchrun branch from 7bc29a4 to 385d1a4 Compare July 23, 2025 22:57

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jul 23, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 23, 2025

pdesupinski force-pushed the raghavhrishi/numa-binding-torchrun branch from 385d1a4 to dc8b4f6 Compare July 24, 2025 18:52

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jul 24, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 24, 2025

pdesupinski force-pushed the raghavhrishi/numa-binding-torchrun branch from dc8b4f6 to b8ba819 Compare July 24, 2025 19:42

pdesupinski mentioned this pull request Jul 24, 2025

UNSTABLE pull / linux-jammy-py3_9-clang9-xla / test (xla) #158876

Open

Fix nits and reference NumaOptions directly in _utils_internal.py

066a805

This was referenced Aug 6, 2025

Support NUMA Binding for Callable entrypoints to elastic_launch #160006

Open

Support NUMA Binding for Callable Entrypoints #160163

Closed

pdesupinski mentioned this pull request Aug 18, 2025

[ez] Only use default numa bindings if nproc == cuda device count #160848

Closed

		# returns array indexed by GPU id and mapping to value NUMA node id
		def get_numa_nodes(self):

		# returns a bitmap for each core, its sibling cores
		def get_thread_siblings(self, cpu):

Conversation

raghavhrishi commented Mar 17, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149334

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

linux-foundation-easycla bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavhrishi commented Mar 17, 2025

Uh oh!

kwen2501 commented Mar 17, 2025

Uh oh!

Uh oh!

sanchitintel Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavhrishi Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashesh2512 Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavhrishi Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashesh2512 Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavhrishi May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdesupinski commented Jun 11, 2025

Uh oh!

raghavhrishi commented Jun 13, 2025

Uh oh!

kwen2501 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

raghavhrishi commented Mar 17, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 17, 2025 •

edited

Loading

linux-foundation-easycla bot commented Mar 17, 2025 •

edited

Loading

sanchitintel Mar 19, 2025 •

edited

Loading

raghavhrishi Mar 21, 2025 •

edited

Loading

sanchitintel Apr 3, 2025 •

edited

Loading

ashesh2512 Apr 9, 2025 •

edited

Loading

raghavhrishi Apr 11, 2025 •

edited

Loading

ashesh2512 Apr 11, 2025 •

edited

Loading

raghavhrishi May 18, 2025 •

edited

Loading

stas00 commented Apr 29, 2025 •

edited

Loading

kwen2501 commented Jun 16, 2025 •

edited

Loading

jingxu10 Jun 23, 2025 •

edited

Loading