NUMA binding integration with elastic agent and torchrun#149334
NUMA binding integration with elastic agent and torchrun#149334raghavhrishi wants to merge 6 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149334
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 066a805 with merge base ee72338 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Thanks @raghavhrishi ! Can you please sign the CLA? |
torch/distributed/numa_binding.py
Outdated
| return numactlargs | ||
|
|
||
|
|
||
| class CoreComplex(Numa): |
There was a problem hiding this comment.
This option seems to be specific to AMD x86_64 processors, which have the concept of a core complex whose cores share L3 cache.
On Intel x86_64 processors, the L3 cache is typically at the granularity of a socket.
L1 & L2 caches are private to each physical core.
Would it be okay to disable this option on Intel x86_64 machines (I'm guessing users would only use this option by mistake on Intel x86_64 machines), or explain the behavior with a warning if it'd be used on an Intel x86_64 machine? @jingxu10, can you please share your opinion?
Thanks!
There was a problem hiding this comment.
A warning message can be added when the core-complex option is used and also in the help page (while describing the --numa_binding option) so that users are aware of it.
There was a problem hiding this comment.
This has been updated in the recent commit.
There was a problem hiding this comment.
Shall we consider the E P core case?
https://www.intel.com/content/www/us/en/gaming/resources/how-hybrid-design-works.html
There was a problem hiding this comment.
Shall we consider the E P core case?
Thanks for your inputs, @jingxu10!
Looks like some variants of new data-center grade Xeon processors may also have E cores as well, so we should also probably consider them.
@leslie-fang-intel, please share your inputs. Thanks!
There was a problem hiding this comment.
Thanks for sharing your thoughts – it's a good idea. It could potentially be a follow-up Pull Request once we’ve had the time to consider the design and how best to integrate it.
cc: @arpitsardhana
torch/distributed/numa_binding.py
Outdated
| resultCpuList = [] | ||
| for i in range(resultCpuLen): | ||
| if (cpusSharedCacheVal >> i) & 1 == 1: | ||
| resultCpuList.append(i) |
There was a problem hiding this comment.
@raghavhrishi First, great job pushing on this feature!
Referencing your example in #148689, for the exclusive binding option,
If Rank 0 and Rank 1 are both affined to NUMA Node 0, the cores would be split as follows:
Rank 0:
numactl --physcpubind=0-3 --membind=0
Rank 1:
numactl --physcpubind=4-7 --membind=0
This assumes that a contiguous indexing of CPUs would result in the most optimal binding. Could you please confirm is this is indeed an assumption in this PR? In my experience, there are many node architectures where linear indexing of CPUs is not the norm, see Frontier e.g., - https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#frontier-compute-nodes
If linear indexing of CPUs is indeed assumed, would it be possible to have a user option to specify --physcpubind or pass in the CPU/GPU topology?
There was a problem hiding this comment.
@ashesh2512 Thanks for your comment!
Non-linear core indexing might have edge cases in scenarios where there are only two NUMA Nodes available for binding, and multiple ranks (e.g., 4) are affined to the same NUMA Node. In such cases, linear indexing might be necessary to address the issue effectively.
The exclusive binding strategy utilizes topology information to determine the NUMA Node associated with each rank. Once identified, it ensures that ranks affined to the same NUMA Node are assigned distinct sets of cores using physcpubind, preventing overlap. This approach ensures that ranks sharing affinity with a NUMA Node do not use the same cores. The strategy uses the system's underlying topology information and avoids cross-NUMA binding.
As a potential enhancement, we could consider adding an option for users to specify the cores they wish to use in a follow-up pull request after reviewing the design.
cc: @arpitsardhana
There was a problem hiding this comment.
@raghavhrishi Thanks, I think that an option for users to specify the cores they wish to use would be ideal in a follow up PR. I could help with that.
For context, one of the architectures I work with, a single compute node (8 GPUs per node) has the following CPU/GPU affinity. Ideally, the user would be able to bind a process to one or multiple cores, and set the GPU index in PyTorch accordingly.
NUMA 0:
hardware threads 000-007, 064-071 | GPU 4
hardware threads 008-015, 072-079 | GPU 5
NUMA 1:
hardware threads 016-023, 080-087 | GPU 2
hardware threads 024-031, 088-095 | GPU 3
NUMA 2:
hardware threads 032-039, 096-103 | GPU 6
hardware threads 040-047, 104-111 | GPU 7
NUMA 3:
hardware threads 048-055, 112-119 | GPU 0
hardware threads 056-063, 120-127 | GPU 1
requirements.txt
Outdated
| psutil | ||
| pyyaml | ||
| requests | ||
| pynvml |
There was a problem hiding this comment.
Nope, we deliberately decided not to depend on pynvml, as one can very easily rewrite everything one need with ctypes
Moreover, it's a big no-go for something like ROCM or XPU
torch/distributed/numa_binding.py
Outdated
| def get_gpu_count(self): | ||
| # Initialize NVML | ||
| pynvml.nvmlInit() | ||
| # Get the number of GPU devices | ||
| device_count = pynvml.nvmlDeviceGetCount() | ||
| # Shutdown NVML | ||
| pynvml.nvmlShutdown() |
There was a problem hiding this comment.
Is there a device-generic way?
There should be some methods in torch.accelerator package now.
There was a problem hiding this comment.
what do you think of this?
torch/distributed/numa_binding.py
Outdated
| # returns array indexed by GPU id and mapping to value NUMA node id | ||
| def get_numa_nodes(self): |
There was a problem hiding this comment.
nit: would appreciate an example of the return.
There was a problem hiding this comment.
Suppose we have 4 GPUs, and they are connected to the following NUMA nodes:
GPU 0 → NUMA Node 0
GPU 1 → NUMA Node 0
GPU 2 → NUMA Node 1
GPU 3 → NUMA Node 1
Then the function would return:
[0, 0, 1, 1]
torch/distributed/numa_binding.py
Outdated
| for busID in pciBusIDs: | ||
| pciFields = busID.split(":") | ||
| pciDir = f"{pciFields[0][-4:]}:{pciFields[1]}:{pciFields[2]}" | ||
| numaFile = NUMA_CMD.format(value=pciDir.lower()) | ||
| try: | ||
| with open(numaFile) as numa_node_text: | ||
| node = int(numa_node_text.read()) | ||
| numaNodes.append(node) | ||
| except FileNotFoundError: | ||
| print(f"The file {numaFile} does not exist.") |
There was a problem hiding this comment.
nit: can you comment on this block?
Also, is it worth for NVML to add an API to return the needed value?
There was a problem hiding this comment.
For each GPU's PCI bus ID, constructs the sysfs path to its NUMA node file & reads the NUMA node associated with it. The function returns a list of NUMA nodes associated with each GPU.
torch/distributed/numa_binding.py
Outdated
| # returns a bitmap for each core, its sibling cores | ||
| def get_thread_siblings(self, cpu): |
There was a problem hiding this comment.
What is the function for?
There was a problem hiding this comment.
get_thread_siblings identifies which other CPUs (cores) are on the same NUMA node as the current CPU.
|
Super!!! Thank you for implementing this, @raghavhrishi This issue could be closed as well when merged: #115305 |
|
Excited for this @raghavhrishi. Any update on the timeline? |
|
@kwen2501: I've addressed the comments in the PR. Please let me know if there's anything else needed to proceed with the merge. |
|
Thanks for the improvements. In general, I am wondering if there is a way to do it in a device-agnostic way. But I understand If we'd like to avoid a direct dependency on pynvml (as you did to requirements.txt), can we put a check in torchrun to see if I will defer to @kiukchung and @d4l3k for final decision. |
torch/distributed/run.py
Outdated
| numa_cmd = None | ||
| py_executable = os.getenv("PYTHON_EXEC", sys.executable) | ||
| if args.numa_binding: | ||
| numa_cmd = update_with_numa_binding_pytorch(args.numa_binding) |
There was a problem hiding this comment.
Can we implement this at the elastic agent level? Putting this logic here means only CLI users can get numa control and not via the programmatic API
|
@raghavhrishi do you have bandwidth to update this PR? There's still some of refactoring required to get this into a good state The two main things are:
Primarily asking since we'd like to land this support and have someone who might be interested in pushing this over the line We could also land this in pieces -- i.e. land the helper utilities and then follow up with a cleaner torchelastic integration |
| "Can be used to override custom logging behavior.", | ||
| ) | ||
|
|
||
| parser.add_argument( |
There was a problem hiding this comment.
cc @EikanWang didn't someone from your end already send a PR to do NUMA binding in torchrun? That vaguely rings a bell to me.
There was a problem hiding this comment.
Hi @albanD , do you mean the script https://github.com/pytorch/pytorch/blob/main/torch/backends/xeon/run_cpu.py or #133835 or a separate recent PR?
Discussed with Nikita before, code changes in 133835 involves too many things, I'll split it into smaller PRs later.
There was a problem hiding this comment.
Ho yes, https://github.com/pytorch/pytorch/blob/main/torch/backends/xeon/run_cpu.py is what I had in mind. @raghavhrishi any link between the two?
There was a problem hiding this comment.
@albanD I think this file that you have mentioned is different from this MR's implementation.
setup.py
Outdated
| "networkx", | ||
| "jinja2", | ||
| "fsspec", | ||
| "pynvml>=11.4.1", |
There was a problem hiding this comment.
Adding such a hard dependency is definitely not ok without much deeper considerations.
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234. |
41c3b11 to
b0a07a4
Compare
b0a07a4 to
7bc29a4
Compare
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234. |
7bc29a4 to
385d1a4
Compare
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234. |
385d1a4 to
dc8b4f6
Compare
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234. |
dc8b4f6 to
b8ba819
Compare
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234. |
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D78319234. |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Implements #148689
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @kwen2501 @c-p-i-o