Allow parallel start NUMA binding by pdesupinski · Pull Request #161576 · pytorch/pytorch

pdesupinski · 2025-08-27T00:12:48Z

Context

In #161183, we added NUMA-binding support for Callable entrypoints to elastic_launch.

However, we would raise an exception if the subprocesses would be spawned in parallel via ThreadPoolExecutor, which is an option configurable via the TORCH_MP_PARALLEL_START environment variable (see diff).

The logic here was that os.sched_setaffinity, which we used to set CPU affinities, is per process, so there could be a race condition during a parallel start:

Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted.

But on further reading, the Linux docs say sched_setaffinity is per thread. As it turns out, the Python doc is a misnomer.

I verified that sched_setaffinity only affects the calling thread, not the entire calling process.

The upshot is that we actually can safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and os.sched_setaffinity only affects the calling thread.

This PR

Remove restrictions against parallel start for NUMA binding.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

pytorch-bot · 2025-08-27T00:12:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161576

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Multiple CI trunk failures after landing https://github.com/pytorch/pytorch/pull/161002

✅ No Failures

As of commit d1c4381 with merge base 443452c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-08-27T00:24:15Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D81092716.

d4l3k

LGTM LGTM LGTM! gogogo

facebook-github-bot · 2025-08-27T01:26:21Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D81092716.

facebook-github-bot · 2025-08-27T19:23:55Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D81092716.

facebook-github-bot · 2025-08-28T01:08:35Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-08-28T01:10:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

# Context In pytorch#161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually *can* safely use the inheritance trick from pytorch#161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: pytorch#161576 Approved by: https://github.com/d4l3k

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Aug 27, 2025

pdesupinski added the topic: not user facing topic category label Aug 27, 2025

pdesupinski requested a review from d4l3k August 27, 2025 00:23

pdesupinski added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Aug 27, 2025

pdesupinski marked this pull request as ready for review August 27, 2025 00:24

pdesupinski requested a review from albanD as a code owner August 27, 2025 00:24

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 27, 2025

d4l3k approved these changes Aug 27, 2025

View reviewed changes

pdesupinski force-pushed the feature/parallel-numa-binding branch from d3d9dfb to 5f42fc3 Compare August 27, 2025 02:38

pdesupinski requested review from a team, angelayi, avikchaudhuri, ezyang, fmassa, jeffdaily, malfet, tugsbayasgalan, ydwu4 and zhxchen17 as code owners August 27, 2025 02:38

pytorch-bot bot added ciflow/h100-symm-mem ciflow/inductor module: dynamo module: inductor release notes: inductor (aoti) labels Aug 27, 2025

pdesupinski force-pushed the feature/parallel-numa-binding branch from 5f42fc3 to 30489bb Compare August 27, 2025 02:48

pdesupinski removed the request for review from a team August 27, 2025 03:20

pdesupinski removed request for albanD, angelayi, avikchaudhuri, ezyang, fmassa, jeffdaily, malfet, tugsbayasgalan, ydwu4 and zhxchen17 August 27, 2025 03:20

pdesupinski force-pushed the feature/parallel-numa-binding branch from 30489bb to f839eab Compare August 27, 2025 17:22

pdesupinski removed module: inductor module: dynamo ciflow/inductor release notes: inductor (aoti) ciflow/h100-symm-mem labels Aug 27, 2025

pdesupinski added 3 commits August 27, 2025 12:20

Allow parallel start NUMA binding

38468fb

Rename stuff from process --> thread

5b47f97

un-factor out should_use_parallel_start

d1c4381

pdesupinski force-pushed the feature/parallel-numa-binding branch from f839eab to d1c4381 Compare August 27, 2025 19:23

pytorchmergebot added the merging label Aug 28, 2025

pytorchmergebot added the Merged label Aug 28, 2025

pytorchmergebot closed this in 768a101 Aug 28, 2025

pytorchmergebot removed the merging label Aug 28, 2025

github-actions bot deleted the feature/parallel-numa-binding branch September 27, 2025 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow parallel start NUMA binding#161576

Allow parallel start NUMA binding#161576
pdesupinski wants to merge 3 commits intomainfrom
feature/parallel-numa-binding

pdesupinski commented Aug 27, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

d4l3k left a comment

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

facebook-github-bot commented Aug 28, 2025

Uh oh!

pytorchmergebot commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pdesupinski commented Aug 27, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

This PR

Uh oh!

pytorch-bot bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161576

❗ 1 Active SEVs

✅ No Failures

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

facebook-github-bot commented Aug 28, 2025

Uh oh!

pytorchmergebot commented Aug 28, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pdesupinski commented Aug 27, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 27, 2025 •

edited

Loading