Fake process group Direct construction error by ahkush · Pull Request #163665 · pytorch/pytorch

ahkush · 2025-09-23T17:31:46Z

Fixes #162129. Added validation in _rank_not_in_group() to check if FakeProcessGroup is properly initialized before use, raising a clear error message if torch.distributed.init_process_group(backend='fake') hasn't been called first.
This prevents silent failures and ensures proper dispatch system integration for all distributed operations.

Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for all_reduce and all_to_all_single operations.

Please let me know if additional distributed operators should be tested or if any other updates are needed.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

pytorch-bot · 2025-09-23T17:31:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163665

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e5fe478 with merge base bac0f28 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang · 2025-09-24T03:31:15Z

The test feels too late to me. Why can't you discover something bad happened earlier?

ahkush · 2025-09-25T20:02:12Z

I moved the validation to the beginning of each distributed operation instead of waiting until _rank_not_in_group. I added it to every operator because:

_group_or_default_group(group) is another common function used in most operators, but in many operators it's called after _rank_not_in_group,
I couldn't add it in the operators in FakeProcessGroup.hpp as they're also called later in the function flow, and
there's no other common entry point where I can place a single check that catches all cases early enough before they hit the dispatch system.
I'd appreciate any suggestions for a better approach!

ezyang · 2025-09-26T03:21:57Z

Yeah, this is cure is worse than the disease, I think.

What if we blocked direct construction of FakeProcessGroup entirely? Instead, the "official" APIs would have to do some private API that gets around this blockage.

ahkush · 2025-09-30T20:59:16Z

Thanks for pointing me toward the right approach. I've implemented blocking direct construction of FakeProcessGroup entirely:
Changes made:

Made the constructor private and added a static _create_internal() method for official APIs
Public __init__ now throws a clear error directing users to use torch.distributed.init_process_group(backend='fake')
Updated all internal usage to use _create_internal()
Added tests covering both the error case and proper dispatch behavior
This ensures users get proper dispatch system integration while maintaining backward compatibility for official APIs. The error message guides users toward the correct usage pattern.
Does this approach look good, or would you like any adjustments to the implementation or error message?

kwen2501

I think a solution to the issue filed by @ezyang is for dist.all_reduce to directly call into torch.ops.c10d.all_reduce_, instead of pg.all_reduce.

kwen2501 · 2025-09-30T21:51:40Z

torch/testing/_internal/distributed/fake_pg.py

    """
-    return FakeProcessGroup(
+    return FakeProcessGroup._create_internal(
        common_opts.group_rank, common_opts.group_size, backend_opts
    )


Looks like a bc break?

The _create_fake_pg() function itself has no BC break - same signature, same behavior, still returns a FakeProcessGroup. Only direct FakeProcessGroup() construction breaks (intentionally), which gets a clear error message directing users to the proper API. Internal utilities and type checking continue working unchanged.

ahkush · 2025-10-01T16:00:51Z

@kwen2501
While that would ensure dispatch integration, it wouldn't solve the core issue. The dispatch system requires process groups to be registered in the GroupRegistry (via resolve_process_group()), but directly constructed FakeProcessGroup instances are never registered. So torch.ops.c10d.all_reduce_ would fail with "Could not resolve the process group" for direct constructions.

Also, should users be able to construct FakeProcessGroup directly at all, or is it better to guide them toward the official init_process_group(backend='fake') API for proper integration?

Do you have any suggestions for a better approach that would address the dispatch integration issue while handling the registration requirement? I'd appreciate your thoughts on this.

ezyang · 2025-10-02T03:42:11Z

torch/distributed/fsdp/_flat_param.py

        self.process_group = process_group
        if self._use_fake_all_gather or self._use_fake_reduce:
-            self._fake_process_group = FakeProcessGroup(
+            self._fake_process_group = FakeProcessGroup._create_internal(


I'm actually kind of skeptical about this use site, it sort of feels like potentially this is buggy LOL

ezyang · 2025-10-02T03:42:37Z

torch/csrc/distributed/c10d/FakeProcessGroup.hpp

-      c10::intrusive_ptr<Options> options = c10::make_intrusive<Options>())
-      : Backend(rank, size), options_(std::move(options)) {}
+      c10::intrusive_ptr<Options> options = c10::make_intrusive<Options>()) {
+    return c10::intrusive_ptr<FakeProcessGroup>(


nit: make_intrusive_ptr

ezyang · 2025-10-02T03:42:59Z

torch/csrc/distributed/c10d/FakeProcessGroup.hpp

 private:
+  // Private constructor used by official APIs
+  FakeProcessGroup(int rank, int size, c10::intrusive_ptr<Options> options)
+      : Backend(rank, size), options_(std::move(options)) {}


I don't think it is as important to hide the ctor on the C++ side

ezyang · 2025-10-02T03:43:46Z

This looks good. Unfortunately you need to rebase

ezyang · 2025-10-02T19:27:33Z

@pytorchbot merge

pytorchmergebot · 2025-10-02T19:30:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

) These happen when building with CMAKE_BUILD_TYPE=RelWithAssert This should fix two types of failures that started with #163665 Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now) The first one type is Truncated: ``` default_pg, _ = _new_process_group_helper( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper backend_class = creator_fn(dist_backend_opts, backend_options) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg return FakeProcessGroup._create_internal( RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero. Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 #7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0 #8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0 #9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0 #10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 ``` and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead. However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR The other one type is ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import self.assertEqual(out, "") File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != '' - /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode. - if is_available() and not torch._C._c10d_init(): To execute this test, run the following from the base repo dir: python test/test_testing.py TestImports.test_no_warning_on_import ``` which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one? Pull Request resolved: #165479 Approved by: https://github.com/ezyang

Fixes pytorch#162129. Added validation in _rank_not_in_group() to check if ```FakeProcessGroup``` is properly initialized before use, raising a clear error message if ```torch.distributed.init_process_group(backend='fake')``` hasn't been called first. This prevents silent failures and ensures proper dispatch system integration for all distributed operations. Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for ```all_reduce``` and ```all_to_all_single``` operations. Please let me know if additional distributed operators should be tested or if any other updates are needed. Pull Request resolved: pytorch#163665 Approved by: https://github.com/ezyang

…rch#165479) These happen when building with CMAKE_BUILD_TYPE=RelWithAssert This should fix two types of failures that started with pytorch#163665 Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now) The first one type is Truncated: ``` default_pg, _ = _new_process_group_helper( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper backend_class = creator_fn(dist_backend_opts, backend_options) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg return FakeProcessGroup._create_internal( RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero. Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 #7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0 #8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0 #9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0 #10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 ``` and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead. However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR The other one type is ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import self.assertEqual(out, "") File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != '' - /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode. - if is_available() and not torch._C._c10d_init(): To execute this test, run the following from the base repo dir: python test/test_testing.py TestImports.test_no_warning_on_import ``` which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one? Pull Request resolved: pytorch#165479 Approved by: https://github.com/ezyang

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 23, 2025

pytorchbot added the open source label Sep 23, 2025

jbschlosser requested review from d4l3k, kwen2501 and wconstab and removed request for d4l3k and kwen2501 September 24, 2025 17:07

jbschlosser added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module topic: bug fixes topic category labels Sep 24, 2025

ahkush force-pushed the fake-process-group-direct-usage-error branch from 50cfa4a to ebc113a Compare September 25, 2025 19:27

kwen2501 reviewed Sep 30, 2025

View reviewed changes

ezyang reviewed Oct 2, 2025

View reviewed changes

ezyang approved these changes Oct 2, 2025

View reviewed changes

ahkush added 5 commits October 2, 2025 14:55

162129: Add FakeProcessGroup direct usage validation

7d63b1b

Validation added at a common point

3458015

Updated Error message and inlined validation code.

5796b4c

Add early validation for FakeProcessGroup initialization

c8e4446

Block direct FakeProcessGroup construction, add internal factory method

217e516

ahkush added 2 commits October 2, 2025 15:00

Apply lintrunner formatting

876472a

Merge upstream changes

e5fe478

ahkush force-pushed the fake-process-group-direct-usage-error branch from 5606d73 to e5fe478 Compare October 2, 2025 15:03

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 2, 2025

pytorchmergebot added the merging label Oct 2, 2025

pytorchmergebot added the Merged label Oct 2, 2025

pytorchmergebot closed this in ece5e0f Oct 2, 2025

pytorchmergebot removed the merging label Oct 2, 2025

clee2000 mentioned this pull request Oct 14, 2025

Fix periodic debug tests failing due to FakeProcessGroup things #165479

Closed

Conversation

ahkush commented Sep 23, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163665

✅ No Failures

Uh oh!

ezyang commented Sep 24, 2025

Uh oh!

ahkush commented Sep 25, 2025

Uh oh!

ezyang commented Sep 26, 2025

Uh oh!

ahkush commented Sep 30, 2025

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

ahkush Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ahkush commented Oct 1, 2025

Uh oh!

ezyang Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang commented Oct 2, 2025

Uh oh!

ezyang commented Oct 2, 2025

Uh oh!

pytorchmergebot commented Oct 2, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ahkush commented Sep 23, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 23, 2025 •

edited

Loading