Change mutable symm_mem ops to return void instead of aliased tensors by RohitRathore1 · Pull Request #179144 · pytorch/pytorch

RohitRathore1 · 2026-04-02T14:00:02Z

Custom operators are not allowed to return an alias of a mutated input (e.g. Tensor(a!) -> Tensor(a!)), as this pattern is incompatible with torch.compile's functionalization. Change all mutable symm_mem ops to return void (-> ()) instead. No callers in the codebase rely on the return values — all use these ops for their in-place side effects.

Discussed with @zou3519 and @kwen2501 on #173513.

Custom operators are not allowed to return an alias of a mutated input (e.g. `Tensor(a!) -> Tensor(a!)`), as this pattern is incompatible with torch.compile's functionalization. Change all mutable symm_mem ops to return void (`-> ()`) instead. No callers in the codebase rely on the return values — all use these ops for their in-place side effects. Discussed with zou3519 and kwen2501 on pytorch#173513.

pytorch-bot · 2026-04-02T14:00:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/179144

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit a1fd65e with merge base d386e0b ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-py3.10-gcc11 / test (backwards_compat, 1, 1, lf.linux.c7i.2xlarge) (gh)
test_modules_can_be_imported

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zou3519 · 2026-04-02T15:24:24Z

torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.cpp

 TORCH_LIBRARY_FRAGMENT(symm_mem, m) {
  m.def(
-      "multimem_all_reduce_(Tensor(a!) input, str reduce_op, str group_name) -> Tensor(a!)");
+      "multimem_all_reduce_(Tensor(a!) input, str reduce_op, str group_name) -> ()");


@kwen2501 an alternative to BC-breaking this is to define a new operator that has no return and have the old operator call it. So something like:

m.def( "multimem_all_reduce_(Tensor(a!) input, str reduce_op, str group_name) -> Tensor(a!)"); m.def( "multimem_all_reduce_noreturn_(Tensor(a!) input, str reduce_op, str group_name) -> ()"); TORCH_LIBRARY_IMPL(symm_mem, m, CompositeImplicitAutograd) { m.impl("multimem_all_reduce_", &multimem_all_reduce_); } Tensor multimem_all_reduce_(...) { multimem_all_reduce_noreturn_(...) return ... }

I'm not sure how worth it this is. I looked at the docs and the APIs are "alpha" so BC-break seemes reasonable to do

ummm, since these APIs are marked alpha and no callers in the codebase (or downstream in vLLM) use the return values, the BC-break seems like the simpler path... happy to add the _noreturn_ wrapper approach if you'd prefer though..
cc: @kwen2501

kwen2501 · 2026-04-02T16:00:07Z

Thanks @RohitRathore1 @zou3519 , I am checking with team members to sign off on this.

kwen2501

I tend to approve this change.
Looking at the impact radius:

Most ops changed are of "_out" form. For these ops, user would pass in out tensor and not use the return anyway.
Four ops are of in-place form "_". It should have been also clear that the input is modified in place.

kwen2501 · 2026-04-02T17:22:02Z

@RohitRathore1
Please check binding site at init.cpp lines 1247 and 1259-1260. The .typed<at::Tensor(...)>() calls for stream_write_value32_ and memset32_ need to be updated to .typed<void(...)>() to match the new void return type.

stream_write_value32 (lines 1244–1248):

  auto op =                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
      c10::Dispatcher::singleton()                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
          .findSchemaOrThrow("symm_mem::stream_write_value32_", "")                                                                                                                                                                                                                                                                                                                                                                                                                                     
          .typed<at::Tensor(at::Tensor&, int64_t, int64_t)>();  // ← wrong return type
  return op.call(input, offset, val);  // ← returns Tensor, but op now returns void

memset32 (lines 1257–1261):

  auto op = c10::Dispatcher::singleton()                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                .findSchemaOrThrow("symm_mem::memset32_", "")                                                                                                                                                                                                                                                                                                                                                                                                                                           
                .typed<at::Tensor(                          // ← wrong return type
                    at::Tensor&, int64_t, int64_t, int64_t)>();                   
  return op.call(input, offset, val, count);  // ← same problem

Both lambdas also use return, which will fail to compile once the op returns void. The fix is changing at::Tensor → void in the .typed<>() calls and dropping the return.

kwen2501 · 2026-04-02T17:34:00Z

Claude checked vLLM repo, here is the verdict:

vllm is not impacted.

kwen2501 · 2026-04-02T17:37:17Z

Claude check on SGLang repo, here is the verdict:

SGLang does use two of the changed ops, both in /python/sglang/srt/distributed/device_communicators/torch_symm_mem.py:

  # multimem path                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  torch.ops.symm_mem.multimem_all_reduce_(                                                                                                                                                                                                                                                                                                                                                                                                                                                              
      self.buffer[: inp.numel()], "sum", self.group.group_name                                                                                                                                                                                                                                                                                                                                                                                                                                          
  )                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  # two-shot fallback path                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
  torch.ops.symm_mem.two_shot_all_reduce_(                                                                                                                                                                                                                                                                                                                                                                                                                                                              
      self.buffer[: inp.numel()], "sum", self.group.group_name                                                                                                                                                                                                                                                                                                                                                                                                                                          
  )

Both calls discard the return value, so the return type change from Tensor(a!) to () does not break SGLang. The ops are used purely for their in-place side effects.

RohitRathore1 · 2026-04-02T17:38:58Z

@kwen2501 thanks for providing all these verdicts!

fegin · 2026-04-02T17:39:14Z

@RohitRathore1

Can you check if torch/_inductor/comm_lowering.py is going to be affected? More specifically, will this change break the torch.compiler + symmetry memory use cases due to Inductor expecting a TensorBox? If it will, you will need to add a similar wrapping in comm_lowering.py.

Also, when you search "codebase", what codebase did you mean? Kraken is a public repo and it will be broken after this change. Kraken is fine as we own it and it is just a benchmark repo. But just want to understand the "codebase" you referred to.

RohitRathore1 · 2026-04-02T17:40:22Z

@RohitRathore1

Can you check if torch/_inductor/comm_lowering.py is going to be affected? More specifically, will this change break the torch.compiler + symmetry memory use cases due to Inductor expecting a TensorBox? If it will, you will need to add a similar wrapping in comm_lowering.py.

Also, when you search "codebase", what codebase did you mean? Kraken is a public repo and it will be broken after this change. Kraken is fine as we own it and it is just a benchmark repo. But just want to understand the "codebase" you referred to.

when i said, my mean in favor of vllm.. let me check more thoroughly

RohitRathore1 · 2026-04-02T17:47:55Z

@fegin yes, this change does affect the torch.compile path but follow-up PR #173513 is designed to handle this

Skylion007 · 2026-04-02T17:50:56Z

torch/csrc/distributed/c10d/symm_mem/CUDASymmetricMemoryOps.cu

    std::string group_name,
    at::Tensor out) {
-  return one_shot_all_reduce_out_impl(
+  one_shot_all_reduce_out_impl(


Other general nit: a lot of these strings should be std::moved

thanks for the suggestion! that was a pre-existing pattern.. i missed it :(

… type

kwen2501 · 2026-04-02T17:53:37Z

What kraken uses:
symm_mem_hdl.stream_write_value32(...) in kraken/comm/copy_engine_all_gather.py:

The return value is not used.

But we'd need to fix the binding in init.cpp in torch. @RohitRathore1

RohitRathore1 · 2026-04-02T17:57:28Z

What kraken uses: symm_mem_hdl.stream_write_value32(...) in kraken/comm/copy_engine_all_gather.py:

The return value is not used.

But we'd need to fix the binding in init.cpp in torch. @RohitRathore1

@kwen2501 allready fixed in my earlier commit to update the .typed<>() calls and removed the return statements for both stream_write_value32_ and memset32_ in init.cpp.

zou3519 · 2026-04-02T18:06:53Z

I chatted a bit with @ngimel on this. The current thinking is:

we should fix the problem where torch.compile does not like the original symm_mem custom ops.
we can have temporary variants of the symm_mem custom ops that do support torch.compile while we wait for a fix for (1). I think ideally we are able to support these with torch.compile in pytorch 2.12, unless we think there are other blockers to this, at which point we should just fix (1).

The general motivation is that (1) is something we should do and that we don't want to BC-break people now and also BC-break them later when we get (1) to work.

Fixing (1) will take me or someone ~2 weeks, but I don't have bandwidth to do this for another couple of weeks. I will get to it sometime in the medium term.

Thoughts @kwen2501 @RohitRathore1 ?

albanD

I'm very confused here.
Why would we ever consider this to be a good way to go?

This is an antipattern we don't want to do as there is no way to get autograd to work.
And I dont see how this fixes the functionalization problem? It just doesn't trigger the particular error we put there to avoid silent correctness issue. But it doesn't make it right?

kwen2501 · 2026-04-02T18:46:01Z

@albanD We are in the same boat of trying to figure out what's a good practice :)
My two cents re autograd:
The ops here (referring to ops.symm_mem) are the bare-minimum of a collective implementation.
They don't mean to be autograd'able.
If someone wants an autograd'able form, they can create a functional form wrapping these bare-minimum implementations, such as:

def foo(x) -> Tensor:
  y = torch.empty(...)
  ops.symm_mem.foo(x, y)
  return y

And add backward formula for it.

Functional collectives in PyTorch do exactly this. They define functional forms that return a new tensor and register proper backward implementations via torch.library.register_autograd. The pattern in torch/distributed/_functional_collectives.py:

  # Forward: y = all_reduce(x)  →  backward: all_reduce the grad too                                                                                                                                                                                                                                                                                                                                                                                                                                    
  torch.library.register_autograd(                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
      "_c10d_functional::all_reduce",                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
      all_reduce_backward,          # does all_reduce on grad_output                                                                                                                                                                                                                                                                                                                                                                                                                                    
      setup_context=all_reduce_setup_context,                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  )
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  # Forward: y = all_gather(x)  →  backward: reduce_scatter the grad                                                                                                                                                                                                                                                                                                                                                                                                                                    
  torch.library.register_autograd(
      "_c10d_functional::all_gather_into_tensor",                                                                                                                                                                                                                                                                                                                                                                                                                                                       
      all_gather_into_tensor_backward,  # does reduce_scatter on grad_output
      ...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  )
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  # Forward: y = reduce_scatter(x)  →  backward: all_gather the grad                                                                                                                                                                                                                                                                                                                                                                                                                                    
  torch.library.register_autograd(
      "_c10d_functional::reduce_scatter_tensor",                                                                                                                                                                                                                                                                                                                                                                                                                                                        
      reduce_scatter_tensor_backward,  # does all_gather on grad_output
      ...
  )

Internally, they just call the in-place dist.reduce_scatter(x, y) version. But that's not visible to autograd.

ngimel · 2026-04-02T19:09:47Z

It doesn't matter whether these ops are meant to be autogradable or not, inplace ops return their output, out ops return their outputs, that's the current convention and we shouldn't be breaking it for unclear reasons.

albanD · 2026-04-03T19:08:20Z

I would expect that the pattern we want is just to have one op that does what we need it to do. And not have 3 differnet ops wrapping each other?

I would argue that the functional collectives for distributed should NOT do weird wrapping like they do and just be one op each. The fact that there are so many layers of wrapping (including ops that are silently wrong) is a BAD thing, looking back at it we should never have done that.
If you want an inplace op, make it a proper op. If it is just an implementation detail of your out-of-place op, then it doesn't need to be an op at all (to avoid confusions like the one here).

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Apr 2, 2026

RohitRathore1 requested a review from kwen2501 April 2, 2026 14:00

pytorchbot added the open source label Apr 2, 2026

RohitRathore1 mentioned this pull request Apr 2, 2026

[library] Add registration API for symmetric memory arguments #173513

Open

zou3519 reviewed Apr 2, 2026

View reviewed changes

kwen2501 approved these changes Apr 2, 2026

View reviewed changes

kwen2501 requested review from fegin, kwen2501 and ngimel April 2, 2026 16:17

fegin added the ciflow/inductor label Apr 2, 2026

Skylion007 approved these changes Apr 2, 2026

View reviewed changes

Skylion007 reviewed Apr 2, 2026

View reviewed changes

Fix stream_write_value32_ and memset32_ bindings to match void return…

be075af

… type

pytorch-bot bot removed the ciflow/inductor label Apr 2, 2026

Fix linting issue

a1fd65e

albanD reviewed Apr 2, 2026

View reviewed changes

Conversation

RohitRathore1 commented Apr 2, 2026

Uh oh!

pytorch-bot bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/179144

❌ 1 New Failure

Uh oh!

zou3519 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

RohitRathore1 Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 2, 2026

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented Apr 2, 2026

Uh oh!

kwen2501 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RohitRathore1 commented Apr 2, 2026

Uh oh!

fegin commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RohitRathore1 commented Apr 2, 2026

Uh oh!

RohitRathore1 commented Apr 2, 2026

Uh oh!

Skylion007 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

RohitRathore1 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RohitRathore1 commented Apr 2, 2026

Uh oh!

zou3519 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Apr 2, 2026

Uh oh!

albanD commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pytorch-bot bot commented Apr 2, 2026 •

edited

Loading

RohitRathore1 Apr 2, 2026 •

edited

Loading

kwen2501 commented Apr 2, 2026 •

edited

Loading

kwen2501 commented Apr 2, 2026 •

edited

Loading

fegin commented Apr 2, 2026 •

edited

Loading

kwen2501 commented Apr 2, 2026 •

edited

Loading

zou3519 commented Apr 2, 2026 •

edited

Loading

kwen2501 commented Apr 2, 2026 •

edited

Loading