[XPU] Add _make_xccl_premul_sum binding for XCCL backend by Chao1Han · Pull Request #25 · Chao1Han/pytorch

Chao1Han · 2026-01-09T07:06:00Z

This PR adds Python binding for _make_xccl_premul_sum to support PREMUL_SUM reduce operation in the XCCL backend (Intel GPU).
The _make_xccl_premul_sum binding directly reuses makeNCCLPreMulSum because XCCL/NCCL have identical memory layouts.
Currently, XCCL backend supports PREMUL_SUM for:

✅ allreduce
✅ reduce_scatter

Not yet supported:

❌ reduce (to be added in future work)

Implemented code in intel/torch-xpu-ops#1948, and base test case in intel/torch-xpu-ops#2690.

…ditions (pytorch#176522) There is a bug in github webhooks. When fired for `pull_request_review_comment` trigger, the `author_association` is `CONTRIBUTOR` (always?). This messes up the pre-check logic for review workflow, causing it to skip. This PR relaxes the pre-check (it's an optimization anyway), and I also relaxed `bedrock` environment protection rules, allowing PR merge branches. [tested in ciforge](pytorch/ciforge#145 (comment)) Pull Request resolved: pytorch#176522 Approved by: https://github.com/drisspg

See title - we add a skill to take a bug, produce a min repro, then iterate on it until fixed Pull Request resolved: pytorch#176359 Approved by: https://github.com/aorenste

This reverts commit 41e5795. Reverted pytorch#176320 on behalf of https://github.com/malfet due to It works now, but does not upload the log I need ([comment](pytorch#176320 (comment)))

) ## Summary - Fix MPS `masked_scatter` to preserve scalar tensor shape `[]` instead of incorrectly returning `[1]` - The function now records whether `self` was originally 0-dimensional and squeezes the result back after processing ## Test plan - Added scalar tensor test case to `test_masked_scatter` in `test/test_mps.py` - Verified fix with reproducer script Related error: pytorch/rl#3137 (comment) Pull Request resolved: pytorch#174381 Approved by: https://github.com/kurtamohler

Co-authored by Claude Pull Request resolved: pytorch#176320 Approved by: https://github.com/izaitsevfb

…torch#176553) Instead of always using offline mode, we need to have a way to periodically refresh the local cache. This job runs only once per day, so rate limit wouldn't be an issue I observe [a failure in trunk](https://github.com/pytorch/pytorch/actions/runs/22625578355/job/65575251092#step:15:13837) for `pytorch/gemma-3-12b-it-int4` where the local cache doesn't seem to work anymore, not sure why, could be due to the recent transformers version update. The failure goes away when I turn off TRANSFORMERS_OFFLINE to refresh the local cache. Pull Request resolved: pytorch#176553 Approved by: https://github.com/zou3519

…#175497) Pull Request resolved: pytorch#175497 Approved by: https://github.com/jansel, https://github.com/mlazos

…ch#175817) Fix pytorch#175560 Pull Request resolved: pytorch#175817 Approved by: https://github.com/eqy, https://github.com/ngimel

Convert Union[X, Y] to X | Y and Optional[X] to X | None using ruff rules UP007 and UP045 since torch is 3.10+ Note: we skip testing here - that is the last directory remaining Pull Request resolved: pytorch#176458 Approved by: https://github.com/aorenste

…ch#168894) This PR ensures that ops on the same stream as events are not reordered around it by using the control deps HOP. This hop in essence creates passthrough aliases of the ops before and after the event to ensure ordering does not change. For stream assignments these are already annotated on the node so can be reordered. Pull Request resolved: pytorch#168894 Approved by: https://github.com/aorenste

This PR adds support for Symbolic events in the TorchInductor scheduler. In short, we create an EventFactory which provides event indices which monotonically increase as compile proceeds and enables reuse of events that are no longer used. It also adds codegen support for events. Pull Request resolved: pytorch#165390 Approved by: https://github.com/eellison

This PR adds utility functions for managing streams and a stream pool. Pull Request resolved: pytorch#165504 Approved by: https://github.com/eellison ghstack dependencies: pytorch#165390

…h#165391) This PR implements stream codegen to the SubgraphWrapper Codegen. It supports codegen for enter/exiting stream contexts and calling record_stream on returned tensors. Pull Request resolved: pytorch#165391 Approved by: https://github.com/eellison ghstack dependencies: pytorch#165390, pytorch#165504

…)" This reverts commit f1da356. Reverted pytorch#176458 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#176458 (comment)))

…orch#176515) create_name stores its counter under base_count[base] (e.g. "sin") but was looking it up under base_count[candidate] (e.g. "sin_1"). When the candidate already has a numeric suffix the lookup always misses, so the while-loop that probes for a free name starts from the regex-extracted number instead of the stored counter — degrading to O(existing_names) per call. This makes repeated node_copy of the same subgraph (as done by inline_invoke_subgraph) quadratic: for an 80-layer model the pass went from ~100ms to ~4.6s. Fix: look up base_count[base] to match the store. Authored with Claude. Pull Request resolved: pytorch#176515 Approved by: https://github.com/jansel, https://github.com/zou3519

…torch#176452) is_from_source(source, target) previously only compared at the root of the ChainedSource hierarchy. This meant that is_from_source(AttrSource(X, 'c'), X) returned False when X was itself a ChainedSource (e.g. UnspecializedNNModuleSource(GlobalSource(...))), because the function walked past X all the way to the root before comparing. Check source == target at each level before recursing so that intermediate sources are correctly recognized as ancestors. Authored with Claude. Pull Request resolved: pytorch#176452 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#176515

…ytorch#175946) [lsakka@devgpu009.cco5 ~/pytorch10/pytorch (13694ac)]$ python benchmarks/dynamo/huggingface.py --backend inductor --performance --inference --compare-backed-unbacked ``` --- AlbertForMaskedLM --- backed... 1.349x unbacked... 1.345x => diff: -0.3% --- AllenaiLongformerBase --- backed... 1.226x unbacked... 1.146x => diff: -6.5% --- BartForCausalLM --- backed... 1.022x unbacked... 1.015x => diff: -0.7% --- BertForMaskedLM --- backed... 1.186x unbacked... 1.187x => diff: +0.1% --- BlenderbotForCausalLM --- backed... 1.058x unbacked... 1.061x => diff: +0.3% ... ``` Pull Request resolved: pytorch#175946 Approved by: https://github.com/jansel

…on to avoid DDE (pytorch#175956) encountered while running python benchmarks/dynamo/huggingface.py --only AllenaiLongformerBase --backend inductor --performance --inference --unbacked-batch-only Pull Request resolved: pytorch#175956 Approved by: https://github.com/ColinPeppler, https://github.com/jansel ghstack dependencies: pytorch#175946

…#175906) Pull Request resolved: pytorch#175906 Approved by: https://github.com/malfet

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#176242 Approved by: https://github.com/pytorchbot

@yanboliang

…are prepended (pytorch#175904) ## Summary When effectful ops (e.g., `with_effects`) are present, `handle_effect_tokens_fn()` prepends effect token placeholders to the input args. However, `static_input_indices` in `ViewAndMutationMeta` is computed before this prepending and is not adjusted afterwards. This causes indices to point to wrong inputs, leading to issues like unnecessary CUDA graph re-recording. ## Problem In `handle_effect_tokens_fn()`, effect tokens are prepended to args: ```python additional_fwd_token_inputs = [torch.tensor([])] * num_tokens args = [*additional_fwd_token_inputs, *args] # tokens prepended at index 0 ``` But meta.static_input_indices is not offset by num_tokens. When these indices are later used (e.g., by CUDAGraph's check_invariants), they point to the wrong inputs: Before tokens: `args=[activation, weight], static_input_indices=[1] → weight ✓` After tokens: `args=[token, activation, weight], static_input_indices=[1] → activation ✗` Expected: ` static_input_indices=[2] (offset by num_tokens=1) → weight ✓` ## Impact - Activations get incorrectly marked as static inputs - CUDAGraph's check_invariants sees data pointer changes for "static" inputs - This triggers unnecessary re-recording, causing performance degradation ## Fix Offset static_input_indices by num_tokens after prepending effect tokens in the forward-only (trace_joint=False) path: ```python if num_tokens > 0: meta.static_input_indices = [ idx + num_tokens for idx in meta.static_input_indices ] ``` ## Unit Test Added `test_static_input_indices_with_effect_tokens` in `test/functorch/test_aotdispatch.py` which: 1. Registers a custom effectful op via _register_effectful_op 2. Compiles a function with torch.compile using a metadata-capturing backend 3. Verifies that all `static_input_indices are >= num_tokens` after effect tokens are prepended (i.e., no index incorrectly points to a token input) cc @yanboliang Pull Request resolved: pytorch#175904 Approved by: https://github.com/angelayi

Trying to reland after original PR's revert pytorch#174338 Pull Request resolved: pytorch#176424 Approved by: https://github.com/Skylion007, https://github.com/huydhn

This is a cleanup for 4 functions: `upsample_bilinear2d_aa_kernel_impl`, `upsample_bilinear2d_kernel_impl`, `upsample_bicubic2d_aa_kernel_impl`, and `upsample_bicubic2d_kernel_impl`. They all follow the same dispatch logic but were previously implementing that logic in slightly different ways, and had duplicated calls (to e.g. `separable_upsample_generic_Nd_kernel_impl`). This PR unifies and simplifies the logic which now looks like this for all 4 functions: ```py if dtype == uint8: if AVX2 and ...: return avx2_path(...) elif aarch64 and ...: return neon_path(...) else: return generic_path(...) ``` Pull Request resolved: pytorch#176422 Approved by: https://github.com/Skylion007

# Motivation This PR aims to fix `torch.Stream` as a context manager nested/reentrance scenario. `torch.cuda.stream` and `torch.xpu.stream` could support these usages. The following scenario would be fixed with this PR: ```python import torch s0 = torch.Stream() with s0, s0: pass ``` ```python import torch s0 = torch.Stream() s1 = torch.Stream() with s0, s1: with s0, s1: pass ``` # Addtional Context Fix pytorch#176560 Pull Request resolved: pytorch#176568 Approved by: https://github.com/albanD

Pull Request resolved: pytorch#176082 Approved by: https://github.com/Lucaskabela

…ytorch#175497)" This reverts commit 9e34c7a. Reverted pytorch#175497 on behalf of https://github.com/malfet due to Broke trunk testing, see https://hud.pytorch.org/hud/pytorch/pytorch/1da0362298a56e82a5d3429fa482c48bb0144fa9/1?per_page=50&name_filter=trunk%20%2F%20linux-jammy-cuda&mergeEphemeralLF=true ([comment](pytorch#175497 (comment)))

) As mentioned above. No other changes. Pull Request resolved: pytorch#176437 Approved by: https://github.com/Skylion007

) Summary: Our test is failing with a `UnicodeDecodeError` during Triton template loading. The cause was a non-ASCII em-dash character (`–`, U+2013) in a comment on line 2 of `triton_depthwise_conv.py.jinja`. When the Triton template engine reads the file, it uses ASCII decoding, which cannot handle multi-byte UTF-8 characters. The fix replaces the em-dash with a standard ASCII hyphen (`-`). Test Plan: Ran cogwheel test Reviewed By: chevalierNoir, kqfu Differential Revision: D95211429 Pull Request resolved: pytorch#176484 Approved by: https://github.com/kqfu, https://github.com/Skylion007

…ersions (pytorch#172696) Fixes pytorch#172684 Updated to use single_dim_strategy. Type conversion to int/bool on Partial(sum) incorrectly preserved the Partial placement, producing wrong results. trunc(a+b) != trunc(a) + trunc(b). This adds a custom strategy for _to_copy that checks if the dtype conversion is linear for the reduce operation before preserving Partial. This PR is offered in support of the Partial correctness stabilization efforts. Pull Request resolved: pytorch#172696 Approved by: https://github.com/wconstab

This fixes MPS SDPA output shape for cases where `value.size(-1) != query.size(-1)`, so output now follows `(..., L, Ev)` as expected. I also added guards in Metal kernel paths that assume equal qkv head dims. Added the updated meta shape inference for the `sdpa_general_mps` path which seems to have been left out initially. Added regression coverage in `test/test_transformers.py` covering the shape semantics, and a similar one in `test/test_mps.py` that also checks for numerical parity with CPU. Fixes pytorch#176767 Pull Request resolved: pytorch#176843 Approved by: https://github.com/malfet

…twise ops (pytorch#175795)" This reverts commit 7cafe7f. Reverted pytorch#175795 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#175795 (comment)))

Previously, `run_pre_grad_passes` was called unconditionally at the top of `_compile_fx_main`. This meant pre-grad transformations were not included in cached artifacts and ran unnecessarily on cache hits. Move pre-grad passes into `aot_module_simplified` (Path B) via a callback so they run after the cache lookup — on cache miss only. `_compile_fx_main` has two compilation paths that diverge at the `V.aot_compilation` check: Path A uses `aot_export_module` (AOTInductor, no cache) and Path B uses `aot_autograd` → `aot_module_simplified` (with `AOTAutogradCache`). Since Path A has no cache, run pre-grad passes explicitly before `aot_export_module`. Pull Request resolved: pytorch#176340 Approved by: https://github.com/aorenste

Summary: See title Test Plan: ``` buck test fbcode//mode/opt fbcode//caffe2/test/distributed/elastic/agent/server/test:local_agent_test -- --run-disabled ``` Differential Revision: D95802783 Pull Request resolved: pytorch#176887 Approved by: https://github.com/Skylion007

…ser error graph breaks (pytorch#176649) Fixes pytorch#166555 Pull Request resolved: pytorch#176649 Approved by: https://github.com/Lucaskabela

…orch#176817) Summary: The `CUDACachingAllocator` (a `DeviceAllocator`) and Caffe2's legacy `DefaultCUDAAllocator` (a plain `Allocator`) both registered for `DeviceType::CUDA` at priority 0. Since `SetAllocator` uses `>=` comparison, whichever static initializer ran last would win. When the legacy allocator won the race, `dynamic_cast<DeviceAllocator*>` in `getDeviceAllocator()` would fail, crashing `torch.accelerator.empty_cache()` and other `torch.accelerator` APIs. To be clear, this is not an issue in pure OSS PyTorch, where the Caffe2 legacy CUAD allocator does not exist. Fix by bumping `CUDACachingAllocator`'s registration priority to 1 so it always takes precedence over the legacy Caffe2 allocator regardless of static initialization order. This SIOF surfaced recently in vLLM after some code was generalized to use `torch.accelerator.empty_cache()` instead of `torch.cuda.empty_cache()` in vllm-project/vllm#30681. Test Plan: ``` buck test fbcode//mode/opt fbcode//vllm/omni:test_kernels_rotary_embedding -- --exact 'fbcode//vllm/omni:test_kernels_rotary_embedding - test_rotary_embedding.py::test_rotary_embedding_opcheck[False-False-1024-108-32-True-11-cuda]' ``` Previously: 1 passed, 1 error (`RuntimeError` during teardown) Now: 2 passed, 0 errors Errors/stack traces like the following are resolved after this change: ``` def empty_cache() -> None: r"""Release all unoccupied cached memory currently held by the caching allocator so that those can be used in other application. .. note:: This function is a no-op if the memory allocator for the current :ref:`accelerator <accelerators>` has not been initialized. """ > if not torch._C._accelerator_isAllocatorInitialized(): E RuntimeError: device_allocator INTERNAL ASSERT FAILED at "fbcode/caffe2/c10/core/CachingDeviceAllocator.h":253, please report a bug to PyTorch. Allocator for cuda is not a DeviceAllocator. ``` Differential Revision: D95703075 Pull Request resolved: pytorch#176817 Approved by: https://github.com/albanD

Add _dijkstra_expand_single_dim_strategy_to_mesh, a priority-queue search over input placement states that finds the lowest-cost sharding for an op without enumerating all S^N strategy combinations (S = single-dim strategies, N = mesh dimensions). The search uses _PreparedSingleDimStrategy (from the previous commit) to materialize single-dim rules and try_propagate() to test whether a candidate state matches on every mesh dimension. Each search state is a tuple of per-input placement tuples. Neighbors are generated by changing one placement on one mesh dimension for one input, using _get_neighbor_placements which encodes DTensor redistribute transition rules (Replicate <-> Shard, Partial -> Replicate/Shard). Cost computation uses _compute_redistribute_cost which calls _compute_placement_transition_cost directly per mesh dimension, avoiding DTensorSpec construction and _gen_transform_infos planning overhead. Returns None for _StridedShard inputs, signaling the caller to fall back to full expansion. Authored with Claude. Pull Request resolved: pytorch#169438 Approved by: https://github.com/anshul-si, https://github.com/zpcore Co-authored-by: Pian Pawakapan <pianpwk@meta.com>

Targeted filepaths that seem most likely to disrupt torchtitan, can revisit the specific paths over time. Fixes pytorch/torchtitan#2350 Pull Request resolved: pytorch#176774 Approved by: https://github.com/tianyu-l ghstack dependencies: pytorch#175901

First of all, fewer systems now how `wget` installed by default, but almost all Linux/MacOS comes with curl If script with such name already exists, `wget` will download it with `.${NUM}` suffixed alias, which results in reporter posting results from something else, for example see pytorch#176829 ``` wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py python collect_env.py --2026-03-08 23:35:37-- https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 31107 (30K) [text/plain] Saving to: ‘collect_env.py.1’ collect_env.py.1 100%[==========================================>] 30.38K --.-KB/s in 0.04s 2026-03-08 23:35:38 (707 KB/s) - ‘collect_env.py.1’ saved [31107/31107] Traceback (most recent call last): File "/mnt/d/my/work/study/ai/kaggle_code/aimo2/collect_env.py", line 15, in <module> ``` Pull Request resolved: pytorch#176904 Approved by: https://github.com/msaroufim, https://github.com/seemethere

…ame (pytorch#176515)" (pytorch#176948) ## Summary Reverts pytorch#176515. This is a prerequisite for reverting the full `[fx] Move _Namespace to C++` series (pytorch#170962), which was reverted internally due to S627920 but the revert was never exported to GitHub. The quadratic fix patches `torch/csrc/fx/graph.cpp` which was introduced by pytorch#170962. This revert must land first so that pytorch#170962 can be cleanly reverted afterwards. ## Test plan CI — this revert removes a bugfix from C++ code that will itself be reverted in a follow-up PR. Pull Request resolved: pytorch#176948 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn

This reverts commit 5c68844. Reverted pytorch#170962 on behalf of https://github.com/wdvr due to reverted in fbcode ([comment](pytorch#170962 (comment)))

) Summary: When K == 1, matrix multiplication (M, 1) @ (1, N) is an outer product. Instead of launching a full GEMM kernel, we decompose it into a broadcasted pointwise multiply at the ATen decomposition level, which is more efficient for this memory-bound case. This is a reland of D94097622 with two fixes: - Skip decomposition when M==1 or N==1 to avoid output strides from the broadcast multiply not matching mm strides. - Remove `as_strided` stride fixup that was causing issues with Helion (SympifyError on symbolic shapes). The M==1/N==1 guard also applies to the existing CPU K==1 decomposition path. **aten.mm** — TritonBench, K=1 shapes, median of 3 runs: | Shape (M, N, K) | B200 base (us) | B200 test (us) | B200 Speedup | H100 base (us) | H100 test (us) | H100 Speedup | |---|---|---|---|---|---|---| | (100, 100, 1) | 12.3 | 11.3 | 1.09x | 9.76 | 8.64 | 1.13x | | (150, 150, 1) | 12.3 | 11.2 | 1.10x | 9.82 | 8.70 | 1.13x | | (200, 200, 1) | 12.3 | 11.3 | 1.09x | 9.95 | 8.80 | 1.13x | | (256, 256, 1) | 12.3 | 11.3 | 1.09x | 9.76 | 8.70 | 1.12x | | (512, 512, 1) | 12.3 | 11.2 | 1.10x | 9.92 | 8.80 | 1.13x | | (1024, 1024, 1) | 14.3 | 13.2 | 1.09x | 11.39 | 9.44 | 1.21x | | (2048, 2048, 1) | 20.5 | 15.3 | **1.34x** | 16.19 | 12.83 | **1.26x** | | (4096, 4096, 1) | 35.8 | 26.8 | **1.33x** | 37.98 | 29.12 | **1.30x** | | (8192, 8192, 1) | 96.3 | 68.6 | **1.40x** | 120.48 | 89.12 | **1.35x** | | (16384, 16384, 1) | 329.8 | 234.5 | **1.41x** | 387.42 | 249.54 | **1.55x** | | (4608, 20, 1) | 13.2 | 11.3 | 1.17x | 10.02 | 8.86 | 1.13x | | (4608, 32, 1) | 13.2 | 11.3 | 1.17x | 9.95 | 8.86 | 1.12x | | (4608, 128, 1) | 13.2 | 11.4 | 1.17x | 10.94 | 8.99 | 1.22x | | (4608, 256, 1) | 14.3 | 13.2 | 1.09x | 12.22 | 9.50 | **1.29x** | | (4608, 512, 1) | 17.4 | 13.3 | **1.31x** | 14.02 | 10.59 | **1.32x** | | (4608, 1024, 1) | 20.5 | 15.3 | **1.34x** | 17.06 | 13.18 | **1.29x** | | (1024, 4096, 1) | 20.5 | 15.3 | **1.34x** | 16.80 | 13.25 | **1.27x** | | (4096, 1024, 1) | 20.5 | 15.3 | **1.34x** | 16.22 | 12.51 | **1.30x** | Geomean speedup: B200 **1.21x**, H100 **1.22x**, 0 regressions. **aten.addmm** — TritonBench, K=1 shapes, median of 3 runs: | Shape (M, N, K) | B200 base (us) | B200 test (us) | B200 Speedup | H100 base (us) | H100 test (us) | H100 Speedup | |---|---|---|---|---|---|---| | (100, 100, 1) | 12.3 | 12.3 | 1.00x | 9.76 | 9.06 | 1.08x | | (150, 150, 1) | 12.4 | 12.3 | 1.01x | 10.08 | 9.18 | 1.10x | | (200, 200, 1) | 12.4 | 12.3 | 1.00x | 9.98 | 9.31 | 1.07x | | (256, 256, 1) | 12.3 | 12.3 | 1.00x | 9.86 | 9.38 | 1.05x | | (512, 512, 1) | 13.3 | 13.2 | 1.01x | 10.37 | 9.73 | 1.07x | | (1024, 1024, 1) | 15.3 | 13.3 | 1.15x | 12.32 | 11.20 | 1.10x | | (2048, 2048, 1) | 23.6 | 18.5 | **1.27x** | 19.01 | 16.19 | **1.17x** | | (4096, 4096, 1) | 56.3 | 33.8 | **1.66x** | 58.72 | 45.60 | **1.29x** | | (8192, 8192, 1) | 172.2 | 102.3 | **1.68x** | 166.75 | 148.45 | 1.12x | | (16384, 16384, 1) | 665.8 | 359.5 | **1.85x** | 638.66 | 503.23 | **1.27x** | | (4608, 20, 1) | 13.2 | 12.3 | 1.07x | 10.21 | 9.47 | 1.08x | | (4608, 32, 1) | 13.2 | 12.4 | 1.06x | 10.11 | 9.47 | 1.07x | | (4608, 128, 1) | 13.3 | 13.2 | 1.00x | 11.68 | 10.27 | 1.14x | | (4608, 256, 1) | 15.3 | 13.4 | 1.14x | 13.28 | 11.55 | 1.15x | | (4608, 512, 1) | 18.6 | 15.4 | 1.20x | 15.87 | 13.63 | 1.16x | | (4608, 1024, 1) | 25.5 | 19.4 | **1.31x** | 21.02 | 17.92 | **1.17x** | | (1024, 4096, 1) | 23.5 | 18.5 | **1.27x** | 18.94 | 16.38 | 1.16x | | (4096, 1024, 1) | 23.5 | 18.5 | **1.27x** | 18.98 | 16.29 | 1.17x | Geomean speedup: B200 **1.19x**, H100 **1.13x**, 0 regressions. diff-train-skip-merge Test Plan: ``` PYTORCH_TEST_REMOTE_GPU=1 buck2 test //caffe2/test/inductor:test_mmdecomp_cuda \ -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 \ -c fbcode.enable_gpu_sections=true mode/opt Pass 30. Fail 0. PYTORCH_TEST_REMOTE_GPU=1 buck2 test //caffe2/test/inductor:test_mmdecomp \ -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 \ -c fbcode.enable_gpu_sections=true mode/opt Pass 29. Fail 0. PYTORCH_TEST_REMOTE_GPU=1 buck2 test //caffe2/test/inductor:fxir_backend \ -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 \ -c fbcode.enable_gpu_sections=true mode/opt Pass 76. Fail 0. ``` Reviewed By: PaulZhang12 Differential Revision: D94437532 Pull Request resolved: pytorch#175825 Approved by: https://github.com/PaulZhang12

…ype conversions (pytorch#172696)" This reverts commit 46cd90c. Reverted pytorch#172696 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#172696 (comment)))

pytorch#176723)" This reverts commit 26dddb9. Reverted pytorch#176723 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))

This reverts commit 492c742. Reverted pytorch#176015 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))

…#175936)" This reverts commit 388d61e. Reverted pytorch#175936 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))

This reverts commit 9b53dac. Reverted pytorch#175924 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>

Chao1Han changed the title ~~Add xccl premul sum register~~ [XPU] Add _make_xccl_premul_sum binding for XCCL backend Jan 9, 2026

pytorchmergebot force-pushed the xccl-premul branch from 7ba5e71 to b7e7e59 Compare February 12, 2026 06:00

izaitsevfb and others added 28 commits March 5, 2026 01:35

[skills][claude] Add skill for bug bashing (pytorch#176359)

8fb4b8f

See title - we add a skill to take a bug, produce a min repro, then iterate on it until fixed Pull Request resolved: pytorch#176359 Approved by: https://github.com/aorenste

Revert "[BE] Upload triage logs to S3 (pytorch#176320)"

2a66646

This reverts commit 41e5795. Reverted pytorch#176320 on behalf of https://github.com/malfet due to It works now, but does not upload the log I need ([comment](pytorch#176320 (comment)))

[BE] Upload triage logs to S3 (pytorch#176320)

0d19c01

Co-authored by Claude Pull Request resolved: pytorch#176320 Approved by: https://github.com/izaitsevfb

symbolic shapes guarding_hint_or_throw and optimization_hint (pytorch…

9e34c7a

…#175497) Pull Request resolved: pytorch#175497 Approved by: https://github.com/jansel, https://github.com/mlazos

[CUDA] Free deferred record_stream blocks at graph capture end (pytor…

b55e531

…ch#175817) Fix pytorch#175560 Pull Request resolved: pytorch#175817 Approved by: https://github.com/eqy, https://github.com/ngimel

[user-streams] Add support functions for stream codegen (pytorch#165504)

809884d

This PR adds utility functions for managing streams and a stream pool. Pull Request resolved: pytorch#165504 Approved by: https://github.com/eellison ghstack dependencies: pytorch#165390

[MPS] Support noncontiguous weight for histogram/histogramdd (pytorch…

4a1bbf4

…#175906) Pull Request resolved: pytorch#175906 Approved by: https://github.com/malfet

[Reland] Upgrade NCCL to 2.29.3 (pytorch#176424)

a2ab162

Trying to reland after original PR's revert pytorch#174338 Pull Request resolved: pytorch#176424 Approved by: https://github.com/Skylion007, https://github.com/huydhn

[dynamo] Add inline_invoke_subgraph post-tracing pass (pytorch#176082)

1da0362

Pull Request resolved: pytorch#176082 Approved by: https://github.com/Lucaskabela

Make kulinseth and albanD emeritus for MPS/Metal backend (pytorch#176437

850853e

) As mentioned above. No other changes. Pull Request resolved: pytorch#176437 Approved by: https://github.com/Skylion007

stmcgovern and others added 28 commits March 9, 2026 19:07

[dynamo, docs] Suggest torch.compiler.set_stance("force_eager") for u…

a25396f

…ser error graph breaks (pytorch#176649) Fixes pytorch#166555 Pull Request resolved: pytorch#176649 Approved by: https://github.com/Lucaskabela

Revert "[fx] Move _Namespace to C++ (pytorch#170962)"

2ee3377

This reverts commit 5c68844. Reverted pytorch#170962 on behalf of https://github.com/wdvr due to reverted in fbcode ([comment](pytorch#170962 (comment)))

Add xccl premul sum register

84df76c

add callable PREMUL_SUM

afd0613

rm xccl register

9c4e75b

correct case

e9dc1de

Rename PreMulSumSupplement

618db26

Update torch/csrc/distributed/c10d/init.cpp

a5296b4

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>

Update torch/csrc/distributed/c10d/init.cpp

8287660

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>

Update torch/csrc/distributed/c10d/init.cpp

4563e91

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>

Update torch/csrc/distributed/c10d/init.cpp

8af6b21

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>

update int eq and add comments

b5018b0

pytorchmergebot force-pushed the xccl-premul branch from b7e7e59 to b5018b0 Compare March 10, 2026 01:37

enhence test case and add api description

e499830

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] Add _make_xccl_premul_sum binding for XCCL backend#25

[XPU] Add _make_xccl_premul_sum binding for XCCL backend#25
Chao1Han wants to merge 2689 commits intomasterfrom
xccl-premul

Chao1Han commented Jan 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Chao1Han commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Chao1Han commented Jan 9, 2026 •

edited

Loading