[XPU] Add _make_xccl_premul_sum binding for XCCL backend#25
Open
[XPU] Add _make_xccl_premul_sum binding for XCCL backend#25
Conversation
7ba5e71 to
b7e7e59
Compare
…ditions (pytorch#176522) There is a bug in github webhooks. When fired for `pull_request_review_comment` trigger, the `author_association` is `CONTRIBUTOR` (always?). This messes up the pre-check logic for review workflow, causing it to skip. This PR relaxes the pre-check (it's an optimization anyway), and I also relaxed `bedrock` environment protection rules, allowing PR merge branches. [tested in ciforge](pytorch/ciforge#145 (comment)) Pull Request resolved: pytorch#176522 Approved by: https://github.com/drisspg
See title - we add a skill to take a bug, produce a min repro, then iterate on it until fixed Pull Request resolved: pytorch#176359 Approved by: https://github.com/aorenste
This reverts commit 41e5795. Reverted pytorch#176320 on behalf of https://github.com/malfet due to It works now, but does not upload the log I need ([comment](pytorch#176320 (comment)))
) ## Summary - Fix MPS `masked_scatter` to preserve scalar tensor shape `[]` instead of incorrectly returning `[1]` - The function now records whether `self` was originally 0-dimensional and squeezes the result back after processing ## Test plan - Added scalar tensor test case to `test_masked_scatter` in `test/test_mps.py` - Verified fix with reproducer script Related error: pytorch/rl#3137 (comment) Pull Request resolved: pytorch#174381 Approved by: https://github.com/kurtamohler
Co-authored by Claude Pull Request resolved: pytorch#176320 Approved by: https://github.com/izaitsevfb
…torch#176553) Instead of always using offline mode, we need to have a way to periodically refresh the local cache. This job runs only once per day, so rate limit wouldn't be an issue I observe [a failure in trunk](https://github.com/pytorch/pytorch/actions/runs/22625578355/job/65575251092#step:15:13837) for `pytorch/gemma-3-12b-it-int4` where the local cache doesn't seem to work anymore, not sure why, could be due to the recent transformers version update. The failure goes away when I turn off TRANSFORMERS_OFFLINE to refresh the local cache. Pull Request resolved: pytorch#176553 Approved by: https://github.com/zou3519
…#175497) Pull Request resolved: pytorch#175497 Approved by: https://github.com/jansel, https://github.com/mlazos
…ch#175817) Fix pytorch#175560 Pull Request resolved: pytorch#175817 Approved by: https://github.com/eqy, https://github.com/ngimel
Convert Union[X, Y] to X | Y and Optional[X] to X | None using ruff rules UP007 and UP045 since torch is 3.10+ Note: we skip testing here - that is the last directory remaining Pull Request resolved: pytorch#176458 Approved by: https://github.com/aorenste
…ch#168894) This PR ensures that ops on the same stream as events are not reordered around it by using the control deps HOP. This hop in essence creates passthrough aliases of the ops before and after the event to ensure ordering does not change. For stream assignments these are already annotated on the node so can be reordered. Pull Request resolved: pytorch#168894 Approved by: https://github.com/aorenste
This PR adds support for Symbolic events in the TorchInductor scheduler. In short, we create an EventFactory which provides event indices which monotonically increase as compile proceeds and enables reuse of events that are no longer used. It also adds codegen support for events. Pull Request resolved: pytorch#165390 Approved by: https://github.com/eellison
This PR adds utility functions for managing streams and a stream pool. Pull Request resolved: pytorch#165504 Approved by: https://github.com/eellison ghstack dependencies: pytorch#165390
…h#165391) This PR implements stream codegen to the SubgraphWrapper Codegen. It supports codegen for enter/exiting stream contexts and calling record_stream on returned tensors. Pull Request resolved: pytorch#165391 Approved by: https://github.com/eellison ghstack dependencies: pytorch#165390, pytorch#165504
…)" This reverts commit f1da356. Reverted pytorch#176458 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#176458 (comment)))
…orch#176515) create_name stores its counter under base_count[base] (e.g. "sin") but was looking it up under base_count[candidate] (e.g. "sin_1"). When the candidate already has a numeric suffix the lookup always misses, so the while-loop that probes for a free name starts from the regex-extracted number instead of the stored counter — degrading to O(existing_names) per call. This makes repeated node_copy of the same subgraph (as done by inline_invoke_subgraph) quadratic: for an 80-layer model the pass went from ~100ms to ~4.6s. Fix: look up base_count[base] to match the store. Authored with Claude. Pull Request resolved: pytorch#176515 Approved by: https://github.com/jansel, https://github.com/zou3519
…torch#176452) is_from_source(source, target) previously only compared at the root of the ChainedSource hierarchy. This meant that is_from_source(AttrSource(X, 'c'), X) returned False when X was itself a ChainedSource (e.g. UnspecializedNNModuleSource(GlobalSource(...))), because the function walked past X all the way to the root before comparing. Check source == target at each level before recursing so that intermediate sources are correctly recognized as ancestors. Authored with Claude. Pull Request resolved: pytorch#176452 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#176515
…ytorch#175946) [lsakka@devgpu009.cco5 ~/pytorch10/pytorch (13694ac)]$ python benchmarks/dynamo/huggingface.py --backend inductor --performance --inference --compare-backed-unbacked ``` --- AlbertForMaskedLM --- backed... 1.349x unbacked... 1.345x => diff: -0.3% --- AllenaiLongformerBase --- backed... 1.226x unbacked... 1.146x => diff: -6.5% --- BartForCausalLM --- backed... 1.022x unbacked... 1.015x => diff: -0.7% --- BertForMaskedLM --- backed... 1.186x unbacked... 1.187x => diff: +0.1% --- BlenderbotForCausalLM --- backed... 1.058x unbacked... 1.061x => diff: +0.3% ... ``` Pull Request resolved: pytorch#175946 Approved by: https://github.com/jansel
…on to avoid DDE (pytorch#175956) encountered while running python benchmarks/dynamo/huggingface.py --only AllenaiLongformerBase --backend inductor --performance --inference --unbacked-batch-only Pull Request resolved: pytorch#175956 Approved by: https://github.com/ColinPeppler, https://github.com/jansel ghstack dependencies: pytorch#175946
…#175906) Pull Request resolved: pytorch#175906 Approved by: https://github.com/malfet
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#176242 Approved by: https://github.com/pytorchbot
…are prepended (pytorch#175904) ## Summary When effectful ops (e.g., `with_effects`) are present, `handle_effect_tokens_fn()` prepends effect token placeholders to the input args. However, `static_input_indices` in `ViewAndMutationMeta` is computed before this prepending and is not adjusted afterwards. This causes indices to point to wrong inputs, leading to issues like unnecessary CUDA graph re-recording. ## Problem In `handle_effect_tokens_fn()`, effect tokens are prepended to args: ```python additional_fwd_token_inputs = [torch.tensor([])] * num_tokens args = [*additional_fwd_token_inputs, *args] # tokens prepended at index 0 ``` But meta.static_input_indices is not offset by num_tokens. When these indices are later used (e.g., by CUDAGraph's check_invariants), they point to the wrong inputs: Before tokens: `args=[activation, weight], static_input_indices=[1] → weight ✓` After tokens: `args=[token, activation, weight], static_input_indices=[1] → activation ✗` Expected: ` static_input_indices=[2] (offset by num_tokens=1) → weight ✓` ## Impact - Activations get incorrectly marked as static inputs - CUDAGraph's check_invariants sees data pointer changes for "static" inputs - This triggers unnecessary re-recording, causing performance degradation ## Fix Offset static_input_indices by num_tokens after prepending effect tokens in the forward-only (trace_joint=False) path: ```python if num_tokens > 0: meta.static_input_indices = [ idx + num_tokens for idx in meta.static_input_indices ] ``` ## Unit Test Added `test_static_input_indices_with_effect_tokens` in `test/functorch/test_aotdispatch.py` which: 1. Registers a custom effectful op via _register_effectful_op 2. Compiles a function with torch.compile using a metadata-capturing backend 3. Verifies that all `static_input_indices are >= num_tokens` after effect tokens are prepended (i.e., no index incorrectly points to a token input) cc @yanboliang Pull Request resolved: pytorch#175904 Approved by: https://github.com/angelayi
Trying to reland after original PR's revert pytorch#174338 Pull Request resolved: pytorch#176424 Approved by: https://github.com/Skylion007, https://github.com/huydhn
This is a cleanup for 4 functions: `upsample_bilinear2d_aa_kernel_impl`, `upsample_bilinear2d_kernel_impl`, `upsample_bicubic2d_aa_kernel_impl`, and `upsample_bicubic2d_kernel_impl`. They all follow the same dispatch logic but were previously implementing that logic in slightly different ways, and had duplicated calls (to e.g. `separable_upsample_generic_Nd_kernel_impl`).
This PR unifies and simplifies the logic which now looks like this for all 4 functions:
```py
if dtype == uint8:
if AVX2 and ...:
return avx2_path(...)
elif aarch64 and ...:
return neon_path(...)
else:
return generic_path(...)
```
Pull Request resolved: pytorch#176422
Approved by: https://github.com/Skylion007
# Motivation
This PR aims to fix `torch.Stream` as a context manager nested/reentrance scenario. `torch.cuda.stream` and `torch.xpu.stream` could support these usages.
The following scenario would be fixed with this PR:
```python
import torch
s0 = torch.Stream()
with s0, s0:
pass
```
```python
import torch
s0 = torch.Stream()
s1 = torch.Stream()
with s0, s1:
with s0, s1:
pass
```
# Addtional Context
Fix pytorch#176560
Pull Request resolved: pytorch#176568
Approved by: https://github.com/albanD
Pull Request resolved: pytorch#176082 Approved by: https://github.com/Lucaskabela
…ytorch#175497)" This reverts commit 9e34c7a. Reverted pytorch#175497 on behalf of https://github.com/malfet due to Broke trunk testing, see https://hud.pytorch.org/hud/pytorch/pytorch/1da0362298a56e82a5d3429fa482c48bb0144fa9/1?per_page=50&name_filter=trunk%20%2F%20linux-jammy-cuda&mergeEphemeralLF=true ([comment](pytorch#175497 (comment)))
) As mentioned above. No other changes. Pull Request resolved: pytorch#176437 Approved by: https://github.com/Skylion007
) Summary: Our test is failing with a `UnicodeDecodeError` during Triton template loading. The cause was a non-ASCII em-dash character (`–`, U+2013) in a comment on line 2 of `triton_depthwise_conv.py.jinja`. When the Triton template engine reads the file, it uses ASCII decoding, which cannot handle multi-byte UTF-8 characters. The fix replaces the em-dash with a standard ASCII hyphen (`-`). Test Plan: Ran cogwheel test Reviewed By: chevalierNoir, kqfu Differential Revision: D95211429 Pull Request resolved: pytorch#176484 Approved by: https://github.com/kqfu, https://github.com/Skylion007
…ersions (pytorch#172696) Fixes pytorch#172684 Updated to use single_dim_strategy. Type conversion to int/bool on Partial(sum) incorrectly preserved the Partial placement, producing wrong results. trunc(a+b) != trunc(a) + trunc(b). This adds a custom strategy for _to_copy that checks if the dtype conversion is linear for the reduce operation before preserving Partial. This PR is offered in support of the Partial correctness stabilization efforts. Pull Request resolved: pytorch#172696 Approved by: https://github.com/wconstab
This fixes MPS SDPA output shape for cases where `value.size(-1) != query.size(-1)`, so output now follows `(..., L, Ev)` as expected. I also added guards in Metal kernel paths that assume equal qkv head dims. Added the updated meta shape inference for the `sdpa_general_mps` path which seems to have been left out initially. Added regression coverage in `test/test_transformers.py` covering the shape semantics, and a similar one in `test/test_mps.py` that also checks for numerical parity with CPU. Fixes pytorch#176767 Pull Request resolved: pytorch#176843 Approved by: https://github.com/malfet
…twise ops (pytorch#175795)" This reverts commit 7cafe7f. Reverted pytorch#175795 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#175795 (comment)))
Previously, `run_pre_grad_passes` was called unconditionally at the top of `_compile_fx_main`. This meant pre-grad transformations were not included in cached artifacts and ran unnecessarily on cache hits. Move pre-grad passes into `aot_module_simplified` (Path B) via a callback so they run after the cache lookup — on cache miss only. `_compile_fx_main` has two compilation paths that diverge at the `V.aot_compilation` check: Path A uses `aot_export_module` (AOTInductor, no cache) and Path B uses `aot_autograd` → `aot_module_simplified` (with `AOTAutogradCache`). Since Path A has no cache, run pre-grad passes explicitly before `aot_export_module`. Pull Request resolved: pytorch#176340 Approved by: https://github.com/aorenste
Summary: See title Test Plan: ``` buck test fbcode//mode/opt fbcode//caffe2/test/distributed/elastic/agent/server/test:local_agent_test -- --run-disabled ``` Differential Revision: D95802783 Pull Request resolved: pytorch#176887 Approved by: https://github.com/Skylion007
…ser error graph breaks (pytorch#176649) Fixes pytorch#166555 Pull Request resolved: pytorch#176649 Approved by: https://github.com/Lucaskabela
…orch#176817) Summary: The `CUDACachingAllocator` (a `DeviceAllocator`) and Caffe2's legacy `DefaultCUDAAllocator` (a plain `Allocator`) both registered for `DeviceType::CUDA` at priority 0. Since `SetAllocator` uses `>=` comparison, whichever static initializer ran last would win. When the legacy allocator won the race, `dynamic_cast<DeviceAllocator*>` in `getDeviceAllocator()` would fail, crashing `torch.accelerator.empty_cache()` and other `torch.accelerator` APIs. To be clear, this is not an issue in pure OSS PyTorch, where the Caffe2 legacy CUAD allocator does not exist. Fix by bumping `CUDACachingAllocator`'s registration priority to 1 so it always takes precedence over the legacy Caffe2 allocator regardless of static initialization order. This SIOF surfaced recently in vLLM after some code was generalized to use `torch.accelerator.empty_cache()` instead of `torch.cuda.empty_cache()` in vllm-project/vllm#30681. Test Plan: ``` buck test fbcode//mode/opt fbcode//vllm/omni:test_kernels_rotary_embedding -- --exact 'fbcode//vllm/omni:test_kernels_rotary_embedding - test_rotary_embedding.py::test_rotary_embedding_opcheck[False-False-1024-108-32-True-11-cuda]' ``` Previously: 1 passed, 1 error (`RuntimeError` during teardown) Now: 2 passed, 0 errors Errors/stack traces like the following are resolved after this change: ``` def empty_cache() -> None: r"""Release all unoccupied cached memory currently held by the caching allocator so that those can be used in other application. .. note:: This function is a no-op if the memory allocator for the current :ref:`accelerator <accelerators>` has not been initialized. """ > if not torch._C._accelerator_isAllocatorInitialized(): E RuntimeError: device_allocator INTERNAL ASSERT FAILED at "fbcode/caffe2/c10/core/CachingDeviceAllocator.h":253, please report a bug to PyTorch. Allocator for cuda is not a DeviceAllocator. ``` Differential Revision: D95703075 Pull Request resolved: pytorch#176817 Approved by: https://github.com/albanD
Add _dijkstra_expand_single_dim_strategy_to_mesh, a priority-queue search over input placement states that finds the lowest-cost sharding for an op without enumerating all S^N strategy combinations (S = single-dim strategies, N = mesh dimensions). The search uses _PreparedSingleDimStrategy (from the previous commit) to materialize single-dim rules and try_propagate() to test whether a candidate state matches on every mesh dimension. Each search state is a tuple of per-input placement tuples. Neighbors are generated by changing one placement on one mesh dimension for one input, using _get_neighbor_placements which encodes DTensor redistribute transition rules (Replicate <-> Shard, Partial -> Replicate/Shard). Cost computation uses _compute_redistribute_cost which calls _compute_placement_transition_cost directly per mesh dimension, avoiding DTensorSpec construction and _gen_transform_infos planning overhead. Returns None for _StridedShard inputs, signaling the caller to fall back to full expansion. Authored with Claude. Pull Request resolved: pytorch#169438 Approved by: https://github.com/anshul-si, https://github.com/zpcore Co-authored-by: Pian Pawakapan <pianpwk@meta.com>
Targeted filepaths that seem most likely to disrupt torchtitan, can revisit the specific paths over time. Fixes pytorch/torchtitan#2350 Pull Request resolved: pytorch#176774 Approved by: https://github.com/tianyu-l ghstack dependencies: pytorch#175901
First of all, fewer systems now how `wget` installed by default, but almost all Linux/MacOS comes with curl
If script with such name already exists, `wget` will download it with `.${NUM}` suffixed alias, which results in reporter posting results from something else, for example see pytorch#176829
```
wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
python collect_env.py
--2026-03-08 23:35:37-- https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31107 (30K) [text/plain]
Saving to: ‘collect_env.py.1’
collect_env.py.1 100%[==========================================>] 30.38K --.-KB/s in 0.04s
2026-03-08 23:35:38 (707 KB/s) - ‘collect_env.py.1’ saved [31107/31107]
Traceback (most recent call last):
File "/mnt/d/my/work/study/ai/kaggle_code/aimo2/collect_env.py", line 15, in <module>
```
Pull Request resolved: pytorch#176904
Approved by: https://github.com/msaroufim, https://github.com/seemethere
…ame (pytorch#176515)" (pytorch#176948) ## Summary Reverts pytorch#176515. This is a prerequisite for reverting the full `[fx] Move _Namespace to C++` series (pytorch#170962), which was reverted internally due to S627920 but the revert was never exported to GitHub. The quadratic fix patches `torch/csrc/fx/graph.cpp` which was introduced by pytorch#170962. This revert must land first so that pytorch#170962 can be cleanly reverted afterwards. ## Test plan CI — this revert removes a bugfix from C++ code that will itself be reverted in a follow-up PR. Pull Request resolved: pytorch#176948 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
This reverts commit 5c68844. Reverted pytorch#170962 on behalf of https://github.com/wdvr due to reverted in fbcode ([comment](pytorch#170962 (comment)))
) Summary: When K == 1, matrix multiplication (M, 1) @ (1, N) is an outer product. Instead of launching a full GEMM kernel, we decompose it into a broadcasted pointwise multiply at the ATen decomposition level, which is more efficient for this memory-bound case. This is a reland of D94097622 with two fixes: - Skip decomposition when M==1 or N==1 to avoid output strides from the broadcast multiply not matching mm strides. - Remove `as_strided` stride fixup that was causing issues with Helion (SympifyError on symbolic shapes). The M==1/N==1 guard also applies to the existing CPU K==1 decomposition path. **aten.mm** — TritonBench, K=1 shapes, median of 3 runs: | Shape (M, N, K) | B200 base (us) | B200 test (us) | B200 Speedup | H100 base (us) | H100 test (us) | H100 Speedup | |---|---|---|---|---|---|---| | (100, 100, 1) | 12.3 | 11.3 | 1.09x | 9.76 | 8.64 | 1.13x | | (150, 150, 1) | 12.3 | 11.2 | 1.10x | 9.82 | 8.70 | 1.13x | | (200, 200, 1) | 12.3 | 11.3 | 1.09x | 9.95 | 8.80 | 1.13x | | (256, 256, 1) | 12.3 | 11.3 | 1.09x | 9.76 | 8.70 | 1.12x | | (512, 512, 1) | 12.3 | 11.2 | 1.10x | 9.92 | 8.80 | 1.13x | | (1024, 1024, 1) | 14.3 | 13.2 | 1.09x | 11.39 | 9.44 | 1.21x | | (2048, 2048, 1) | 20.5 | 15.3 | **1.34x** | 16.19 | 12.83 | **1.26x** | | (4096, 4096, 1) | 35.8 | 26.8 | **1.33x** | 37.98 | 29.12 | **1.30x** | | (8192, 8192, 1) | 96.3 | 68.6 | **1.40x** | 120.48 | 89.12 | **1.35x** | | (16384, 16384, 1) | 329.8 | 234.5 | **1.41x** | 387.42 | 249.54 | **1.55x** | | (4608, 20, 1) | 13.2 | 11.3 | 1.17x | 10.02 | 8.86 | 1.13x | | (4608, 32, 1) | 13.2 | 11.3 | 1.17x | 9.95 | 8.86 | 1.12x | | (4608, 128, 1) | 13.2 | 11.4 | 1.17x | 10.94 | 8.99 | 1.22x | | (4608, 256, 1) | 14.3 | 13.2 | 1.09x | 12.22 | 9.50 | **1.29x** | | (4608, 512, 1) | 17.4 | 13.3 | **1.31x** | 14.02 | 10.59 | **1.32x** | | (4608, 1024, 1) | 20.5 | 15.3 | **1.34x** | 17.06 | 13.18 | **1.29x** | | (1024, 4096, 1) | 20.5 | 15.3 | **1.34x** | 16.80 | 13.25 | **1.27x** | | (4096, 1024, 1) | 20.5 | 15.3 | **1.34x** | 16.22 | 12.51 | **1.30x** | Geomean speedup: B200 **1.21x**, H100 **1.22x**, 0 regressions. **aten.addmm** — TritonBench, K=1 shapes, median of 3 runs: | Shape (M, N, K) | B200 base (us) | B200 test (us) | B200 Speedup | H100 base (us) | H100 test (us) | H100 Speedup | |---|---|---|---|---|---|---| | (100, 100, 1) | 12.3 | 12.3 | 1.00x | 9.76 | 9.06 | 1.08x | | (150, 150, 1) | 12.4 | 12.3 | 1.01x | 10.08 | 9.18 | 1.10x | | (200, 200, 1) | 12.4 | 12.3 | 1.00x | 9.98 | 9.31 | 1.07x | | (256, 256, 1) | 12.3 | 12.3 | 1.00x | 9.86 | 9.38 | 1.05x | | (512, 512, 1) | 13.3 | 13.2 | 1.01x | 10.37 | 9.73 | 1.07x | | (1024, 1024, 1) | 15.3 | 13.3 | 1.15x | 12.32 | 11.20 | 1.10x | | (2048, 2048, 1) | 23.6 | 18.5 | **1.27x** | 19.01 | 16.19 | **1.17x** | | (4096, 4096, 1) | 56.3 | 33.8 | **1.66x** | 58.72 | 45.60 | **1.29x** | | (8192, 8192, 1) | 172.2 | 102.3 | **1.68x** | 166.75 | 148.45 | 1.12x | | (16384, 16384, 1) | 665.8 | 359.5 | **1.85x** | 638.66 | 503.23 | **1.27x** | | (4608, 20, 1) | 13.2 | 12.3 | 1.07x | 10.21 | 9.47 | 1.08x | | (4608, 32, 1) | 13.2 | 12.4 | 1.06x | 10.11 | 9.47 | 1.07x | | (4608, 128, 1) | 13.3 | 13.2 | 1.00x | 11.68 | 10.27 | 1.14x | | (4608, 256, 1) | 15.3 | 13.4 | 1.14x | 13.28 | 11.55 | 1.15x | | (4608, 512, 1) | 18.6 | 15.4 | 1.20x | 15.87 | 13.63 | 1.16x | | (4608, 1024, 1) | 25.5 | 19.4 | **1.31x** | 21.02 | 17.92 | **1.17x** | | (1024, 4096, 1) | 23.5 | 18.5 | **1.27x** | 18.94 | 16.38 | 1.16x | | (4096, 1024, 1) | 23.5 | 18.5 | **1.27x** | 18.98 | 16.29 | 1.17x | Geomean speedup: B200 **1.19x**, H100 **1.13x**, 0 regressions. diff-train-skip-merge Test Plan: ``` PYTORCH_TEST_REMOTE_GPU=1 buck2 test //caffe2/test/inductor:test_mmdecomp_cuda \ -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 \ -c fbcode.enable_gpu_sections=true mode/opt Pass 30. Fail 0. PYTORCH_TEST_REMOTE_GPU=1 buck2 test //caffe2/test/inductor:test_mmdecomp \ -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 \ -c fbcode.enable_gpu_sections=true mode/opt Pass 29. Fail 0. PYTORCH_TEST_REMOTE_GPU=1 buck2 test //caffe2/test/inductor:fxir_backend \ -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 \ -c fbcode.enable_gpu_sections=true mode/opt Pass 76. Fail 0. ``` Reviewed By: PaulZhang12 Differential Revision: D94437532 Pull Request resolved: pytorch#175825 Approved by: https://github.com/PaulZhang12
…ype conversions (pytorch#172696)" This reverts commit 46cd90c. Reverted pytorch#172696 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#172696 (comment)))
pytorch#176723)" This reverts commit 26dddb9. Reverted pytorch#176723 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))
This reverts commit 492c742. Reverted pytorch#176015 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))
…#175936)" This reverts commit 388d61e. Reverted pytorch#175936 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))
This reverts commit 9b53dac. Reverted pytorch#175924 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
b7e7e59 to
b5018b0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds Python binding for
_make_xccl_premul_sumto supportPREMUL_SUMreduce operation in the XCCL backend (Intel GPU).The
_make_xccl_premul_sumbinding directly reusesmakeNCCLPreMulSumbecause XCCL/NCCL have identical memory layouts.Currently, XCCL backend supports PREMUL_SUM for:
Not yet supported:
Implemented code in intel/torch-xpu-ops#1948, and base test case in intel/torch-xpu-ops#2690.