Skip to content

[XPU] Add _make_xccl_premul_sum binding for XCCL backend#25

Open
Chao1Han wants to merge 2689 commits intomasterfrom
xccl-premul
Open

[XPU] Add _make_xccl_premul_sum binding for XCCL backend#25
Chao1Han wants to merge 2689 commits intomasterfrom
xccl-premul

Conversation

@Chao1Han
Copy link
Owner

@Chao1Han Chao1Han commented Jan 9, 2026

This PR adds Python binding for _make_xccl_premul_sum to support PREMUL_SUM reduce operation in the XCCL backend (Intel GPU).
The _make_xccl_premul_sum binding directly reuses makeNCCLPreMulSum because XCCL/NCCL have identical memory layouts.
Currently, XCCL backend supports PREMUL_SUM for:

  • ✅ allreduce
  • ✅ reduce_scatter

Not yet supported:

  • ❌ reduce (to be added in future work)

Implemented code in intel/torch-xpu-ops#1948, and base test case in intel/torch-xpu-ops#2690.

@Chao1Han Chao1Han changed the title Add xccl premul sum register [XPU] Add _make_xccl_premul_sum binding for XCCL backend Jan 9, 2026
izaitsevfb and others added 28 commits March 5, 2026 01:35
…ditions (pytorch#176522)

There is a bug in github webhooks.

When fired for `pull_request_review_comment` trigger, the `author_association` is `CONTRIBUTOR` (always?).
This messes up the pre-check logic for review workflow, causing it to skip.

This PR relaxes the pre-check (it's an optimization anyway), and I also relaxed `bedrock` environment protection rules, allowing PR merge branches.

[tested in ciforge](pytorch/ciforge#145 (comment))
Pull Request resolved: pytorch#176522
Approved by: https://github.com/drisspg
See title - we add a skill to take a bug, produce a min repro, then iterate on it until fixed
Pull Request resolved: pytorch#176359
Approved by: https://github.com/aorenste
This reverts commit 41e5795.

Reverted pytorch#176320 on behalf of https://github.com/malfet due to It works now, but does not upload the log I need ([comment](pytorch#176320 (comment)))
)

## Summary
- Fix MPS `masked_scatter` to preserve scalar tensor shape `[]` instead of incorrectly returning `[1]`
- The function now records whether `self` was originally 0-dimensional and squeezes the result back after processing

## Test plan
- Added scalar tensor test case to `test_masked_scatter` in `test/test_mps.py`
- Verified fix with reproducer script

Related error: pytorch/rl#3137 (comment)
Pull Request resolved: pytorch#174381
Approved by: https://github.com/kurtamohler
…torch#176553)

Instead of always using offline mode, we need to have a way to periodically refresh the local cache. This job runs only once per day, so rate limit wouldn't be an issue

I observe [a failure in trunk](https://github.com/pytorch/pytorch/actions/runs/22625578355/job/65575251092#step:15:13837) for `pytorch/gemma-3-12b-it-int4` where the local cache doesn't seem to work anymore, not sure why, could be due to the recent transformers version update.  The failure goes away when I turn off TRANSFORMERS_OFFLINE to refresh the local cache.

Pull Request resolved: pytorch#176553
Approved by: https://github.com/zou3519
Convert Union[X, Y] to X | Y and Optional[X] to X | None using ruff rules UP007 and UP045 since torch is 3.10+

Note: we skip testing here - that is the last directory remaining

Pull Request resolved: pytorch#176458
Approved by: https://github.com/aorenste
…ch#168894)

This PR ensures that ops on the same stream as events are not reordered around it by using the control deps HOP. This hop in essence creates passthrough aliases of the ops before and after the event to ensure ordering does not change. For stream assignments these are already annotated on the node so can be reordered.

Pull Request resolved: pytorch#168894
Approved by: https://github.com/aorenste
This PR adds support for Symbolic events in the TorchInductor scheduler. In short, we create an EventFactory which provides event indices which monotonically increase as compile proceeds and enables reuse of events that are no longer used. It also adds codegen support for events.

Pull Request resolved: pytorch#165390
Approved by: https://github.com/eellison
This PR adds utility functions for managing streams and a stream pool.

Pull Request resolved: pytorch#165504
Approved by: https://github.com/eellison
ghstack dependencies: pytorch#165390
…h#165391)

This PR implements stream codegen to the SubgraphWrapper Codegen. It supports codegen for enter/exiting stream contexts and calling record_stream on returned tensors.

Pull Request resolved: pytorch#165391
Approved by: https://github.com/eellison
ghstack dependencies: pytorch#165390, pytorch#165504
…)"

This reverts commit f1da356.

Reverted pytorch#176458 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#176458 (comment)))
…orch#176515)

create_name stores its counter under base_count[base] (e.g. "sin") but
was looking it up under base_count[candidate] (e.g. "sin_1").  When the
candidate already has a numeric suffix the lookup always misses, so the
while-loop that probes for a free name starts from the regex-extracted
number instead of the stored counter — degrading to O(existing_names)
per call.

This makes repeated node_copy of the same subgraph (as done by
inline_invoke_subgraph) quadratic: for an 80-layer model the pass went
from ~100ms to ~4.6s.

Fix: look up base_count[base] to match the store.

Authored with Claude.
Pull Request resolved: pytorch#176515
Approved by: https://github.com/jansel, https://github.com/zou3519
…torch#176452)

is_from_source(source, target) previously only compared at the root of
the ChainedSource hierarchy. This meant that
is_from_source(AttrSource(X, 'c'), X) returned False when X was itself
a ChainedSource (e.g. UnspecializedNNModuleSource(GlobalSource(...))),
because the function walked past X all the way to the root before
comparing.

Check source == target at each level before recursing so that
intermediate sources are correctly recognized as ancestors.

Authored with Claude.

Pull Request resolved: pytorch#176452
Approved by: https://github.com/williamwen42
ghstack dependencies: pytorch#176515
…ytorch#175946)

[lsakka@devgpu009.cco5 ~/pytorch10/pytorch (13694ac)]$ python benchmarks/dynamo/huggingface.py  --backend inductor --performance --inference --compare-backed-unbacked

```
--- AlbertForMaskedLM ---
  backed... 1.349x
  unbacked... 1.345x
  => diff: -0.3%

--- AllenaiLongformerBase ---
  backed... 1.226x
  unbacked... 1.146x
  => diff: -6.5%

--- BartForCausalLM ---
  backed... 1.022x
  unbacked... 1.015x
  => diff: -0.7%

--- BertForMaskedLM ---
  backed... 1.186x
  unbacked... 1.187x
  => diff: +0.1%

--- BlenderbotForCausalLM ---
  backed... 1.058x
  unbacked... 1.061x
  => diff: +0.3%
...
```

Pull Request resolved: pytorch#175946
Approved by: https://github.com/jansel
…on to avoid DDE (pytorch#175956)

encountered while running
python benchmarks/dynamo/huggingface.py --only AllenaiLongformerBase --backend inductor --performance --inference --unbacked-batch-only

Pull Request resolved: pytorch#175956
Approved by: https://github.com/ColinPeppler, https://github.com/jansel
ghstack dependencies: pytorch#175946
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: pytorch#176242
Approved by: https://github.com/pytorchbot
…are prepended (pytorch#175904)

## Summary

  When effectful ops (e.g., `with_effects`) are present, `handle_effect_tokens_fn()` prepends effect token placeholders to the input args. However, `static_input_indices` in `ViewAndMutationMeta` is computed before this prepending and is not adjusted afterwards. This causes indices to point to wrong inputs, leading to issues like unnecessary CUDA graph re-recording.

  ## Problem

  In `handle_effect_tokens_fn()`, effect tokens are prepended to args:

  ```python
  additional_fwd_token_inputs = [torch.tensor([])] * num_tokens
  args = [*additional_fwd_token_inputs, *args]  # tokens prepended at index 0
```

  But meta.static_input_indices is not offset by num_tokens. When these indices are later used (e.g., by CUDAGraph's check_invariants), they point to the wrong inputs:

  Before tokens: `args=[activation, weight], static_input_indices=[1]  → weight ✓`
  After tokens:  `args=[token, activation, weight], static_input_indices=[1]  → activation ✗`
  Expected:     ` static_input_indices=[2] (offset by num_tokens=1)  → weight ✓`

  ## Impact

  - Activations get incorrectly marked as static inputs
  - CUDAGraph's check_invariants sees data pointer changes for "static" inputs
  - This triggers unnecessary re-recording, causing performance degradation

  ## Fix

  Offset static_input_indices by num_tokens after prepending effect tokens in the forward-only (trace_joint=False) path:

```python
  if num_tokens > 0:
      meta.static_input_indices = [
          idx + num_tokens for idx in meta.static_input_indices
      ]
```

  ## Unit Test

  Added `test_static_input_indices_with_effect_tokens` in `test/functorch/test_aotdispatch.py` which:
  1. Registers a custom effectful op via _register_effectful_op
  2. Compiles a function with torch.compile using a metadata-capturing backend
  3. Verifies that all `static_input_indices are >= num_tokens` after effect tokens are prepended (i.e., no index
  incorrectly points to a token input)

  cc @yanboliang

Pull Request resolved: pytorch#175904
Approved by: https://github.com/angelayi
This is a cleanup for 4 functions: `upsample_bilinear2d_aa_kernel_impl`, `upsample_bilinear2d_kernel_impl`, `upsample_bicubic2d_aa_kernel_impl`, and `upsample_bicubic2d_kernel_impl`. They all follow the same dispatch logic but were previously implementing that logic in slightly different ways, and had duplicated calls (to e.g. `separable_upsample_generic_Nd_kernel_impl`).

This PR unifies and simplifies the logic which now looks like this for all 4 functions:

```py
if dtype == uint8:
	if AVX2 and ...:
		return avx2_path(...)
    elif aarch64 and ...:
        return neon_path(...)
else:
	return generic_path(...)
```

Pull Request resolved: pytorch#176422
Approved by: https://github.com/Skylion007
# Motivation
This PR aims to fix `torch.Stream` as a context manager nested/reentrance scenario. `torch.cuda.stream` and `torch.xpu.stream` could support these usages.

The following scenario would be fixed with this PR:
```python
import torch
s0 = torch.Stream()
with s0, s0:
    pass
```
```python
import torch
s0 = torch.Stream()
s1 = torch.Stream()
with s0, s1:
    with s0, s1:
        pass
```

# Addtional Context
Fix pytorch#176560

Pull Request resolved: pytorch#176568
Approved by: https://github.com/albanD
)

As mentioned above. No other changes.
Pull Request resolved: pytorch#176437
Approved by: https://github.com/Skylion007
)

Summary:
Our test is failing with a `UnicodeDecodeError` during Triton template loading. The cause was a non-ASCII em-dash character (`–`, U+2013) in a comment on line 2 of `triton_depthwise_conv.py.jinja`. When the Triton template engine reads the file, it uses ASCII decoding, which cannot handle multi-byte UTF-8 characters.

The fix replaces the em-dash with a standard ASCII hyphen (`-`).

Test Plan: Ran cogwheel test

Reviewed By: chevalierNoir, kqfu

Differential Revision: D95211429

Pull Request resolved: pytorch#176484
Approved by: https://github.com/kqfu, https://github.com/Skylion007
stmcgovern and others added 28 commits March 9, 2026 19:07
…ersions (pytorch#172696)

Fixes pytorch#172684
Updated to use single_dim_strategy.
Type conversion to int/bool on Partial(sum) incorrectly preserved the Partial placement, producing wrong results. trunc(a+b) != trunc(a) + trunc(b).

This adds a custom strategy for _to_copy that checks if the dtype conversion is linear for the reduce operation before preserving Partial.

This PR is offered in support of the Partial correctness stabilization efforts.
Pull Request resolved: pytorch#172696
Approved by: https://github.com/wconstab
This fixes MPS SDPA output shape for cases where `value.size(-1) != query.size(-1)`, so output now follows `(..., L, Ev)` as expected. I also added guards in Metal kernel paths that assume equal qkv head dims.

Added the updated meta shape inference for the `sdpa_general_mps` path which seems to have been left out initially.

Added regression coverage in `test/test_transformers.py` covering the shape semantics, and a similar one in `test/test_mps.py` that also checks for numerical parity with CPU.

Fixes pytorch#176767
Pull Request resolved: pytorch#176843
Approved by: https://github.com/malfet
…twise ops (pytorch#175795)"

This reverts commit 7cafe7f.

Reverted pytorch#175795 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#175795 (comment)))
Previously, `run_pre_grad_passes` was called unconditionally at the top
of `_compile_fx_main`.  This meant pre-grad transformations were not
included in cached artifacts and ran unnecessarily on cache hits.

Move pre-grad passes into `aot_module_simplified` (Path B) via a
callback so they run after the cache lookup — on cache miss only.

`_compile_fx_main` has two compilation paths that diverge at the
`V.aot_compilation` check: Path A uses `aot_export_module` (AOTInductor,
no cache) and Path B uses `aot_autograd` → `aot_module_simplified` (with
`AOTAutogradCache`).  Since Path A has no cache, run pre-grad passes
explicitly before `aot_export_module`.

Pull Request resolved: pytorch#176340
Approved by: https://github.com/aorenste
Summary: See title

Test Plan:
```
buck test fbcode//mode/opt fbcode//caffe2/test/distributed/elastic/agent/server/test:local_agent_test -- --run-disabled
```

Differential Revision: D95802783

Pull Request resolved: pytorch#176887
Approved by: https://github.com/Skylion007
…orch#176817)

Summary:
The `CUDACachingAllocator` (a `DeviceAllocator`) and Caffe2's legacy
`DefaultCUDAAllocator` (a plain `Allocator`) both registered for
`DeviceType::CUDA` at priority 0. Since `SetAllocator` uses `>=` comparison,
whichever static initializer ran last would win. When the legacy
allocator won the race, `dynamic_cast<DeviceAllocator*>` in
`getDeviceAllocator()` would fail, crashing `torch.accelerator.empty_cache()`
and other `torch.accelerator` APIs. To be clear, this is not an issue in
pure OSS PyTorch, where the Caffe2 legacy CUAD allocator does not exist.

Fix by bumping `CUDACachingAllocator`'s registration priority to 1 so it
always takes precedence over the legacy Caffe2 allocator regardless of
static initialization order.

This SIOF surfaced recently in vLLM after some code was generalized to use
`torch.accelerator.empty_cache()` instead of `torch.cuda.empty_cache()` in
vllm-project/vllm#30681.

Test Plan:
```
buck test fbcode//mode/opt fbcode//vllm/omni:test_kernels_rotary_embedding -- --exact 'fbcode//vllm/omni:test_kernels_rotary_embedding - test_rotary_embedding.py::test_rotary_embedding_opcheck[False-False-1024-108-32-True-11-cuda]'
```

Previously: 1 passed, 1 error (`RuntimeError` during teardown)
Now: 2 passed, 0 errors

Errors/stack traces like the following are resolved after this change:
```
    def empty_cache() -> None:
        r"""Release all unoccupied cached memory currently held by the caching
        allocator so that those can be used in other application.

        .. note:: This function is a no-op if the memory allocator for the current
            :ref:`accelerator <accelerators>` has not been initialized.
        """
>       if not torch._C._accelerator_isAllocatorInitialized():
E       RuntimeError: device_allocator INTERNAL ASSERT FAILED at "fbcode/caffe2/c10/core/CachingDeviceAllocator.h":253, please report a bug to PyTorch. Allocator for cuda is not a DeviceAllocator.
```

Differential Revision: D95703075

Pull Request resolved: pytorch#176817
Approved by: https://github.com/albanD
Add _dijkstra_expand_single_dim_strategy_to_mesh, a priority-queue search
over input placement states that finds the lowest-cost sharding for an op
without enumerating all S^N strategy combinations (S = single-dim strategies,
N = mesh dimensions). The search uses _PreparedSingleDimStrategy (from the
previous commit) to materialize single-dim rules and try_propagate() to
test whether a candidate state matches on every mesh dimension.

Each search state is a tuple of per-input placement tuples. Neighbors are
generated by changing one placement on one mesh dimension for one input,
using _get_neighbor_placements which encodes DTensor redistribute transition
rules (Replicate <-> Shard, Partial -> Replicate/Shard). Cost computation
uses _compute_redistribute_cost which calls _compute_placement_transition_cost
directly per mesh dimension, avoiding DTensorSpec construction and
_gen_transform_infos planning overhead.

Returns None for _StridedShard inputs, signaling the caller to fall back to
full expansion.

Authored with Claude.
Pull Request resolved: pytorch#169438
Approved by: https://github.com/anshul-si, https://github.com/zpcore

Co-authored-by: Pian Pawakapan <pianpwk@meta.com>
Targeted filepaths that seem most likely to disrupt torchtitan, can
revisit the specific paths over time.

Fixes pytorch/torchtitan#2350
Pull Request resolved: pytorch#176774
Approved by: https://github.com/tianyu-l
ghstack dependencies: pytorch#175901
First of all, fewer systems now how `wget` installed by default, but almost all Linux/MacOS comes with curl

If script with such name already exists, `wget` will download it with `.${NUM}` suffixed alias, which results in reporter posting results from something else, for example see pytorch#176829
```
wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
python collect_env.py
--2026-03-08 23:35:37--  https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31107 (30K) [text/plain]
Saving to: ‘collect_env.py.1’

collect_env.py.1            100%[==========================================>]  30.38K  --.-KB/s    in 0.04s

2026-03-08 23:35:38 (707 KB/s) - ‘collect_env.py.1’ saved [31107/31107]

Traceback (most recent call last):
  File "/mnt/d/my/work/study/ai/kaggle_code/aimo2/collect_env.py", line 15, in <module>
```
Pull Request resolved: pytorch#176904
Approved by: https://github.com/msaroufim, https://github.com/seemethere
…ame (pytorch#176515)" (pytorch#176948)

## Summary
Reverts pytorch#176515.

This is a prerequisite for reverting the full `[fx] Move _Namespace to C++` series (pytorch#170962), which was reverted internally due to S627920 but the revert was never exported to GitHub.

The quadratic fix patches `torch/csrc/fx/graph.cpp` which was introduced by pytorch#170962. This revert must land first so that pytorch#170962 can be cleanly reverted afterwards.

## Test plan
CI — this revert removes a bugfix from C++ code that will itself be reverted in a follow-up PR.
Pull Request resolved: pytorch#176948
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
This reverts commit 5c68844.

Reverted pytorch#170962 on behalf of https://github.com/wdvr due to reverted in fbcode ([comment](pytorch#170962 (comment)))
)

Summary:
When K == 1, matrix multiplication (M, 1) @ (1, N) is an outer product.
Instead of launching a full GEMM kernel, we decompose it into a broadcasted
pointwise multiply at the ATen decomposition level, which is more efficient
for this memory-bound case.

This is a reland of D94097622 with two fixes:
- Skip decomposition when M==1 or N==1 to avoid output strides
  from the broadcast multiply not matching mm strides.
- Remove `as_strided` stride fixup that was causing issues with Helion
  (SympifyError on symbolic shapes).

The M==1/N==1 guard also applies to the existing CPU K==1 decomposition path.

**aten.mm** — TritonBench, K=1 shapes, median of 3 runs:

| Shape (M, N, K) | B200 base (us) | B200 test (us) | B200 Speedup | H100 base (us) | H100 test (us) | H100 Speedup |
|---|---|---|---|---|---|---|
| (100, 100, 1) | 12.3 | 11.3 | 1.09x | 9.76 | 8.64 | 1.13x |
| (150, 150, 1) | 12.3 | 11.2 | 1.10x | 9.82 | 8.70 | 1.13x |
| (200, 200, 1) | 12.3 | 11.3 | 1.09x | 9.95 | 8.80 | 1.13x |
| (256, 256, 1) | 12.3 | 11.3 | 1.09x | 9.76 | 8.70 | 1.12x |
| (512, 512, 1) | 12.3 | 11.2 | 1.10x | 9.92 | 8.80 | 1.13x |
| (1024, 1024, 1) | 14.3 | 13.2 | 1.09x | 11.39 | 9.44 | 1.21x |
| (2048, 2048, 1) | 20.5 | 15.3 | **1.34x** | 16.19 | 12.83 | **1.26x** |
| (4096, 4096, 1) | 35.8 | 26.8 | **1.33x** | 37.98 | 29.12 | **1.30x** |
| (8192, 8192, 1) | 96.3 | 68.6 | **1.40x** | 120.48 | 89.12 | **1.35x** |
| (16384, 16384, 1) | 329.8 | 234.5 | **1.41x** | 387.42 | 249.54 | **1.55x** |
| (4608, 20, 1) | 13.2 | 11.3 | 1.17x | 10.02 | 8.86 | 1.13x |
| (4608, 32, 1) | 13.2 | 11.3 | 1.17x | 9.95 | 8.86 | 1.12x |
| (4608, 128, 1) | 13.2 | 11.4 | 1.17x | 10.94 | 8.99 | 1.22x |
| (4608, 256, 1) | 14.3 | 13.2 | 1.09x | 12.22 | 9.50 | **1.29x** |
| (4608, 512, 1) | 17.4 | 13.3 | **1.31x** | 14.02 | 10.59 | **1.32x** |
| (4608, 1024, 1) | 20.5 | 15.3 | **1.34x** | 17.06 | 13.18 | **1.29x** |
| (1024, 4096, 1) | 20.5 | 15.3 | **1.34x** | 16.80 | 13.25 | **1.27x** |
| (4096, 1024, 1) | 20.5 | 15.3 | **1.34x** | 16.22 | 12.51 | **1.30x** |

Geomean speedup: B200 **1.21x**, H100 **1.22x**, 0 regressions.

**aten.addmm** — TritonBench, K=1 shapes, median of 3 runs:

| Shape (M, N, K) | B200 base (us) | B200 test (us) | B200 Speedup | H100 base (us) | H100 test (us) | H100 Speedup |
|---|---|---|---|---|---|---|
| (100, 100, 1) | 12.3 | 12.3 | 1.00x | 9.76 | 9.06 | 1.08x |
| (150, 150, 1) | 12.4 | 12.3 | 1.01x | 10.08 | 9.18 | 1.10x |
| (200, 200, 1) | 12.4 | 12.3 | 1.00x | 9.98 | 9.31 | 1.07x |
| (256, 256, 1) | 12.3 | 12.3 | 1.00x | 9.86 | 9.38 | 1.05x |
| (512, 512, 1) | 13.3 | 13.2 | 1.01x | 10.37 | 9.73 | 1.07x |
| (1024, 1024, 1) | 15.3 | 13.3 | 1.15x | 12.32 | 11.20 | 1.10x |
| (2048, 2048, 1) | 23.6 | 18.5 | **1.27x** | 19.01 | 16.19 | **1.17x** |
| (4096, 4096, 1) | 56.3 | 33.8 | **1.66x** | 58.72 | 45.60 | **1.29x** |
| (8192, 8192, 1) | 172.2 | 102.3 | **1.68x** | 166.75 | 148.45 | 1.12x |
| (16384, 16384, 1) | 665.8 | 359.5 | **1.85x** | 638.66 | 503.23 | **1.27x** |
| (4608, 20, 1) | 13.2 | 12.3 | 1.07x | 10.21 | 9.47 | 1.08x |
| (4608, 32, 1) | 13.2 | 12.4 | 1.06x | 10.11 | 9.47 | 1.07x |
| (4608, 128, 1) | 13.3 | 13.2 | 1.00x | 11.68 | 10.27 | 1.14x |
| (4608, 256, 1) | 15.3 | 13.4 | 1.14x | 13.28 | 11.55 | 1.15x |
| (4608, 512, 1) | 18.6 | 15.4 | 1.20x | 15.87 | 13.63 | 1.16x |
| (4608, 1024, 1) | 25.5 | 19.4 | **1.31x** | 21.02 | 17.92 | **1.17x** |
| (1024, 4096, 1) | 23.5 | 18.5 | **1.27x** | 18.94 | 16.38 | 1.16x |
| (4096, 1024, 1) | 23.5 | 18.5 | **1.27x** | 18.98 | 16.29 | 1.17x |

Geomean speedup: B200 **1.19x**, H100 **1.13x**, 0 regressions.

diff-train-skip-merge

Test Plan:
```
PYTORCH_TEST_REMOTE_GPU=1 buck2 test //caffe2/test/inductor:test_mmdecomp_cuda \
  -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 \
  -c fbcode.enable_gpu_sections=true mode/opt
Pass 30. Fail 0.

PYTORCH_TEST_REMOTE_GPU=1 buck2 test //caffe2/test/inductor:test_mmdecomp \
  -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 \
  -c fbcode.enable_gpu_sections=true mode/opt
Pass 29. Fail 0.

PYTORCH_TEST_REMOTE_GPU=1 buck2 test //caffe2/test/inductor:fxir_backend \
  -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 \
  -c fbcode.enable_gpu_sections=true mode/opt
Pass 76. Fail 0.
```

Reviewed By: PaulZhang12

Differential Revision: D94437532

Pull Request resolved: pytorch#175825
Approved by: https://github.com/PaulZhang12
…ype conversions (pytorch#172696)"

This reverts commit 46cd90c.

Reverted pytorch#172696 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#172696 (comment)))
pytorch#176723)"

This reverts commit 26dddb9.

Reverted pytorch#176723 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))
This reverts commit 492c742.

Reverted pytorch#176015 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))
…#175936)"

This reverts commit 388d61e.

Reverted pytorch#175936 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))
This reverts commit 9b53dac.

Reverted pytorch#175924 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](pytorch#175924 (comment)))
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.