Sync scratch branch to upstream main by bobrenjc93 · Pull Request #4 · bobrenjc93/pytorch

bobrenjc93 · 2026-03-12T20:22:16Z

Temporary sync PR created via REST API.

As those runners are shared between builds and tests, and always have a warm cache, so updating submodules is a relatively cheap operation

test plan: create a new python file with ``` from torch._higher_order_ops.triton_kernel_wrap import get_mutated_tensors ``` and see that this no longer errs on circular import Pull Request resolved: #176836 Approved by: https://github.com/Skylion007

In FA2, seqused_k limited valid K/V length and one of the use cases for this was KV caching. The caller writes cached + new tokens into the K/V buffer externally and then passes the buffer with seqused_k marking which tokens were valid. Pull Request resolved: #175897 Approved by: https://github.com/drisspg

`page_table` is an FA3 feature so we need to modify the function definitions in `native_functions.yaml`. if this is used with FA2, we throw an error Pull Request resolved: #175924 Approved by: https://github.com/drisspg ghstack dependencies: #175897

Pull Request resolved: #175936 Approved by: https://github.com/drisspg ghstack dependencies: #175897, #175924

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_no_dropout_inplace` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered a new op `_flash_attention_forward_no_dropout_inplace` `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests Pull Request resolved: #176015 Approved by: https://github.com/drisspg ghstack dependencies: #175897, #175924, #175936

Pull Request resolved: #176723 Approved by: https://github.com/drisspg ghstack dependencies: #175897, #175924, #175936, #176015

Added a comma for clarity in the sentence about writing new neural network modules. Pull Request resolved: #176633 Approved by: https://github.com/albanD

…ser_defined_triton_kernel is disabled (#176832) Summary: PR #173662 introduces a new fusion `epilogue_fusion_user_defined_triton_kernel`. It overrides the `get_read_writes` method of `UserDefinedTritonKernel`, even when `epilogue_fusion_user_defined_triton_kernel` is disabled. This may cause regression to existing Triton kernel CUDA graph for other models. This diff fixed the regression by falling back to the original `get_read_writes` when `epilogue_fusion_user_defined_triton_kernel` is disabled Differential Revision: D95727036 Pull Request resolved: #176832 Approved by: https://github.com/AmesingFlank

The first return value of `__tensor_flatten__` can now contain opaque (non-tensor) values such as DeviceMesh. Opaques flow through the flat arg list alongside tensors: they get indices, become graph inputs/outputs, and are pulled from all_args by position during subclass reconstruction — the same as PlainTensorMeta. An empty OpaqueMeta marker in SubclassCreationMeta.attrs distinguishes opaque slots from tensor slots (needed by process_runtime_tangent, which skips non-differentiable opaques). All code that previously iterated `__tensor_flatten__` results and assumed every element was a tensor now handles opaques: subclass_parametrization, non_strict_utils, FSDP _init_utils, common_utils (get_untyped_storages), frontend_utils, and parametrize.py (which stores opaques as plain attributes rather than parameters). Tests added (all in test/test_opaque_obj_v2.py): - test_tensor_subclass_with_opaque_attr_backward - test_tensor_subclass_opaque_backward_compiled_autograd - test_tensor_subclass_shared_opaque_remapping - test_shared_opaque_identity_guard - test_shared_direct_opaque_identity_guard - test_tensor_subclass_shared_opaque_backward - test_deeply_nested_tensor_subclass_with_opaque - test_subclass_parametrization_with_opaque_attrs - test_export_non_strict_with_opaque_attrs - test_get_untyped_storages_with_opaque_attrs - test_subclass_opaque_attrs_cache_hit - test_value_type_opaque_in_tensor_attrs_errors Authored with Claude. Pull Request resolved: #176457 Approved by: https://github.com/ezyang

…in `assert_functional_graph` (#176606) The `_index_fill` decomposition used mutable `empty_like + copy_` to restore strides when `index_copy` returned a contiguous tensor, which broke the functional graph invariant. Replace with the functional `prims.copy_strided` prim that does the same thing as a single op. Fixes #144846 Authored with Claude. Pull Request resolved: #176606 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos

Previously, `run_pre_grad_passes` was called unconditionally at the top of `_compile_fx_main`. This meant pre-grad transformations were not included in cached artifacts and ran unnecessarily on cache hits. Move pre-grad passes into `aot_module_simplified` (Path B) via a callback so they run after the cache lookup — on cache miss only. `_compile_fx_main` has two compilation paths that diverge at the `V.aot_compilation` check: Path A uses `aot_export_module` (AOTInductor, no cache) and Path B uses `aot_autograd` → `aot_module_simplified` (with `AOTAutogradCache`). Since Path A has no cache, run pre-grad passes explicitly before `aot_export_module`. Pull Request resolved: #176340 Approved by: https://github.com/aorenste

This reverts commit 4bdfaaa. Reverted #176136 on behalf of https://github.com/wdvr due to reverting to put back on top of the trunk and develop a full forward fix ([comment](#176136 (comment)))

This reverts commit d49571b. Reverted #173330 on behalf of https://github.com/wdvr due to sorry having issues with fixing build issues internally -- will revert and reland on top of master, and have a combined improved build fix instead of #176136 ([comment](#173330 (comment)))

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned torchtitan hash. Pull Request resolved: #176850 Approved by: https://github.com/pytorchbot

Summary: Update the default C++ standard from c++17 to c++20 in torch.utils.cpp_extension and torch._inductor.cpp_builder. This affects how user C++ extensions and inductor-generated code are compiled. Test Plan: Sandcastle Reviewed By: malfet Pull Request resolved: #176659 Approved by: https://github.com/malfet

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: #176668 Approved by: https://github.com/pytorchbot

This reverts commit ff91f31. Reverted #176340 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#176340 (comment)))

This reverts commit 7ec1b27. Reverted #176668 on behalf of https://github.com/huydhn due to This is not ready to land yet ([comment](#176668 (comment)))

Pull Request resolved: #173330 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

# Motivation Aligned to the other backend, move PeerToPeerAccess from `ATen` to `c10`. # Additional Context Just code paste and copy, no additional code change needed. #174698 Pull Request resolved: #176856 Approved by: https://github.com/EikanWang, https://github.com/gujinghui

…6423) - When the total score is zero, it should be updated otherwise there will be a division by zero error - Trying to create a repro proved difficult. Example state that produces this is: ```python size_hints = {"y": 4096, "x": 64, "r0_": 128} inductor_meta = { "tiling_scores": {"y": 134217728, "x": 2097152, "r0_": 0}, "num_load": 1, } triton_meta = { "signature": { "in_ptr0": "*fp32", "out_ptr0": "*fp32", "ynumel": "i32", "xnumel": "i32", "r0_numel": "i32", "YBLOCK": "constexpr", "XBLOCK": "constexpr", "R0_BLOCK": "constexpr", } } persistent_reduction=False ``` Pull Request resolved: #176423 Approved by: https://github.com/jansel

This PR enables inductor/test_cpu_repro on AArch64 and marks known failures with xfail and link to issue. Pull Request resolved: #171095 Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/malfet

@peterbell10

… kstest helpers (#176226) Fixes #116271 Allow test/nn/test_init.py to run under TorchDynamo (e.g. with PYTORCH_TEST_WITH_DYNAMO=1) by removing the @skipIfTorchDynamo decorators from the tests that use scipy.stats.kstest and by adding @torch._dynamo.disable to the helpers _is_uniform, _is_normal, and _is_trunc_normal. Those helpers only perform the distribution checks via kstest; the actual init logic (e.g. init.uniform_, init.normal_, init.trunc_normal_) is still exercised under Dynamo. Tracing into kstest under Dynamo is very slow (and can hit the read-only numpy array path in CI). Disabling Dynamo only for these helpers avoids that cost and those failures while keeping the init behavior under test when Dynamo is enabled. @peterbell10, @Lucaskabela Pull Request resolved: #176226 Approved by: https://github.com/peterbell10

…with NVGEMM Cutlass API (#176543) Pull Request resolved: #176543 Approved by: https://github.com/mlazos

Pull Request resolved: #176544 Approved by: https://github.com/mlazos ghstack dependencies: #176543

Pull Request resolved: #176545 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544

Instead of cloning this directly from the Cutlass repo via `setup.py`, we need to own it ourselves inside of Inductor to do some Tensor mode reordering due to the differences between how Inductor and this kernel need the dims ordered Pull Request resolved: #176546 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545

…end (#176547) Pull Request resolved: #176547 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545, #176546

…#176548) Pull Request resolved: #176548 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545, #176546, #176547

Changing the `num_stages` value from 1 to 2 enables more efficient pipelining in Triton backend which improves the performance. Here's some benchmark numbers for comparison run on MI350X. | Attn Type | Shape (B,Hq,M,Hkv,N,D) | stages=1 (μs) | stages=2 (μs) | Speedup | |----------------|----------------------------------|----------------|----------------|---------| | causal | (2, 16, 512, 16, 512, 64) | 37.6 | 35.8 | 1.05x | | causal | (2, 16, 512, 2, 512, 128) | 35.7 | 35.1 | 1.02x | | causal | (2, 16, 1024, 16, 1024, 64) | 39.5 | 31.4 | 1.26x | | causal | (2, 16, 4096, 16, 4096, 128) | 680.3 | 580.6 | 1.17x | | causal | (2, 16, 4096, 2, 4096, 64) | 259.0 | 238.4 | 1.09x | | noop | (8, 16, 1024, 16, 1024, 128) | 196.2 | 183.3 | 1.07x | | causal | (8, 16, 1024, 2, 1024, 64) | 79.7 | 75.5 | 1.06x | | alibi | (8, 16, 4096, 16, 4096, 64) | 2017.7 | 1727.3 | 1.17x | | causal | (8, 16, 4096, 16, 4096, 128) | 2686.0 | 2258.7 | 1.19x | | sliding_window | (8, 16, 4096, 2, 4096, 64) | 610.4 | 559.3 | 1.09x | | causal | (16, 16, 512, 16, 512, 128) | 111.6 | 99.0 | 1.13x | | alibi | (16, 16, 1024, 2, 1024, 128) | 391.6 | 335.3 | 1.17x | | causal | (16, 16, 1024, 16, 1024, 64) | 163.6 | 142.6 | 1.15x | | noop | (16, 16, 4096, 16, 4096, 128) | 6260.5 | 5130.3 | 1.22x | | causal | (16, 16, 4096, 2, 4096, 64) | 2084.5 | 1780.5 | 1.17x | | causal | (1, 32, 16384, 4, 16384, 64) | 2687.9 | 2472.8 | 1.09x | | **Geo-mean** | | | | **1.13x** | All configs: `num_warps=4`, `dtype=bfloat16`, fwd only. Benchmarked with `attention-gym` on ROCm. Pull Request resolved: #176676 Approved by: https://github.com/drisspg, https://github.com/jeffdaily

1. Created privateuse1_profiler.h/.cpp — A registry pattern that allows PrivateUse1 backends to register IActivityProfiler factories via REGISTER_PRIVATEUSE1_PROFILER(MyProfiler) macro, with compile-time static_assert ensuring the class inherits from libkineto::IActivityProfiler. * This makes the assumption that backends will take a dependency on Kineto to use IActivityProfiler interface. Right now the backends have to check in their implementation to Kineto - so this might be a step up and a safe assumption. * As an alternative, PyTorch could define its own abstract interface that mirrors IActivityProfiler, then internally forward to Kineto. 2. Kineto init paths — Added onKinetoInit() calls in kineto_shim.cpp (user-triggered profiling via prepareTrace()), but _not_ for kineto_client_interface.cpp (daemon mode via global_kineto_init()), with guards to ensure Kineto is initialized before forwarding. TODO 1. [Done] Gate this behind a new ProfilerState::KINETO_PRIVATEUSE1 check 2. [Done] Check how (if at all) kineto build args need to change. Mostly it shouldn't as for privateuse1 we wont need CUDA/ROCm/XPU etc. 3. [Done] How does this break kineto's fbcode setup? Not applicable Pull Request resolved: #172154 Approved by: https://github.com/scotts

…e (lru_cache, wraps) (#176934)" This reverts commit b09edda. Reverted #176934 on behalf of https://github.com/malfet due to Broke dynamo_wrapped tests, see https://hud.pytorch.org/hud/pytorch/pytorch/e274afff409c9c595b25206160e7c1dd4c6bbf6a/1?per_page=50&name_filter=dynamo_wrapped&mergeEphemeralLF=true ([comment](#176934 (comment)))

…171269) # Motivation The original goal was to generalize `CUDAGraph` and share implementations and logic across different backends, as mentioned in #158827. However, after further offline discussions, we decided to take a more incremental approach: start by defining a unified interface, while allowing each backend to maintain its own implementation. This avoids premature coupling and addresses backend-specific concerns. This PR introduces `GraphImplInterface`, a lightweight, backend-agnostic interface that defines a unified API for graph capture and replay. Each backend (e.g., `CUDA`, `XPU`, `PrivateUse1`) provides its own implementation and registers it via `REGISTER_GRAPH_IMPL`. On top of this interface, we provide a unified graph API, `at::accelerator::Graph`, which transparently maps to: - `CUDAGraph` on CUDA - `XPUGraph` on XPU - and corresponding implementations for other backends (including `PrivateUse1`) This design establishes a common abstraction layer while preserving backend autonomy, and lays the groundwork for future sharing of logic once the interface and use cases have stabilized. An additional benefit is that, for `CUDA` and `XPU`, the backend-specific graph types (e.g., `cuda::CUDAGraph` and `xpu::XPUGraph`) can share the same underlying implementation as `accelerator::Graph` on each backend, avoiding code duplication and ensuring consistent behavior. For `PrivateUse1`, `accelerator::Graph` can be supported with minimal effort by reusing the existing `PU1Graph` implementation. Pull Request resolved: #171269 Approved by: https://github.com/EikanWang, https://github.com/eellison

As decomposition via two triangular solves Frequently requested op in #154052 Pull Request resolved: #176703 Approved by: https://github.com/kurtamohler, https://github.com/malfet

…177207) ---- - Remove `test_bfloat_constant`, `test_lowp_reduction`, and `test_lowp_where` as they don't test for anything beyond what existing tests cover. - Add `test_pad_after_gelu` as a regression test for Voxtral compilation on MPS, exercising pad(gelu(x)) across fp32, fp16, and bfloat16. Before #176436 test will fail with ``` torch._inductor.exc.InductorError: SyntaxError: failed to compile #include <c10/metal/utils.h> #include <c10/metal/special_math.h> kernel void generated_kernel( device bfloat* out_ptr0, constant bfloat* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = (xindex) % (17); int x1 = c10::metal::floor_divide(xindex, 17); int x2 = xindex; auto tmp0 = (-1) + x0; auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 0; auto tmp3 = tmp1 >= tmp2; bfloat tmp4; if (tmp3) { auto tmp_scoped_0 = static_cast<float>(in_ptr0[(-1) + x0 + 16*x1]); auto tmp_scoped_1 = static_cast<float>(tmp_scoped_0); auto tmp_scoped_2 = 0.5; auto tmp_scoped_3 = tmp_scoped_1 * tmp_scoped_2; auto tmp_scoped_4 = 0.7071067811865476; auto tmp_scoped_5 = tmp_scoped_1 * tmp_scoped_4; auto tmp_scoped_6 = c10::metal::erf(tmp_scoped_5); auto tmp_scoped_7 = 1.0; auto tmp_scoped_8 = tmp_scoped_6 + tmp_scoped_7; auto tmp_scoped_9 = tmp_scoped_3 * tmp_scoped_8; auto tmp_scoped_10 = static_cast<bfloat>(tmp_scoped_9); tmp4 = tmp_scoped_10; } else tmp4 = 0.0; out_ptr0[x2] = static_cast<bfloat>(tmp4); } with program_source:4495:23: error: assigning to 'bfloat' from incompatible type 'float' } else tmp4 = 0.0; ^~~ ``` Authored with Claude. Pull Request resolved: #177207 Approved by: https://github.com/atalman, https://github.com/mergennachin, https://github.com/jansel

…177202) The codegen'd subclass wrapper only iterated over inp_metas when building unwrapped_args, but FunctionalizedRngRuntimeWrapper appends rng seed/offset to args after inp_metas. The old data-driven loop in runtime_unwrap_tensor_subclasses passed these through because it iterated over all args. Add an extend() call to forward any trailing args not covered by inp_metas. Authored with Claude. Pull Request resolved: #177202 Approved by: https://github.com/Lucaskabela

The underlying bug (eager vs AOTDispatcher output mismatch for as_strided_scatter) was fixed by #150543. Remove the stale skip. Closes #85879 Pull Request resolved: #177203 Approved by: https://github.com/aorenste, https://github.com/zou3519

@sraikund16

@sraikund16 is no longer working on the Profiler. Pull Request resolved: #177176 Approved by: https://github.com/ryanzhang22

## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: #174873 Approved by: https://github.com/jeffdaily

This reverts commit 6d12a21. Reverted #174718 on behalf of https://github.com/eellison due to inadvertently causes fill_ on inductor allocations, not just user empty_strided ([comment](#174718 (comment)))

Fixes #167636 #### Summary - Fix `meta_fft_r2c` missing CPU/MKL code path - Fix `_sort_dims` no-op also add `reverse=True` to match C++ descending sort - Fix `_exec_fft` signal dim ordering - Add PocketFFT early return paths for `meta_fft_c2c` and `meta_fft_c2r` on CPU without MKL - Move FFT ops from `CHECK_STRIDES_SKIPS` to `CHECK_STRIDES` in `test_meta.py` - Use `torch.backends.mkl.is_available()` per [this](#167636 (comment)) Pull Request resolved: #175731 Approved by: https://github.com/isuruf

The list-based reduce_scatter_ op was the only reduce_scatter variant missing from LocalTensorMode's dispatch table. This is the variant used by `dist.reduce_scatter(output, input_list)`, which is the only API that supports uneven split sizes across ranks. Authored with Claude. Pull Request resolved: #175710 Approved by: https://github.com/dzmitry-huba

…176162) test_max_autotune.py:test_max_autotune_exhaustive() was using cuda template config heuristics, but it needed to use rocm template heuristics instead Pull Request resolved: #176162 Approved by: https://github.com/jeffdaily

…#175710)" This reverts commit 90035af. Reverted #175710 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#175710 (comment)))

@staticmethod

…g functionalization (#177213) During `_functionalized_f_helper`, we call `before.copy_(after)`. If the compiled function is a custom autograd function with an inplace op during the backward, this can trigger [inplace correctness check](https://docs.pytorch.org/docs/stable/autograd.html#in-place-correctness-checks) unintentionally. This causes an exception ``` torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. ``` This PR instead wraps the `before.copy_(after)` with a `torch.no_grad()` context to avoid the inplace check. Example script: ``` import torch class MutateBufferInBackward(torch.autograd.Function): @staticmethod def forward(ctx, x, buf): ctx.save_for_backward(buf) return x * buf.mean() @staticmethod def backward(ctx, grad_output): (buf,) = ctx.saved_tensors buf.mul_(2.0) return grad_output * buf.mean(), None @torch.compile def f(x, buf): return MutateBufferInBackward.apply(x, buf) def main(): device = "cuda" x = torch.randn(16, device=device, requires_grad=True) buf = torch.randn(16, device=device, requires_grad=True) out = f(x, buf) out.sum().backward() print("Success!") if __name__ == "__main__": main() ``` Pull Request resolved: #177213 Approved by: https://github.com/aorenste, https://github.com/frgossen

…6622) ## Stack This is part of a PR stack. Merge order: 1. **#176622 — Pallas: strided access support (this PR)** 2. #176952 — Pallas: permutation detection 3. #177212 — Pallas: scalar prefetch (depends on #176952) ## Summary - Add strided access support via reshape + static indexing for non-contiguous tensor patterns - Fix torch_tpu keyword-only API changes in `register_custom_kernel` and `call_custom_kernel` - Set `jax_default_device` to CPU when running in interpret mode on TPU machines ## Test plan - Full test suite on TPU: 869 passed, 153 skipped, 69 xfailed, 0 unexpected failures - `python -m pytest test/inductor/test_pallas.py -v -k "strided"` — all strided access tests pass Pull Request resolved: #176622 Approved by: https://github.com/oulgen, https://github.com/norx1991

Reland of #177047 which was auto-reverted due to lint failures. This reverts commit 4bc9d7f ("Reapply "[fx] Move the rest of torch.fx.Node to C++ (#170946)" (#177047)"). Authored with Claude. Pull Request resolved: #177183 Approved by: https://github.com/huydhn

Add per-graph dynamo config overrides, mirroring the existing debug_inductor_config_override. Set via TORCH_COMPILE_OVERRIDE_DYNAMO_CONFIGS env var or config.debug_dynamo_config_override. The override is applied as a config.patch context manager around the entire compile_frame call, so dynamo config is patched for all tracing and backend compilation within that frame. Warn when TORCH_COMPILE_OVERRIDE_DYNAMO_CONFIGS is set, since dynamo config overrides are keyed by frame ID and some configs can affect graph breaks, shifting frame IDs. Authored with Claude. Pull Request resolved: #176734 Approved by: https://github.com/williamwen42

Add pattern-based skip. Let me know if you prefer regular exclude list. Pull Request resolved: #177046 Approved by: https://github.com/malfet

…ble state (#177095) Fixes #172088 Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level mode stack during `trace_frame` (via `preserve_global_state`). This meant compilation infrastructure running outside tracing — guard building, global state cleanup — would trigger real `__torch_function__` dispatch, mutating mode state (e.g. incrementing a counter) and causing the compile-time guard verification to fail with "Guard failed on the same frame it was created". This change moves the mode stack save/clear/restore up to `compile_inner` so modes are off the C stack for the entire compilation pipeline. For guard building, modes are temporarily restored so guard expressions can reference them, but `DisableTorchFunction` prevents dispatch during construction. Co-authored-by: Claude <noreply@anthropic.com> Pull Request resolved: #177095 Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42

… tests (#177211) Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency. Before today, those jobs already took more than 3+ hours to finish, so a recent change might pushes them over the limit. Pull Request resolved: #177211 Approved by: https://github.com/malfet

## PR Summary Fixes #176067 a `torch.compile(..., fullgraph=True)` failure for `Tensor.new_tensor(...)` ## Repro ```python import torch def f1(a): num = a.nonzero().squeeze(-1).numel() return torch.tensor([num]) # works def f2(a): num = a.nonzero().squeeze(-1).numel() return a.new_tensor([num]) # fails before this change a = torch.tensor([1, 0]) torch.compile(f1, fullgraph=True)(a) torch.compile(f2, fullgraph=True)(a) ``` Prior to this change, `f2` could fail with: ```text torch._dynamo.exc.UserError: Could not extract specialized integer from data-dependent expression u0 (unhinted: u0). (Size-like symbols: u0) ``` related issue: #176067 Pull Request resolved: #176390 Approved by: https://github.com/Lucaskabela

Fixes #89630 ## Summary NLL backward was failing under compile with 1d input when self is 1D, where target.unsqueeze(0) produced a 2D index for the subsequent 1D scatter. The fix mirrors the C++ kernel's behavior of only using target[0] ## Test ```bash python -m pytest test/inductor/test_torchinductor.py -xvs -k "test_nll_loss_backward_1d_input" ``` Pull Request resolved: #177189 Approved by: https://github.com/frgossen

This reverts commit 17af810. Reverted #177101 on behalf of https://github.com/malfet due to I could be wrong, but look like this introduced rocm distributed failures, see https://hud.pytorch.org/hud/pytorch/pytorch/d9d7c0bef5db069c6c47a49c1472f2d3fe034aec/2?per_page=50&name_filter=trunk%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(dist&mergeEphemeralLF=true ([comment](#177101 (comment)))

Use `torch.cpu.get_capabilities()` rather than parse `/proc/cpuinfo` and restrict the check to just Linux systems Followup after #171095 Pull Request resolved: #177177 Approved by: https://github.com/Skylion007

Two main change is to push the reviwer to use sub agents. And added claude-suggested and manually reviewed patterns that we want to see. trying to strike a thorough but not too verbose balance. Pull Request resolved: #177288 Approved by: https://github.com/drisspg

Which is not supported, same as `double` Fixes #176981 Pull Request resolved: #176985 Approved by: https://github.com/atalman, https://github.com/Skylion007

…nces between x86 vs aarch64 (pytorch#176085) In the test: ``` python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues: - pytorch#119905 - pytorch#134387 In the current setup when getting a stack trace string: - x86 contains `C++ CapturedTraceback:` - aarch64 contains `Exception raised from` + `frame #` An example of the full string from an aarch64 system when : ``` AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe pytorch#5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe pytorch#6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe pytorch#7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#11: /usr/bin/python() [0x504a34]\nframe pytorch#12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe pytorch#13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe pytorch#14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe pytorch#15: /usr/bin/python() [0x52a070]\nframe pytorch#16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe pytorch#17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe pytorch#18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe pytorch#19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe pytorch#20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe pytorch#21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe pytorch#22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe pytorch#23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe pytorch#24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe pytorch#25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n' To execute this test, run the following from the base repo dir: python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` Pull Request resolved: pytorch#176085 Approved by: https://github.com/eqy

malfet and others added 30 commits March 8, 2026 13:02

Alaways checkout submodules on MacOS

fb4f0be

As those runners are shared between builds and tests, and always have a warm cache, so updating submodules is a relatively cheap operation

[varlen_attn for inference] add test to aot_inductor (#175936)

388d61e

Pull Request resolved: #175936 Approved by: https://github.com/drisspg ghstack dependencies: #175897, #175924

[varlen_attn for inference] remove unnecessary tensor creation (#176723)

26dddb9

Pull Request resolved: #176723 Approved by: https://github.com/drisspg ghstack dependencies: #175897, #175924, #175936, #176015

Fix grammar in README.md (#176633)

58a7ff5

Added a comma for clarity in the sentence about writing new neural network modules. Pull Request resolved: #176633 Approved by: https://github.com/albanD

Revert "Fix ROCm build in CUDACachingAllocator (#176136)"

1dac036

This reverts commit 4bdfaaa. Reverted #176136 on behalf of https://github.com/wdvr due to reverting to put back on top of the trunk and develop a full forward fix ([comment](#176136 (comment)))

Revert "[vllm hash update] update the pinned vllm hash (#176668)"

ddb522b

This reverts commit 7ec1b27. Reverted #176668 on behalf of https://github.com/huydhn due to This is not ready to land yet ([comment](#176668 (comment)))

[ROCm] Enable expandable segments (#173330)

60d2f1e

Pull Request resolved: #173330 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

[Inductor][NVGEMM] Add infrastructure for registering custom kernels …

9e49f44

…with NVGEMM Cutlass API (#176543) Pull Request resolved: #176543 Approved by: https://github.com/mlazos

[Inductor] Add FP4 tensor creation support for autotuning (#176544)

eb97b58

Pull Request resolved: #176544 Approved by: https://github.com/mlazos ghstack dependencies: #176543

[Inductor][NVGEMM] Patch cutlass_api FP4 dtype mapping (#176545)

8df497d

Pull Request resolved: #176545 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544

[Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Back…

dd06fff

…end (#176547) Pull Request resolved: #176547 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545, #176546

[Inductor][NVGEMM] Enable nvMatmulHeuristics for FP4 blockscaled GEMM (…

9db6179

…#176548) Pull Request resolved: #176548 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545, #176546, #176547

nithinsubbiah and others added 28 commits March 12, 2026 13:28

[MPS] Add cholesky_solve support (#176703)

ca164b3

As decomposition via two triangular solves Frequently requested op in #154052 Pull Request resolved: #176703 Approved by: https://github.com/kurtamohler, https://github.com/malfet

Profiler owners: remove Shivam, add Scott (#177176)

fc90fdf

@sraikund16 is no longer working on the Profiler. Pull Request resolved: #177176 Approved by: https://github.com/ryanzhang22

Revert "[inductor] fix allocation with deterministic guard (#174718)"

440ad94

This reverts commit 6d12a21. Reverted #174718 on behalf of https://github.com/eellison due to inadvertently causes fill_ on inductor allocations, not just user empty_strided ([comment](#174718 (comment)))

Add raw throw lint (#177046)

da9238d

Add pattern-based skip. Let me know if you prefer regular exclude list. Pull Request resolved: #177046 Approved by: https://github.com/malfet

Refactor CPU feature checks to use torch capabilities (#177177)

1e4c9d0

Use `torch.cpu.get_capabilities()` rather than parse `/proc/cpuinfo` and restrict the check to just Linux systems Followup after #171095 Pull Request resolved: #177177 Approved by: https://github.com/Skylion007

[MPS] Error out if one tries to create torch.cdouble tensor (#176985)

e07e28d

Which is not supported, same as `double` Fixes #176981 Pull Request resolved: #176985 Approved by: https://github.com/atalman, https://github.com/Skylion007

bobrenjc93 merged commit 078b581 into bobrenjc93:rest-api-probe-branch Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync scratch branch to upstream main#4

Sync scratch branch to upstream main#4
bobrenjc93 merged 276 commits intobobrenjc93:rest-api-probe-branchfrom
pytorch:main

bobrenjc93 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

bobrenjc93 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants