Skip to content

Sync scratch branch to upstream main#4

Merged
bobrenjc93 merged 276 commits intobobrenjc93:rest-api-probe-branchfrom
pytorch:main
Mar 12, 2026
Merged

Sync scratch branch to upstream main#4
bobrenjc93 merged 276 commits intobobrenjc93:rest-api-probe-branchfrom
pytorch:main

Conversation

@bobrenjc93
Copy link
Copy Markdown
Owner

Temporary sync PR created via REST API.

malfet and others added 30 commits March 8, 2026 13:02
As those runners are shared between builds and tests, and always have a warm cache, so updating submodules is a relatively cheap operation
test plan: create a new python file with

```
from torch._higher_order_ops.triton_kernel_wrap import get_mutated_tensors
```

and see that this no longer errs on circular import

Pull Request resolved: #176836
Approved by: https://github.com/Skylion007
In FA2, seqused_k limited valid K/V length and one of the use cases for this was KV caching. The caller writes cached + new tokens into the K/V buffer externally and then passes the buffer with seqused_k marking which tokens were valid.

Pull Request resolved: #175897
Approved by: https://github.com/drisspg
`page_table` is an FA3 feature so we need to modify the function definitions in `native_functions.yaml`. if this is used with FA2, we throw an error

Pull Request resolved: #175924
Approved by: https://github.com/drisspg
ghstack dependencies: #175897
`aten/src/ATen/native/transformers/cuda/attention.cu`

- renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`.
- `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt`
- `_flash_attention_forward_no_dropout_inplace` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out`

`aten/src/ATen/native/native_functions.yaml`

- i registered a new op `_flash_attention_forward_no_dropout_inplace`

`torch/_meta_registrations.py`

- added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor

`torch/nn/attention/varlen.py`

- added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}`

`test/test_varlen_attention.py`

- added out variant to existing tests

Pull Request resolved: #176015
Approved by: https://github.com/drisspg
ghstack dependencies: #175897, #175924, #175936
Added a comma for clarity in the sentence about writing new neural network modules.
Pull Request resolved: #176633
Approved by: https://github.com/albanD
…ser_defined_triton_kernel is disabled (#176832)

Summary: PR #173662 introduces a new fusion `epilogue_fusion_user_defined_triton_kernel`. It overrides the `get_read_writes` method of `UserDefinedTritonKernel`, even when `epilogue_fusion_user_defined_triton_kernel` is disabled. This may cause regression to existing Triton kernel CUDA graph for other models. This diff fixed the regression by falling back to the original `get_read_writes` when `epilogue_fusion_user_defined_triton_kernel` is disabled

Differential Revision: D95727036

Pull Request resolved: #176832
Approved by: https://github.com/AmesingFlank
The first return value of `__tensor_flatten__` can now contain opaque
(non-tensor) values such as DeviceMesh. Opaques flow through the flat
arg list alongside tensors: they get indices, become graph inputs/outputs,
and are pulled from all_args by position during subclass reconstruction
— the same as PlainTensorMeta. An empty OpaqueMeta marker in
SubclassCreationMeta.attrs distinguishes opaque slots from tensor slots
(needed by process_runtime_tangent, which skips non-differentiable
opaques).

All code that previously iterated `__tensor_flatten__` results and assumed
every element was a tensor now handles opaques: subclass_parametrization,
non_strict_utils, FSDP _init_utils, common_utils (get_untyped_storages),
frontend_utils, and parametrize.py (which stores opaques as plain
attributes rather than parameters).

Tests added (all in test/test_opaque_obj_v2.py):
- test_tensor_subclass_with_opaque_attr_backward
- test_tensor_subclass_opaque_backward_compiled_autograd
- test_tensor_subclass_shared_opaque_remapping
- test_shared_opaque_identity_guard
- test_shared_direct_opaque_identity_guard
- test_tensor_subclass_shared_opaque_backward
- test_deeply_nested_tensor_subclass_with_opaque
- test_subclass_parametrization_with_opaque_attrs
- test_export_non_strict_with_opaque_attrs
- test_get_untyped_storages_with_opaque_attrs
- test_subclass_opaque_attrs_cache_hit
- test_value_type_opaque_in_tensor_attrs_errors

Authored with Claude.

Pull Request resolved: #176457
Approved by: https://github.com/ezyang
…in `assert_functional_graph` (#176606)

The `_index_fill` decomposition used mutable `empty_like + copy_` to
restore strides when `index_copy` returned a contiguous tensor, which
broke the functional graph invariant. Replace with the functional
`prims.copy_strided` prim that does the same thing as a single op.

Fixes #144846

Authored with Claude.

Pull Request resolved: #176606
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
Previously, `run_pre_grad_passes` was called unconditionally at the top
of `_compile_fx_main`.  This meant pre-grad transformations were not
included in cached artifacts and ran unnecessarily on cache hits.

Move pre-grad passes into `aot_module_simplified` (Path B) via a
callback so they run after the cache lookup — on cache miss only.

`_compile_fx_main` has two compilation paths that diverge at the
`V.aot_compilation` check: Path A uses `aot_export_module` (AOTInductor,
no cache) and Path B uses `aot_autograd` → `aot_module_simplified` (with
`AOTAutogradCache`).  Since Path A has no cache, run pre-grad passes
explicitly before `aot_export_module`.

Pull Request resolved: #176340
Approved by: https://github.com/aorenste
This reverts commit 4bdfaaa.

Reverted #176136 on behalf of https://github.com/wdvr due to reverting to put back on top of the trunk and develop a full forward fix ([comment](#176136 (comment)))
This reverts commit d49571b.

Reverted #173330 on behalf of https://github.com/wdvr due to sorry having issues with fixing build issues internally -- will revert and reland on top of master, and have a combined improved build fix instead of #176136 ([comment](#173330 (comment)))
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned torchtitan hash.
Pull Request resolved: #176850
Approved by: https://github.com/pytorchbot
Summary:
Update the default C++ standard from c++17 to c++20 in
torch.utils.cpp_extension and torch._inductor.cpp_builder.
This affects how user C++ extensions and inductor-generated
code are compiled.

Test Plan: Sandcastle

Reviewed By: malfet

Pull Request resolved: #176659
Approved by: https://github.com/malfet
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: #176668
Approved by: https://github.com/pytorchbot
This reverts commit ff91f31.

Reverted #176340 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#176340 (comment)))
This reverts commit 7ec1b27.

Reverted #176668 on behalf of https://github.com/huydhn due to This is not ready to land yet ([comment](#176668 (comment)))
Pull Request resolved: #173330
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
# Motivation
Aligned to the other backend, move PeerToPeerAccess from `ATen` to `c10`.

# Additional Context
Just code paste and copy, no additional code change needed.
#174698
Pull Request resolved: #176856
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
…6423)

- When the total score is zero, it should be updated otherwise there will be a division by zero error
- Trying to create a repro proved difficult. Example state that produces this is:

```python
size_hints = {"y": 4096, "x": 64, "r0_": 128}
inductor_meta = {
        "tiling_scores": {"y": 134217728, "x": 2097152, "r0_": 0},
        "num_load": 1,
    }
triton_meta = {
        "signature": {
            "in_ptr0": "*fp32",
            "out_ptr0": "*fp32",
            "ynumel": "i32",
            "xnumel": "i32",
            "r0_numel": "i32",
            "YBLOCK": "constexpr",
            "XBLOCK": "constexpr",
            "R0_BLOCK": "constexpr",
        }
    }
persistent_reduction=False
```

Pull Request resolved: #176423
Approved by: https://github.com/jansel
This PR enables inductor/test_cpu_repro on AArch64 and marks known failures with xfail and link to issue.

Pull Request resolved: #171095
Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/malfet
… kstest helpers (#176226)

Fixes #116271

Allow test/nn/test_init.py to run under TorchDynamo (e.g. with PYTORCH_TEST_WITH_DYNAMO=1) by removing the @skipIfTorchDynamo decorators from the tests that use scipy.stats.kstest and by adding @torch._dynamo.disable to the helpers _is_uniform, _is_normal, and _is_trunc_normal. Those helpers only perform the distribution checks via kstest; the actual init logic (e.g. init.uniform_, init.normal_, init.trunc_normal_) is still exercised under Dynamo.

Tracing into kstest under Dynamo is very slow (and can hit the read-only numpy array path in CI). Disabling Dynamo only for these helpers avoids that cost and those failures while keeping the init behavior under test when Dynamo is enabled.

 @peterbell10, @Lucaskabela
Pull Request resolved: #176226
Approved by: https://github.com/peterbell10
Instead of cloning this directly from the Cutlass repo via `setup.py`, we need to own it ourselves inside of Inductor to do some Tensor mode reordering due to the differences between how Inductor and this kernel need the dims ordered

Pull Request resolved: #176546
Approved by: https://github.com/mlazos
ghstack dependencies: #176543, #176544, #176545
nithinsubbiah and others added 28 commits March 12, 2026 13:28
Changing the `num_stages` value from 1 to 2 enables more efficient pipelining in Triton backend which improves the performance. Here's some benchmark numbers for comparison run on MI350X.

  | Attn Type      | Shape (B,Hq,M,Hkv,N,D)         | stages=1 (μs) | stages=2 (μs) | Speedup |
  |----------------|----------------------------------|----------------|----------------|---------|
  | causal         | (2, 16, 512, 16, 512, 64)       | 37.6           | 35.8           | 1.05x   |
  | causal         | (2, 16, 512, 2, 512, 128)       | 35.7           | 35.1           | 1.02x   |
  | causal         | (2, 16, 1024, 16, 1024, 64)     | 39.5           | 31.4           | 1.26x   |
  | causal         | (2, 16, 4096, 16, 4096, 128)    | 680.3          | 580.6          | 1.17x   |
  | causal         | (2, 16, 4096, 2, 4096, 64)      | 259.0          | 238.4          | 1.09x   |
  | noop           | (8, 16, 1024, 16, 1024, 128)    | 196.2          | 183.3          | 1.07x   |
  | causal         | (8, 16, 1024, 2, 1024, 64)      | 79.7           | 75.5           | 1.06x   |
  | alibi          | (8, 16, 4096, 16, 4096, 64)     | 2017.7         | 1727.3         | 1.17x   |
  | causal         | (8, 16, 4096, 16, 4096, 128)    | 2686.0         | 2258.7         | 1.19x   |
  | sliding_window | (8, 16, 4096, 2, 4096, 64)      | 610.4          | 559.3          | 1.09x   |
  | causal         | (16, 16, 512, 16, 512, 128)     | 111.6          | 99.0           | 1.13x   |
  | alibi          | (16, 16, 1024, 2, 1024, 128)    | 391.6          | 335.3          | 1.17x   |
  | causal         | (16, 16, 1024, 16, 1024, 64)    | 163.6          | 142.6          | 1.15x   |
  | noop           | (16, 16, 4096, 16, 4096, 128)   | 6260.5         | 5130.3         | 1.22x   |
  | causal         | (16, 16, 4096, 2, 4096, 64)     | 2084.5         | 1780.5         | 1.17x   |
  | causal         | (1, 32, 16384, 4, 16384, 64)    | 2687.9         | 2472.8         | 1.09x   |
  | **Geo-mean**   |                                  |                |                | **1.13x** |

  All configs: `num_warps=4`, `dtype=bfloat16`, fwd only. Benchmarked with `attention-gym` on ROCm.

Pull Request resolved: #176676
Approved by: https://github.com/drisspg, https://github.com/jeffdaily
1. Created privateuse1_profiler.h/.cpp — A registry pattern that allows PrivateUse1 backends to register IActivityProfiler factories via REGISTER_PRIVATEUSE1_PROFILER(MyProfiler) macro, with compile-time static_assert ensuring the class inherits from libkineto::IActivityProfiler.
    * This makes the assumption that backends will take a dependency on Kineto to use IActivityProfiler interface. Right now the backends have to check in their implementation to Kineto - so this might be a step up and a safe assumption.
    * As an alternative, PyTorch could define its own abstract interface that mirrors IActivityProfiler, then internally forward to Kineto.
2. Kineto init paths — Added onKinetoInit() calls in kineto_shim.cpp (user-triggered profiling via prepareTrace()), but _not_ for kineto_client_interface.cpp (daemon mode via global_kineto_init()), with guards to ensure Kineto is initialized before forwarding.

TODO
1. [Done] Gate this behind a new ProfilerState::KINETO_PRIVATEUSE1 check
2. [Done] Check how (if at all) kineto build args need to change. Mostly it shouldn't as for privateuse1 we wont need CUDA/ROCm/XPU etc.
3. [Done] How does this break kineto's fbcode setup? Not applicable
Pull Request resolved: #172154
Approved by: https://github.com/scotts
…171269)

# Motivation
The original goal was to generalize `CUDAGraph` and share implementations and logic across different backends, as mentioned in #158827. However, after further offline discussions, we decided to take a more incremental approach: start by defining a unified interface, while allowing each backend to maintain its own implementation. This avoids premature coupling and addresses backend-specific concerns.

This PR introduces `GraphImplInterface`, a lightweight, backend-agnostic interface that defines a unified API for graph capture and replay. Each backend (e.g., `CUDA`, `XPU`, `PrivateUse1`) provides its own implementation and registers it via `REGISTER_GRAPH_IMPL`.
On top of this interface, we provide a unified graph API, `at::accelerator::Graph`, which transparently maps to:
- `CUDAGraph` on CUDA
- `XPUGraph` on XPU
- and corresponding implementations for other backends (including `PrivateUse1`)

This design establishes a common abstraction layer while preserving backend autonomy, and lays the groundwork for future sharing of logic once the interface and use cases have stabilized.

An additional benefit is that, for `CUDA` and `XPU`, the backend-specific graph types (e.g., `cuda::CUDAGraph` and `xpu::XPUGraph`) can share the same underlying implementation as `accelerator::Graph` on each backend, avoiding code duplication and ensuring consistent behavior.

For `PrivateUse1`, `accelerator::Graph` can be supported with minimal effort by reusing the existing `PU1Graph` implementation.

Pull Request resolved: #171269
Approved by: https://github.com/EikanWang, https://github.com/eellison
As decomposition via two triangular solves

Frequently requested op in  #154052
Pull Request resolved: #176703
Approved by: https://github.com/kurtamohler, https://github.com/malfet
…177207)

----

- Remove `test_bfloat_constant`, `test_lowp_reduction`, and `test_lowp_where` as they don't test for anything beyond what existing tests cover.
- Add `test_pad_after_gelu` as a regression test for Voxtral compilation on MPS, exercising pad(gelu(x)) across fp32, fp16, and bfloat16.

Before #176436 test will fail with
```
torch._inductor.exc.InductorError: SyntaxError: failed to compile
    #include <c10/metal/utils.h>
    #include <c10/metal/special_math.h>
    kernel void generated_kernel(
        device bfloat* out_ptr0,
        constant bfloat* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = (xindex) % (17);
        int x1 = c10::metal::floor_divide(xindex, 17);
        int x2 = xindex;
        auto tmp0 = (-1) + x0;
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 0;
        auto tmp3 = tmp1 >= tmp2;
        bfloat tmp4;
        if (tmp3) {
            auto tmp_scoped_0 = static_cast<float>(in_ptr0[(-1) + x0 + 16*x1]);
            auto tmp_scoped_1 = static_cast<float>(tmp_scoped_0);
            auto tmp_scoped_2 = 0.5;
            auto tmp_scoped_3 = tmp_scoped_1 * tmp_scoped_2;
            auto tmp_scoped_4 = 0.7071067811865476;
            auto tmp_scoped_5 = tmp_scoped_1 * tmp_scoped_4;
            auto tmp_scoped_6 = c10::metal::erf(tmp_scoped_5);
            auto tmp_scoped_7 = 1.0;
            auto tmp_scoped_8 = tmp_scoped_6 + tmp_scoped_7;
            auto tmp_scoped_9 = tmp_scoped_3 * tmp_scoped_8;
            auto tmp_scoped_10 = static_cast<bfloat>(tmp_scoped_9);
            tmp4 = tmp_scoped_10;
        } else tmp4 = 0.0;
        out_ptr0[x2] = static_cast<bfloat>(tmp4);
    }
 with program_source:4495:23: error: assigning to 'bfloat' from incompatible type 'float'
        } else tmp4 = 0.0;
                      ^~~
```

Authored with Claude.

Pull Request resolved: #177207
Approved by: https://github.com/atalman, https://github.com/mergennachin, https://github.com/jansel
…177202)

The codegen'd subclass wrapper only iterated over inp_metas when
building unwrapped_args, but FunctionalizedRngRuntimeWrapper appends
rng seed/offset to args after inp_metas. The old data-driven loop
in runtime_unwrap_tensor_subclasses passed these through because it
iterated over all args. Add an extend() call to forward any trailing
args not covered by inp_metas.

Authored with Claude.
Pull Request resolved: #177202
Approved by: https://github.com/Lucaskabela
The underlying bug (eager vs AOTDispatcher output mismatch for
as_strided_scatter) was fixed by #150543. Remove the stale skip.

Closes #85879
Pull Request resolved: #177203
Approved by: https://github.com/aorenste, https://github.com/zou3519
@sraikund16 is no longer working on the Profiler.
Pull Request resolved: #177176
Approved by: https://github.com/ryanzhang22
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: #174873
Approved by: https://github.com/jeffdaily
This reverts commit 6d12a21.

Reverted #174718 on behalf of https://github.com/eellison due to inadvertently causes fill_ on inductor allocations, not just user empty_strided ([comment](#174718 (comment)))
Fixes #167636

#### Summary
- Fix `meta_fft_r2c` missing CPU/MKL code path
- Fix `_sort_dims` no-op also add `reverse=True` to match C++ descending sort
- Fix `_exec_fft` signal dim ordering
- Add PocketFFT early return paths for `meta_fft_c2c` and `meta_fft_c2r` on CPU without MKL
- Move FFT ops from `CHECK_STRIDES_SKIPS` to `CHECK_STRIDES` in `test_meta.py`
- Use `torch.backends.mkl.is_available()` per [this](#167636 (comment))

Pull Request resolved: #175731
Approved by: https://github.com/isuruf
The list-based reduce_scatter_ op was the only reduce_scatter variant
missing from LocalTensorMode's dispatch table. This is the variant used
by `dist.reduce_scatter(output, input_list)`, which is the only API that
supports uneven split sizes across ranks.

Authored with Claude.
Pull Request resolved: #175710
Approved by: https://github.com/dzmitry-huba
…176162)

test_max_autotune.py:test_max_autotune_exhaustive() was using cuda template config heuristics, but it needed to use rocm template heuristics instead

Pull Request resolved: #176162
Approved by: https://github.com/jeffdaily
…#175710)"

This reverts commit 90035af.

Reverted #175710 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#175710 (comment)))
…g functionalization (#177213)

During `_functionalized_f_helper`, we call `before.copy_(after)`. If the compiled function is a custom autograd function with an inplace op during the backward, this can trigger [inplace correctness check](https://docs.pytorch.org/docs/stable/autograd.html#in-place-correctness-checks) unintentionally. This causes an exception
```
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
```

This PR instead wraps the `before.copy_(after)` with a `torch.no_grad()` context to avoid the inplace check.

Example script:
```
import torch

class MutateBufferInBackward(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, buf):
        ctx.save_for_backward(buf)
        return x * buf.mean()

    @staticmethod
    def backward(ctx, grad_output):
        (buf,) = ctx.saved_tensors
        buf.mul_(2.0)
        return grad_output * buf.mean(), None

@torch.compile
def f(x, buf):
    return MutateBufferInBackward.apply(x, buf)

def main():
    device = "cuda"
    x = torch.randn(16, device=device, requires_grad=True)
    buf = torch.randn(16, device=device, requires_grad=True)

    out = f(x, buf)
    out.sum().backward()
    print("Success!")

if __name__ == "__main__":
    main()
    ```

Pull Request resolved: #177213
Approved by: https://github.com/aorenste, https://github.com/frgossen
…6622)

## Stack

This is part of a PR stack. Merge order:
1. **#176622 — Pallas: strided access support (this PR)**
2. #176952 — Pallas: permutation detection
3. #177212 — Pallas: scalar prefetch (depends on #176952)

## Summary

- Add strided access support via reshape + static indexing for non-contiguous tensor patterns
- Fix torch_tpu keyword-only API changes in `register_custom_kernel` and `call_custom_kernel`
- Set `jax_default_device` to CPU when running in interpret mode on TPU machines

## Test plan

- Full test suite on TPU: 869 passed, 153 skipped, 69 xfailed, 0 unexpected failures
- `python -m pytest test/inductor/test_pallas.py -v -k "strided"` — all strided access tests pass
Pull Request resolved: #176622
Approved by: https://github.com/oulgen, https://github.com/norx1991
Reland of #177047 which was auto-reverted due to lint failures.

This reverts commit 4bc9d7f ("Reapply "[fx] Move the rest of torch.fx.Node to C++ (#170946)" (#177047)").

Authored with Claude.
Pull Request resolved: #177183
Approved by: https://github.com/huydhn
Add per-graph dynamo config overrides, mirroring the existing
debug_inductor_config_override. Set via TORCH_COMPILE_OVERRIDE_DYNAMO_CONFIGS
env var or config.debug_dynamo_config_override. The override is applied as a
config.patch context manager around the entire compile_frame call, so dynamo
config is patched for all tracing and backend compilation within that frame.

Warn when TORCH_COMPILE_OVERRIDE_DYNAMO_CONFIGS is set, since dynamo
config overrides are keyed by frame ID and some configs can affect graph
breaks, shifting frame IDs.

Authored with Claude.

Pull Request resolved: #176734
Approved by: https://github.com/williamwen42
Add pattern-based skip. Let me know if you prefer regular exclude list.
Pull Request resolved: #177046
Approved by: https://github.com/malfet
…ble state (#177095)

Fixes #172088

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>

Pull Request resolved: #177095
Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42
… tests (#177211)

Increase distributed test shards from 2 to 3 to reduce per-shard test time and improve CI latency.

Before today, those jobs already took more than 3+ hours to finish, so a recent change might pushes them over the limit.
Pull Request resolved: #177211
Approved by: https://github.com/malfet
## PR Summary

Fixes #176067  a `torch.compile(..., fullgraph=True)` failure for `Tensor.new_tensor(...)`

## Repro

```python
import torch

def f1(a):
    num = a.nonzero().squeeze(-1).numel()
    return torch.tensor([num])  # works

def f2(a):
    num = a.nonzero().squeeze(-1).numel()
    return a.new_tensor([num])  # fails before this change

a = torch.tensor([1, 0])
torch.compile(f1, fullgraph=True)(a)
torch.compile(f2, fullgraph=True)(a)
```

Prior to this change, `f2` could fail with:

```text
torch._dynamo.exc.UserError: Could not extract specialized integer from
data-dependent expression u0 (unhinted: u0). (Size-like symbols: u0)
```

related issue: #176067

Pull Request resolved: #176390
Approved by: https://github.com/Lucaskabela
Fixes #89630

## Summary
NLL backward was failing under compile with 1d input when self is 1D, where target.unsqueeze(0) produced
  a 2D index for the subsequent 1D scatter. The fix mirrors the C++ kernel's behavior of only using
  target[0]

## Test
```bash
python -m pytest test/inductor/test_torchinductor.py -xvs -k "test_nll_loss_backward_1d_input"
```

Pull Request resolved: #177189
Approved by: https://github.com/frgossen
Use `torch.cpu.get_capabilities()` rather than parse `/proc/cpuinfo` and restrict the check to just Linux systems

Followup after #171095
Pull Request resolved: #177177
Approved by: https://github.com/Skylion007
Two main change is to push the reviwer to use sub agents.
And added claude-suggested and manually reviewed patterns that we want to see. trying to strike a thorough but not too verbose balance.
Pull Request resolved: #177288
Approved by: https://github.com/drisspg
@bobrenjc93 bobrenjc93 merged commit 078b581 into bobrenjc93:rest-api-probe-branch Mar 12, 2026
bobrenjc93 pushed a commit that referenced this pull request Mar 16, 2026
…nces between x86 vs aarch64 (pytorch#176085)

In the test:

```
python  test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda
```
 it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues:
  - pytorch#119905
  - pytorch#134387

In the current setup when getting a stack trace string:
- x86 contains `C++ CapturedTraceback:`
- aarch64 contains `Exception raised from` + `frame #`

An example of the full string from an aarch64 system when :
```
AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe pytorch#5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe pytorch#6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe pytorch#7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#11: /usr/bin/python() [0x504a34]\nframe pytorch#12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe pytorch#13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe pytorch#14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe pytorch#15: /usr/bin/python() [0x52a070]\nframe pytorch#16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe pytorch#17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe pytorch#18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe pytorch#19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe pytorch#20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe pytorch#21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe pytorch#22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe pytorch#23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe pytorch#24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe pytorch#25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n'

To execute this test, run the following from the base repo dir:
    python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda
```

Pull Request resolved: pytorch#176085
Approved by: https://github.com/eqy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.