[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 by pragupta · Pull Request #4 · pragupta/pytorch

pragupta · 2025-09-08T17:44:09Z

Merged latest changes from upstream/main into rocm7.1_internal_testing on 2025-09-08

Adds fallback commands for the following: * python setup.py install * python setup.py develop Ideally these should just work and should provide backwards compat. Thought process here is that multiple people rely on these commands and just because setuptools wants to drop support for this I don't think a lot of our downstream users who build from source are expecting these to be gone. This should provide some room for developers to move away from these commands until we have a unified frontend for doing all of these commands that should abstract most of these away. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: pytorch#162009 Approved by: https://github.com/clee2000, https://github.com/atalman

…accelerator (pytorch#161845) As the tile stated. As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator". Pull Request resolved: pytorch#161845 Approved by: https://github.com/albanD

…h#161903) As the title stated. Pull Request resolved: pytorch#161903 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161845

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#161928 Approved by: https://github.com/pytorchbot

… the hardware limit. (pytorch#161996) Summary: This is a re-land of [PR161040](pytorch#161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs. This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error: ``` No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help. ``` Test Plan: ``` pytest test/inductor/test_max_autotune.py pytest test/inductor/test_triton_heuristics.py ``` Pull Request resolved: pytorch#161996 Approved by: https://github.com/coconutruben

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#161929 Approved by: https://github.com/pytorchbot

…ble (pytorch#161950) This PR is a followup to pytorch#149764. In that PR, it only forbids illegal view due to `Flatten`; this PR also forbids illegal view caused by `Split`. This PR also updates the error message to be less about internal implementation details, which users may find confusing. Pull Request resolved: pytorch#161950 Approved by: https://github.com/ezyang

Fixes pytorch#161483 When the whole `test/test_transformers.py` file is run, the case `test_default_priority_order` can pass because other xpu cases would call SDPA so that the priority order is set by https://github.com/pytorch/pytorch/blob/eec876deb659fe667aac2d97a48d7451c3e88dee/aten/src/ATen/native/mkldnn/xpu/Attention.cpp#L98-L112 However, when the case `test_default_priority_order` is run separately, the priority order is unset so that this case would fail. This PR fix this case. Pull Request resolved: pytorch#161690 Approved by: https://github.com/guangyey, https://github.com/drisspg

…61998) This function has come up in DTensor perf work, and I had a nitpick on pytorch#160256 so here it is. I have neither compiled nor measured this, but am reasonably confident it's better nonetheless. Pull Request resolved: pytorch#161998 Approved by: https://github.com/ezyang

Enable cat op for sparse on MPS Pull Request resolved: pytorch#162007 Approved by: https://github.com/malfet

…torch#161947) As the title stated. Pull Request resolved: pytorch#161947 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161845, pytorch#161903

Keeps SymInt::maybe_as_int small enough to inline. Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097) Pull Request resolved: pytorch#161466 Approved by: https://github.com/ezyang

If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from pytorch#160580 and Linux perf. Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107) Pull Request resolved: pytorch#161586 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466

Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci

…rch#159889) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#159889 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#160449

Summary: we have ``` std::vector<size_t> constants_internal_offset( num_constants - num_folded_constants); ``` but the for loop does not consider it ``` for (size_t i = 0; i < num_constants; i++) { ... constants_internal_offset[i] ... ``` even in the for loop, it does ``` bool from_folded = this->constant_from_folded(i); if (from_folded) { continue; } ``` but `i` could still be wrong Rollback Plan: Differential Revision: D81425007 Pull Request resolved: pytorch#161887 Approved by: https://github.com/angelayi

Summary: Save the config args that Inductor burns into `inductor_metadata` so we can optionally pass them to any Jit Hooks that are set. This allows us to pass them to Tritonparse. Reviewed By: davidberard98, FindHao Differential Revision: D80994791 Pull Request resolved: pytorch#161953 Approved by: https://github.com/FindHao

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

…0943)" This reverts commit bbedc71. Reverted pytorch#160943 on behalf of https://github.com/jeanschmidt due to See [D81486248](https://www.internalfb.com/diff/D81486248) for details on broken test ([comment](pytorch#160943 (comment)))

…h#161862)" This reverts commit 204697f. Reverted pytorch#161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](pytorch#161862 (comment)))

… nodes (pytorch#161339)" This reverts commit 90f50f7. Reverted pytorch#161339 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, check D81486248 for more details ([comment](pytorch#161339 (comment)))

…1261) In this pr, we port test/distributed/parallel 4 test files and test/distributed/debug 1 test file for Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Use torch.accelerator for general gpu 2. Skip the case if running on xpu which has known issues Pull Request resolved: pytorch#161261 Approved by: https://github.com/guangyey, https://github.com/d4l3k

Summary: Inductor has the following configurations: config.comprehensive_padding config.padding_alignment_bytes config.padding_stride_threshold In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today. This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic. config.pad_dynamic_shapes In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic. Test Plan: CI Rollback Plan: Differential Revision: D80468808 Pull Request resolved: pytorch#160997 Approved by: https://github.com/blaine-rister, https://github.com/jansel

…2058) fix split_aot_inductor_output_path on Windows. Pull Request resolved: pytorch#162058 Approved by: https://github.com/angelayi

Pull Request resolved: pytorch#161799 Approved by: https://github.com/anijain2305

Pull Request resolved: pytorch#161800 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#161799

## Summary Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently. ## Background On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower. For example: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176])) )) ``` is a lot slower than: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1])) )) ``` This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels. ## Data I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups: ``` Parsed 420 unique shapes from benchmark output addmm improvements when best: addmm_16448x512x2048: +0.14% addmm_128x2048x2048: +0.01% addmm_128x768x1000: +0.75% addmm_12672x3072x768: +1.08% addmm_512x768x32000: +0.62% addmm_12608x384x384: +0.00% addmm_4160x1024x4096: +0.90% addmm_16x768x2: +0.56% addmm_12608x3072x768: +0.09% addmm_64x4096x1000: +2.77% addmm_256x1024x512: +1.99% addmm_30x256x256: +1.12% addmm_100480x128x384: +0.91% addmm_6400x2048x512: +0.25% addmm_61568x1024x256: +0.08% addmm_1x768x768: +0.93% addmm_12544x384x384: +0.19% addmm_128x512x1000: +0.77% addmm_2048x128x128: +1.32% addmm_128x3072x1000: +0.24% addmm_7936x512x2048: +0.07% addmm_8192x512x2048: +0.33% addmm_64x1024x1000: +1.43% addmm_128x2304x1000: +0.01% addmm_32768x256x2: +0.75% addmm_64x384x1152: +0.79% addmm_64x640x1000: +0.01% addmm_100480x128x128: +0.87% addmm_1152x3072x768: +1.13% addmm_8192x256x2048: +1.40% addmm_4096x128x768: +0.01% addmm_128x2560x1000: +0.01% addmm_12544x2048x512: +0.43% addmm_200704x24x96: +0.14% addmm_8448x512x2048: +0.96% addmm_50176x256x1024: +0.62% addmm_4160x4096x1024: +0.22% addmm_4096x768x768: +0.32% addmm_220x2048x512: +0.56% addmm_8x2048x1000: +1.12% addmm_256x197951x512: +26.99% addmm_401536x64x192: +0.60% addmm_2040x2048x512: +0.47% addmm_512x1024x256: +1.32% addmm_128x4096x1000: +1.67% addmm_12672x768x768: +0.34% addmm_128x368x1000: +0.77% addmm_96x1280x1000: +0.01% addmm_12544x512x2048: +0.41% addmm_6272x320x1280: +0.76% addmm_12544x3072x768: +0.09% addmm_64x384x1000: +0.39% mm improvements when best: mm_200704x128x512: +1.29% mm_663552x16x16: +0.80% mm_4096x768x768: +0.51% mm_131072x64x31: +0.24% mm_12544x1152x384: +0.11% mm_128x2048x2: +0.46% mm_262144x16x23: +0.62% mm_50176x576x192: +0.37% mm_131072x16x31: +0.26% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 247 Average Subgraph placement: 3.38 Median Subgraph placement: 2.0 Subgraph is best choice: 52/247 shapes (21.1%) Average improvement when best: 1.15% Median improvement when best: 0.58% Largest improvement when best: +26.99% Operation: bmm ---------------------------------------- Total shapes analyzed: 85 Average Subgraph placement: 24.00 Median Subgraph placement: 21.0 Subgraph is best choice: 0/85 shapes (0.0%) Average improvement when best: N/A (never best) Median improvement when best: N/A (never best) Largest improvement when best: N/A (never best) Operation: mm ---------------------------------------- Total shapes analyzed: 88 Average Subgraph placement: 15.08 Median Subgraph placement: 4.0 Subgraph is best choice: 9/88 shapes (10.2%) Average improvement when best: 0.52% Median improvement when best: 0.46% Largest improvement when best: +1.29% ``` ## Results The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune: ``` addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436 addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702 addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834 addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105 ... ``` Compared to the non-transposed autotune: ``` addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421 addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246 addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547 addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895 addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916 ``` It seems to perform really well for high values of `K` vs `N` and `M`. Testing this hypothesis with some custom shapes: ``` Parsed 64 unique shapes from benchmark output addmm improvements when best: addmm_128x16384x128: +0.18% addmm_128x262144x256: +38.24% addmm_128x200000x512: +14.76% addmm_256x800000x128: +0.06% addmm_131072x128x256: +0.27% addmm_128x256x131072: +0.25% addmm_2048x200000x64: +12.45% mm improvements when best: mm_128x16384x128: +0.18% mm_128x262144x256: +38.05% mm_128x200000x512: +9.47% mm_256x800000x128: +0.99% mm_512x6400000x256: +3.17% mm_524288x64x64: +0.29% mm_2048x200000x64: +11.19% mm_8192x1000000x256: +34.14% mm_128x4096x100000: +0.40% mm_128x3072x150000: +0.27% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 33 Average Subgraph placement: 4.39 Median Subgraph placement: 2.0 Subgraph is best choice: 7/33 shapes (21.2%) Average improvement when best: 9.46% Median improvement when best: 0.27% Largest improvement when best: +38.24% Operation: mm ---------------------------------------- Total shapes analyzed: 30 Average Subgraph placement: 7.63 Median Subgraph placement: 2.0 Subgraph is best choice: 10/30 shapes (33.3%) Average improvement when best: 9.81% Median improvement when best: 2.08% Largest improvement when best: +38.05% ``` ## Conclusion Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes. Data gathering scripts: https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866 ## Test Plan: New unit tests. Differential Revision: D80771648 Pull Request resolved: pytorch#161241 Approved by: https://github.com/eellison

Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries Fixes pytorch#161515 Pull Request resolved: pytorch#161920 Approved by: https://github.com/malfet

…ytorch#160510) Pull Request resolved: pytorch#160510 Approved by: https://github.com/ezyang

This reverts commit b9ba612. Reverted pytorch#161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert pytorch#159473, feel free to merge it back once conflicts are cleared ([comment](pytorch#161715 (comment)))

@d4l3k

…GPU (pytorch#159473)" This reverts commit 040d00a. Reverted pytorch#159473 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal signals, @d4l3k please help the author to have this change landed. [D81718444](https://www.internalfb.com/diff/D81718444) ([comment](pytorch#159473 (comment)))

…60483) Pull Request resolved: pytorch#160483 Approved by: https://github.com/zou3519

) The goal of this PR stack is to be able to implement `aot_compile_module`, which AOT precompiles a torch.nn.Module. Step 1 is a simple refactor to make CompileArtifacts itself the callable, which makes it easier to use directly. Pull Request resolved: pytorch#162169 Approved by: https://github.com/zhxchen17

This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now. In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx. Pull Request resolved: pytorch#162170 Approved by: https://github.com/zhxchen17 ghstack dependencies: pytorch#162169

This PR implements the Autograd feature of the associative_scan. Pull Request resolved: pytorch#139939 Approved by: https://github.com/ydwu4

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#162315 Approved by: https://github.com/pytorchbot

…lt (pytorch#159889)" This reverts commit 01edcd4. Reverted pytorch#159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](pytorch#160449 (comment)))

This reverts commit de893e9. Reverted pytorch#160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](pytorch#160449 (comment)))

Follow-up to pytorch#161768. Context: ProcessPool pickles the outputs before sending them back to the main process. Triton kernels have some un-pickleable fields, so `prepare_for_pickle()` is used to strip out those fields. Previously, in the standard case (without triton_bundler.py), `prepare_for_pickle()` would strip out the un-pickleable fields and they would never be added back after unpickling, because the un-pickleable fields were not actually needed after compilation finished. In pytorch#161768 updated `prepare_for_pickle` to also strip out the `fn._hash_lock` field, a newly added field in JITCallable instances which is a `threading.RLock()`, which is not pickleable. It turns out that we do need to restore the `fn._hash_lock` field, even in the non-triton_bundler case - the MultiKernel case uses the hash lock. To do this, we add `restore_after_unpickle()` which will restore fields (or if the old fields are not provided, initialize just the hash_lock) Compile time benchmarks look good, maybe a very minor regression (see the comment below on the PR) Pull Request resolved: pytorch#162244 Approved by: https://github.com/atalman

…ytorch#162309) Fixes static cuda launcher after triton-lang/triton#7866. Static cuda launcher checks to make sure that no hook knobs are set (and if they are, it throws an error). But Triton has changed the semantics of hooks so that "empty hooks" are now represented by empty `HookChain`s instead of being represented by `None`. This PR changes the way we define "empty hooks" to account for HookChains. Pull Request resolved: pytorch#162309 Approved by: https://github.com/aakhundov ghstack dependencies: pytorch#162244

@ngimel

The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: pytorch#162040 Approved by: https://github.com/danielvegamyhre

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: pytorch#162372 Approved by: https://github.com/pytorchbot

This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#161395 Approved by: https://github.com/pytorchbot

Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton) Notably: * this does *not* include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR) * sam_fast is still failing, so we've disabled it temporarily pytorch#162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch. Pull Request resolved: pytorch#162278 Approved by: https://github.com/atalman ghstack dependencies: pytorch#162244, pytorch#162309

… mode (pytorch#161405) During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp pytorch#157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) Pull Request resolved: pytorch#161405 Approved by: https://github.com/eellison

)" This reverts commit c9ac8c2. Reverted pytorch#162315 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if this introduced the failure https://github.com/pytorch/pytorch/actions/runs/17539536914/job/49810513700 ([comment](pytorch#162315 (comment)))

This reverts commit 3770337. Reverted pytorch#161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](pytorch#161649 (comment)))

…#162041) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: pytorch#162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: pytorch#162040

…lready know (pytorch#161591) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: pytorch#161591 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590

…ytorch#161595) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: pytorch#161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591

…rch#161058) This PR is the first split PR of pytorch#156272, only contains the OneDNN code. Please help review. Pending on OneDNN v3.9 commit update. Don't merge. Pull Request resolved: pytorch#161058 Approved by: https://github.com/guangyey, https://github.com/EikanWang

LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel

…sting_IFU_2025-09-08 # Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/requirements-ci.txt # aten/src/ATen/Context.cpp # aten/src/ATen/cuda/tunable/GemmHipblaslt.h # aten/src/ATen/native/ConvUtils.h # aten/src/ATen/native/Convolution.cpp # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/cuda/Blas.cpp # aten/src/ATen/native/miopen/Conv_miopen.cpp # requirements.txt # test/distributed/_tools/test_fsdp2_mem_tracker.py # test/distributed/tensor/parallel/test_tp_examples.py # test/dynamo/test_activation_checkpointing.py # test/dynamo/test_structured_trace.py # test/inductor/test_aot_inductor.py # test/inductor/test_combo_kernels.py # test/test_matmul_cuda.py # test/test_sparse.py # torch/_higher_order_ops/triton_kernel_wrap.py # torch/_inductor/choices.py # torch/_inductor/codegen/triton.py # torch/testing/_internal/common_cuda.py

…rch#165479) These happen when building with CMAKE_BUILD_TYPE=RelWithAssert This should fix two types of failures that started with pytorch#163665 Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now) The first one type is Truncated: ``` default_pg, _ = _new_process_group_helper( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper backend_class = creator_fn(dist_backend_opts, backend_options) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg return FakeProcessGroup._create_internal( RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero. Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 #7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0 #8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0 #9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0 #10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)ROCm#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)ROCm#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 ``` and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead. However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR The other one type is ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import self.assertEqual(out, "") File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != '' - /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode. - if is_available() and not torch._C._c10d_init(): To execute this test, run the following from the base repo dir: python test/test_testing.py TestImports.test_no_warning_on_import ``` which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one? Pull Request resolved: pytorch#165479 Approved by: https://github.com/ezyang

Previously g3 = NVIDIA Tesla M60 Now g6 = NVIDIA L4 Also change cuda arch list accordingly Pros: More memory, newer GPU Cons: That was one of the few remaining tests on g3 runners, so we probably lost coverage? We can probably run more tests in parallel now but I'm not going to do that here Disabled a bunch of sparse tests and nestedtensor tests that were previously skipped due to not having sufficient hardware? They are now failing with ``` Traceback (most recent call last): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3293, in wrapper method(*args, **kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3292, in wrapper with policy(): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2532, in __enter__ self.beforeStreams[-1].synchronize() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 105, in synchronize super().synchronize() torch.AcceleratorError: CUDA error: device-side assert triggered Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from stream_synchronize at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.h:120 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) [clone .cold] from CUDAException.cpp:0 #7 THCPStream_synchronize(_object*, _object*) from Stream.cpp:0 #8 cfunction_vectorcall_NOARGS from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:489 #9 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #10 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 #11 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #12 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 ``` when run with cuda launch blocking I got a ton of stuff like ``` /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [2,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [3,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,4,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,4,0] Assertion `value < upper_bound` failed. ``` Pull Request resolved: pytorch#165158 Approved by: https://github.com/seemethere

If another static object (like `g_device_config_parse_hook_registry_instance` created by the `REGISTER_ALLOCATOR_CONFIG_PARSE_HOOK` macro) tries to call `registerDeviceConfigParserHook` before `device_config_parser_hook_` is initialized, assigning to it (operator=) can fail, which leads to a runtime error. When I use a compilation optimization of ` -O1` I see this issue: ``` [src/libcxx/include/__functional/function.h:496]:14: runtime error: member access within null pointer of type 'const __policy' #0 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:496]:14 #1 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:483]:19 #2 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:727]:8 #3 0x563224e28b78 in c10::CachingAllocator::AcceleratorAllocatorConfig::registerDeviceConfigParserHook(std::__u::function<void (std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>> const&)>&&, std::__u::unordered_set<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>, std::__u::hash<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::equal_to<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::allocator<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>> const&) [torch/c10/core/AllocatorConfig.h:263]:32 #4 0x563224e28e9d in DeviceConfigParserHookRegistry [torch/c10/core/AllocatorConfig.h:369]:5 #5 0x563224e28e9d in __cxx_global_var_init.34 [torch/c10/cuda/CUDAAllocatorConfig.cpp:195]:1 #6 0x563224e28e9d in _GLOBAL__sub_I_CUDAAllocatorConfig.cpp torch/c10/cuda/CUDAAllocatorConfig.cpp #7 0x5632459709ac in __libc_csu_init /[usr/grte/v5/debug-src/src/csu/elf-init.c:88]:7 #8 0x7f748b9562e7 in __libc_start_main (/usr/grte/v5/lib64/libc.so.6+0x612e7) (BuildId: ca23ec6d935352118622ce674a8bb52d) #9 0x5632018f3729 in _start /usr/grte/v5/debug-src/src/csu/../sysdeps/x86_64/start.S:120 ``` Pull Request resolved: pytorch#172581 Approved by: https://github.com/guangyey, https://github.com/albanD

…nces between x86 vs aarch64 (pytorch#176085) In the test: ``` python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues: - pytorch#119905 - pytorch#134387 In the current setup when getting a stack trace string: - x86 contains `C++ CapturedTraceback:` - aarch64 contains `Exception raised from` + `frame #` An example of the full string from an aarch64 system when : ``` AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe #5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #11: /usr/bin/python() [0x504a34]\nframe #12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe #13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe #14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe #15: /usr/bin/python() [0x52a070]\nframe #16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe #17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe #18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe #19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe #20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe #21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe #22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe #23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe ROCm#25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n' To execute this test, run the following from the base repo dir: python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` Pull Request resolved: pytorch#176085 Approved by: https://github.com/eqy

seemethere and others added 30 commits September 3, 2025 02:56

Using pip3 install instead of python setup.py develop/install (pytorc…

dac8a4b

…h#161903) As the title stated. Pull Request resolved: pytorch#161903 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161845

[MPS] enable cat op for sparse (pytorch#162007)

2c03f0a

Enable cat op for sparse on MPS Pull Request resolved: pytorch#162007 Approved by: https://github.com/malfet

Using get_paths() to get correct installation path for PYTHONPATY (py…

827f0d4

…torch#161947) As the title stated. Pull Request resolved: pytorch#161947 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161845, pytorch#161903

Outline SymInt::maybe_as_int_slow_path (pytorch#161466)

fa1514a

Keeps SymInt::maybe_as_int small enough to inline. Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097) Pull Request resolved: pytorch#161466 Approved by: https://github.com/ezyang

Always build USE_DISTRIBUTED. (pytorch#160449)

90b0864

Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci

[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (pytorch#161999)

02c83f1

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Revert "[CUDAGraph] add config to error on skipping cudagraph (pytorc…

f27985b

…h#161862)" This reverts commit 204697f. Reverted pytorch#161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](pytorch#161862 (comment)))

[inductor] fix split_aot_inductor_output_path on Windows. (pytorch#16…

451ed93

…2058) fix split_aot_inductor_output_path on Windows. Pull Request resolved: pytorch#162058 Approved by: https://github.com/angelayi

Add CPython test test_range (pytorch#161799)

889f01e

Pull Request resolved: pytorch#161799 Approved by: https://github.com/anijain2305

Add range_iterator (pytorch#161800)

eb18d32

Pull Request resolved: pytorch#161800 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#161799

S390x: build nightly binaries for new pythons (pytorch#161920)

71992dd

Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries Fixes pytorch#161515 Pull Request resolved: pytorch#161920 Approved by: https://github.com/malfet

stop suggesting using guard_size_oblivious on data dependent errors (p…

3559c35

…ytorch#160510) Pull Request resolved: pytorch#160510 Approved by: https://github.com/ezyang

pytorchmergebot and others added 24 commits September 7, 2025 21:03

[while_loop][autograd] support autograd_key of while_loop (pytorch#1…

ec2e368

…60483) Pull Request resolved: pytorch#160483 Approved by: https://github.com/zou3519

[associative_scan] Autograd separated (pytorch#139939)

103f725

This PR implements the Autograd feature of the associative_scan. Pull Request resolved: pytorch#139939 Approved by: https://github.com/ydwu4

Update slow tests (pytorch#161395)

e101411

This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#161395 Approved by: https://github.com/pytorchbot

Revert "Use vectorized stores for all dtypes in cat (pytorch#161649)"

a92773e

This reverts commit 3770337. Reverted pytorch#161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](pytorch#161649 (comment)))

pragupta closed this Sep 9, 2025

pragupta deleted the rocm7.1_internal_testing_IFU_2025-09-08 branch September 9, 2025 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08#4

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08#4
pragupta wants to merge 1421 commits intorocm7.1_internal_testingfrom
rocm7.1_internal_testing_IFU_2025-09-08

pragupta commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pragupta commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants