dummy by pragupta · Pull Request #20 · pragupta/pytorch

pragupta · 2025-10-02T22:22:38Z

Fixes #ISSUE_NUMBER

Currently, conv1d converts the 3D view to 4D before calling onednn::convolution(). However, this function converts the 4D tensor to a channel-last memory format for computation, resulting in incorrect return results (the correct result should be channel-first). This PR fixes this issue, ensuring that the output return value format is consistent with the expected format. Pull Request resolved: pytorch#162944 Approved by: https://github.com/EikanWang

Summary: Adds documentation for EventList, FunctionEvent and FunctionEventAvg. Closes pytorch#165907 Test Plan: N/A Documentation Differential Revision: D86913697 Pull Request resolved: pytorch#167688 Approved by: https://github.com/sanrise

## MOTIVATION To generalize Distributed test cases for non-CUDA devices ## CHANGES - Replaced hard coded device/backends with torch.accelerator.current_accelerator() and dist.get_default_backend_for_device - Use DistributedTestBase instead of MultiProcessTestCase to use common utilities - Remove instantiate_device_tests and make use of torch.accelerator.current_accelerator for test/distributed/test_c10d_object_collectives.py - fix deterministic context issue for non-cuda devices in test/distributed/optim/test_zero_redundancy_optimizer.py - use torch.accelerator.device_count() for multi-gpu check in torch/testing/_internal/distributed/_tensor/common_dtensor.py Pull Request resolved: pytorch#165067 Approved by: https://github.com/guangyey, https://github.com/albanD

…o "original_aten" node meta (pytorch#167749) Fixes pytorch#167706 - Add `torch.fx.experimental.proxy_tensor.set_original_aten_op()` around flex_atention HOP dispatch so we have `original_aten` populated for flex_attention - Update the usages of `original_aten` to also expect HOP in addition to OpOverload Pull Request resolved: pytorch#167749 Approved by: https://github.com/drisspg

) Summary: Autovectorization of casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D86207189 Pull Request resolved: pytorch#166958 Approved by: https://github.com/mcfi

Fixes pytorch#150477 ### Summary: - Added frame information (function name, file, line number) to all graph break/skip messages - Standardized message format: "torch.compile will skip tracing the frame <name> (<file> line <N>) and fall back to eager. Reason: <reason>" ### Impacts: module: dynamo Pull Request resolved: pytorch#167067 Approved by: https://github.com/williamwen42

… and add focused documentation (pytorch#165897) ## Summary This PR enriches OpenReg device management codes and adds focused documentation. ## Key Changes - Introduced device management documentation in `device.md`. - Updated `OpenRegFunctions.h` and `OpenRegFunctions.cpp` to use `DeviceIndex` and added error handling. - Implemented `check_device_index` function for validating device indices. - Enhanced Python bindings in `Module.cpp` for device management. - Added tests for invalid device index handling in `test_device.py`. Pull Request resolved: pytorch#165897 Approved by: https://github.com/fffrog

…ytorch#166573) We need to track all symbols, we used to skip u = item() and fail with ``` File "/home/lsakka/pytorch10/pytorch/torch/fx/passes/_tensorify_python_scalars.py", line 149, in _sympy_interp expr_to_sym_proxy[expr] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: KeyError: u0 ``` Pull Request resolved: pytorch#166573 Approved by: https://github.com/bobrenjc93

To support use case in pytorch/helion#1122, i.e. ``` @helion.kernel def foo( x: Tensor, group_name: str ): x_remotes = torch.ops.symm_mem.get_remote_tensors(x, group_name) for t in x_remotes: ... ```` Helion uses fake tensor to trace a program, thus we cannot use the following code in a Helion function: ``` hdl = rendezvous(tensor) remote_tensors = tuple( hdl.get_remote_tensor(peer, ...) for peer in range(world_size) ) ``` The reason is that when `tensor` is fake, the returned `hdl` is None, thus any subsequent call on it will fail. This PR wraps the above functionality as an op: ``` lib.define("get_remote_tensors(Tensor x, str group_name) -> Tensor[]") ``` so that things like `hdl` is not exposed to Helion. The op also provides a `meta` implementation so that Helion can trace it without actually running the rendezvous. Pull Request resolved: pytorch#167779 Approved by: https://github.com/yf225

Differential Revision: D86685546 Pull Request resolved: pytorch#167481 Approved by: https://github.com/eellison

Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93

This reverts commit c78e646. Reverted pytorch#167481 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167481 (comment)))

@yuchengliu1

…h#165978) This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in pytorch#164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> pytorch#165978 : implementation of XPU scaled_mm and oneDNN kernel - pytorch#167518 : implementation of XPU scaled_mm_v2 - pytorch#166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in pytorch#166056. ## Credit: This work is based on @yuchengliu1's work at pytorch#140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: pytorch#167170 Pull Request resolved: pytorch#165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <eikan.wang@intel.com>

)" This reverts commit 50bf1f0. Reverted pytorch#167198 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167198 (comment)))

…ytorch#164729) Fixes pytorch#163374. Here is the output from reproducible code: ``` W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] ***************************************** W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] ***************************************** aten::clamp_(dt: f32[][R], None, 2) redistribute_input(0, [P] -> [R]) redistribute_input(t: f32[], [P] -> [R]) _c10d_functional::all_reduce(t: f32[], sum, 0) _c10d_functional::wait_tensor(t: f32[]) aten::clamp_(t: f32[], None, 2) aten::view(t: f32[], []) (Replicate(),) tensor(2., device='cuda:0') ``` The behavior is now matching what you were expecting in issue pytorch#163374: Expected behavior (from the issue): 1. Placement should change from Partial(sum) to Replicate() 2. Value should be tensor(2.) instead of tensor(144.) Actual output from this build: 1. (Replicate(),) - placement is correct 2. tensor(2., device='cuda:0') - value is correct so the inplace operation now properly redistributes the partial DTensor to replicate before performing the clamp snd maintains the correct aliasing semantics. It also produces the expected clamped value. Pull Request resolved: pytorch#164729 Approved by: https://github.com/ezyang

This PR add a sm_121a flag for row-wise scaled matmuls on DGX Spark. Pull Request resolved: pytorch#167734 Approved by: https://github.com/eqy, https://github.com/cyyever

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#167338 Approved by: https://github.com/jamesjwu

This PR adds a basic spin configuration to allow for linting. It is designed as a drop-in replacement for the current Makefile based solution, i.e. it sets up and updates lintrunner based on the hashes of certain configuration files. Lintrunner is called via Uv's `uvx` command, separating its environment from the general development environment in an effort to reduce instances of competing requirements breaking environments. Pull Request resolved: pytorch#167226 Approved by: https://github.com/atalman, https://github.com/albanD

…sLtWorkspace" (pytorch#167722) Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Note: this is a re-land of D86316117 / pytorch#167248 (see comments for details) Test Plan: Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN: ``` buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 ``` https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 TSAN output (before synchronization was added): P2026731804 Differential Revision: D86964261 Pull Request resolved: pytorch#167722 Approved by: https://github.com/malfet

Fixes pytorch#161871. Behaviour on arm: ``` PyTorch version: 2.10.0a0+gitdef3b05 Architecture: arm64 Platform: Darwin Processor: arm Testing mvlgamma_ with integer tensor on arm64... Got expected error: mvlgamma: result type Long can't be cast to the desired output type Float ``` and on x86: ``` PyTorch version: 2.10.0a0+git1310d6a Architecture: x86_64 Platform: Linux Processor: x86_64 Testing mvlgamma_ with integer tensor on x86_64... Got expected error: mvlgamma: result type Long can't be cast to the desired output type Float ``` Pull Request resolved: pytorch#164230 Approved by: https://github.com/albanD

This PR enables ROCm/HIP support for PyTorch's StaticCudaLauncher, which provides static compilation and launching of Triton kernels. The implementation has been tested on AMD MI300 and MI200 hardware. **Changes** **Python (torch/_inductor/runtime/)** - static_cuda_launcher.py: Added ROCm detection, .hsaco binary support, and ROCm-specific scratch parameter handling - triton_heuristics.py: Updated device type checks to support both cuda and hip **C++ (torch/csrc/)** - Module.cpp: Enabled StaticCudaLauncher for ROCm builds - inductor/static_cuda_launcher.cpp: Added HIP API equivalents for all CUDA driver calls - inductor/static_cuda_launcher.h: Updated header guard **Tests (test/inductor/)** - test_static_cuda_launcher.py: Removed @skipIfRocm decorators and updated binary file handling **Enabled Unit Tests** All tests in test/inductor/test_static_cuda_launcher.py now pass on ROCm: 1. test_basic 2. test_unsigned_integers 3. test_signed_integers 4. test_basic_1arg 5. test_constexpr 6. test_implied_constant 7. test_kernel_no_args 8. test_high_shared_mem 9. test_too_high_shared_mem 10. test_kernel_empty_tensor 11. test_kernel_many_args 12. test_basic_compile 13. test_incompatible_code 14. test_static_launch_user_defined_triton_kernels 15. test_empty_tensor 16. test_any 17. test_disable_static_cuda_launcher In addition to this, the following tests from test/inductor/test_codecache.py also pass: 1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False 2. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False 3. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True 4. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False 5. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False 6. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True The following tests are skipped since triton bundling is necessary for StaticCudaLauncher: 1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True 2. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True Pull Request resolved: pytorch#166492 Approved by: https://github.com/jeffdaily

…rch#167471) Try to prevent two big tests from overlapping in their memory usage Pull Request resolved: pytorch#167471 Approved by: https://github.com/soulitzer

…ytorch#167731) Summary: as title. Test Plan: pytest test/export/test_export.py -k test_invalid_pytree_dynamo_graph_capture Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#167731 Approved by: https://github.com/tugsbayasgalan

Fix for pytorch#166653. Two fixes: - We were inducing a split for broadcasted loads. e.g. (x // 16). While a split of 16 here will make the load coalesced in one of the tile vars, since the load is already in cache it's not worth splitting. And it would make the other tile var load from memory that isnt in cache. - Add a slight term for uncoalesced memory. This prevents doing tiling for loads which are a small % of the overall kernel. Pull Request resolved: pytorch#167771 Approved by: https://github.com/v0i0

…ppable. (pytorch#167674) Summary: Prior to this PR we will always build global and torch funciton guards in all cases. In this PR we did 2 changes to dynamo guards: 1. Created a new guard called "GLOBAL_STATE" which corresponds to the global state guard and can be filtered out using guard_filter_fn 2. Repurpose the existing "TORCH_FUNCTION_STATE" guard for checking torch function mode stack. Also added a new helper `torch.compiler.skip_all_guards_unsafe` which can be useful for use cases like vllm Test Plan: CI Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#167674 Approved by: https://github.com/anijain2305

when we know all tensor and intermediate tensors fit in 32 bit but use unbacked DS we want a way to assume that we can use 32 bit indexing(we will runtime assert on it). It is not practical to torch check every possible intermediate tensor size ahead of time. This is needed to enhance vLLM perf with unbacked, since in vLLM all tensors and intermediates assumed to fit in 32 bits. Pull Request resolved: pytorch#167784 Approved by: https://github.com/jansel

…e2/serialize/inline_container.cc (pytorch#167612) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Differential Revision: D85813824 Pull Request resolved: pytorch#167612 Approved by: https://github.com/seemethere, https://github.com/malfet

…h#166787) Fixes pytorch#165427 ## Description of Bug 🐛 As reported in pytorch#165427, When both the input of `atan2` function is zero the gradient becomes `NaN`. During the forward pass, `atan2` successfully avoids division-by-zero issue, but during backpropagation gradients become `NaN`. This is because the backward pass calculates `(self * self + other * other).reciprocal()`, which becomes `inf` at `(0, 0)`. The subsequent multiplication by zero `(0 * inf)` results in `NaN`. ## Changes - Added an `at::where` condition to handle zero denominators in `atan2_backward`. - If denom is zero return 0 for the reciprocal; otherwise, use the original value. ## Testing - Added` test_atan2_zero_gradient` in `test/test_autograd.py` to verify `atan2` returns `0.0` gradients for `(0,0)`. Pull Request resolved: pytorch#166787 Approved by: https://github.com/soulitzer

Update NVTX to 3.3.0. Mostly fixes some errors in the bindings, improve C++20 support, and improve C++ bindings to NVTX. Header only library upgrade so should be mostly safe. Pull Request resolved: pytorch#167751 Approved by: https://github.com/albanD, https://github.com/eqy

…torch#168102) Summary: This adds the ability to trace through the code which has context manager enabled and returned in a user callable. Since we rely on side effects to return user defined variables, we cannot disable side effects by default anymore in the short term. So we decide to leave side effect config up to the caller side of dynamo_graph_capture_for_export, and still disable it for torch.export by default. In the short term we will just assume dynamo_graph_capture_for_export is a low level API and it's user responsibility to control side effect options. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#168102 Approved by: https://github.com/tugsbayasgalan

Helps with reducing Dynamo tracing time. Earlier the generator object would cause more polyfills. Pull Request resolved: pytorch#168024 Approved by: https://github.com/williamwen42

Summary: Prior to #[164333](pytorch#164333), the 32-bit activation range was defined as `(int(-(2**31)), int(2**31 - 1))`. The `int` was deemed unnecessary, however torch.jit.script inteprets 2**31 as a float (example error P2044074770). Instead of reverting to the old definition (introduced by our team in #[150870](pytorch#150870), which could be "fixed" again), I replace with the value directly. Test Plan: N8628317 demonstrates the error without this diff. No error on this diff. Differential Revision: D87278420 Pull Request resolved: pytorch#168046 Approved by: https://github.com/cyyever, https://github.com/yangw-dev

…h/csrc/Storage.cpp (pytorch#168184) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Differential Revision: D87467930 Pull Request resolved: pytorch#168184 Approved by: https://github.com/malfet

Pull Request resolved: pytorch#167673 Approved by: https://github.com/zpcore, https://github.com/Skylion007

Pull Request resolved: pytorch#168113 Approved by: https://github.com/mlazos, https://github.com/zpcore ghstack dependencies: pytorch#167673

…torch#168086) See numpy/numpy#28343 Pull Request resolved: pytorch#168086 Approved by: https://github.com/williamwen42

\# why enable configuring conv operations through the lookup table \# what - move kwargs etc into template_heuristics - add conv specific kernel inputs - add lookup table e2e test for conv \# testing ``` python3 -bb -m pytest test/inductor/test_lookup_table.py -k "conv2d" -v python3 -bb -m pytest test/inductor/test_max_autotune.py -k "conv" -v ``` Differential Revision: [D86474839](https://our.internmc.facebook.com/intern/diff/D86474839) Pull Request resolved: pytorch#167179 Approved by: https://github.com/drisspg

A couple of changes: * Update `xformers==0.0.33.post1`. This is the latest version for 2.9 release * Remove `flashinfer-python` build, we don't need to compile it anymore after vllm-project/vllm#26443. This is now a regular dependency for vLLM * I also switch the base image to 12.9.1 to match what is vLLM is using nowadays ### Testing https://github.com/pytorch/pytorch/actions/runs/19490188972/job/55780754518 Pull Request resolved: pytorch#168141 Approved by: https://github.com/yangw-dev

Need to land: Dao-AILab/flash-attention#1985 ^^First^^ Pull Request resolved: pytorch#167040 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: pytorch#168017

…bled (pytorch#167245)" This reverts commit 789240b. Reverted pytorch#167245 on behalf of https://github.com/yangw-dev due to the base pr is broken internal tests in the stack ([comment](pytorch#167245 (comment)))

This reverts commit f49833d. Reverted pytorch#167231 on behalf of https://github.com/yangw-dev due to the diff breaks tests internally ([comment](pytorch#167231 (comment)))

This reverts commit 689d731. Reverted pytorch#167697 on behalf of https://github.com/yangw-dev due to break internal tests, need to include internal changes ([comment](pytorch#167697 (comment)))

This reverts commit ef7fa96. Reverted pytorch#167883 on behalf of https://github.com/yangw-dev due to break some internal test, error: use of undeclared identifier, reached out author but no resp, so revert this to keep diff train hygiene ([comment](pytorch#167883 (comment)))

~This PR does change the semantics of the >> operator by using STD_TORCH_CHECK to throw the error instead of TORCH_CHECK. Jane (who is writing this message) thinks it is okay because it is the error case when an invalid MemoryFormat or Layout is getting passed into >>, so the UX benefits of TORCH_CHECK over STD_TORCH_CHECK there are not significant enough to warrant making a new copy of Layout and MemoryFormat's >> APIs.~ Never mind! We shouldn't change TORCH_CHECK to STD_TORCH_CHECK for core usage ever, cuz the traceback info and c10::Error is very much desired!! So the solution is to not migrate the >>s. I pushed new commits to the stack to remove the >> code, but for reference, pytorch@8a30179 has all the code that I ended up deleting. Pull Request resolved: pytorch#168034 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#168025, pytorch#167802, pytorch#167803, pytorch#167804, pytorch#167962 Co-authored-by: Jane Xu <janeyx@meta.com>

@StrongerXi

…Variable (pytorch#167468) Continuation of work from previous PR, see link for context pytorch#161645 (comment) I think this PR is a step in that direction. There is probably some room for simplification. At a high level, the new class NamedTupleVariable handles methods that branch on structseq or the more dynamic subclasses of namedtuple, and falls back to UserDefinedTupleVariable otherwise. Please let me know what you think. @StrongerXi Pull Request resolved: pytorch#167468 Approved by: https://github.com/guilhermeleobas, https://github.com/StrongerXi, https://github.com/mlazos

…rch#168063)" This reverts commit cdca10b. Reverted pytorch#168063 on behalf of https://github.com/yangw-dev due to Internal test breaks, contacted author to revert it and fix it test_codegen_int_array_var_fix_memory_leak, self.assertTrue(allocated_memory[1] == allocated_memory[2]) AssertionError: False is not true ([comment](pytorch#168063 (comment)))

Adding Lavender to the list. Pull Request resolved: pytorch#168172 Approved by: https://github.com/ramanishsingh, https://github.com/aelavender

Initial autotuning support for foreach kernels, 4x improvement for some kernels in internal workload. More improvements can surely be made here in the future. Removing num_warps for definition to enable autotune support in generated wrapper code. Before: triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 | triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 | triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 | After: triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 | triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 | triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 | num_warps=8 default due to https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton_combo_kernel.py#L374 Pull Request resolved: pytorch#162053 Approved by: https://github.com/mlazos, https://github.com/naromero77amd, https://github.com/jeffdaily Co-authored-by: Nichols A. Romero <nick.romero@amd.com>

…#168190) ``sys.getrefcount(lib)`` got impacted due to python3.13 optimization. ``sys.getrefcount(lib._op_impls)`` and others remain the same. Test plan: ``python test/test_python_dispatch.py TestPythonRegistration.test_finalizer`` in local ``python=3.14`` env Pull Request resolved: pytorch#168190 Approved by: https://github.com/williamwen42, https://github.com/azahed98

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit 1ad5bb95d796283d5f56ac1edd16f1731d24a49d)

Co-authored-by: AMD <amd@amd.com>

pragupta force-pushed the main branch from 20b68fe to 62bc7e3 Compare October 7, 2025 17:33

yucai-intel and others added 29 commits November 14, 2025 01:12

Fix different seq length (pytorch#167481)

c78e646

Differential Revision: D86685546 Pull Request resolved: pytorch#167481 Approved by: https://github.com/eellison

deprecate check_is_size and guard_size_oblivious (pytorch#167198)

50bf1f0

Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93

[ATen][CUDA] Add sm_121a flag for RowwiseScaledMM (pytorch#167734)

226850c

This PR add a sm_121a flag for row-wise scaled matmuls on DGX Spark. Pull Request resolved: pytorch#167734 Approved by: https://github.com/eqy, https://github.com/cyyever

[precompile] Integrate AOTI as a backend. (pytorch#167338)

b657061

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#167338 Approved by: https://github.com/jamesjwu

[CUDA][Test] Add serialTest() to some largeTensorTest tests (pyto…

02ee7dd

…rch#167471) Try to prevent two big tests from overlapping in their memory usage Pull Request resolved: pytorch#167471 Approved by: https://github.com/soulitzer

zhxchen17 and others added 21 commits November 19, 2025 22:33

[pytree][compile] Slightly faster TreeSpec init (pytorch#168024)

fcc7841

Helps with reducing Dynamo tracing time. Earlier the generator object would cause more polyfills. Pull Request resolved: pytorch#168024 Approved by: https://github.com/williamwen42

[DTensor] Fix mypy on register_op_strategy (pytorch#167673)

a5e9dce

Pull Request resolved: pytorch#167673 Approved by: https://github.com/zpcore, https://github.com/Skylion007

[DTensor] Document some utils (pytorch#168113)

c9d944b

Pull Request resolved: pytorch#168113 Approved by: https://github.com/mlazos, https://github.com/zpcore ghstack dependencies: pytorch#167673

Update linalg.norm to match numpy's handling of degenerate inputs (py…

a4a5d03

…torch#168086) See numpy/numpy#28343 Pull Request resolved: pytorch#168086 Approved by: https://github.com/williamwen42

[FlexFlash] Blackwell fwd support (pytorch#167040)

cda1b8d

Need to land: Dao-AILab/flash-attention#1985 ^^First^^ Pull Request resolved: pytorch#167040 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: pytorch#168017

Revert "[hoo] Invoke subgraph + effect (pytorch#167231)"

ca6175c

This reverts commit f49833d. Reverted pytorch#167231 on behalf of https://github.com/yangw-dev due to the diff breaks tests internally ([comment](pytorch#167231 (comment)))

Revert "[inductor] fix the decision of inner reduction (pytorch#167697)"

771be8c

This reverts commit 689d731. Reverted pytorch#167697 on behalf of https://github.com/yangw-dev due to break internal tests, need to include internal changes ([comment](pytorch#167697 (comment)))

[torch/utils/data] Update CODEOWNERS (pytorch#168172)

9e9e8fa

Adding Lavender to the list. Pull Request resolved: pytorch#168172 Approved by: https://github.com/ramanishsingh, https://github.com/aelavender

Revert to prev sccache by ROCm

4e7dd33

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit 1ad5bb95d796283d5f56ac1edd16f1731d24a49d)

pragupta force-pushed the main branch from e7ea673 to a5436a5 Compare November 25, 2025 12:48

pragupta force-pushed the main branch from a5436a5 to 4e7dd33 Compare December 2, 2025 20:26

pragupta and others added 7 commits December 2, 2025 14:27

Add assign_issues workflow (#24)

b119326

Co-authored-by: AMD <amd@amd.com>

Fix syntax error on assign_issues

44fbdae

More syntax errors

a5b4f6d

Add TOKEN permissions

2423a33

Update token in assign_issues.yml

113e72b

Update assign_issues.yml

dd09ce3

update error handling

980957c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dummy#20

dummy#20
pragupta wants to merge 3405 commits intorocm7.1_internal_testingfrom
main

pragupta commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pragupta commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants