Tags: pytorch/pytorch
Tags
[export] avoid RecursionError in guards-fn codegen for deeply nested … …guards (#186993) (#186993) Summary: `ExportedProgram.module()` builds a `_guards_fn` submodule that re-asserts the exported shape guards. For each assert's human-readable error message, `_convert_guards_code_to_fn` (in `torch/export/_unlift.py`) pretty-prints the guard via `ast.unparse(ast.parse(shadow))`. Both `ast.parse` and `ast.unparse` recurse once per AST node, so a guard whose expression is very deeply nested -- e.g. a sum over many symbolic sizes, as produced when exporting a recommendation model with `auto_dynamic_shapes` over hundreds of jagged/KJT features -- exceeds Python's recursion limit and raises `RecursionError`, aborting the entire export (including standalone publish, which reaches this code via `run_decompositions()` -> `module()`). Root cause: the `ast.unparse(ast.parse(...))` round-trip is purely cosmetic; as the existing comment states, it "is not necessary for correctness, just deemed desirable" -- it only normalizes redundant parentheses in the assert error string. The executed runtime check uses the separate `actual` expression and does not depend on the pretty-printed `shadow`, so a deep guard should never be fatal. Fix: wrap the normalization in `try/except RecursionError` and fall back to the un-normalized guard string. The emitted runtime assert is unchanged; only the readability of the guard-failure message degrades slightly in the rare deep-guard case. Test Plan: Built custom aps package and publish f1096406197 Added `test_guards_fn_recovers_from_unparse_recursion_error`, which mocks `ast.unparse` to raise `RecursionError` and asserts `_convert_guards_code_to_fn` still returns a guards fn instead of propagating the error. A mock is used rather than a genuinely deep expression because the test target is ASAN-instrumented, where deep `ast.parse`/`compile` recursion can abort the process before the pure-Python `RecursionError` is reached. ``` buck2 test fbcode//caffe2/test:test_export -- --regex 'test_guards_fn_recovers_from_unparse_recursion_error' ``` After the fix: `Pass 11. Fail 0. Fatal 0.` (the test is fanned out across export modes: strict, nonstrict, serdes, retraceability, cpp_serdes, training_ir, nativert, ...). Before the fix the same test fails with `RecursionError: maximum recursion depth exceeded` at `_unlift.py` (`Pass 0, Fail 11`). Authored with the assistance of an AI coding assistant. Reviewed By: jijunyan, sophielin508 Differential Revision: D108111211 Pull Request resolved: #186993 Approved by: https://github.com/jijunyan
[c10] Make basic_string_view inherit from std::basic_string_view (#18… …4152) This PR simplifies c10:: basic_string_view body and keeps minimal methods. Pull Request resolved: #184152 Approved by: https://github.com/Skylion007
Revert "[BE] Make spmd_type a CI rather than CD dependency (#187067)" This reverts commit d4c98cd. Reverted #187067 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#187067 (comment)))
Move backend-specific c10d files into per-backend subfolders (#187083) Summary: Pull Request resolved: #187083 Reorganizes `torch/csrc/distributed/c10d` by moving non-public, backend-specific implementation files and the TCPStore backend files into per-backend subfolders, while leaving the public-facing classes at the top level (the `ProcessGroupGloo`/`NCCL`/`MPI`/`UCC` backends and the `Store`/`TCPStore`/`FileStore`/`HashStore`/`PrefixStore` classes all stay put). The moves are: `store/` gets `TCPStoreBackend.{cpp,hpp}` and `TCPStoreLibUvBackend.cpp`; `gloo/` gets `ProcessGroupGlooCuda.cpp`, `ProcessGroupGlooDetail.hpp`, and `GlooDeviceFactory.{cpp,hpp}`; `ucc/` gets `UCCTracing.{cpp,hpp}` and `UCCUtils.{cpp,hpp}`; `nccl/` gets `NCCLXStub.hpp`. `NCCLUtils.{cpp,hpp}` was deliberately kept at the top level even though it is backend-specific: it is included by several call sites outside `caffe2` (in `gen_ai`, `ads_mkl`, and `fbgemm_gpu`), so relocating it would be a wider, riskier change better done on its own. As a result the new `nccl/` folder currently holds only `NCCLXStub.hpp`. All include sites were updated, covering both the canonical `torch/csrc/distributed/c10d/...` include form and the legacy short `c10d/...` form (used by `fb/GlooDeviceFactory.cpp`). Build wiring was updated in `build_variables.bzl` -- the canonical source list consumed by CMake (via `append_filelist` in `cmake/Codegen.cmake`), OSS Bazel, and OSS Buck -- and in the internal `fb/fbcode/target_definitions.bzl` for `ProcessGroupGlooCuda.cpp`. Headers are picked up by recursive globs, so no header-list edits were needed. This is a pure file move: contents are unchanged apart from the relocated `#include` paths, so correctness is established by a clean build rather than by behavioral tests. Authored with the assistance of an AI coding assistant (Claude Code). Test Plan: Confirmed no references to the old paths remain anywhere in `fbcode`, then ran the fbcode lint and build tooling: ``` arc f arc lint arc lint --take AUTODEPS --apply-patches buck2 build fbcode//caffe2:_libtorch fbcode//caffe2:_libtorch_cuda ``` `arc f` and `arc lint` reported no issues; AUTODEPS produced no dependency changes (the moves stayed within existing Buck targets); both the CPU (`_libtorch`) and CUDA (`_libtorch_cuda`) libraries built successfully (exit 0). Reviewed By: kapilsh Differential Revision: D108332288
Update on "[dtensor] migrating tensor ops to single dim strategies" **Summary:** Before Directly registered: rule (register_prop_rule): 2 op_strategy (register_op_strategy): 158 single_dim_strategy: 1013 total: 1164 After Directly registered: rule (register_prop_rule): 2 op_strategy (register_op_strategy): 114 single_dim_strategy: 1068 total: 1176 Net New Ops Added: 12 **Test Cases** 1. pytest test/distributed/tensor/test_tensor_ops.py [ghstack-poisoned]
PreviousNext