[inductor] Skip addcdiv decomposition to enable FMA lowering#175310
[inductor] Skip addcdiv decomposition to enable FMA lowering#175310mlazos wants to merge 22 commits intogh/mlazos/107/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175310
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit b342d39 with merge base 0b6476f ( NEW FAILURE - The following job has failed:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. ghstack-source-id: 370d088 Pull-Request: #175310
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. ghstack-source-id: 370d088 Pull-Request: #175310
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. ghstack-source-id: 7d5f500 Pull-Request: #175310
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. ghstack-source-id: 43afebc Pull-Request: #175310
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. ghstack-source-id: f30177a Pull-Request: #175310
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. ghstack-source-id: a180c70 Pull-Request: #175310
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: inductor / inductor-test / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / inductor-test-cuda13 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309, #175310
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309, #175310
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. ghstack-source-id: b0ac9d4 Pull-Request: pytorch/pytorch#175310
…#175310) Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. Pull Request resolved: pytorch#175310 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
Stack from ghstack (oldest at bottom):
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.
Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.
This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela