[inductor] Add FMA lowering for add-with-alpha on CUDA#175838
[inductor] Add FMA lowering for add-with-alpha on CUDA#175838mlazos wants to merge 13 commits intogh/mlazos/109/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175838
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 330cd4f with merge base 0b6476f ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. ghstack-source-id: bac0c83 Pull-Request: #175838
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 3 checks: trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 1, 5, linux.g6.4xlarge.experimental.nvidia.gpu), inductor / unit-test / inductor-test / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert This PR is attributed to have caused regression in: Please investigate and fix the issues. |
|
Claude finished @pytorch-auto-revert[bot]'s task in 4m 1s —— View job Diagnosing Auto-Revert for FMA Lowering PR
Root CauseThe revert is legitimate. The bug is in the FMA lowering path in The FMA optimization is applied unconditionally for both # line 689 - always computes a + alpha * b
return _add_with_alpha_fma(inputs[0], inputs[1], alpha)
This is why every test failure shows 100% of elements mismatched with large absolute differences (~35), and always with Failing Tests
How to FixThe
Option 2 is cleaner since it avoids leaking op-specific semantics into the generic |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…)" This reverts commit 45dfce3. Reverted #175838 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#175838 (comment)))
|
@mlazos your PR has been successfully reverted. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. ghstack-source-id: c259808 Pull-Request: pytorch/pytorch#175838
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
…ch#175838)" This reverts commit 45dfce3. Reverted pytorch#175838 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#175838 (comment)))
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
Stack from ghstack (oldest at bottom):
Eager CUDA computes
a + alpha * basfma(b, alpha, a). Without this,Triton computes
b * alphathen adds toaas separate operations, losingthe FMA precision guarantee.
This affects optimizer weight_decay paths which use
grad.add(param, alpha=weight_decay)and_foreach_addwith alpha.Authored with Claude.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo