Add deterministic path for CUDA cumsum#136224
Add deterministic path for CUDA cumsum#136224kurtamohler wants to merge 2 commits intopytorch:mainfrom
cumsum#136224Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136224
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New Failures, 1 Unrelated FailureAs of commit 0dd2ef2 with merge base 5516ac5 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
55e2ea4 to
9afa925
Compare
5c2fa20 to
eccd643
Compare
eccd643 to
6ef5ddc
Compare
|
test fail looks real |
I'm having trouble figuring out how to fix that. It looks like the graph context given to |
6ef5ddc to
e6b9d31
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) Details for Dev Infra teamRaised by workflow job |
|
@kurtamohler your PR has been successfully reverted. |
This reverts commit d1bb8e8. Reverted #136224 on behalf of https://github.com/atalman due to Break internal CI ([comment](#136224 (comment)))
|
Whelp. I guess we have to put the logic in C++. Maybe we should just port the Python decomp to C++ then. |
Sounds good, working on it |
8865cb4 to
1189a5d
Compare
1189a5d to
0dd2ef2
Compare
|
I've ported the decomp to C++ |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| return at::sum_out(result, self.unsqueeze(0), /*dim=*/IntArrayRef{0}); | ||
| } | ||
| self = self.unsqueeze(dim + 1); | ||
| Tensor rg = at::arange(self.size(dim), c10::TensorOptions().device(self.device())); |
There was a problem hiding this comment.
this is quadratic memory usage for mask and for self, I don't think this solves the problem of large dim as it will OOM (perf is also quadratic obv)
There was a problem hiding this comment.
Oh good point, thanks for bringing that up. I've reopened the issue for large inputs
There was a problem hiding this comment.
yeah this is making the deterministic model quite unsable imo, e.g.
torch.cumsum(torch.rand([1, 1, 128256], device="cuda"), dim=-1)
This is using 60GB memory.
There was a problem hiding this comment.
The alternative is hard error in deterministic mode, so the current state is still better? Or would you prefer this PR reverted @xw285cornell ?
There was a problem hiding this comment.
There is often a funny situation where we buggily gave nondeterministic results but people prefer that over OOM/error 🤣
There was a problem hiding this comment.
the difference is only in non-deterministic warn mode, where previously we would warn and take non-deterministic path and now will oom. For non-deterministic error mode if someone preferred non-deterministic result, that's on them, error means error.
There was a problem hiding this comment.
It looks like some internal peeps are complaining, so I'm gonna yank this and we can talk about it more
There was a problem hiding this comment.
So.... what should we do? Is there a torch.compile version that we can ship instead?
|
@pytorchbot revert -c nosignal -m "larger memory usage apparently not acceptable" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@kurtamohler your PR has been successfully reverted. |
This reverts commit 383eba5. Reverted #136224 on behalf of https://github.com/ezyang due to larger memory usage apparently not acceptable ([comment](#136224 (comment)))
This reverts commit 383eba5. Reverted pytorch#136224 on behalf of https://github.com/ezyang due to larger memory usage apparently not acceptable ([comment](pytorch#136224 (comment)))
|
Replaced by #140887 |
Change
cumsumto call its decomposition whenuse_deterministic_algorithms(True)and input is CUDA.Fixes #89492
Fixes #75240
cc @mruberry