Skip to content

Add deterministic path for CUDA cumsum#136224

Closed
kurtamohler wants to merge 2 commits intopytorch:mainfrom
kurtamohler:cumsum-deterministic-0
Closed

Add deterministic path for CUDA cumsum#136224
kurtamohler wants to merge 2 commits intopytorch:mainfrom
kurtamohler:cumsum-deterministic-0

Conversation

@kurtamohler
Copy link
Collaborator

@kurtamohler kurtamohler commented Sep 17, 2024

Change cumsum to call its decomposition when use_deterministic_algorithms(True) and input is CUDA.

Fixes #89492
Fixes #75240

cc @mruberry

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136224

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 1 Unrelated Failure

As of commit 0dd2ef2 with merge base 5516ac5 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kurtamohler kurtamohler force-pushed the cumsum-deterministic-0 branch from 55e2ea4 to 9afa925 Compare September 17, 2024 21:11
@kurtamohler kurtamohler added module: determinism release notes: python_frontend python frontend release notes category labels Sep 17, 2024
@kurtamohler kurtamohler marked this pull request as ready for review September 17, 2024 21:20
@kurtamohler kurtamohler requested a review from ezyang September 18, 2024 19:24
@kurtamohler kurtamohler force-pushed the cumsum-deterministic-0 branch 2 times, most recently from 5c2fa20 to eccd643 Compare September 18, 2024 22:04
@kurtamohler kurtamohler force-pushed the cumsum-deterministic-0 branch from eccd643 to 6ef5ddc Compare September 18, 2024 23:54
@ezyang
Copy link
Contributor

ezyang commented Sep 19, 2024

test fail looks real

@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 19, 2024
@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Sep 19, 2024

test fail looks real

I'm having trouble figuring out how to fix that.

It looks like the graph context given to cumsum in torch/onnx/symbolic_opset11.py used to have an input variable called "%0", which was changed to "%x" for some reason. I'm not sure if that's the only difference, but I'd like to try changing it back to see if it fixes the test. But I'm not sure how to do that. I don't know what determines these variable names or why my changes messed with the variable name

@kurtamohler
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 19, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable)

Details for Dev Infra team Raised by workflow job

@pytorchmergebot
Copy link
Collaborator

@kurtamohler your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Sep 27, 2024
This reverts commit d1bb8e8.

Reverted #136224 on behalf of https://github.com/atalman due to Break internal CI ([comment](#136224 (comment)))
@ezyang
Copy link
Contributor

ezyang commented Sep 27, 2024

Whelp. I guess we have to put the logic in C++. Maybe we should just port the Python decomp to C++ then.

@kurtamohler
Copy link
Collaborator Author

Whelp. I guess we have to put the logic in C++. Maybe we should just port the Python decomp to C++ then.

Sounds good, working on it

@kurtamohler kurtamohler force-pushed the cumsum-deterministic-0 branch from 1189a5d to 0dd2ef2 Compare October 9, 2024 22:32
@kurtamohler
Copy link
Collaborator Author

I've ported the decomp to C++

@kurtamohler
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

return at::sum_out(result, self.unsqueeze(0), /*dim=*/IntArrayRef{0});
}
self = self.unsqueeze(dim + 1);
Tensor rg = at::arange(self.size(dim), c10::TensorOptions().device(self.device()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quadratic memory usage for mask and for self, I don't think this solves the problem of large dim as it will OOM (perf is also quadratic obv)

Copy link
Collaborator Author

@kurtamohler kurtamohler Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point, thanks for bringing that up. I've reopened the issue for large inputs

Copy link
Contributor

@xw285cornell xw285cornell Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is making the deterministic model quite unsable imo, e.g.

torch.cumsum(torch.rand([1, 1, 128256], device="cuda"), dim=-1)

This is using 60GB memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is hard error in deterministic mode, so the current state is still better? Or would you prefer this PR reverted @xw285cornell ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is often a funny situation where we buggily gave nondeterministic results but people prefer that over OOM/error 🤣

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the difference is only in non-deterministic warn mode, where previously we would warn and take non-deterministic path and now will oom. For non-deterministic error mode if someone preferred non-deterministic result, that's on them, error means error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some internal peeps are complaining, so I'm gonna yank this and we can talk about it more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So.... what should we do? Is there a torch.compile version that we can ship instead?

@ezyang
Copy link
Contributor

ezyang commented Oct 30, 2024

@pytorchbot revert -c nosignal -m "larger memory usage apparently not acceptable"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@kurtamohler your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Oct 30, 2024
This reverts commit 383eba5.

Reverted #136224 on behalf of https://github.com/ezyang due to larger memory usage apparently not acceptable ([comment](#136224 (comment)))
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
This reverts commit 383eba5.

Reverted pytorch#136224 on behalf of https://github.com/ezyang due to larger memory usage apparently not acceptable ([comment](pytorch#136224 (comment)))
@kurtamohler
Copy link
Collaborator Author

Replaced by #140887

@kurtamohler kurtamohler closed this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: determinism open source release notes: python_frontend python frontend release notes category Reverted triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: deterministic CUDA cumsum Large cumulative sums appear to be nondeterministic.