torch.utils.checkpoint.checkpoint + torch.cuda.amp by tano297 · Pull Request #40221 · pytorch/pytorch

tano297 · 2020-06-18T12:36:50Z

Simple 2 line workaround to allow gradient checkpointing to work with amp autocast.
In the same way pytorch stores the "has_cuda" state in the context, we store a "has_autocast" during the first forward pass, so that we can re-enable it when the forward pass runs for the second time during the backward pass.

For anybody having this problem, the simple solution to this problem before this is merged can be found either here: #37730, or by simply copying this version of the file into your own codebase with the added 2 lines.

…kpointing gradients

dr-ci · 2020-06-18T13:58:48Z

💊 CI failures summary and remediations

As of commit a9cca95 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.3-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

zou3519 · 2020-06-23T21:23:36Z

@mcarilli I requested review from you because of the mention of amp, please let me know if that's not right

tano297 · 2020-07-29T14:38:20Z

Any update on this? Let me know how I can help

ZhichengHuang · 2020-09-17T00:24:18Z

Is this bug solved? I have met the same issue.

mcarilli

Looks good to me, thanks!

I think the usual custom autograd function decorators aren't preferable here, because CheckpointFunction.backward runs a nested forward and backward. The autocast API recommends running only forward under autocast, but globally enabling autocast for all of CheckpointFunction.backward (as @custom_bwd might do) would include the nested backward as well.

Definitely needs a test though.

mcarilli · 2020-12-22T21:42:32Z

PR appears orphaned, moving to #49757.

Summary: Adds a test to orphaned original PR (#40221). Should fix #49738 and #47183 Pull Request resolved: #49757 Reviewed By: mruberry Differential Revision: D25689609 Pulled By: ngimel fbshipit-source-id: 0a6adc11eb98382048ef9a9775e185dcdeff6010

Summary: Adds a test to orphaned original PR (pytorch#40221). Should fix pytorch#49738 and pytorch#47183 Pull Request resolved: pytorch#49757 Reviewed By: mruberry Differential Revision: D25689609 Pulled By: ngimel fbshipit-source-id: 0a6adc11eb98382048ef9a9775e185dcdeff6010

allowing to do forward pass again in the same autocast mode when chec…

a9cca95

…kpointing gradients

pytorchbot added the open source label Jun 22, 2020

zou3519 requested a review from mcarilli June 23, 2020 21:22

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 23, 2020

ptrblck mentioned this pull request Nov 3, 2020

torch.cuda.amp is not compatible with torch.utils.checkpoint.checkpoint #47183

Closed

mcarilli requested changes Nov 25, 2020

View reviewed changes

mcarilli mentioned this pull request Dec 1, 2020

torch.utils.checkpoint.checkpoint + torch.cuda.amp fails #37730

Closed

ghost mentioned this pull request Dec 20, 2020

torch.cuda.amp, example with 20% memory increase compared to apex/amp #49653

Open

mcarilli mentioned this pull request Dec 22, 2020

torch.utils.checkpoint.checkpoint + torch.cuda.amp #49757

Closed

mcarilli closed this Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.utils.checkpoint.checkpoint + torch.cuda.amp #40221

torch.utils.checkpoint.checkpoint + torch.cuda.amp #40221
tano297 wants to merge 1 commit intopytorch:masterfrom
tano297:autocast_grad_checkpoint

tano297 commented Jun 18, 2020

Uh oh!

dr-ci Bot commented Jun 18, 2020 •

edited

Loading

Uh oh!

zou3519 commented Jun 23, 2020

Uh oh!

tano297 commented Jul 29, 2020

Uh oh!

ZhichengHuang commented Sep 17, 2020

Uh oh!

mcarilli left a comment

Uh oh!

mcarilli commented Dec 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tano297 commented Jun 18, 2020

Uh oh!

dr-ci Bot commented Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

Uh oh!

zou3519 commented Jun 23, 2020

Uh oh!

tano297 commented Jul 29, 2020

Uh oh!

ZhichengHuang commented Sep 17, 2020

Uh oh!

mcarilli left a comment

Choose a reason for hiding this comment

Uh oh!

mcarilli commented Dec 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dr-ci Bot commented Jun 18, 2020 •

edited

Loading