Skip to content

Increase the memory requirement of test_reduction_split#43257

Closed
zasdfgbnm wants to merge 1 commit intomasterfrom
zasdfgbnm-patch-1
Closed

Increase the memory requirement of test_reduction_split#43257
zasdfgbnm wants to merge 1 commit intomasterfrom
zasdfgbnm-patch-1

Conversation

@zasdfgbnm
Copy link
Copy Markdown
Collaborator

This test is constantly failing on my 12GB GPU

This test is constantly failing on my 12GB GPU
@zasdfgbnm zasdfgbnm requested a review from ngimel August 19, 2020 07:04
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Aug 19, 2020

💊 CI failures summary and remediations

As of commit c20dafd (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Aug 19, 2020

hm, that's weird, tensor requires just over 4 GB of memory, result and expected are also less than 1 GB each. Are you running the test in isolation or as part of the test suite? Is it that intermediates in the summation are not reused? Nope, it looks as expected:

In [6]: expect = input_[0] + input_[1] + input_[2] + input_[3] + input_[4]                                                                                                                                         

In [7]: torch.cuda.memory_allocated()                                                                                                                                                                              
Out[7]: 4978638848

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

I am running it as part of the test suit. I don't know why it is failing, and my card has 12GB, and the error message says unable to allocate XXXMB memory, 7.XXGB is already allocated. Maybe the reason is, running the whole test suit creates too much fragments?

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Aug 19, 2020

That's possible. What if you do empty_cache before this test? If that helps, maybe empty_cache should be part of the @largeCUDATensorTest wrapper.

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

zasdfgbnm commented Aug 19, 2020

tested that empty_cache does not work. One strange thing I just find is, I can only reproduce this failure with my amin branch, but my branch does not change anything about this test, no about reduction kernel.

@mruberry mruberry added module: tests Issues related to tests (not the torch.testing module) module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 26, 2020
@mruberry mruberry self-requested a review August 26, 2020 04:42
Copy link
Copy Markdown
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks @zasdfgbnm!

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mruberry
Copy link
Copy Markdown
Collaborator

Looks like I missed the conversation while stamping in PRs yesterday. @zasdfgbnm, what would you think of filing an issue for this following your discussion with @ngimel instead of us trying to paper over the underlying issue?

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

@mruberry An issue about what? About this specific test or about largeCUDATensorTest in general?

@mruberry
Copy link
Copy Markdown
Collaborator

@mruberry An issue about what? About this specific test or about largeCUDATensorTest in general?

That you've been able to hit OOM when running this test while developing unrelated features, and that the OOM is despite the test not using that much memory and after emptying the cache. As you mention above, this suggests the test suite is holding CUDA memory between tests, which it probably shouldn't do.

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

@mruberry Oh, your comment #43257 (comment) remind me of something. Maybe I should try gc.collect(); torch.cuda.empty_cache() instead of just torch.cuda.empty_cache(). Let me try that to see the result.

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

No, adding gc.collect() does not help.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

Hi @zasdfgbnm!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

I don't think this is a problem any more.

@zasdfgbnm zasdfgbnm closed this Nov 9, 2020
@zasdfgbnm zasdfgbnm deleted the zasdfgbnm-patch-1 branch November 9, 2020 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: cuda Related to torch.cuda, and CUDA support in general module: tests Issues related to tests (not the torch.testing module) open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants