Increase the memory requirement of test_reduction_split by zasdfgbnm · Pull Request #43257 · pytorch/pytorch

zasdfgbnm · 2020-08-19T07:04:02Z

This test is constantly failing on my 12GB GPU

dr-ci · 2020-08-19T07:10:54Z

💊 CI failures summary and remediations

As of commit c20dafd (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/1 broken upstream at merge base fa6b34b since Aug 17

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:

binary_windows_libtorch_3_7_cpu_release_build since Aug 17
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

ngimel · 2020-08-19T19:42:07Z

hm, that's weird, tensor requires just over 4 GB of memory, result and expected are also less than 1 GB each. Are you running the test in isolation or as part of the test suite? Is it that intermediates in the summation are not reused? Nope, it looks as expected:

In [6]: expect = input_[0] + input_[1] + input_[2] + input_[3] + input_[4]                                                                                                                                         

In [7]: torch.cuda.memory_allocated()                                                                                                                                                                              
Out[7]: 4978638848

zasdfgbnm · 2020-08-19T19:54:26Z

I am running it as part of the test suit. I don't know why it is failing, and my card has 12GB, and the error message says unable to allocate XXXMB memory, 7.XXGB is already allocated. Maybe the reason is, running the whole test suit creates too much fragments?

ngimel · 2020-08-19T20:15:12Z

That's possible. What if you do empty_cache before this test? If that helps, maybe empty_cache should be part of the @largeCUDATensorTest wrapper.

zasdfgbnm · 2020-08-19T23:53:33Z

tested that empty_cache does not work. One strange thing I just find is, I can only reproduce this failure with my amin branch, but my branch does not change anything about this test, no about reduction kernel.

mruberry

Sounds good. Thanks @zasdfgbnm!

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mruberry · 2020-08-26T22:47:02Z

Looks like I missed the conversation while stamping in PRs yesterday. @zasdfgbnm, what would you think of filing an issue for this following your discussion with @ngimel instead of us trying to paper over the underlying issue?

zasdfgbnm · 2020-08-26T22:55:11Z

@mruberry An issue about what? About this specific test or about largeCUDATensorTest in general?

mruberry · 2020-08-26T23:00:07Z

@mruberry An issue about what? About this specific test or about largeCUDATensorTest in general?

That you've been able to hit OOM when running this test while developing unrelated features, and that the OOM is despite the test not using that much memory and after emptying the cache. As you mention above, this suggests the test suite is holding CUDA memory between tests, which it probably shouldn't do.

zasdfgbnm · 2020-08-26T23:04:56Z

@mruberry Oh, your comment #43257 (comment) remind me of something. Maybe I should try gc.collect(); torch.cuda.empty_cache() instead of just torch.cuda.empty_cache(). Let me try that to see the result.

zasdfgbnm · 2020-08-26T23:53:42Z

No, adding gc.collect() does not help.

facebook-github-bot · 2020-10-30T17:32:02Z

Hi @zasdfgbnm!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

zasdfgbnm · 2020-11-09T18:49:36Z

I don't think this is a problem any more.

Increase the memory requirement of test_reduction_split

c20dafd

This test is constantly failing on my 12GB GPU

zasdfgbnm requested a review from ngimel August 19, 2020 07:04

pytorchbot added the open source label Aug 19, 2020

mruberry added module: tests Issues related to tests (not the torch.testing module) module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 26, 2020

mruberry self-requested a review August 26, 2020 04:42

mruberry approved these changes Aug 26, 2020

View reviewed changes

facebook-github-bot reviewed Aug 26, 2020

View reviewed changes

zasdfgbnm mentioned this pull request Aug 27, 2020

Some largeCUDATensorTest fails with OOM when running with the entire test suit, but not when running standalone #43677

Open

facebook-github-bot added the cla signed label Nov 2, 2020

zasdfgbnm closed this Nov 9, 2020

zasdfgbnm deleted the zasdfgbnm-patch-1 branch November 9, 2020 18:49

Conversation

zasdfgbnm commented Aug 19, 2020

Uh oh!

dr-ci Bot commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 1 ongoing upstream failure:

Uh oh!

ngimel commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm commented Aug 19, 2020

Uh oh!

ngimel commented Aug 19, 2020

Uh oh!

zasdfgbnm commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mruberry left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mruberry commented Aug 26, 2020

Uh oh!

zasdfgbnm commented Aug 26, 2020

Uh oh!

mruberry commented Aug 26, 2020

Uh oh!

zasdfgbnm commented Aug 26, 2020

Uh oh!

zasdfgbnm commented Aug 26, 2020

Uh oh!

facebook-github-bot commented Oct 30, 2020

Uh oh!

zasdfgbnm commented Nov 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dr-ci Bot commented Aug 19, 2020 •

edited

Loading

ngimel commented Aug 19, 2020 •

edited

Loading

zasdfgbnm commented Aug 19, 2020 •

edited

Loading