Increase the memory requirement of test_reduction_split#43257
Increase the memory requirement of test_reduction_split#43257
Conversation
This test is constantly failing on my 12GB GPU
💊 CI failures summary and remediationsAs of commit c20dafd (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 1 time. |
|
hm, that's weird, tensor requires just over 4 GB of memory, result and expected are also less than 1 GB each. Are you running the test in isolation or as part of the test suite? Is it that intermediates in the summation are not reused? Nope, it looks as expected: |
|
I am running it as part of the test suit. I don't know why it is failing, and my card has 12GB, and the error message says unable to allocate XXXMB memory, 7.XXGB is already allocated. Maybe the reason is, running the whole test suit creates too much fragments? |
|
That's possible. What if you do empty_cache before this test? If that helps, maybe empty_cache should be part of the @largeCUDATensorTest wrapper. |
|
tested that |
mruberry
left a comment
There was a problem hiding this comment.
Sounds good. Thanks @zasdfgbnm!
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Looks like I missed the conversation while stamping in PRs yesterday. @zasdfgbnm, what would you think of filing an issue for this following your discussion with @ngimel instead of us trying to paper over the underlying issue? |
|
@mruberry An issue about what? About this specific test or about |
That you've been able to hit OOM when running this test while developing unrelated features, and that the OOM is despite the test not using that much memory and after emptying the cache. As you mention above, this suggests the test suite is holding CUDA memory between tests, which it probably shouldn't do. |
|
@mruberry Oh, your comment #43257 (comment) remind me of something. Maybe I should try |
|
No, adding |
|
Hi @zasdfgbnm! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but we do not have a signature on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
|
I don't think this is a problem any more. |
This test is constantly failing on my 12GB GPU