[perf][OSS] tensor views for bucketing#300
Conversation
|
@msbaines I'm not sure how that works for torch1.5 and 1.6, I don't think that we're actually testing it nowadays (since the pip default change) |
I've just tested, this part (tensor views) is fine, but a previous PR (#297) was not in fact pytorch 1.5 compatible (cc @pritamdamania87), and CI did not catch that.. |
min-xu-ai
left a comment
There was a problem hiding this comment.
Really like the simplification and performance/memory gain!
| torch.distributed group (default: group.WORLD) | ||
| broadcast_buffer_size (int): | ||
| the size of the buffer used to batch the small parameter tensors (default 128k). | ||
| the max size of the buffer used to batch the small parameter tensors, in number of elements (default 16M). |
There was a problem hiding this comment.
talk a bit about the impact on per-GPU memory usage here?
There was a problem hiding this comment.
the beauty of this PR is that the impact on GPU usage is now.. 0 (zero) :) the small parameters are just a view of a bigger buffer, so the bucketing takes no extra space (on top of taking no extra cpu cycle).
I just realized that I should resize it to match the last bucketed parameter though, right now the remainder is unused and do increase the gpu memory a tiny bit
There was a problem hiding this comment.
hmm, the test jobs do show a slightly larger memory consumption, looking into that
There was a problem hiding this comment.
ok, makes sense given the initial extra memory allocation, before the parameters have been folded to the buffer, I've updated the docstring to reflect that
Before submitting
What does this PR do?
Speedups by a couple of %, removes a lot of code, saves some memory (buckets are free)
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃