Enable bucketized all-reduce for gradients#7216
Conversation
|
Something wrong with the build @JackCaoG . Maybe it is one-off? I don't know how to restart the run though. |
| import torch_xla.distributed.xla_backend | ||
| import torch.distributed as dist | ||
|
|
||
|
|
There was a problem hiding this comment.
Do you need to set the ALLREDUCE_GRADIENTS_BUCKET_SIZE_MB envvar for this test?
There was a problem hiding this comment.
Also, do you need to add this test to run_tests.sh?
There was a problem hiding this comment.
yea this test is not being run in CI
There was a problem hiding this comment.
we don't need the env flag, the test runs the bucketized version directly
|
Hmm this is weird Let me rerun, if it still fails I will ask someone on our end to take a look. Sorry that CI gave you guys so much trouble,, |
|
cpu failures looks real seems like one of the buffer has been aliased(hence the buffer is |
|
Ci might be glictching.. You should add your test to |
|
@JackCaoG the error in the GPU run for torch_mp_op is not very clear. Do you know why it is failing? |
|
it is relevant, we can ignore. However the test in this pr is not being run. |
|
Hi @JackCaoG seems like there's some CI infra issue? |
|
yea github action is glitching, this affects all github projects. |
b1bf93c to
f332a8f
Compare
|
Hi, I saw we want to backport this PR into release 2.4, and mentioned in #7242, but looks like not backported in release 2.4, please correct me if I'm wrong now we are preparing 2.5 release, and would you mind help to describe more context about this PR's feature/use-case for or inspired from? @amithrm |
@ManfeiBai thanks for checking. Yeah we can drop the 2.4 backport request. This change adds bucketing of all-reduce so that it prevents small tensor all-reduces which are inefficient for DMAs. The bucketing aggregates/coaelesce small tensors until a specified size, and to one all-reduce on the aggregate. This feature is already part of all-gather/reduce-scatter. |
This PR adds bucketing (aka coalescing) to all-reduce, to increase DMA utilization and reduce DMA overhead associated with small data transfers.
Replaces #6417 .