implement send and recv using collective_permute#9373
Conversation
|
The approach implemented here works for a "pipeline" type operation but does not work for a "permutation" type operation. The way this is commonly done in native pytorch in order to avoid deadlocks is that half of the devices send and the other receive, then they switch roles. What this means is that the sending and receiving tensors must be different, and one half of the devices end up having a different IR than the other half, resulting in a deadlock. I'm still searching for a way around this. |
|
The only way I was able to make a "permutation" type op (every device sends and every device receives) work is by inserting a sync after each set of send/recv. This is not ideal. It's better than the status quo for TPU, which is that send/recv don't work at all. But since Neuron does have something working I'll defer to you @rpsilva-aws . We can put this on ice until the Send/Recv XLA ops can be called directly. |
|
Hm, that does complicate things... I have it working on TRN, though I deviated a bit with multi-operands to capture tokens. I'll end up creating a PR for this one, which would build upon the work you had in the prior commits. Actually, TRN has the same limitation for send/recv, requiring a graph break. Do you think we can merge this PR without the sync since it's working for existing devices (e.g. TRN), and revisit as we figure out the underlying issues with TPU? If you want to defer until the new ops, or we re-raise the need as we bring in our work, both are ok with me. |
There are two tests in the PR,
I'd be interested in seeing that |
|
@bfolie i tried your changes (on top of ToT) on neuron device and also wrote a small test specific to neuron device which uses 2 ranks and does I also tried reproducing the cpu xla_op test failures from this PR, and was not able to reproduce those locally cc: @rpsilva-aws |
Good to know. I would expect that this PR as it currently is, with the syncs on both send and receive, would work for all devices. It's not the most efficient but if this works for you then we can go ahead and merge it. |
rpsilva-aws
left a comment
There was a problem hiding this comment.
We need tests under neuron/ too, but we can do that separately. LGTM.
#9315