[Core][Distributed] refactor pynccl to hold multiple communicators#4591
[Core][Distributed] refactor pynccl to hold multiple communicators#4591youkaichao merged 44 commits intovllm-project:mainfrom
Conversation
WoosukKwon
left a comment
There was a problem hiding this comment.
@youkaichao Thanks a lot for the PR! The refactoring makes sense.
Left some comments on code style and possible errors in the PR. Please check my review.
| # A small all_reduce for warmup. | ||
| data = torch.zeros(1) | ||
| if torch.cuda.is_available(): | ||
| data = data.to(device=f"cuda:{local_rank}") | ||
| torch.distributed.all_reduce(data) |
There was a problem hiding this comment.
I feel warmup should not be a part of this method?
There was a problem hiding this comment.
which method do you think is better?
vllm/distributed/parallel_state.py
Outdated
| _TP_DEVICE_GROUP = group | ||
| _TP_CPU_GROUP = cpu_group | ||
|
|
||
| from vllm.distributed.device_communicators.pynccl import NCCLCommunicator |
There was a problem hiding this comment.
Again, why do we need this lazy import?
There was a problem hiding this comment.
lazy import is required here to avoid circular import. vllm.distributed.device_communicators.pynccl will try to import vllm/distributed/parallel_state.py.
There was a problem hiding this comment.
If that's the case, I think it means it's a bad design tbh. We should use lazy import only to avoid the unnecessary imports, but not to avoid circular imports. Otherwise, the code will be too complicated.
There was a problem hiding this comment.
Yeah, we can have a better design. The reason why we have this circular import, is because we tried very hard to figure out the default argument for the group (which requires the import from vllm/distributed/parallel_state.py). We can remove this, but it might break some old code. ( I can do it if you think it is good).
|
@WoosukKwon one change after our discussion: c1b1cdb change |
WoosukKwon
left a comment
There was a problem hiding this comment.
LGTM. Thanks for addressing my review! Looking forward to the planned refactoring!
[Core][Distributed] refactor pynccl to hold multiple communicators (vllm-project#4591)
[Core][Distributed] refactor pynccl to hold multiple communicators (vllm-project#4591)
Currently pynccl is bound to the module level instance. And we are using it for just tensor parallel group.
After this refactor, we can create as many pynccl communicator instances as we want, e.g. a new pynccl communicator for pipeline parallel group.
This is an ongoing effort to support pipeline parallel #4412 .