-
Notifications
You must be signed in to change notification settings - Fork 27.7k
About torch.UntypedStorage._new_shared_cuda #161481
Copy link
Copy link
Closed
Labels
module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generalmodule: multiprocessingRelated to torch.multiprocessingRelated to torch.multiprocessingtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generalmodule: multiprocessingRelated to torch.multiprocessingRelated to torch.multiprocessingtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
I am using some functions from torch.multiprocessing.reductions for cross‑process tensor sharing. However, I noticed that the torch.UntypedStorage._new_shared_cuda function can sometimes take hundreds of milliseconds to execute. In performance‑critical scenarios — for example, model parameter synchronization in RLHF — this latency can have a noticeable impact on overall performance. Is there a better way to address or mitigate this issue?
cc @VitalyFedyunin @albanD @pragupta @ptrblck @msaroufim @eqy @jerryzh168