About torch.UntypedStorage._new_shared_cuda

I am using some functions from torch.multiprocessing.reductions for cross‑process tensor sharing. However, I noticed that the torch.UntypedStorage._new_shared_cuda function can sometimes take hundreds of milliseconds to execute. In performance‑critical scenarios — for example, model parameter synchronization in RLHF — this latency can have a noticeable impact on overall performance. Is there a better way to address or mitigate this issue?



cc @VitalyFedyunin @albanD @pragupta @ptrblck @msaroufim @eqy @jerryzh168

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About torch.UntypedStorage._new_shared_cuda #161481

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About torch.UntypedStorage._new_shared_cuda #161481

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions