Skip to content

About torch.UntypedStorage._new_shared_cuda #161481

@Schnabel-8

Description

@Schnabel-8

I am using some functions from torch.multiprocessing.reductions for cross‑process tensor sharing. However, I noticed that the torch.UntypedStorage._new_shared_cuda function can sometimes take hundreds of milliseconds to execute. In performance‑critical scenarios — for example, model parameter synchronization in RLHF — this latency can have a noticeable impact on overall performance. Is there a better way to address or mitigate this issue?

cc @VitalyFedyunin @albanD @pragupta @ptrblck @msaroufim @eqy @jerryzh168

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generalmodule: multiprocessingRelated to torch.multiprocessingtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions