Skip to content

TCPStore.delete is 150x to 600x slower than TCPStore.set #53872

@PetrochukM

Description

@PetrochukM

🐛 Bug

For some reason, TCPStore.delete is very slow every once in awhile.

To Reproduce

In my code, TCPStore.delete is 600x slower than TCPStore.set. In the below code, I'm able to reproduce a 150x slow down:

import time
from datetime import timedelta

import torch
import torch.cuda
import torch.distributed
import torch.multiprocessing
import torch.nn
import torch.optim
import torch.utils.data

world_size = 4
master_rank = 0


def init_distributed(
    rank: int,
    world_size: int = world_size,
    backend: str = "nccl",
    hostname: str = "127.0.0.1",
    port: int = 29500,
    timeout: timedelta = timedelta(minutes=30),
) -> torch.distributed.TCPStore:
    is_master = rank == master_rank
    store = torch.distributed.TCPStore(hostname, port, world_size, is_master, timeout)
    torch.distributed.init_process_group(
        backend, store=store, timeout=timeout, world_size=world_size, rank=rank
    )
    print("Worker %d started." % rank)
    return store


def run_worker(rank):
    store = init_distributed(rank)
    for i in range(5):
        if rank == master_rank:
            keys = [str(i) for i in range(world_size) if i != master_rank]

            start = time.time()
            [store.wait([k]) for k in keys]
            print("`wait` ran for", (time.time() - start) * 1000, "milliseconds")

            start = time.time()
            [store.get(k) for k in keys]
            print("`get` ran for", (time.time() - start) * 1000, "milliseconds")

            start = time.time()
            [store.delete_key(k) for k in keys]
            print("`delete_key` ran for", (time.time() - start) * 1000, "milliseconds")
        else:
            start = time.time()
            store.set(str(rank), str(rank))
            print("[%s] `set` ran for" % rank, (time.time() - start) * 1000, "milliseconds")
        time.sleep(1)


if __name__ == "__main__":
    torch.multiprocessing.spawn(run_worker, nprocs=world_size, args=tuple())

Sample output:

Worker 0 started.
Worker 2 started.
Worker 1 started.
Worker 3 started.
[1] `set` ran for 0.05698204040527344 milliseconds
[3] `set` ran for 0.05507469177246094 milliseconds
[2] `set` ran for 0.06079673767089844 milliseconds
`wait` ran for 0.2951622009277344 milliseconds
`get` ran for 0.26607513427734375 milliseconds
`delete_key` ran for 41.30744934082031 milliseconds <--- 41 milliseconds!
[3] `set` ran for 0.0972747802734375 milliseconds
[2] `set` ran for 0.095367431640625 milliseconds
[1] `set` ran for 0.06556510925292969 milliseconds
`wait` ran for 0.24962425231933594 milliseconds
`get` ran for 0.377655029296875 milliseconds
`delete_key` ran for 0.1647472381591797 milliseconds
[3] `set` ran for 0.10418891906738281 milliseconds
[1] `set` ran for 0.07224082946777344 milliseconds
[2] `set` ran for 0.07963180541992188 milliseconds
`wait` ran for 0.3006458282470703 milliseconds
`get` ran for 0.27680397033691406 milliseconds
`delete_key` ran for 44.26217079162598 milliseconds <--- 44 milliseconds!
[2] `set` ran for 0.09489059448242188 milliseconds
[3] `set` ran for 0.09655952453613281 milliseconds
[1] `set` ran for 0.09250640869140625 milliseconds
`wait` ran for 0.3743171691894531 milliseconds
`get` ran for 0.2734661102294922 milliseconds
`delete_key` ran for 0.13303756713867188 milliseconds
[2] `set` ran for 0.10132789611816406 milliseconds
[3] `set` ran for 0.07557868957519531 milliseconds
[1] `set` ran for 0.07748603820800781 milliseconds
`wait` ran for 0.38695335388183594 milliseconds
`get` ran for 0.29587745666503906 milliseconds
`delete_key` ran for 44.054508209228516 milliseconds <--- 44 milliseconds!

Expected behavior

delete_key should have a comparable speed to other write operations, like set.

Environment

Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla T4
GPU 1: Tesla T4
GPU 2: Tesla T4
GPU 3: Tesla T4

Nvidia driver version: 460.32.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] pytorch-nlp==0.5.0
[pip3] torch==1.7.1
[conda] Could not collect

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions