Skip to content

Can't pickle local object 'DistributedDataParallel._register_nccl_grad_hook. #11683

@PetrochukM

Description

@PetrochukM

Issue description

$ python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 abcd.py
Traceback (most recent call last):
  File "abcd.py", line 18, in <module>
    torch.save(model, 'model.pt')
  File "/home/michaelp/.local/lib/python3.6/site-packages/torch/serialization.py", line 209, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/michaelp/.local/lib/python3.6/site-packages/torch/serialization.py", line 134, in _with_file_like
    return body(f)
  File "/home/michaelp/.local/lib/python3.6/site-packages/torch/serialization.py", line 209, in <lambda>
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/michaelp/.local/lib/python3.6/site-packages/torch/serialization.py", line 282, in _save
    pickler.dump(obj)
AttributeError: Can't pickle local object 'DistributedDataParallel._register_nccl_grad_hook.<locals>.allreduce_hook'
Traceback (most recent call last):
  File "abcd.py", line 18, in <module>
    torch.save(model, 'model.pt')
  File "/home/michaelp/.local/lib/python3.6/site-packages/torch/serialization.py", line 209, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/michaelp/.local/lib/python3.6/site-packages/torch/serialization.py", line 134, in _with_file_like
    return body(f)
  File "/home/michaelp/.local/lib/python3.6/site-packages/torch/serialization.py", line 209, in <lambda>
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/michaelp/.local/lib/python3.6/site-packages/torch/serialization.py", line 282, in _save
    pickler.dump(obj)
AttributeError: Can't pickle local object 'DistributedDataParallel._register_nccl_grad_hook.<locals>.allreduce_hook'

Code example

import argparse
import torch

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int)
    args = parser.parse_args()

    device = torch.device('cuda', args.local_rank)

    torch.distributed.init_process_group(backend='nccl')
    model = torch.nn.LSTM(10, 10).to(device)
    model = torch.nn.parallel.DistributedDataParallel(
        model, device_ids=[args.local_rank], output_device=args.local_rank, dim=1)

    torch.save(model, 'model.pt')

System Info

PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-16ubuntu3) 7.3.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB
GPU 2: Tesla P100-PCIE-16GB
GPU 3: Tesla P100-PCIE-16GB

Nvidia driver version: 390.30
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.3

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions