Skip to content

[BUG] version 0.6.7 is throwing RuntimeError: trying to initialize the default process group twice! #2117

@pacman100

Description

@pacman100

Describe the bug
Using Accelerate integration of DeepSpeed with below config.

- `Accelerate` version: 0.12.0.dev0
- Platform: Linux-5.4.0-121-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.23.0
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - fsdp_config: {}

To Reproduce
Steps to reproduce the behavior:

  1. Run accelerate config command to set the above DeepSpeed config
  2. RUn accelerate launch complete_nlp_example.py. It is the official example complete_nlp_example.py
  3. Throws below error when using 0.6.7 version whereas version 0.6.5 works fine.
File "complete_nlp_example.py", line 128, in training_function
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
  File "/home/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/sourab/transformers/src/transformers/modeling_utils.py", line 2065, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 655, in __init__
    init_distributed()
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 427, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 35, in __init__
    self.init_process_group(backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 38, in init_process_group
    return torch.distributed.init_process_group(backend,
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 563, in init_process_group
    raise RuntimeError("trying to initialize the default process group " "twice!")
RuntimeError: trying to initialize the default process group twice!

Expected behavior
No error when using latest version 0.6.7

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sourab/dev/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0+cu102
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/sourab/dev/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 10.2

System info (please complete the following information):

  • OS: Ubuntu 20.04.3 LTS (Focal Fossa)
  • GPU count and types: 1 machine with x2 NVIDIA TITAN RTX each
  • Python version: Python 3.8.10

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
Accelerate launcher which just triggers deepspeed launcher

Additional context
original issue raised in Accelerate repo huggingface/accelerate#536

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions