[BUG] version `0.6.7` is throwing `RuntimeError: trying to initialize the default process group twice!`

**Describe the bug**
Using `Accelerate` integration of DeepSpeed with below config. 

```yaml
- `Accelerate` version: 0.12.0.dev0
- Platform: Linux-5.4.0-121-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.23.0
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - fsdp_config: {}
```

**To Reproduce**
Steps to reproduce the behavior:
1. Run `accelerate config` command to set the above DeepSpeed config
2. RUn `accelerate launch complete_nlp_example.py`. It is the official example [complete_nlp_example.py](https://github.com/huggingface/accelerate/blob/main/examples/complete_nlp_example.py)
3. Throws below error when using `0.6.7` version whereas version `0.6.5` works fine.
```bash
File "complete_nlp_example.py", line 128, in training_function
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
  File "/home/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/sourab/transformers/src/transformers/modeling_utils.py", line 2065, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 655, in __init__
    init_distributed()
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 427, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 35, in __init__
    self.init_process_group(backend, timeout, init_method)
  File "/home/sourab/dev/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 38, in init_process_group
    return torch.distributed.init_process_group(backend,
  File "/home/sourab/dev/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 563, in init_process_group
    raise RuntimeError("trying to initialize the default process group " "twice!")
RuntimeError: trying to initialize the default process group twice!
```

**Expected behavior**
No error when using latest version 0.6.7

**ds_report output**

```bash
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sourab/dev/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0+cu102
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/sourab/dev/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 10.2
```

**System info (please complete the following information):**
 - OS: Ubuntu 20.04.3 LTS (Focal Fossa)
 - GPU count and types: 1 machine with x2 NVIDIA TITAN RTX each
 - Python version: Python 3.8.10

**Launcher context**
Are you launching your experiment with the `deepspeed` launcher, MPI, or something else?
Accelerate launcher which just triggers `deepspeed` launcher

**Additional context**
original issue raised in Accelerate repo https://github.com/huggingface/accelerate/issues/536


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] version `0.6.7` is throwing `RuntimeError: trying to initialize the default process group twice!` #2117

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] version 0.6.7 is throwing RuntimeError: trying to initialize the default process group twice! #2117

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] version `0.6.7` is throwing `RuntimeError: trying to initialize the default process group twice!` #2117