[BUG][0.6.7] garbage output for multi-gpu with tutorial

**Describe the bug**
When running GPU = 2 started to see garbage output generated.

```
[{'generated_text': 'DeepSpeed is����極��極\\\\\\\\\\ \n\nの (  (  "\n090 nodot\x0c �\n �$, "\xa0\n\n \n\n \\ �\n �\n\n � �\n �osa\n\n � oldaran � � �aran======\\'}
```


**To Reproduce**
I am running with 2 GPU instance with V100, also reproducible using A100.

Just follow this example: https://www.deepspeed.ai/tutorials/inference-tutorial/

```python
# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)
```


```
deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py
```

**Expected behavior**

Should be just normal?

**ds_report output**

```
# ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.6.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
```

**Screenshots**

```
f': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 29, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
[2022-07-19 23:58:00,131] [INFO] [engine.py:144:__init__] Place model to device: 1
[2022-07-19 23:58:00,153] [INFO] [engine.py:144:__init__] Place model to device: 0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'DeepSpeed is����極��極\\\\\\\\\\ \n\nの (  (  "\n090 nodot\x0c �\n �$, "\xa0\n\n \n\n \\ �\n �\n\n � �\n �osa\n\n � oldaran � � �aran======\\'}]
[2022-07-19 23:58:04,674] [INFO] [launch.py:210:main] Process 811 exits successfully.
[2022-07-19 23:58:05,675] [INFO] [launch.py:210:main] Process 810 exits successfully.
```


**System info (please complete the following information):**
 - OS:Ubuntu 20.04
 - GPU count and types 2 V100
 - Python version 3.8




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG][0.6.7] garbage output for multi-gpu with tutorial #2113

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG][0.6.7] garbage output for multi-gpu with tutorial #2113

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions