[BUG][master branch] garbage GPTJ output for multi-gpu inference

**Describe the bug**

Similar to #2113 this bug relates to garbage output when using multi-gpu inference.  In that issue @RezaYazdaniAminabadi made a fix seen in #2198 that fixed a similar issue for GPT Neo 2.7B that after building from master I can confirm  solved multi-gpu inference for GPT Neo 2.7B.  However, for GPTJ the issue remains:

Output from 2 3090s for GPTJ

```
[{'generated_text': 'DeepSpeed is,: to,,/ &.. by and.. a\n.. and- and.. the,,\n of\n [.,.\n:, &-. and a- the,\n\n). the'}]
```

Meanwhile output from 1 3090 for GPTJ

```
[{'generated_text': 'DeepSpeed is a leading deep learning framework designed for distributed training and inference on heterogeneous accelerators and CPUs. Our paper (https://arxiv.org/abs/1811.11540) describes an optimized deep architecture and inference engine and'}]
```

**To Reproduce**
Steps to reproduce the behavior:
1. Install DeepSpeed from source on master
2. pip install transformers
3. Run with 2 GPUs to get bad output
4. Run with 1 GPU to get good output

```
import os
import deepspeed
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = 'EleutherAI/gpt-j-6B'
# model_name = "EleutherAI/gpt-neo-2.7B"
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generator = pipeline('text-generation', model=model_name, device=local_rank,torch_dtype=torch.float16)


generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)
```

**Expected behavior**

I would expect output that makes sense, like the output for one GPU.

**ds_report output**

ds_report 
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/torch']
torch version .................... 1.12.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.1+7d8ad45, 7d8ad45, master
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3


**System info (please complete the following information):**
 - OS: Ubuntu 20.04
 - GPU count and types: 2 3090s
 - Interconnects: 1 system, 2 3090s
 - Python version: 3.9.13

I am using a docker container with Nvidia Cuda already set up as the base image.

**Launcher context**

deepspeed --num_gpus 2 infer.py
deepspeed --num_gpus 1 infer.py

**Docker context**

Are you using a specific docker image that you can share?
nvidia/cuda:11.3.1-devel-ubuntu20.04 
then I am building python packages into the container

**Additional context**

NA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG][master branch] garbage GPTJ output for multi-gpu inference #2233

ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG][master branch] garbage GPTJ output for multi-gpu inference #2233

Description

ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]