[BUG] Inference predictions don't match Huggingface model's

**Describe the bug**

HF output:
`[{'generated_text': 'DeepSpeed is the greatest force of nature. It is not only energy which, due to how well it creates energy, is very powerful. Some may say that the universe is only one such entity; others may claim that the cosmos is all a result'}]`

Deepspeed output with mp size = 1, 
`[{'generated_text': "DeepSpeed is the greatest speed at which your avatar is at (not how fast you can reach or fall, but how far you can fall).\n\nOn iOS and Android, if the player has reached or fallen too far they won't fly."}]`

**To Reproduce**
```
        modelname = "gpt2"
        pipe = pipeline("text-generation", model="gpt2", device=0, framework="pt")
        model = pipe.model
        if dtype == "fp16":
            pipe.model.half()

        query_text = "DeepSpeed is the greatest"

        torch.cuda.synchronize()
        start = time.time()
        hf_output = pipe(query_text)
        torch.cuda.synchronize()

        print("hf output", hf_output)

        import deepspeed
        pipe.model = deepspeed.init_inference(
            pipe.model,
            mp_size=1,
            dtype=torch.half,
            replace_method="auto",
            replace_with_kernel_inject=True,
        )

        torch.cuda.synchronize()
        start = time.time()
        ds_output = pipe(query_text)
        torch.cuda.synchronize()
        ds_time = time.time() - start

        print(hf_output, ds_output)
```

**Expected behavior**
Outputs expected to be the same

**ds_report output**
```
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
```

**Screenshots**
NA

**System info (please complete the following information):**
 - OS: [e.g. Ubuntu 18.04] Ubuntu 
 - GPU count and types A100 GPU
 - Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]  N/A
 - Python version 3.8.3
 - Any other relevant info about your setup

**Launcher context**
inference, single process

**Docker context**
-

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Inference predictions don't match Huggingface model's #2184

Docker context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Inference predictions don't match Huggingface model's #2184

Description

Docker context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions