[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter

**Describe the bug**

My understanding of model parallelism is that the model is split over multiple GPUs to lower memory usage per GPU allowing larger models and speeding up inference.  Thus for GPT Neo 2.7B and two 3090s I would expect the VRAM usage to be roughly 5.5GB per GPU or 11GB in total if placed on one GPU.  

The issue is that the VRAM usage is roughly 11GB on each GPU.  Additionally, when selecting torch.half for the dtype, the VRAM usage stays high and does not change. Due to this problem, I also OOM for GPTJ on one or both GPUs

**To Reproduce**

Steps to reproduce the behavior:
1. Install DeepSpeed from the source
2. Install Transformers from pip
3. Follow [inference tutorial](https://www.deepspeed.ai/tutorials/inference-tutorial/)
4. Watch VRAM usage or experience OOM for larger models

**Expected behavior**

I would expect VRAM usage to decrease as you use multiple GPUs.  I would also expect the VRAM usage to decrease when using lower precision data types.

**ds_report output**

ds_report 
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/torch']
torch version .................... 1.12.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.1+7d8ad45, 7d8ad45, master
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3


**System info (please complete the following information):**
 - OS: Ubuntu 20.04
 - GPU count and types: 2 3090s
 - Interconnects: 1 system, 2 3090s
 - Python version: 3.9.13

I am using a docker container with Nvidia Cuda already set up as the base image.

**Launcher context**

deepspeed --num_gpus 2 infer.py
deepspeed --num_gpus 1 infer.py
both lead to the same VRAM usage per GPU 

**Docker context**

Are you using a specific docker image that you can share?
nvidia/cuda:11.3.1-devel-ubuntu20.04 
then I am building python packages into the container


**Additional context**

When setting up the pipeline like the tutorial, if I load the model first with the data type I want, and then pass the model and tokenizer to the pipeline, it seems to lower the VRAM usage on one of the GPUs.  Doing this and running with only one GPU rather than two had lead me to be able to use GPTJ, but I want to use two GPUs for inference

```
#this works for one GPU
model_name = 'EleutherAI/gpt-j-6B'
model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generator = pipeline('text-generation', model=model,tokenizer=tokenizer, device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)


```
```
#this leads to OOM for 2 3090s, with one being normal and one not
model_name = 'EleutherAI/gpt-j-6B'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generator = pipeline('text-generation', model=model,tokenizer=tokenizer, device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)


```
```
#this also leads to OOM for 2 3090s
model_name = 'EleutherAI/gpt-j-6B'
generator = pipeline('text-generation', model=model_name, device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)


```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter #2227

ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter #2227

Description

ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]