[BUG] MP-sharded checkpoint loading does not work for models except BLOOM

**Describe the bug**

We currently want to run inference on EleutherAI/gpt-j-6B model with tensor parallelism on multiple GPUs, similarly to what BLOOM model does. But it seems the way DeepSpeed inference saves and loads the pre-shared checkpoints are not consistent and general enough for other models.


**To Reproduce**

I tried using the [DeepSpeed inference script](https://github.com/huggingface/transformers-bloom-inference/blob/main/bloom-inference-scripts/bloom-ds-inference.py) for BLOOM and modifying lines 140-141 to

```
model = GPTJForCausalLM.from_pretrained(
    "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)
```

and line 100 to 
```
model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    base_dir=repo_root,
    dtype=getattr(torch, infer_dtype),
    save_mp_checkpoint_path =<some path to save mp checkpoint>,
    **kwargs,
)
```

After the first run, on my 2x A6000 server, I was able to get the tensor parallelism-sharded checkpoints under the path `<some path to save mp checkpoint>` and a configuration file `ds_inference_config.json` shown below
```
{"type": "ds_model",
"base_dir": <some path to save mp checkpoint>, 
"checkpoints": {"non-tp":["non-tp.pt"], "tp":["tp_00_00.pt", "tp_01_00.pt", "tp_00_01.pt", "tp_01_01.pt", 
    "tp_00_02.pt", "tp_01_02.pt", "tp_00_03.pt", "tp_01_03.pt", "tp_00_04.pt", "tp_01_04.pt",
    , "tp_00_05.pt", "tp_01_05.pt", "tp_00_06.pt", "tp_01_06.pt", "tp_01_07.pt", "tp_01_07.pt"]},
"version": 1.0, 
"parallelization": "tp", 
"tp_size": 2,
"dtype": "float16}
```

For the second round, I undo the changes for lines 140-141 as well as `save_mp_checkpoint_path ` and use `checkpoint=<some path to save mp checkpoint>/ds_inference_config.json` in `deepspeed.init_inference`. This is the standard way for loading the preshared model for BLOOM which speeds up the loading process. However, the above code raises the following error 

```AssertionError: ds_model checkpoint type is not supported```

, which comes from the following code that DeepSpeed inference loads the JSON state_dict

https://github.com/microsoft/DeepSpeed/blob/a5248643571581c6ca3b453a3b9d01ac157ac00a/deepspeed/runtime/state_dict_factory.py#L34

https://github.com/microsoft/DeepSpeed/blob/a5248643571581c6ca3b453a3b9d01ac157ac00a/deepspeed/runtime/state_dict_factory.py#L42


I also tried to change the type in `ds_inference_config.json` to BLOOM, since the only supported format for JSON checkpoints are BLOOM and Megatron, but this time the following line cause error

https://github.com/microsoft/DeepSpeed/blob/a5248643571581c6ca3b453a3b9d01ac157ac00a/deepspeed/module_inject/load_checkpoint.py#L199

```AttributeError: 'NoneType' object has no attribute 'is_meta'```

Is the preshared checkpoints loading feature only limited to the BLOOM model? How can I use tensor parallelism to split a single model to run on multiple GPUs? 

**Similar threads:**
#2379 
#2132 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions