-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Describe the bug
We currently want to run inference on EleutherAI/gpt-j-6B model with tensor parallelism on multiple GPUs, similarly to what BLOOM model does. But it seems the way DeepSpeed inference saves and loads the pre-shared checkpoints are not consistent and general enough for other models.
To Reproduce
I tried using the DeepSpeed inference script for BLOOM and modifying lines 140-141 to
model = GPTJForCausalLM.from_pretrained(
"EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)
and line 100 to
model = deepspeed.init_inference(
model,
mp_size=world_size,
base_dir=repo_root,
dtype=getattr(torch, infer_dtype),
save_mp_checkpoint_path =<some path to save mp checkpoint>,
**kwargs,
)
After the first run, on my 2x A6000 server, I was able to get the tensor parallelism-sharded checkpoints under the path <some path to save mp checkpoint> and a configuration file ds_inference_config.json shown below
{"type": "ds_model",
"base_dir": <some path to save mp checkpoint>,
"checkpoints": {"non-tp":["non-tp.pt"], "tp":["tp_00_00.pt", "tp_01_00.pt", "tp_00_01.pt", "tp_01_01.pt",
"tp_00_02.pt", "tp_01_02.pt", "tp_00_03.pt", "tp_01_03.pt", "tp_00_04.pt", "tp_01_04.pt",
, "tp_00_05.pt", "tp_01_05.pt", "tp_00_06.pt", "tp_01_06.pt", "tp_01_07.pt", "tp_01_07.pt"]},
"version": 1.0,
"parallelization": "tp",
"tp_size": 2,
"dtype": "float16}
For the second round, I undo the changes for lines 140-141 as well as save_mp_checkpoint_path and use checkpoint=<some path to save mp checkpoint>/ds_inference_config.json in deepspeed.init_inference. This is the standard way for loading the preshared model for BLOOM which speeds up the loading process. However, the above code raises the following error
AssertionError: ds_model checkpoint type is not supported
, which comes from the following code that DeepSpeed inference loads the JSON state_dict
I also tried to change the type in ds_inference_config.json to BLOOM, since the only supported format for JSON checkpoints are BLOOM and Megatron, but this time the following line cause error
AttributeError: 'NoneType' object has no attribute 'is_meta'
Is the preshared checkpoints loading feature only limited to the BLOOM model? How can I use tensor parallelism to split a single model to run on multiple GPUs?