-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Description
System Info
node 2(throw exception)
Accelerateversion: 0.23.0- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
- Python version: 3.9.18
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 38.44 GB
- GPU type: Tesla V100-SXM2-32GB
Acceleratedefault config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 1
- num_machines: 2
- main_process_ip: 10.8.0.7
- main_process_port: 29500
- rdzv_backend: static
- same_network: False
- main_training_function: main
- fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_forward_prefetch': True, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': False, 'fsdp_use_orig_params': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
node1:
Accelerateversion: 0.23.0- Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
- Python version: 3.9.18
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 93.55 GB
- GPU type: NVIDIA A100-SXM4-40GB
Acceleratedefault config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 2
- main_process_ip: 10.8.0.7
- main_process_port: 29500
- rdzv_backend: static
- same_network: False
- main_training_function: main
- fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_forward_prefetch': True, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': False, 'fsdp_use_orig_params': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
@ArthurZucker @younesbelkada @sgugger @lewtun
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am following the blog Fine-tuning Llama 2 70B using PyTorch FSDP.
I want to use two machine each with 1 gpu card to run llama 2 7B model. I am using the codes provided by the blog here.
I am running "accelerate config" on each machine to setup. Please see above for detailed information. I am using the following command line to run on each machine:
accelerate launch train.py \
--model_name "/home/lili/models_hf/7B-chat" \
--dataset_name "smangrul/code-chat-assistant-v1" \
--max_seq_len 2048 \
--max_steps 1000 \
--logging_steps 25 \
--eval_steps 100 \
--save_steps 500 \
--bf16 False \
--fp16 True \
--packing True \
--output_dir "full-finetune-llama-chat-asst" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--dataset_text_field "content" \
--use_gradient_checkpointing \
--learning_rate 5e-5 \
--lr_scheduler_type "cosine" \
--weight_decay 0.01 \
--warmup_ratio 0.03 \
--use_flash_attn False
error message:
Traceback (most recent call last):
File "/nas/lili/deepspeedtest/chat_assistant/training/train.py", line 238, in <module>
main(args)
File "/nas/lili/deepspeedtest/chat_assistant/training/train.py", line 178, in main
model, peft_config, tokenizer = create_and_prepare_model(args)
File "/nas/lili/deepspeedtest/chat_assistant/training/utils.py", line 184, in create_and_prepare_model
model = AutoModelForCausalLM.from_pretrained(
File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pret
rained
return model_class.from_pretrained(
File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
) = cls._load_pretrained_model(
File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_m
odel
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_in
to_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 317, in set_module_tensor_to_
device
new_value = value.to(device)
NotImplementedError: Cannot copy out of meta tensor; no data!
And I debug the code and the two machine ran with different path. The second machine with node_rank=1 went: "map_location = "meta"" while the first one went to "cpu".
if (
(is_deepspeed_zero3_enabled() or is_fsdp_enabled())
and torch.distributed.is_initialized()
and torch.distributed.get_rank() > 0
):
map_location = "meta"
else:
map_location = "cpu"
For fsdp init, only machine with local_rank=0 will load full model to cpu and dispatch sharded parameter to other gpus in this node. But As the codes above shows, it checks rank rather than local rank. So my question is: if there are two nodes each with 4 gpu cards.
nodes1 gpu0 gpu1 gpu2 gpu3 nodes2 gpu0 gpu1 gpu2 gpu3
rank 0 1 2 3 4 5 6 7
local_rank 0 1 2 3 0 1 2 3
for fsdp full sharding, nodes1 will load all parameters to it's cpu memory and then dispatch sharded parameters to all the 8 gpu. Am I right?
Expected behavior
no exception