NotImplementedError: Cannot copy out of meta tensor; no data!

### System Info

node 2(throw exception)
- `Accelerate` version: 0.23.0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
- Python version: 3.9.18
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 38.44 GB
- GPU type: Tesla V100-SXM2-32GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 1
        - num_machines: 2
        - main_process_ip: 10.8.0.7
        - main_process_port: 29500
        - rdzv_backend: static
        - same_network: False
        - main_training_function: main
        - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_forward_prefetch': True, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': False, 'fsdp_use_orig_params': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

node1:
- `Accelerate` version: 0.23.0
- Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
- Python version: 3.9.18
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 93.55 GB
- GPU type: NVIDIA A100-SXM4-40GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 2
        - main_process_ip: 10.8.0.7
        - main_process_port: 29500
        - rdzv_backend: static
        - same_network: False
        - main_training_function: main
        - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_forward_prefetch': True, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': False, 'fsdp_use_orig_params': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

### Who can help?

@ArthurZucker  @younesbelkada @sgugger @lewtun 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I am following the blog [Fine-tuning Llama 2 70B using PyTorch FSDP](https://huggingface.co/blog/ram-efficient-pytorch-fsdp).

I want to use two machine each with 1 gpu card to run llama 2 7B model. I am using the codes provided by the blog [here](https://github.com/pacman100/DHS-LLM-Workshop/tree/main/chat_assistant/training).

I am running "accelerate config" on each machine to setup. Please see above for detailed information. I am using the following command line to run on each machine:

```
accelerate launch  train.py \
--model_name "/home/lili/models_hf/7B-chat" \
--dataset_name "smangrul/code-chat-assistant-v1" \
--max_seq_len 2048 \
--max_steps 1000 \
--logging_steps 25 \
--eval_steps 100 \
--save_steps 500 \
--bf16 False \
--fp16 True \
--packing True \
--output_dir "full-finetune-llama-chat-asst" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--dataset_text_field "content" \
--use_gradient_checkpointing \
--learning_rate 5e-5  \
--lr_scheduler_type "cosine" \
--weight_decay 0.01 \
--warmup_ratio 0.03 \
--use_flash_attn False
```

error message:
```
Traceback (most recent call last):   
  File "/nas/lili/deepspeedtest/chat_assistant/training/train.py", line 238, in <module>
    main(args)
  File "/nas/lili/deepspeedtest/chat_assistant/training/train.py", line 178, in main
    model, peft_config, tokenizer = create_and_prepare_model(args)
  File "/nas/lili/deepspeedtest/chat_assistant/training/utils.py", line 184, in create_and_prepare_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pret
rained
    return model_class.from_pretrained(
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(  
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_m
odel
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_in
to_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 317, in set_module_tensor_to_
device
    new_value = value.to(device)
NotImplementedError: Cannot copy out of meta tensor; no data!
```
And I debug the code and the two machine ran with different path. The second machine with node_rank=1 went: "map_location = "meta"" while the first one went to "cpu".
```
        if (
            (is_deepspeed_zero3_enabled() or is_fsdp_enabled())
            and torch.distributed.is_initialized()
            and torch.distributed.get_rank() > 0
        ):
            map_location = "meta"
        else:
            map_location = "cpu"
```

For fsdp init, only machine with local_rank=0 will load full model to cpu and dispatch sharded parameter to other gpus in this node. But As the codes above shows, it checks rank rather than local rank. So my question is: if there are two nodes each with 4 gpu cards.

nodes1 gpu0 gpu1 gpu2 gpu3  nodes2 gpu0 gpu1 gpu2 gpu3
 rank           0         1        2         3                           4        5         6        7
local_rank 0       1        2         3                           0         1         2        3

for fsdp full sharding, nodes1 will load all parameters to it's cpu memory and then dispatch sharded parameters to all the 8 gpu. Am I right? 

### Expected behavior

no exception

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NotImplementedError: Cannot copy out of meta tensor; no data! #27166

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NotImplementedError: Cannot copy out of meta tensor; no data! #27166

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions