Skip to content

NotImplementedError: Cannot copy out of meta tensor; no data! #27166

@fancyerii

Description

@fancyerii

System Info

node 2(throw exception)

  • Accelerate version: 0.23.0
  • Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
  • Python version: 3.9.18
  • Numpy version: 1.24.1
  • PyTorch version (GPU?): 2.1.0+cu118 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 38.44 GB
  • GPU type: Tesla V100-SXM2-32GB
  • Accelerate default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: FSDP
    - mixed_precision: fp16
    - use_cpu: False
    - debug: False
    - num_processes: 2
    - machine_rank: 1
    - num_machines: 2
    - main_process_ip: 10.8.0.7
    - main_process_port: 29500
    - rdzv_backend: static
    - same_network: False
    - main_training_function: main
    - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_forward_prefetch': True, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': False, 'fsdp_use_orig_params': True}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

node1:

  • Accelerate version: 0.23.0
  • Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
  • Python version: 3.9.18
  • Numpy version: 1.24.1
  • PyTorch version (GPU?): 2.1.0+cu118 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 93.55 GB
  • GPU type: NVIDIA A100-SXM4-40GB
  • Accelerate default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: FSDP
    - mixed_precision: fp16
    - use_cpu: False
    - debug: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 2
    - main_process_ip: 10.8.0.7
    - main_process_port: 29500
    - rdzv_backend: static
    - same_network: False
    - main_training_function: main
    - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_forward_prefetch': True, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': False, 'fsdp_use_orig_params': True}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

Who can help?

@ArthurZucker @younesbelkada @sgugger @lewtun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am following the blog Fine-tuning Llama 2 70B using PyTorch FSDP.

I want to use two machine each with 1 gpu card to run llama 2 7B model. I am using the codes provided by the blog here.

I am running "accelerate config" on each machine to setup. Please see above for detailed information. I am using the following command line to run on each machine:

accelerate launch  train.py \
--model_name "/home/lili/models_hf/7B-chat" \
--dataset_name "smangrul/code-chat-assistant-v1" \
--max_seq_len 2048 \
--max_steps 1000 \
--logging_steps 25 \
--eval_steps 100 \
--save_steps 500 \
--bf16 False \
--fp16 True \
--packing True \
--output_dir "full-finetune-llama-chat-asst" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--dataset_text_field "content" \
--use_gradient_checkpointing \
--learning_rate 5e-5  \
--lr_scheduler_type "cosine" \
--weight_decay 0.01 \
--warmup_ratio 0.03 \
--use_flash_attn False

error message:

Traceback (most recent call last):   
  File "/nas/lili/deepspeedtest/chat_assistant/training/train.py", line 238, in <module>
    main(args)
  File "/nas/lili/deepspeedtest/chat_assistant/training/train.py", line 178, in main
    model, peft_config, tokenizer = create_and_prepare_model(args)
  File "/nas/lili/deepspeedtest/chat_assistant/training/utils.py", line 184, in create_and_prepare_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pret
rained
    return model_class.from_pretrained(
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(  
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_m
odel
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_in
to_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/lili/miniconda3/envs/py39_torch21/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 317, in set_module_tensor_to_
device
    new_value = value.to(device)
NotImplementedError: Cannot copy out of meta tensor; no data!

And I debug the code and the two machine ran with different path. The second machine with node_rank=1 went: "map_location = "meta"" while the first one went to "cpu".

        if (
            (is_deepspeed_zero3_enabled() or is_fsdp_enabled())
            and torch.distributed.is_initialized()
            and torch.distributed.get_rank() > 0
        ):
            map_location = "meta"
        else:
            map_location = "cpu"

For fsdp init, only machine with local_rank=0 will load full model to cpu and dispatch sharded parameter to other gpus in this node. But As the codes above shows, it checks rank rather than local rank. So my question is: if there are two nodes each with 4 gpu cards.

nodes1 gpu0 gpu1 gpu2 gpu3 nodes2 gpu0 gpu1 gpu2 gpu3
rank 0 1 2 3 4 5 6 7
local_rank 0 1 2 3 0 1 2 3

for fsdp full sharding, nodes1 will load all parameters to it's cpu memory and then dispatch sharded parameters to all the 8 gpu. Am I right?

Expected behavior

no exception

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions