-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Closed
Labels
Description
System Info
transformersversion: 4.44.0.dev0- Platform: Linux-5.10.112-005.x86_64-x86_64-with-glibc2.31
- Python version: 3.10.13
- Huggingface_hub version: 0.24.2
- Safetensors version: 0.4.3
- Accelerate version: 0.33.0.dev0
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.2 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: (True)
- Using GPU in script?: (True)
- GPU type: NVIDIA A100-SXM4-80GB
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
fsdp_config.json
{
"fsdp_transformer_layer_cls_to_wrap": [
"Qwen2DecoderLayer"
],
"xla": true,
"xla_fsdp_settings": {
"compute_dtype": "bfloat16",
"buffer_dtype": "bfloat16",
"pin_layout_in_collective_ops": false
},
"xla_fsdp_grad_ckpt": false
}
- use FSDP to train Qwen2-7B and save checkpoints
torchrun --nproc_per_node 8 \
examples/pytorch/language-modeling/run_clm.py \
--num_train_epochs 2 \
--dataset_name wikitext \
--dataset_config_name wikitext-103-raw-v1 \
--use_fast_tokenizer false \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--do_train \
--output_dir /tmp/test-clm \
--model_name_or_path Qwen/Qwen2-7B-Instruct \
--tokenizer_name Qwen/Qwen2-7B-Instruct \
--trust_remote_code true \
--block_size 4096 \
--optim adamw_torch \
--save_strategy steps \
--max_steps 10 \
--logging_strategy steps \
--gradient_checkpointing no \
--logging_steps 1 \
--bf16 true \
--fsdp "full_shard" \
--fsdp_config fsdp_config.json
Expected behavior
The saved checkpoint should contain the complete weights, but currently, FSDP on XLA only saves the sharded weights from rank0.
Reactions are currently unavailable