Issue
When I run the following (on CPUs) using the from-source installation of accelerate:
accelerate launch --config_file default_config.yml examples/unconditional_image_generation/train_unconditional.py \
--dataset_name hf-internal-testing/dummy_image_class_data \
--model_config_name_or_path diffusers/ddpm_dummy \
--resolution 64 \
--output_dir /temp \
--train_batch_size 2 \
--num_epochs 1 \
--gradient_accumulation_steps 1 \
--ddpm_num_inference_steps 2 \
--learning_rate 1e-3 \
--lr_warmup_steps 5
it throws:
Epoch 0: 0%| | 0/3 [00:00<?, ?it/s]Traceback (most recent call last):
File "examples/unconditional_image_generation/train_unconditional.py", line 692, in <module>
main(args)
File "examples/unconditional_image_generation/train_unconditional.py", line 594, in main
accelerator.backward(loss)
File "/opt/venv/lib/python3.8/site-packages/accelerate/accelerator.py", line 1761, in backward
loss.backward(**kwargs)
File "/opt/venv/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/venv/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Float but expected BFloat16
Epoch 0: 0%| | 0/3 [00:04<?, ?it/s]
Traceback (most recent call last):
File "/opt/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/opt/venv/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/opt/venv/lib/python3.8/site-packages/accelerate/commands/launch.py", line 928, in launch_command
simple_launcher(args)
File "/opt/venv/lib/python3.8/site-packages/accelerate/commands/launch.py", line 588, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/venv/bin/python', 'examples/unconditional_image_generation/train_unconditional.py', '--dataset_name', 'hf-internal-testing/dummy_image_class_data', '--model_config_name_or_path', 'diffusers/ddpm_dummy', '--resolution', '64', '--output_dir', '/temp', '--train_batch_size', '2', '--num_epochs', '1', '--gradient_accumulation_steps', '1', '--ddpm_num_inference_steps', '2', '--learning_rate', '1e-3', '--lr_warmup_steps', '5']' returned non-zero exit status 1.
But with the latest stable installation, it doesn't happen.
Setup
- `diffusers` version: 0.17.0.dev0
- Platform: Linux-4.19.0-24-cloud-amd64-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 2.0.1+cpu (False)
- Huggingface_hub version: 0.14.1
- Transformers version: 4.30.0.dev0
- Accelerate version: 0.20.0.dev0
- xFormers version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
To run
Clone diffusers:
git clone https://github.com/huggingface/diffusers
And then fire:
accelerate launch --config_file default_config.yml examples/unconditional_image_generation/train_unconditional.py \
--dataset_name hf-internal-testing/dummy_image_class_data \
--model_config_name_or_path diffusers/ddpm_dummy \
--resolution 64 \
--output_dir /temp \
--train_batch_size 2 \
--num_epochs 1 \
--gradient_accumulation_steps 1 \
--ddpm_num_inference_steps 2 \
--learning_rate 1e-3 \
--lr_warmup_steps 5
default_config.yml was obtained using:
from accelerate.utils import write_basic_config
write_basic_config(save_location="default_config.yml")
This issue popped up in our broken CI: https://github.com/huggingface/diffusers/actions/runs/4955138459/jobs/8864271403?pr=3397.
Cc: @patrickvonplaten
Issue
When I run the following (on CPUs) using the from-source installation of
accelerate:accelerate launch --config_file default_config.yml examples/unconditional_image_generation/train_unconditional.py \ --dataset_name hf-internal-testing/dummy_image_class_data \ --model_config_name_or_path diffusers/ddpm_dummy \ --resolution 64 \ --output_dir /temp \ --train_batch_size 2 \ --num_epochs 1 \ --gradient_accumulation_steps 1 \ --ddpm_num_inference_steps 2 \ --learning_rate 1e-3 \ --lr_warmup_steps 5it throws:
But with the latest stable installation, it doesn't happen.
Setup
To run
Clone
diffusers:And then fire:
accelerate launch --config_file default_config.yml examples/unconditional_image_generation/train_unconditional.py \ --dataset_name hf-internal-testing/dummy_image_class_data \ --model_config_name_or_path diffusers/ddpm_dummy \ --resolution 64 \ --output_dir /temp \ --train_batch_size 2 \ --num_epochs 1 \ --gradient_accumulation_steps 1 \ --ddpm_num_inference_steps 2 \ --learning_rate 1e-3 \ --lr_warmup_steps 5default_config.ymlwas obtained using:This issue popped up in our broken CI: https://github.com/huggingface/diffusers/actions/runs/4955138459/jobs/8864271403?pr=3397.
Cc: @patrickvonplaten