Skip to content

[Casting] RuntimeError: Found dtype Float but expected BFloat16 on a CPU #1420

@sayakpaul

Description

@sayakpaul

Issue

When I run the following (on CPUs) using the from-source installation of accelerate:

accelerate launch --config_file default_config.yml examples/unconditional_image_generation/train_unconditional.py \
    --dataset_name hf-internal-testing/dummy_image_class_data \
    --model_config_name_or_path diffusers/ddpm_dummy \
    --resolution 64 \
    --output_dir /temp \
    --train_batch_size 2 \
    --num_epochs 1 \
    --gradient_accumulation_steps 1 \
    --ddpm_num_inference_steps 2 \
    --learning_rate 1e-3 \
    --lr_warmup_steps 5

it throws:

Epoch 0:   0%|                                                                                                          | 0/3 [00:00<?, ?it/s]Traceback (most recent call last):
  File "examples/unconditional_image_generation/train_unconditional.py", line 692, in <module>
    main(args)
  File "examples/unconditional_image_generation/train_unconditional.py", line 594, in main
    accelerator.backward(loss)
  File "/opt/venv/lib/python3.8/site-packages/accelerate/accelerator.py", line 1761, in backward
    loss.backward(**kwargs)
  File "/opt/venv/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/opt/venv/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Float but expected BFloat16
Epoch 0:   0%|                                                                                                          | 0/3 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "/opt/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/venv/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/venv/lib/python3.8/site-packages/accelerate/commands/launch.py", line 928, in launch_command
    simple_launcher(args)
  File "/opt/venv/lib/python3.8/site-packages/accelerate/commands/launch.py", line 588, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/venv/bin/python', 'examples/unconditional_image_generation/train_unconditional.py', '--dataset_name', 'hf-internal-testing/dummy_image_class_data', '--model_config_name_or_path', 'diffusers/ddpm_dummy', '--resolution', '64', '--output_dir', '/temp', '--train_batch_size', '2', '--num_epochs', '1', '--gradient_accumulation_steps', '1', '--ddpm_num_inference_steps', '2', '--learning_rate', '1e-3', '--lr_warmup_steps', '5']' returned non-zero exit status 1.

But with the latest stable installation, it doesn't happen.

Setup

- `diffusers` version: 0.17.0.dev0
- Platform: Linux-4.19.0-24-cloud-amd64-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 2.0.1+cpu (False)
- Huggingface_hub version: 0.14.1
- Transformers version: 4.30.0.dev0
- Accelerate version: 0.20.0.dev0
- xFormers version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

To run

Clone diffusers:

git clone https://github.com/huggingface/diffusers

And then fire:

accelerate launch --config_file default_config.yml examples/unconditional_image_generation/train_unconditional.py \
    --dataset_name hf-internal-testing/dummy_image_class_data \
    --model_config_name_or_path diffusers/ddpm_dummy \
    --resolution 64 \
    --output_dir /temp \
    --train_batch_size 2 \
    --num_epochs 1 \
    --gradient_accumulation_steps 1 \
    --ddpm_num_inference_steps 2 \
    --learning_rate 1e-3 \
    --lr_warmup_steps 5

default_config.yml was obtained using:

from accelerate.utils import write_basic_config

write_basic_config(save_location="default_config.yml")

This issue popped up in our broken CI: https://github.com/huggingface/diffusers/actions/runs/4955138459/jobs/8864271403?pr=3397.

Cc: @patrickvonplaten

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions