Skip to content

NaN errors with fp16 training on Anima. #2293

@sashasubbbb

Description

@sashasubbbb

On SD3 branch i'm getting NaN errors immediately for Anima LORA in fp16 training. Unfortunately, my GPU doesn't support bf16. Is fp16 training for Anima not supported?
The command i'm running is:
accelerate launch --num_cpu_threads_per_process 1 anima_train_network.py --pretrained_model_name_or_path="B:/AIimages/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/anima-preview2.safetensors" --qwen3="B:/AIimages/ComfyUI_windows_portable/ComfyUI/models/text_encoders/qwen_3_06b_base.safetensors" --vae="B:/AIimages/ComfyUI_windows_portable/ComfyUI/models/vae/qwen_image_vae.safetensors" --dataset_config="B:\AIimages\stable-diffusion-webui\models\Lora\lora\animamine\dataset.toml" --output_dir="B:/AIimages/stable-diffusion-webui/models/Lora/lora/animamine/" --output_name="my_anima_lora" --save_model_as=safetensors --network_module=networks.lora_anima --network_dim=8 --learning_rate=1e-4 --optimizer_type="AdamW8bit" --lr_scheduler="constant" --timestep_sampling="sigmoid" --discrete_flow_shift=1.0 --max_train_epochs=10 --save_every_n_epochs=1 --mixed_precision="fp16" --gradient_checkpointing --cache_latents --vae_chunk_size=64 --vae_disable_cache
EDIT:
Interestingly enough, with #2274 applied manually, i'm getting normal loss, no NaN errors, and lora trained successfully.
EDIT2:
Unfortunately, even with that patch and some specific training settings i'm getting NaN, but deep into the training. Chance of success without things blowing up is 50%.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions