-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
NaN errors with fp16 training on Anima. #2293
Description
On SD3 branch i'm getting NaN errors immediately for Anima LORA in fp16 training. Unfortunately, my GPU doesn't support bf16. Is fp16 training for Anima not supported?
The command i'm running is:
accelerate launch --num_cpu_threads_per_process 1 anima_train_network.py --pretrained_model_name_or_path="B:/AIimages/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/anima-preview2.safetensors" --qwen3="B:/AIimages/ComfyUI_windows_portable/ComfyUI/models/text_encoders/qwen_3_06b_base.safetensors" --vae="B:/AIimages/ComfyUI_windows_portable/ComfyUI/models/vae/qwen_image_vae.safetensors" --dataset_config="B:\AIimages\stable-diffusion-webui\models\Lora\lora\animamine\dataset.toml" --output_dir="B:/AIimages/stable-diffusion-webui/models/Lora/lora/animamine/" --output_name="my_anima_lora" --save_model_as=safetensors --network_module=networks.lora_anima --network_dim=8 --learning_rate=1e-4 --optimizer_type="AdamW8bit" --lr_scheduler="constant" --timestep_sampling="sigmoid" --discrete_flow_shift=1.0 --max_train_epochs=10 --save_every_n_epochs=1 --mixed_precision="fp16" --gradient_checkpointing --cache_latents --vae_chunk_size=64 --vae_disable_cache
EDIT:
Interestingly enough, with #2274 applied manually, i'm getting normal loss, no NaN errors, and lora trained successfully.
EDIT2:
Unfortunately, even with that patch and some specific training settings i'm getting NaN, but deep into the training. Chance of success without things blowing up is 50%.