Fix NaN in inpainting sample generation under fp16 autocast#2322
Merged
Fix NaN in inpainting sample generation under fp16 autocast#2322
Conversation
The lpw inpaint encode path had two related dtype issues that surfaced when training-time sample generation hit the VAE inside the accelerator.autocast() block in train_util.sample_image_inference: 1. Inputs were cast to the UNet dtype (`dtype = unet.dtype`) rather than `self.vae.dtype`. With `--no_half_vae` or fp32 VAE this fed the wrong-precision tensor into the encode. 2. More importantly, even after fixing (1), fp16 autocast forces conv kernels to fp16 regardless of input/weight dtype, and SDXL VAE produces NaN in fp16 — so `vae.encode` inside autocast NaN'd out even with fp32 weights and fp32 inputs. Cast inpaint inputs to `self.vae.dtype` and wrap the encode in `torch.autocast(..., enabled=False)`. Mirrors how the existing `decode_latents()` runs outside the caller's autocast block. Verified on SDXL with `--mixed_precision fp16 --no_half_vae` (was NaN, now produces valid samples) and `--mixed_precision bf16` (no regression). Same change applied to the SD1.5 lpw pipeline; SD1.5 VAE is more fp16-tolerant so the bug was latent there. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bug fix for inpainting sample generation following PR #2309. With
--mixed_precision fp16(with or without--no_half_vae), the SDXL training-time inpainting sample generation produces NaN latents on the very first sample, leading toRuntimeWarning: invalid value encountered in castand a black image. Withbf16, sampling works.Two related issues in the lpw inpaint encode path, both surfacing under the
accelerator.autocast()block intrain_util.sample_image_inference:dtype = unet.dtype(UNet dtype), notself.vae.dtype. Under--no_half_vae(vae=fp32, unet=fp16) this fed mismatched precision intoself.vae.encode. Mirrors the existing pattern used indecode_latents()(self.vae.decode(latents.to(self.vae.dtype))).decode_latents()happens to be called outside the caller'saccelerator.autocast()block, so it was unaffected — but inpaint encode runs insidepipeline()and hit this.Fix wraps the inpaint encode in
torch.autocast(device_type=device.type, enabled=False)so the encode runs in the VAE's actual precision.Same change applied to
lpw_stable_diffusion.py(SD1.5). SD1.5 VAE is more fp16-tolerant so the bug was latent there, but the path is structurally identical and worth keeping consistent.Test plan
Verified by @kohya-ss:
--mixed_precision fp16 --no_half_vae— was NaN, now produces valid samples--mixed_precision bf16— no regressionNotes
library/lpw_stable_diffusion.py,library/sdxl_lpw_stable_diffusion.pyhere vs.train_util.py/sdxl_train.py/model_util.py/ docs /inpainting_minimal_inference.pyin Inpainting cleanup: misc fixes following PR #2309 review #2321), so they can be merged in either order.--no_half_vae.🤖 Generated with Claude Code