LoRA training QoL improvements: UI progress bar, deterministic seeding, make gradient checkpointing optional by spacepxl · Pull Request #8668 · Comfy-Org/ComfyUI

spacepxl · 2025-06-25T18:50:52Z

Adding the UI progress bar allows users to see the training progress in the UI (obviously) but also makes it possible to cancel training.

Gradient checkpointing, especially with so many checkpoints, is computationally expensive and not necessary if memory isn't a constraint. I left it enabled by default but disabling it is a free speed boost:

got prompt
Added gradient checkpoints to 51 modules
Requested to load BaseModel
0 models unloaded.
loaded completely 9.5367431640625e+25 1656.400333404541 True
Training LoRA: 100%|████████████████████████████████████████████████████| 100/100 [00:50<00:00,  1.99it/s, loss=0.0382]
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:05<00:00,  5.99it/s]
Requested to load AutoencoderKL
loaded completely 21470.762142181396 159.55708122253418 True
Prompt executed in 57.28 seconds

got prompt
Requested to load BaseModel
0 models unloaded.
loaded completely 9.5367431640625e+25 1656.400333404541 True
Training LoRA: 100%|████████████████████████████████████████████████████| 100/100 [00:30<00:00,  3.27it/s, loss=0.0376]
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:05<00:00,  5.98it/s]
Requested to load AutoencoderKL
loaded completely 21470.762142181396 159.55708122253418 True
Prompt executed in 37.95 seconds

As for seeding, I replaced the unused generator and instead temporarily store the global RNG states, seed everything, then restore after training is finished. This seeds the weight initialization without needing to pass a generator function all over the place. The RNG of weight initialization is pretty significant, if it's allowed to be random then workflows which directly incorporate lora training instead of loading a trained file would be impossible to reproduce. It also seeds timestep sampling, which is the main factor driving training loss at small batch sizes.

With this change, fp32 training is now fully deterministic, although bf16 training is still partially nondeterministic, and I wasn't able to track down the cause of that. I'm guessing it could be related to stochastic rounding?

spacepxl · 2025-06-25T18:54:17Z

Some examples of the degree of run-to-run variance from bf16 training:

Each row is the same seed, just forced to rerun.

comfyanonymous · 2025-07-03T23:22:14Z

@KohakuBlueleaf what do you think?

KohakuBlueleaf · 2025-07-04T00:46:35Z

@KohakuBlueleaf what do you think?

Not sure the UI part, but others are easy.

Will do them after I finish the refactor (which will affect seed)

KohakuBlueleaf · 2025-07-23T06:45:46Z

@KohakuBlueleaf what do you think?

@comfyanonymous
I have optional grad ckpt in my local patch which is PR #9015 which may be conflict
other part looks good for me

Maybe you can consider to merge this at first than I will resolve conflict on my PR

nocrcl · 2025-09-24T12:50:35Z

Hi @spacepxl
i made a PR myself on this node and also added ability to have the progressbar show progress on the gui.
I removed it from mine in favor of this one.
But just now I realized, that this one updates the pbar in the training loop inside the TrainLoraNode.
I've put it in the sample function (around line 52) in the TrainSampler class.
Thought this is more convenient, as maybe another training node would then also call the sample function and trigger the pbar.

LoRA training QoL improvements

28fc7ab

spacepxl requested review from Kosinkadink, christian-byrne, ltdrdata, pythongosssss, robinjhuang, webfiltered and yoland68 as code owners June 25, 2025 18:50

KohakuBlueleaf mentioned this pull request Jul 23, 2025

[Training Node] algo support, grad acc, optional grad ckpt #9015

Merged

nocrcl mentioned this pull request Sep 24, 2025

Update Crop and Pad in nodes_train.py #9824

Closed

KohakuBlueleaf mentioned this pull request Nov 11, 2025

Dataset Processing Nodes and Improved LoRA Trainer Nodes with multi resolution supports. #10708

Merged

comfyanonymous closed this in #10708 Nov 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA training QoL improvements: UI progress bar, deterministic seeding, make gradient checkpointing optional#8668

LoRA training QoL improvements: UI progress bar, deterministic seeding, make gradient checkpointing optional#8668
spacepxl wants to merge 1 commit intoComfy-Org:masterfrom
spacepxl:master

spacepxl commented Jun 25, 2025

Uh oh!

spacepxl commented Jun 25, 2025 •

edited

Loading

Uh oh!

comfyanonymous commented Jul 3, 2025

Uh oh!

KohakuBlueleaf commented Jul 4, 2025

Uh oh!

KohakuBlueleaf commented Jul 23, 2025 •

edited

Loading

Uh oh!

nocrcl commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

spacepxl commented Jun 25, 2025

Uh oh!

spacepxl commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comfyanonymous commented Jul 3, 2025

Uh oh!

KohakuBlueleaf commented Jul 4, 2025

Uh oh!

KohakuBlueleaf commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nocrcl commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

spacepxl commented Jun 25, 2025 •

edited

Loading

KohakuBlueleaf commented Jul 23, 2025 •

edited

Loading