Skip to content

Support Anima model#2260

Merged
kohya-ss merged 10 commits intokohya-ss:sd3from
duongve13112002:sd3
Feb 8, 2026
Merged

Support Anima model#2260
kohya-ss merged 10 commits intokohya-ss:sd3from
duongve13112002:sd3

Conversation

@duongve13112002
Copy link
Copy Markdown
Contributor

Hi @kohya-ss

Here is the description of my change for support Anima model in sd-scripts

Summary
Add Anima model support for LoRA and full-finetune training

Changes
Anima Model Support

  • Added anima_train.py - Full finetune training script
  • Added anima_train_network.py - LoRA training script
  • Added library/anima_models.py - MiniTrainDIT architecture (28 DiT blocks, AdaLN-LoRA, 3D RoPE)
  • Added library/anima_utils.py - Model loading for DiT, VAE, Qwen3 text encoder, T5 tokenizer
  • Added library/anima_vae.py - Anima VAE (WanVAE-based)
  • Added library/anima_train_utils.py
  • Added library/strategy_anima.py - Tokenize, text encoding, and caching strategies
  • Added networks/lora_anima.py - LoRA network for Anima
  • Added configs/qwen3_06b/ - Qwen3 tokenizer and model config
  • Added docs/anima_train_network.md - Usage documentation

Note: No existing files modified, all changes are new files only. Follows the same patterns as SD3/Lumina/Flux. Tested with both LoRA and full finetune training. If you have any question feel free to ask.

@kohya-ss
Copy link
Copy Markdown
Owner

kohya-ss commented Feb 7, 2026

Thank you for this! I am working on Anima based on ComfyUI code, but the GPL 3.0 license is a problem. This is based on ASL 2.0, which is great.

Please note that there is a possibility that some changes will be made after merging to align with other models.

@duongve13112002
Copy link
Copy Markdown
Contributor Author

If you need any help, feel free to reach out. I’d be happy to help.

@kohya-ss
Copy link
Copy Markdown
Owner

kohya-ss commented Feb 7, 2026

I have a few questions.

  1. To reduce repository size, the tokenizer settings are not stored in the repository but downloaded from HuggingFace. I'd like to use the official tokenizer config from https://huggingface.co/Qwen/Qwen3-0.6B. However, the official config seems to have a different value for max_position_embeddings (40960) than the settings in this PR (and ComfyUI) 32768: https://github.com/Comfy-Org/ComfyUI/blob/17e7df43d19bde49efa46a32b89f5153b9cb0ded/comfy/text_encoders/llama.py#L92. Can we use the official tokenizer config?
  2. Also, regarding T5XXL, is it okay to use the official tokenizer config?: https://huggingface.co/google/t5-v1_1-xxl The tokenizer config in this PR seem to be slightly different from both the official and ComfyUI settings.

@duongve13112002
Copy link
Copy Markdown
Contributor Author

Yes, I noticed these issues as well when I reviewed them. However, these configurations come from the diffusion-pipe repository, and the owner of this repository is also the creator of this model.

@kohya-ss
Copy link
Copy Markdown
Owner

kohya-ss commented Feb 7, 2026

Yes, I noticed these issues as well when I reviewed them. However, these configurations come from the diffusion-pipe repository, and the owner of this repository is also the creator of this model.

Thank you for the explanation. I understood. So keeping the config in the repository sounds good.

@kohya-ss
Copy link
Copy Markdown
Owner

kohya-ss commented Feb 7, 2026

I also noticed one other thing: ComfyUI's inference code doesn't seem to use an attention mask in its Text Encoder calls, but diffusion-pipe does seem to use one.

@duongve13112002
Copy link
Copy Markdown
Contributor Author

I’m also very confused about this, but when I tested the model with and without an attention mask, I observed that using the attention mask makes the model more stable and produces better results.

@kohya-ss
Copy link
Copy Markdown
Owner

kohya-ss commented Feb 7, 2026

Hmm, the attention mask is used for Text Encoder, but then the embedding where mask=0 is set to zero, and a fixed-length embedding is passed to DiT. That's interesting...

@duongve13112002
Copy link
Copy Markdown
Contributor Author

I'm not entirely sure about the author's intention, but here is my understanding (I could be wrong):

  • Qwen3 needs mask internally , tokens self-attend, so padding must be excluded.
  • After Qwen3, padding embeddings are zeroed out. In DiT cross-attention, K=0 produces near-zero scores, and V=0 contributes nothing. So zeros act as a "soft mask" automatically.

@kohya-ss
Copy link
Copy Markdown
Owner

kohya-ss commented Feb 8, 2026

Thank you for the detailed explanation. I think I have a general understanding of Anima's training and inference.

Please understand that I will make some changes after merging. I would also appreciate any feedback or additional pull requests after the merge.

Thank you again for this great PR!

@kohya-ss kohya-ss merged commit e21a773 into kohya-ss:sd3 Feb 8, 2026
0 of 3 checks passed
@brundlesprout
Copy link
Copy Markdown

working well. Thank you @kohya-ss, @duongve13112002

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants