NucleusMoE-Image by sippycoder · Pull Request #13317 · huggingface/diffusers

sippycoder · 2026-03-24T03:16:39Z

What does this PR do?

This PR introduces NucleusMoE-Image series into the diffusers library.

NucleusMoE-Image is a 2B active 17B parameter model trained with efficiency at its core. Our novel architecture highlights the scalability of sparse MoE architecture for Image generation. The technical report will be released very soon.

NucleusImage - text kv caching

sippycoder · 2026-03-24T04:07:56Z

cc: @sayakpaul @IlyasMoutawwakil

dg845 · 2026-03-25T04:39:05Z

+        gate1 = gate1.clamp(min=-2.0, max=2.0)
+        gate2 = gate2.clamp(min=-2.0, max=2.0)


It seems weird to me that we first clamp the gates to [-2.0, 2.0] and then essentially clamp again by squashing with the tanh function below. Is this intended?

I agree it's weird. :) I used it to stabilize the gradients if the tanh gates get saturated while training. I will evaluate the model performance without it and get back to you!

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

dg845 · 2026-03-31T08:20:35Z

Hi @sippycoder, it doesn't look like my HF account (also dg845) has access to NucleusAI/NucleusMoE-Image, not sure if I am missing something. I get a 404 error if I try to access it.

sippycoder · 2026-03-31T08:29:03Z

Hi @sippycoder, it doesn't look like my HF account (also dg845) has access to NucleusAI/NucleusMoE-Image, not sure if I am missing something. I get a 404 error if I try to access it.

Looks like I can't give you private repo access unless you are in my org. I just made the repo public! I didn't update the model page yet.

dg845 · 2026-03-31T08:53:04Z

@bot /style

github-actions · 2026-03-31T08:53:28Z

Style bot fixed some files and pushed the changes.

dg845 · 2026-03-31T09:12:05Z

+logger = logging.get_logger(__name__)
+
+
+# Copied from diffusers.models.transformers.transformer_qwenimage.apply_rotary_emb_qwen with qwen->nucleus


Can you run make fix-copies to sync the implementation here with the QwenImage implementation (assuming the implementations are intended to be the same, which I believe is the case)?

python utils/check_dummies.py --fix_and_overwrite

sippycoder · 2026-04-02T05:15:48Z

@yiyixuxu Any comments from you for the text_kv_cache hook?
@dg845 Do you think we are good to merge this PR?

dg845 · 2026-04-02T06:20:46Z

@bot /style

github-actions · 2026-04-02T06:21:17Z

Style bot fixed some files and pushed the changes.

dg845 · 2026-04-02T07:42:29Z

Hi @sippycoder, I think this PR is close to merge, the remaining items should be:

Confirm that the text KV cache design looks good
Ensure that the MoE weights are correctly supported (e.g. NucleusMoE-Image #13317 (comment)). I think with the new changes they are probably good, CC @IlyasMoutawwakil to confirm.

Additionally, having docs would be nice, but this is not a hard blocker (we can add them in a follow-up PR if necessary).

sippycoder · 2026-04-03T01:18:07Z

Hey @dg845 @yiyixuxu , we are releasing the model report tomorrow! Thanks for all the reviews! I think Expert Parallelism can be a separate PR. Any chance we can merge the PR today?

dg845 · 2026-04-03T07:38:39Z

+    def __init__(self, state_manager: StateManager):
+        super().__init__()
+        self.state_manager = state_manager
+        self.kv_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {}


After looking at the existing cache design more closely, I think kv_cache should not be owned by TextKVCacheBlockHook but rather refactored into its own BaseState subclass, which is what MagCache and TaylorSeer do (both store cached tensors in their state classes rather than on the hook).

This would mean we would have two state classes: one which holds the shared encoder_hidden_states tensor and one which holds the KV cache dict for each block:

# Same as before class TextKVCacheState(BaseState): def __init__(self): self.key: int | None = None ... # Holds the block-level KV cache class TextKVCacheBlockState(BaseState): def __init__(self): self.kv_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {} ... # Same as before class TextKVCacheTransformerHook(ModelHook): ... class TextKVCacheBlockHook(ModelHook): # One state manager for shared transformer-level state, one for block-specific state def __init__(self, state_manager: StateManager, block_state_manager: StateManager): super().__init__() self.state_manager = state_manager self.block_state_manager = block_state_manager ...

This would allow us to manage each block-level KV Cache with a StateManager, which I think more cleanly follows the current design.

Yea that makes sense! I added a commit for this.

dg845

Thanks again for the PR! I think the items in #13317 (comment) should be resolved. We can add docs and handle any remaining issues in follow-up PRs.

dg845 · 2026-04-03T09:01:05Z

Merging as the CI is green.

q5sys · 2026-04-13T16:00:30Z

Hey @dg845 @yiyixuxu , we are releasing the model report tomorrow! Thanks for all the reviews! I think Expert Parallelism can be a separate PR. Any chance we can merge the PR today?

@sippycoder When is this actually getting released? You said 'tomorrow' two weeks ago to get this PR merged, but your model still isn't released. I'm eager to try this.

Edit: it was posted the day after my comment

* adding NucleusMoE-Image model * update system prompt * Add text kv caching * Class/function name changes * add missing imports * add RoPE credits * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * update defaults * Update src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * review updates * fix the tests * clean up * update apply_text_kv_cache * SwiGLUExperts addition * fuse SwiGLUExperts up and gate proj * Update src/diffusers/hooks/text_kv_cache.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/hooks/text_kv_cache.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/hooks/text_kv_cache.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/hooks/text_kv_cache.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * _SharedCacheKey -> TextKVCacheState * Apply style fixes * Run python utils/check_copies.py --fix_and_overwrite python utils/check_dummies.py --fix_and_overwrite * Apply style fixes * run `make fix-copies` * fix import * refactor text KV cache to be managed by StateManager --------- Co-authored-by: Murali Nandan Nagarapu <nmn@withnucleus.ai> Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

nmnWithNucleus and others added 9 commits March 20, 2026 07:59

adding NucleusMoE-Image model

76dcd51

update system prompt

f691395

Add text kv caching

7eef03e

Class/function name changes

cb63a95

Merge pull request #1 from heuristicoder/caching

50792e8

NucleusImage - text kv caching

add missing imports

f2eec82

add RoPE credits

d8b50e5

Merge branch 'main' into main

9a84625

Merge branch 'main' into main

8bad648

sayakpaul requested review from dg845 and yiyixuxu March 24, 2026 04:08

Merge branch 'main' into main

115f765