NucleusMoE-Image#13317
Conversation
NucleusImage - text kv caching
| gate1 = gate1.clamp(min=-2.0, max=2.0) | ||
| gate2 = gate2.clamp(min=-2.0, max=2.0) |
There was a problem hiding this comment.
It seems weird to me that we first clamp the gates to [-2.0, 2.0] and then essentially clamp again by squashing with the tanh function below. Is this intended?
There was a problem hiding this comment.
I agree it's weird. :) I used it to stabilize the gradients if the tanh gates get saturated while training. I will evaluate the model performance without it and get back to you!
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
|
Hi @sippycoder, it doesn't look like my HF account (also |
Looks like I can't give you private repo access unless you are in my org. I just made the repo public! I didn't update the model page yet. |
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| # Copied from diffusers.models.transformers.transformer_qwenimage.apply_rotary_emb_qwen with qwen->nucleus |
There was a problem hiding this comment.
Can you run make fix-copies to sync the implementation here with the QwenImage implementation (assuming the implementations are intended to be the same, which I believe is the case)?
python utils/check_dummies.py --fix_and_overwrite
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
|
Hi @sippycoder, I think this PR is close to merge, the remaining items should be:
Additionally, having docs would be nice, but this is not a hard blocker (we can add them in a follow-up PR if necessary). |
| def __init__(self, state_manager: StateManager): | ||
| super().__init__() | ||
| self.state_manager = state_manager | ||
| self.kv_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {} |
There was a problem hiding this comment.
After looking at the existing cache design more closely, I think kv_cache should not be owned by TextKVCacheBlockHook but rather refactored into its own BaseState subclass, which is what MagCache and TaylorSeer do (both store cached tensors in their state classes rather than on the hook).
This would mean we would have two state classes: one which holds the shared encoder_hidden_states tensor and one which holds the KV cache dict for each block:
# Same as before
class TextKVCacheState(BaseState):
def __init__(self):
self.key: int | None = None
...
# Holds the block-level KV cache
class TextKVCacheBlockState(BaseState):
def __init__(self):
self.kv_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {}
...
# Same as before
class TextKVCacheTransformerHook(ModelHook):
...
class TextKVCacheBlockHook(ModelHook):
# One state manager for shared transformer-level state, one for block-specific state
def __init__(self, state_manager: StateManager, block_state_manager: StateManager):
super().__init__()
self.state_manager = state_manager
self.block_state_manager = block_state_manager
...This would allow us to manage each block-level KV Cache with a StateManager, which I think more cleanly follows the current design.
There was a problem hiding this comment.
Yea that makes sense! I added a commit for this.
dg845
left a comment
There was a problem hiding this comment.
Thanks again for the PR! I think the items in #13317 (comment) should be resolved. We can add docs and handle any remaining issues in follow-up PRs.
|
Merging as the CI is green. |
@sippycoder When is this actually getting released? You said 'tomorrow' two weeks ago to get this PR merged, but your model still isn't released. I'm eager to try this. Edit: it was posted the day after my comment |
* adding NucleusMoE-Image model * update system prompt * Add text kv caching * Class/function name changes * add missing imports * add RoPE credits * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * update defaults * Update src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * review updates * fix the tests * clean up * update apply_text_kv_cache * SwiGLUExperts addition * fuse SwiGLUExperts up and gate proj * Update src/diffusers/hooks/text_kv_cache.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/hooks/text_kv_cache.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/hooks/text_kv_cache.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/hooks/text_kv_cache.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_nucleusmoe_image.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * _SharedCacheKey -> TextKVCacheState * Apply style fixes * Run python utils/check_copies.py --fix_and_overwrite python utils/check_dummies.py --fix_and_overwrite * Apply style fixes * run `make fix-copies` * fix import * refactor text KV cache to be managed by StateManager --------- Co-authored-by: Murali Nandan Nagarapu <nmn@withnucleus.ai> Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
What does this PR do?
This PR introduces NucleusMoE-Image series into the diffusers library.
NucleusMoE-Image is a 2B active 17B parameter model trained with efficiency at its core. Our novel architecture highlights the scalability of sparse MoE architecture for Image generation. The technical report will be released very soon.