Add HyperCLOVAX SEED Think 14B#44956
Conversation
Vendor the HyperCLOVAX Vision config into vLLM to fix transformers v5 compatibility. The upstream remote code config does not handle empty initialization (text_config=None), which breaks v5's @strict config validation added in huggingface/transformers#41250. Fixes: vllm-project#38387 TODO: Remove vendored config once HyperCLOVAX is upstreamed to transformers. Tracking PR: huggingface/transformers#44956 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b31ff44 to
ef1e73f
Compare
|
@zucchini-nlp , All CI checks have completed, except for one job that is still pending its status report. |
bigshanedogg
left a comment
There was a problem hiding this comment.
This is a self-review of the key changes in this PR.
| attention_multiplier: float | None = None | ||
| residual_multiplier: float | None = None | ||
| embedding_multiplier: float | None = None | ||
| logits_scaling: float | None = None |
There was a problem hiding this comment.
These fields also exist in Granite, but are defined here due to a different default values.
Although they are present in config.json, if not explicitly declared, the dynamic default value setting in post_init will not be applied.
There was a problem hiding this comment.
This part has been removed based on the modification noted in the comment below, except for attention_multiplier.
| # Peri-Layer Normalization: additional RMSNorm after each sub-layer output | ||
| if self.use_post_norm: | ||
| self.post_norm1 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||
| self.post_norm2 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps) |
There was a problem hiding this comment.
When self.use_post_norm is True,
post_norm for both attention and MLP are declared separately to match the Peri-LN structure.
Since there is a branch on self.use_post_norm, Granite is inherited instead of GLM4
(field similarity with Granite was also greater).
Vendor the HyperCLOVAX Vision config into vLLM to fix transformers v5 compatibility. The upstream remote code config does not handle empty initialization (text_config=None), which breaks v5's @strict config validation added in huggingface/transformers#41250. Fixes: vllm-project#38387 TODO: Remove vendored config once HyperCLOVAX is upstreamed to transformers. Tracking PR: huggingface/transformers#44956 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fang Han <fhan0520@gmail.com>
zucchini-nlp
left a comment
There was a problem hiding this comment.
Great work on applying modular! I left a few comments on what can be deleted because it's already auto-resolved by modular
Other than that we're fine. After addressing the comments, will request core maintainer review and we'll merge
| hidden_states = outputs.last_hidden_state | ||
| slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep | ||
| # MuP: multiply logits by logits_scaling (cf. GraniteForCausalLM which divides) | ||
| logits = self.lm_head(hidden_states[:, slice_indices, :]) * self.config.logits_scaling |
There was a problem hiding this comment.
can we adjust scaling, so we can copy fully? For ex in config self.logits_scaling = 1 / self.logits_scaling
There was a problem hiding this comment.
Good idea!
However, I'm a bit concerned that storing the inverted value in Config.logits_scaling could cause confusion,
since users inspecting config.json would see a different value than what's actually used in the forward pass.
Would it be okay to keep the explicit * self.config.logits_scaling in forward for clarity, even if it means a small override?
|
run-slow: hyperclovax |
|
This comment contains models: ["models/hyperclovax"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
6aa22bc to
a0f82ba
Compare
|
@zucchini-nlp, Some of the failed tests appear to be outside the scope of this PR (e.g., |
a0f82ba to
9c3fd14
Compare
| @@ -0,0 +1,27 @@ | |||
| # Copyright 2025 The HuggingFace Team. All rights reserved. | |||
There was a problem hiding this comment.
a few files left wrt 2026 😄
|
run-slow: hyperclovax |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Oke, seeing a bad rebase with unrelated diff 😄 and a tiny change in rope doc. I will pass-over the latest diff after the bad rebase is fixed, and prob a core maintainer will pass over soon
29df799 to
331ed88
Compare
|
@zucchini-nlp , |
|
@bigshanedogg , one tiny unrelated diff left-out. And vasqu will come to review next week :) |
9600edb to
d5a0472
Compare
|
Sorry for all the delays, will be taking a look today!! |
vasqu
left a comment
There was a problem hiding this comment.
Only some nits tbh, looks overall super good! Let's sync with main and fixup the last details 🤗
| @unittest.skip( | ||
| "In TP mode, Float8 quantization derives scales per shard rather than globally, " | ||
| "so each TP rank observes different weight magnitudes than the full-weight non-TP " | ||
| "baseline. HyperCLOVAX's Peri-Layer Normalization (post_norm1/post_norm2) amplifies " | ||
| "this discrepancy past the 75% token-match threshold. Skipped pending an upstream fix." | ||
| ) | ||
| @is_tensor_parallel_test | ||
| def test_tp_generation_quantized(self): | ||
| pass |
There was a problem hiding this comment.
Interesting, cc @3outeille @SunMarc just for viz
Vendor the HyperCLOVAX Vision config into vLLM to fix transformers v5 compatibility. The upstream remote code config does not handle empty initialization (text_config=None), which breaks v5's @strict config validation added in huggingface/transformers#41250. With the vendored config registered, vLLM uses the local class instead of the broken remote code, so we can lift the max_transformers_version cap that was added in tests/models/registry.py to skip this model on v5. Also fix the unreachable hidden_size n_embd fallback per gemini-code-assist review: the text_config_attribute_map remap pops n_embd before the fallback would ever be checked. Read hidden_size from the instantiated text_config object instead. Fixes: vllm-project#38387 TODO: Remove vendored config once HyperCLOVAX is upstreamed to transformers. Tracking PR: huggingface/transformers#44956 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Fang Han <fhan0520@gmail.com>
d5a0472 to
fa3494a
Compare
bigshanedogg
left a comment
There was a problem hiding this comment.
@vasqu , Thank you for the detailed comments!
I've addressed the points you mentioned in your review.
Please let me know if I've missed anything or if there's anything else you'd like me to address.
|
hey @bigshanedogg 👋 I will review tomorrow, gh currently has issues so I can't create any reviews or new comments 😢 just wanted to keep you in the loop |
|
Thanks for letting me know! No rush at all — take your time. 🙂 |
vasqu
left a comment
There was a problem hiding this comment.
Very nicely done!! 🫡 let me take care of CI slow tests on our side but will merge in a bit
|
run-slow: hyperclovax |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, hyperclovax |
|
This comment contains models: ["models/hyperclovax"] |
|
@bigshanedogg congrats on the merge!! 🤗 |
|
Thanks everyone, we will work on the VLM now that the lm backbone is merged! @bigshanedogg would be great if you could also update the README on the hub, I am seeing that it sets |
|
@vasqu Thank you for review and the additional commits on the test code! @zucchini-nlp Along with the README update you mentioned, I'll also push a minor update to fix |
* feat: hyperclovax * fix: import and doc date * updated tests --------- Co-authored-by: vasqu <antonprogamer@gmail.com>
What does this PR do?
Adds native Transformers support for HyperCLOVA X SEED Think 14B, a 14.74B-parameter Korean reasoning LLM developed by NAVER Cloud.
Architecture
LLaMA-style decoder-only transformer with two modifications:
use_post_norm): an extraRMSNormis applied after eachsub-layer output (both attention and MLP), in addition to the standard pre-norm.
attention_multiplier— replaces1/sqrt(head_dim)in attentionresidual_multiplier— scales each sub-layer output before adding to the residual streamembedding_multiplier— scales the token embedding outputlogits_scaling— scales final logits before softmax / samplingImplementation approach
Following the maintainer's guidance in #44957, this PR uses the modular system (
modular_hyperclovax.py) to minimise LOC and make the diff easy to review-iterate. (Roughly 59% of lines are generated rather than manually maintained.)The maintainer suggested inheriting the decoder layer with post-norms from GLM4. After evaluation, Granite was chosen as the decoder layer base instead, for the following reasons:
use_post_normis optional (Falseby default). GLM4's decoder layer has post-norms always on — inheriting from it would require logic to conditionally disablepost_self_attn_layernorm/post_mlp_layernorm, adding complexity rather than reducing it.residual_multiplier(always-active MuP). Whenuse_post_norm=False,HyperCLOVAXDecoderLayeris identical toGraniteDecoderLayer— zero extra code.residual_multiplierand conditionally disabling its built-in norms — two changes in opposite directions for no net gain in code reuse.All other modules (RMSNorm, MLP, Attention, etc.) are inherited from Granite unchanged. The modular file is a few hundred LOC as suggested.
Benchmark validation
External support
Code Agent Policy
A code agent was used for mechanical tasks such as aligning docstrings and comments. The core implementation was written by the submitter directly, who has reviewed every changed line and personally run the tests including benchmark validation.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.