add HyperClovaX Vision#44314
Conversation
|
When I ran the test in my environment, hcx test ran normally without any failures.
I think the work will proceed in this manner? |
|
Hmm... In my environment, |
…iteration and using cache
…orConditionalGeneration
…d processor classes
…template application
…ing in HyperClovaXProcessor
|
Yep don't worry we'll fix this one on main! |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Hey @jp1924, thanks a lot for the PR!
I think we need to move all the code to modular file first, since there is a lot of copied stuff from other models. I left you comments below about where exactly can copy from. Ping me for another review when the modular is ready
…deo processor classes.
… to HCXVisionConfig and remove unnecessary code.
vasqu
left a comment
There was a problem hiding this comment.
A few comments I think we can improve on especially re modular. But I don't think it's anything major
I responded on the comment re model type as well btw, it's a bit hard because not being able to use auto is kind of a big deal
| # Use processor.tokenizer.apply_chat_template for video inputs. | ||
| # processor.apply_chat_template rewrites image_url to image before the | ||
| # template runs, which breaks HCX's extension-based video detection. | ||
| text = processor.tokenizer.apply_chat_template( |
There was a problem hiding this comment.
This is weird, why are we even using image url --> shouldnt it be video url to properly distinguish between img/video? cc @zucchini-nlp
There was a problem hiding this comment.
huh, didn't see it. I believe related ti how the jinja template is saved on the hub
This indeed looks weird, lets' ping the Naver team to change model-type/jinja/etc or ask them to host a non-remote version. @jp1924 , when you are done passing over comments, could you list the things we need to change on the hub and ping @ (bigshanedogg). We need to either push directly to the existing repo or upload a new repo with {{MODELNAME}}-hf suffix
There was a problem hiding this comment.
#44314 (comment)
You can check the details via this link.
Sharing the text below as well since the link is dead.
I added this logic because video handling for HCX is broken on recent versions of transformers.
If you look at HCX's chat_template.jinja, it handles image/video differently from most models.
Models like `Qwen2_5_vl` or `VideoLLaMA3` expect multimodal items in a normalized form such as `{"type": "image"}` or `{"type": "video"}`.
HCX, however, uses `{"type": "image_url", "image_url": {"url": "video.mp4"}}` and `{"type": "image_url", "image_url": {"url": "image.png"}}`, then branches by file extension inside the template.
In newer transformers, this block rewrites image_url into image:
https://github.com/huggingface/transformers/blob/2da00a3cec88fac160d481406e7961cf59472894/src/transformers/processing_utils.py#L1792-L1799
Because of that rewrite, HCX's image_url-based branch no longer triggers correctly. As a result, even when a video is provided, it gets rendered like plain text:
```
>>> from transformers import AutoProcessor
>>> processor = AutoProcessor.from_pretrained("naver-hyperclovax/HyperCLOVAX-SEED-Think-32B", trust_remote_code=False)
You are using a model of type `vlm` to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
>>> video_messages = [
... {
... "role": "user",
... "content": [
... {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/bedroom.mp4"}},
... {"type": "text", "text": "What is shown in this video?"},
... ],
... }
... ]
>>> text = processor.apply_chat_template(video_messages, tokenize=False, add_generation_prompt=True)
>>> text
'<|im_start|>user\n\nWhat is shown in this video?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'
>>> video_messages
[{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/bedroom.mp4'}, {'type': 'text', 'text': 'What is shown in this video?'}]}]
```
If it were working correctly, the output should include the video MIME and video tokens, e.g. mime_start / video_aux_start / VIDEO_PAD, not the plain-text-only prompt above.
bigshanedogg The current multimodal input contract in naver-hyperclovax/HyperCLOVAX-SEED-Think-32B is non-standard.
I opened a Hub discussion/PR to update chat_template.jinja so it aligns with current processor behavior:
https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-32B/discussions/12
zucchini-nlp Until that template update is merged, we need to keep this compatibility code in this PR.
I'll follow up again once we have a decision on the Hub-side template PR.
|
|
||
| ## Notes | ||
|
|
||
| - HyperCLOVAX Vision V2 uses a unique media input format. Both images and videos are specified using `{"type": "image_url", "image_url": {"url": "..."}}`. The processor and chat template distinguish images from videos by file extension (`.mp4`, `.avi`, `.mov`, `.mkv`, `.webm`, `.flv`, `.wmv`, `.m4v` are treated as video; everything else is treated as image). |
There was a problem hiding this comment.
Ah ok this is the explicit change, imo would maybe make sense to have a different chat template if possible. This really is weird
| ``` | ||
|
|
||
| > [!WARNING] | ||
| > Video input via `processor.apply_chat_template` is currently broken. Recent Transformers versions rewrite `image_url` entries to `image` before the chat template runs, so the video-detection branch in HCX's template never triggers and video inputs are silently dropped. As a workaround, use `processor.tokenizer.apply_chat_template` to render the prompt text, then pass the video path separately to `processor(...)`. See [this review comment](https://github.com/huggingface/transformers/pull/44314#discussion_r3008382827) for details. |
There was a problem hiding this comment.
I think the review comment is closed, cant load via url it seems
There was a problem hiding this comment.
I've reopened the comment.
| @auto_docstring | ||
| class HCXVisionV2ForConditionalGeneration(HCXVisionV2PreTrainedModel, GenerationMixin): | ||
| accepts_loss_kwargs = False | ||
| _tied_weights_keys = {"lm_head.weight": "model.language_model.embed_tokens.weight"} |
There was a problem hiding this comment.
Sorry but I don't really agree, we can inherit a lot from exaone 4.5 here
- the flags are the same
- init is the same (minus the vocab size but we can just call the forward similarly to get the vocab size)
- get img/video features are identical (only wrappers, agree base mode is different but it doesnt matter here)
The difference can be easily overwriten as is (forward/prepare for generate). And we can ignore the methods we don't need with def method(...): raise AttributeError()
Currently it seems we miss a few utilities as expand input is missing which will cause failures with beam search iirc
There was a problem hiding this comment.
Thanks for the changes — I've reviewed them.
The main thing I took away is that the model_type in config.json needs to be updated. If my understanding from the comment above is correct, that should be all that's needed for this part:
#44314 (comment)
I also have two minor suggestions.
First, would it be possible to rename the classes from HCX to HyperCLOVAX?
This would also be more consistent with the naming convention used in the related preceding PR for the HyperCLOVAX bare LM — e.g., HyperCLOVAXVisionV2Config.
Second, regarding the processor updates: in previous models, image_processing_*.py and video_processing_*.py are split into separate files, with processing_*.py acting as the unified entry point. We were planning to push a new version of the processor for hyperclovax_vision_v2 to the hub following the same structure, and I was wondering if it might make sense to include that update in this branch as well.
(ref: https://github.com/huggingface/transformers/tree/main/src/transformers/models/qwen2_vl)
Please feel free to share any feedback on the minor points.
As for the config.json change, based on my understanding only the model_type needs to be updated to "hyperclovax_vision_v2" — could you double-check this? Once confirmed, I'll go ahead and update it on the hub.
vasqu
left a comment
There was a problem hiding this comment.
Sorry most of the team was off for a few days!
Responding now:
- I think the rename is reasonable (
HCX -> HyperCLOVAXVisionV2) cc @jp1924 - Hmm, not sure I can follow the processor update but in general yes, we would likely need to update the code as well. I would not update qwen2 vl itself - if there are changes needed for this model, I'd rather have a custom version that builds upon that (that's what modular is for). It kind of relates to #44314 (comment) where we want to change the chat template to properly separate between images and videos.
- Yes, exactly the model type on the hub needs to be corrected, e.g. see https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-32B/blob/main/config.json#L25 where we have
"model_type": "vlm",but need to sync with here
Ideally, we try to sync with this PR so that we can avoid the remote code and directly move to the "native" version. Meaning, we keep these changes to a PR and only merge them when everything is ready here, wdyt? @bigshanedogg @jp1924
…vax_vision_v2.py Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
- Updated class names and configurations to follow the new naming convention (e.g., HCXVisionV2 to HyperCLOVAXVisionV2). - Enhanced documentation strings for clarity and consistency. - Adjusted default token IDs for image and video processing. - Removed unnecessary attributes and methods related to token type IDs and temporal RoPE. - Improved the processor's handling of multimodal inputs, including image and video tokens. - Updated test cases to reflect the new class names and configurations.
|
@vasqu @bigshanedogg @zucchini-nlp Sorry for the late reply. |
|
Did I understand it correctly that the template fixes are still missing then? @bigshanedogg @jp1924 Would it be possible to make them available (through a hub PR for example). The goal is to split the image to separate image and video tokens / placeholders. Would like to review when we have everything together if that's ok |
|
Sure — here's my proposed breakdown and ordering of the work:
I was thinking we could proceed roughly like this: once 1) and 2) are done, I'll share the hub PR here. Also, I saw your comment from yesterday but wasn't able to get to it today. |
|
Awesome, sounds good to me! Thanks a lot for all the work and quick responses, appreciate it |
|
@bigshanedogg
Is that what you’re saying? |
|
Thanks for summarizing — your understanding is largely correct! 3, 4, and 5 are exactly as you described. For 1 and 2, I'd just like to align on a couple of small details:
As for where to commit the processor changes
If it helps, my own thinking was to commit directly to this branch, since the processor additions naturally move together with the model code, and keeping things in a single branch might be a bit easier to track given everything else the reviewers are juggling. |
|
@bigshanedogg As for option 2, regardless of the exact approach, it basically means restoring the video and image processors that were previously removed as duplicates during this PR, right? Yeah, let’s proceed like that. You can either open a PR against Since this means adding more code changes around the video/image processors, I think there are probably two scenarios:
Either way, please understand that I may need to touch up some parts of the code afterward depending on review comments. Since this involves some additional implementation work, the PR will probably get delayed a bit more. I’ve also been pretty busy lately, so I couldn’t spend much time on this PR and my replies were slower than usual. |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, hyperclovax_vision_v2, qwen2_5_vl |
|
So I see we are going for option 2? But yea fine with either way tbh, just a matter of preference for you guys to sync 🤗 |
|
Yeah, looks like I'll go with Option 2. I'll send another review request once the work is done! |
|
Sharing a quick progress update since things are running a bit behind. Status
What's blockingA quick sanity check on our side suggests the current state doesn't fully reproduce our released performance —
@jp1924, if you happen to have any prior sanity-check results or notes from your side, I'd love to cross-reference them — it might well be a small processor-side issue I'm overlooking, or a minor cause like a dtype mismatch that sometimes pops up. Path forwardGiven @jp1924's earlier concern about not dragging this PR out, one option is to merge as-is and let me submit a follow-up fix PR once we've found the root cause. Either way works for me — please let me know what would prefer 🙏 |
|
I'm fine with either way, we can also fix afterwards but best would be to have aligned outputs tbh - except we are really in a time rush. Correctness > quickness imo
|
|
@bigshanedogg I used to be against PRs getting too long, but after seeing the PRs for the “Add model” features open on the Transformers GitHub, I’ve had a change of heart—it’s totally fine now, lol. Hmm... What kind of evaluation are you currently conducting? |
|
@jp1924, Sharing a few details below for context and to keep us in sync:
Since the 32B uses the bare LM class, the KoBEST result isn't unrelated — but as it's a bare-LM evaluation, it is not sufficient on its own to validate this PR, where we'd want an end-to-end VLM benchmark.
The evaluation I'm running is essentially the
For the chat template issue, two options should work:
The second approach in particular should let you freely adjust the checkpoint, so the limitation you ran into shouldn't be a blocker. Also, a quick note on benchmark choice:
If the target VLM in |
|
@bigshanedogg By the way, if the processor changes you worked on are already done on your side, could you send over a PR for those first? We're already validating things anyway, so there's no reason not to get the processor review started in parallel. Anything we can get reviewed and out of the way early, let's just get it done and keep things moving. Oh, and could you also take a look at the PR I uploaded to the Hub and get that merged when you get a chance? At this point we've been going back and forth on this for quite a while, so it'd be pretty cool if I could become a contributor to your repo too 😂 |
What does this PR do?
Hello, Transformers team!
I submitted a PR to add naver-hyperclovax/HyperCLOVAX-SEED-Think-32B (hereafter HCX), developed by the Korean IT company Naver while executing the government's national AI model project.
The HCX code was written based on Transformer 4.52.4, leading to the following issues:
Moving to Transformer 5.0.0 significantly improved the readability and development convenience of the modeling code. We aim to leverage this to add the HCX model to transformers.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp @yonigozlan @molbap