[Model] Add Ernie4.5 VL model support#15679
Conversation
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Summary of ChangesHello @CSWYF3634076, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for the Baidu Ernie4.5 VL model. It includes the necessary architectural components for both the text backbone (leveraging a Mixture-of-Experts approach) and the vision transformer, along with a dedicated multimodal processor. The changes enable the system to handle and process visual and textual inputs for the Ernie4.5 VL model, utilizing a specialized 3D rotary embedding for advanced multimodal positional encoding. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the Ernie4.5 VL model. The changes include new model definition files, a vision processor, and modifications to existing chat templates and model configurations. Overall, the implementation looks solid, but I've identified a few areas for improvement, including potential division-by-zero errors in the processor, inefficient tensor operations, and some code clarity issues. Addressing these points will enhance the robustness and performance of the new model integration.
| min_pixels: int = MIN_PIXELS, | ||
| max_pixels: int = MAX_PIXELS, | ||
| ): | ||
| if max(height, width) / min(height, width) > MAX_RATIO: |
There was a problem hiding this comment.
| max_frames = floor_by_factor( | ||
| ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR | ||
| ) | ||
| nframes = total_frames / video_fps * fps |
| device=input_ids.device, | ||
| ) | ||
| image_index, video_index = 0, 0 | ||
| for i, input_ids in enumerate(total_input_ids): |
There was a problem hiding this comment.
The loop variable input_ids shadows the function parameter with the same name defined on line 2196. This can be confusing and lead to bugs. It's a good practice to use a different name for the loop variable to improve code clarity.
| for i, input_ids in enumerate(total_input_ids): | |
| for i, current_input_ids in enumerate(total_input_ids): |
| input_type_group.append((key, start_index, end_index)) | ||
|
|
||
| llm_pos_ids_list = [] | ||
| video_frame_num = 1 |
| rope_scaling: Optional[Dict[str, Any]] = None, | ||
| rope_is_neox_style: bool = True, |
There was a problem hiding this comment.
The parameters rope_scaling and rope_is_neox_style are defined in the __init__ method's signature but are not used within the method. When creating Ernie4_5_VLRotaryEmbedding, is_neox_style is hardcoded to False and rope_scaling is not passed. This can be misleading and may cause issues if the model's configuration changes. Please either use these parameters or remove them if they are not needed.
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
|
"I think the function process_mm_data should be rewritten by each Processor class, similar to the _call_cf_processor in vllm. I don't know why the current code doesn't do this" I think it's because most VLMs are using process_mm_data_async which has been rewritten by each Processor class. |
In the |
|
@yuan-luo May I ask, besides resolving conflicts, is there anything else that needs to be updated in this PR |
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
|
@yuan-luo Hi, could you please review this PR? |
Sure. |
I guess you mean _call_hf_processor in vllm. Writing a _call_hf_processor in each model is a huge refactor which would introduce massive duplicated code segment. Probably the current mechanism is a concise and reusable implementation. |
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
@yuan-luo
|
I actually did exactly the same task on my own side, implementing this model, but encountered some error in correctness. So I can understand what obstacles you have pulled through, which is awesome. I'll review it ASAP. |
|
@JustinTong0323 Hello, the review has been approved. Can you help trigger CI and subsequent merges |
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
|
/tag-and-rerun-ci |
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com> Signed-off-by: wangyafeng <wangyafeng@baidu.com>
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com> Signed-off-by: wangyafeng <wangyafeng@baidu.com>
Motivation
Add Baidu Ernie4.5 VL model support
Modifications
ernie45_moe_vl.pythe text backboneernie45_vl.pythe vitprocessors/ernie45_vl.pythe processorrotary_embedding.py::Ernie4_5_VLRotaryEmbeddingthe 3d_rope (hwhwhw...ttt..)Accuracy Tests
test case
Benchmarking and Profiling
Checklist
Others
I think the function
process_mm_datashould be rewritten by eachProcessor class, similar to the_call_hf_processorin vllm. I don't know why the current code doesn't do this