model: minicpm-4 main model#6900
Conversation
There was a problem hiding this comment.
Hello @mickqian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini or gemini-code-assist here, providing a summary of this pull request. This PR introduces support for the MiniCPM-4 base model within the sglang framework. The primary changes involve adding a specific implementation for MiniCPM's scaled Rotary Positional Embedding (RoPE) and integrating support for Flash Attention 2 tailored for this model architecture. It also includes minor cleanup in model loading utilities.
Highlights
- MiniCPM Model Support: Adds the necessary model definition and components to support the MiniCPM-4 base model.
- Scaled Rotary Embedding: Implements
MiniCPMScaledRotaryEmbeddingto handle the LongRoPE scaling mechanism specific to MiniCPM-V, ensuring correct positional encoding for potentially longer sequences. - Flash Attention 2 Integration: Introduces
MiniCPMFlashAttention2to leverage the performance benefits of Flash Attention 2 for the MiniCPM model, including handling dtype casting for compatibility. - Attention Mechanism Selection: Modifies the decoder layer to dynamically select the attention implementation (
eagerorflash_attention_2) based on the model configuration.
Changelog
Click here to see the changelog
- python/sglang/srt/layers/rotary_embedding.py
- Added
rotate_halffunction (copied from transformers) for general use. See lines 26-28. - Introduced
MiniCPMScaledRotaryEmbeddingclass to implement MiniCPM's specific LongRoPE scaling logic, including_compute_cos_sin_cacheandforwardmethods tailored for this scaling. See lines 1067-1153.
- Added
- python/sglang/srt/model_executor/model_runner.py
- Removed unused import
device_loading_context. See lines 80-84. - Added import for
print_warning_once. See line 102.
- Removed unused import
- python/sglang/srt/models/minicpm.py
- Imported necessary modules:
PretrainedConfig,is_flash_attn_greater_or_equal_2_10,logger, and the newMiniCPMScaledRotaryEmbedding. See lines 20, 21, 41, and 34. - Added
**kwargstoMiniCPMAttention.__init__andforwardmethods for flexibility. See lines 94 and 160. - Implemented
MiniCPMFlashAttention2class, providing a Flash Attention 2 implementation for MiniCPM attention, incorporating the scaled RoPE and dtype handling. See lines 173-323. - Defined
MINICPM_ATTENTION_CLASSESdictionary to map attention implementation names to classes. See lines 320-323. - Modified
MiniCPMDecoderLayerto use the attention class specified byconfig._attn_implementation. See line 340. - Passed the full
configobject to the attention layer initialization inMiniCPMDecoderLayer. See line 350. - Added a check and
squeeze(0)call inMiniCPMDecoderLayer.forwardifhidden_statesis 3D. See lines 382-383.
- Imported necessary modules:
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A new model arrives,
MiniCPM now thrives,
With scaled RoPE,
And Flash Attention's hope,
Performance swiftly revives.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Code Review
This pull request introduces support for the MiniCPM-4 base model. Key additions include a new MiniCPMScaledRotaryEmbedding for LongRoPE scaling and MiniCPMFlashAttention2 for FlashAttention 2 compatibility. The changes are generally well-structured and integrate with the existing codebase.
However, there are a few areas that require attention, particularly regarding the handling of config.sliding_window in the new attention module, which could lead to runtime errors. Additionally, type hints and docstrings for some new utility functions would improve maintainability. Given the addition of a new model and attention mechanism, comprehensive unit tests are highly recommended to ensure correctness and prevent regressions.
Overall, good progress on extending model support. Addressing the identified issues and adding tests will make this PR robust.
Summary of Findings
- Potential TypeError with
config.sliding_window: InMiniCPMFlashAttention2.__init__, the use oflen(config.sliding_window)andconfig.sliding_window[0]might cause aTypeErrorifconfig.sliding_windowis an integer orNone, as is common in HuggingFacePretrainedConfigobjects. This needs verification based on MiniCPM's config structure. - Missing Type Hints and Docstrings: The new utility function
rotate_halfinrotary_embedding.pylacks type hints and a docstring, which affects code clarity and maintainability. - Clarity of Squeeze Operations in RoPE: In
MiniCPMScaledRotaryEmbedding.forward, thesqueezeoperations oncosandsintensors might be unnecessary if these tensors are already 2D, potentially reducing code clarity. - Need for Unit Tests: The pull request introduces significant new components (MiniCPM model support, new RoPE, FlashAttention2 integration). Adding comprehensive unit tests is crucial to ensure correctness, verify the new functionalities, and prevent future regressions.
Merge Readiness
This pull request makes good progress in adding support for the MiniCPM-4 model. However, there's a potentially critical issue with how config.sliding_window is handled that could lead to runtime errors. Additionally, some minor improvements regarding type hints and code clarity are suggested.
Most importantly, given the scale of these additions, comprehensive unit tests for the new RoPE implementation and the MiniCPM model (especially with the FlashAttention2 path) are essential before merging to ensure stability and correctness.
I recommend that these changes, particularly the sliding_window issue and the addition of tests, be addressed before this PR is merged. As an AI assistant, I am not authorized to approve pull requests; please ensure further review and approval from other maintainers.
|
ref #6899 |
mmlu
Thanks to @yhyang201 and openbmb
Motivation
Modifications
Checklist