Skip to content

model: minicpm-4 main model#6900

Closed
mickqian wants to merge 20 commits intosgl-project:mainfrom
mickqian:minicpm
Closed

model: minicpm-4 main model#6900
mickqian wants to merge 20 commits intosgl-project:mainfrom
mickqian:minicpm

Conversation

@mickqian
Copy link
Copy Markdown
Collaborator

@mickqian mickqian commented Jun 5, 2025

mmlu

# start the server 
$ python -m sglang.launch_server --model-path ${path_to_main_model} --host 0.0.0.0 --port 9080  --trust-remote-code
# start mmlu
$ python3 bench_sglang.py --nsub 5 --port 9080
100%|███████████████████████████████████| 752/752 [00:03<00:00, 232.73it/s]
subject: abstract_algebra, #q:100, acc: 0.490
subject: anatomy, #q:135, acc: 0.667
subject: astronomy, #q:152, acc: 0.822
subject: business_ethics, #q:100, acc: 0.770
subject: clinical_knowledge, #q:265, acc: 0.781
Total latency: 3.236
Average accuracy: 0.729

Thanks to @yhyang201 and openbmb

Motivation

Modifications

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mickqian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini or gemini-code-assist here, providing a summary of this pull request. This PR introduces support for the MiniCPM-4 base model within the sglang framework. The primary changes involve adding a specific implementation for MiniCPM's scaled Rotary Positional Embedding (RoPE) and integrating support for Flash Attention 2 tailored for this model architecture. It also includes minor cleanup in model loading utilities.

Highlights

  • MiniCPM Model Support: Adds the necessary model definition and components to support the MiniCPM-4 base model.
  • Scaled Rotary Embedding: Implements MiniCPMScaledRotaryEmbedding to handle the LongRoPE scaling mechanism specific to MiniCPM-V, ensuring correct positional encoding for potentially longer sequences.
  • Flash Attention 2 Integration: Introduces MiniCPMFlashAttention2 to leverage the performance benefits of Flash Attention 2 for the MiniCPM model, including handling dtype casting for compatibility.
  • Attention Mechanism Selection: Modifies the decoder layer to dynamically select the attention implementation (eager or flash_attention_2) based on the model configuration.

Changelog

Click here to see the changelog
  • python/sglang/srt/layers/rotary_embedding.py
    • Added rotate_half function (copied from transformers) for general use. See lines 26-28.
    • Introduced MiniCPMScaledRotaryEmbedding class to implement MiniCPM's specific LongRoPE scaling logic, including _compute_cos_sin_cache and forward methods tailored for this scaling. See lines 1067-1153.
  • python/sglang/srt/model_executor/model_runner.py
    • Removed unused import device_loading_context. See lines 80-84.
    • Added import for print_warning_once. See line 102.
  • python/sglang/srt/models/minicpm.py
    • Imported necessary modules: PretrainedConfig, is_flash_attn_greater_or_equal_2_10, logger, and the new MiniCPMScaledRotaryEmbedding. See lines 20, 21, 41, and 34.
    • Added **kwargs to MiniCPMAttention.__init__ and forward methods for flexibility. See lines 94 and 160.
    • Implemented MiniCPMFlashAttention2 class, providing a Flash Attention 2 implementation for MiniCPM attention, incorporating the scaled RoPE and dtype handling. See lines 173-323.
    • Defined MINICPM_ATTENTION_CLASSES dictionary to map attention implementation names to classes. See lines 320-323.
    • Modified MiniCPMDecoderLayer to use the attention class specified by config._attn_implementation. See line 340.
    • Passed the full config object to the attention layer initialization in MiniCPMDecoderLayer. See line 350.
    • Added a check and squeeze(0) call in MiniCPMDecoderLayer.forward if hidden_states is 3D. See lines 382-383.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A new model arrives,
MiniCPM now thrives,
With scaled RoPE,
And Flash Attention's hope,
Performance swiftly revives.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mickqian

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the MiniCPM-4 base model. Key additions include a new MiniCPMScaledRotaryEmbedding for LongRoPE scaling and MiniCPMFlashAttention2 for FlashAttention 2 compatibility. The changes are generally well-structured and integrate with the existing codebase.

However, there are a few areas that require attention, particularly regarding the handling of config.sliding_window in the new attention module, which could lead to runtime errors. Additionally, type hints and docstrings for some new utility functions would improve maintainability. Given the addition of a new model and attention mechanism, comprehensive unit tests are highly recommended to ensure correctness and prevent regressions.

Overall, good progress on extending model support. Addressing the identified issues and adding tests will make this PR robust.

Summary of Findings

  • Potential TypeError with config.sliding_window: In MiniCPMFlashAttention2.__init__, the use of len(config.sliding_window) and config.sliding_window[0] might cause a TypeError if config.sliding_window is an integer or None, as is common in HuggingFace PretrainedConfig objects. This needs verification based on MiniCPM's config structure.
  • Missing Type Hints and Docstrings: The new utility function rotate_half in rotary_embedding.py lacks type hints and a docstring, which affects code clarity and maintainability.
  • Clarity of Squeeze Operations in RoPE: In MiniCPMScaledRotaryEmbedding.forward, the squeeze operations on cos and sin tensors might be unnecessary if these tensors are already 2D, potentially reducing code clarity.
  • Need for Unit Tests: The pull request introduces significant new components (MiniCPM model support, new RoPE, FlashAttention2 integration). Adding comprehensive unit tests is crucial to ensure correctness, verify the new functionalities, and prevent future regressions.

Merge Readiness

This pull request makes good progress in adding support for the MiniCPM-4 model. However, there's a potentially critical issue with how config.sliding_window is handled that could lead to runtime errors. Additionally, some minor improvements regarding type hints and code clarity are suggested.

Most importantly, given the scale of these additions, comprehensive unit tests for the new RoPE implementation and the MiniCPM model (especially with the FlashAttention2 path) are essential before merging to ensure stability and correctness.

I recommend that these changes, particularly the sliding_window issue and the addition of tests, be addressed before this PR is merged. As an AI assistant, I am not authorized to approve pull requests; please ensure further review and approval from other maintainers.

Comment thread python/sglang/srt/models/minicpm.py Outdated
Comment thread python/sglang/srt/layers/rotary_embedding.py Outdated
Comment thread python/sglang/srt/layers/rotary_embedding.py Outdated
@mickqian mickqian changed the title model: minicpm-4 base model model: minicpm-4 main model Jun 5, 2025
@mickqian
Copy link
Copy Markdown
Collaborator Author

mickqian commented Jun 5, 2025

ref #6899

@mickqian mickqian closed this Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants