model: minicpm-4 main model by mickqian · Pull Request #6900 · sgl-project/sglang

mickqian · 2025-06-05T15:12:41Z

mmlu

# start the server 
$ python -m sglang.launch_server --model-path ${path_to_main_model} --host 0.0.0.0 --port 9080  --trust-remote-code
# start mmlu
$ python3 bench_sglang.py --nsub 5 --port 9080
100%|███████████████████████████████████| 752/752 [00:03<00:00, 232.73it/s]
subject: abstract_algebra, #q:100, acc: 0.490
subject: anatomy, #q:135, acc: 0.667
subject: astronomy, #q:152, acc: 0.822
subject: business_ethics, #q:100, acc: 0.770
subject: clinical_knowledge, #q:265, acc: 0.781
Total latency: 3.236
Average accuracy: 0.729

Thanks to @yhyang201 and openbmb

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Hello @mickqian, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini or gemini-code-assist here, providing a summary of this pull request. This PR introduces support for the MiniCPM-4 base model within the sglang framework. The primary changes involve adding a specific implementation for MiniCPM's scaled Rotary Positional Embedding (RoPE) and integrating support for Flash Attention 2 tailored for this model architecture. It also includes minor cleanup in model loading utilities.

Highlights

MiniCPM Model Support: Adds the necessary model definition and components to support the MiniCPM-4 base model.
Scaled Rotary Embedding: Implements MiniCPMScaledRotaryEmbedding to handle the LongRoPE scaling mechanism specific to MiniCPM-V, ensuring correct positional encoding for potentially longer sequences.
Flash Attention 2 Integration: Introduces MiniCPMFlashAttention2 to leverage the performance benefits of Flash Attention 2 for the MiniCPM model, including handling dtype casting for compatibility.
Attention Mechanism Selection: Modifies the decoder layer to dynamically select the attention implementation (eager or flash_attention_2) based on the model configuration.

Changelog

Click here to see the changelog

python/sglang/srt/layers/rotary_embedding.py
- Added rotate_half function (copied from transformers) for general use. See lines 26-28.
- Introduced MiniCPMScaledRotaryEmbedding class to implement MiniCPM's specific LongRoPE scaling logic, including _compute_cos_sin_cache and forward methods tailored for this scaling. See lines 1067-1153.
python/sglang/srt/model_executor/model_runner.py
- Removed unused import device_loading_context. See lines 80-84.
- Added import for print_warning_once. See line 102.
python/sglang/srt/models/minicpm.py
- Imported necessary modules: PretrainedConfig, is_flash_attn_greater_or_equal_2_10, logger, and the new MiniCPMScaledRotaryEmbedding. See lines 20, 21, 41, and 34.
- Added **kwargs to MiniCPMAttention.__init__ and forward methods for flexibility. See lines 94 and 160.
- Implemented MiniCPMFlashAttention2 class, providing a Flash Attention 2 implementation for MiniCPM attention, incorporating the scaled RoPE and dtype handling. See lines 173-323.
- Defined MINICPM_ATTENTION_CLASSES dictionary to map attention implementation names to classes. See lines 320-323.
- Modified MiniCPMDecoderLayer to use the attention class specified by config._attn_implementation. See line 340.
- Passed the full config object to the attention layer initialization in MiniCPMDecoderLayer. See line 350.
- Added a check and squeeze(0) call in MiniCPMDecoderLayer.forward if hidden_states is 3D. See lines 382-383.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

A new model arrives,
MiniCPM now thrives,
With scaled RoPE,
And Flash Attention's hope,
Performance swiftly revives.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the MiniCPM-4 base model. Key additions include a new MiniCPMScaledRotaryEmbedding for LongRoPE scaling and MiniCPMFlashAttention2 for FlashAttention 2 compatibility. The changes are generally well-structured and integrate with the existing codebase.

However, there are a few areas that require attention, particularly regarding the handling of config.sliding_window in the new attention module, which could lead to runtime errors. Additionally, type hints and docstrings for some new utility functions would improve maintainability. Given the addition of a new model and attention mechanism, comprehensive unit tests are highly recommended to ensure correctness and prevent regressions.

Overall, good progress on extending model support. Addressing the identified issues and adding tests will make this PR robust.

Summary of Findings

Potential TypeError with config.sliding_window: In MiniCPMFlashAttention2.__init__, the use of len(config.sliding_window) and config.sliding_window[0] might cause a TypeError if config.sliding_window is an integer or None, as is common in HuggingFace PretrainedConfig objects. This needs verification based on MiniCPM's config structure.
Missing Type Hints and Docstrings: The new utility function rotate_half in rotary_embedding.py lacks type hints and a docstring, which affects code clarity and maintainability.
Clarity of Squeeze Operations in RoPE: In MiniCPMScaledRotaryEmbedding.forward, the squeeze operations on cos and sin tensors might be unnecessary if these tensors are already 2D, potentially reducing code clarity.
Need for Unit Tests: The pull request introduces significant new components (MiniCPM model support, new RoPE, FlashAttention2 integration). Adding comprehensive unit tests is crucial to ensure correctness, verify the new functionalities, and prevent future regressions.

Merge Readiness

This pull request makes good progress in adding support for the MiniCPM-4 model. However, there's a potentially critical issue with how config.sliding_window is handled that could lead to runtime errors. Additionally, some minor improvements regarding type hints and code clarity are suggested.

Most importantly, given the scale of these additions, comprehensive unit tests for the new RoPE implementation and the MiniCPM model (especially with the FlashAttention2 path) are essential before merging to ensure stability and correctness.

I recommend that these changes, particularly the sliding_window issue and the addition of tests, be addressed before this PR is merged. As an AI assistant, I am not authorized to approve pull requests; please ensure further review and approval from other maintainers.

mickqian · 2025-06-05T15:25:47Z

ref #6899

TODO: use sglang's kvcache

mickqian requested review from BBuf, ByronHsu, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy, zhaochenyang20 and zhyncs as code owners June 5, 2025 15:12

gemini-code-assist Bot reviewed Jun 5, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

gemini-code-assist Bot suggested changes Jun 5, 2025

View reviewed changes

Comment thread python/sglang/srt/models/minicpm.py Outdated

Comment thread python/sglang/srt/layers/rotary_embedding.py Outdated

Comment thread python/sglang/srt/layers/rotary_embedding.py Outdated

mickqian changed the title ~~model: minicpm-4 base model~~ model: minicpm-4 main model Jun 5, 2025

mickqian force-pushed the minicpm branch from fad1387 to 3b8eba1 Compare June 8, 2025 04:05

mickqian temporarily deployed to prod June 8, 2025 04:05 — with GitHub Actions Inactive

b8zhong added the new-model label Jun 15, 2025

b8zhong mentioned this pull request Jun 21, 2025

[Tracking] Model support #7429

Closed

81 tasks

mickqian added 11 commits July 6, 2025 10:46

initial

00e3ddf

update

b3ec964

MiniCPMFlashAttention2 initial

6b001ba

nsa

e552119

fix some bugs: model loadable

a2f5a71

TODO: use sglang's kvcache

use flash-attn

8087b13

runnable

e7cbde1

--disable-radix-cache

f425870

move rope to rotary_embedding

4ee8095

fix rotary

a6dfb9c

use KVCache

a4ad2f5

mickqian added 8 commits July 6, 2025 10:46

cleanup cache

30d5a15

defaults to fa3

c842cf2

cleanup

ddb1818

stop storing kv in past_key_value

b650ccb

MiniCPMFlashAttention2

984861a

all params

a5351d8

cleanup

e777f5d

fix review

d1d4b0d

mickqian force-pushed the minicpm branch from c5bedeb to d1d4b0d Compare July 6, 2025 02:49

upd

43aa1b5

mickqian closed this Nov 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model: minicpm-4 main model#6900

model: minicpm-4 main model#6900
mickqian wants to merge 20 commits intosgl-project:mainfrom
mickqian:minicpm

mickqian commented Jun 5, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

This comment was marked as outdated.

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mickqian commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mmlu

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

This comment was marked as outdated.

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mickqian commented Jun 5, 2025 •

edited

Loading