Skip to content

[GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V#14720

Merged
BBuf merged 1 commit intosgl-project:mainfrom
antgroup:support_pp_for_glm_4.6v
Dec 10, 2025
Merged

[GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V#14720
BBuf merged 1 commit intosgl-project:mainfrom
antgroup:support_pp_for_glm_4.6v

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Dec 9, 2025

Motivation

This PR is to support GLM-4.6V and GLM-4.1V Pipeline Parallelism.
In main branch GLM-4.6V and GLM-4.1V PP don't work correctly. There are two reasons:

  1. In pp-size > 1, some pipeline shard doesn't have lm_head weight, but the current code attempt to load it in all shards, so it gets assert with errorkey lm_head.weights. There are some other layers have the similar issue, this PR fixes them.
    Why pp=1 have no this problem, is because GLM-4.1V's tie_word_embeddings=False, then
    a. if pp=1, lm_head is always set to ParallelLMHead()
    b. if pp>1, lm_head is only set on the last_rank, for intermediate stage, no lm_head is set.
if self.pp_group.is_last_rank:
    if self.pp_group.world_size == 1 and self.config.tie_word_embeddings:
        self.lm_head = self.model.embed_tokens.  
    else:
        self.lm_head = ParallelLMHead(...) <<<<<< last rank enters this branch.
else:
    self.lm_head = PPMissingLayer(). <<<<<< intermediate stage enter this branch, no lm_head

Some issues are related with this topic #1508 #2935.
2. general_mm_embed_routine is missing PPProxyTensors parameters.

For the first point, when training/saving checkpoints, some frameworks only save model.embed_tokens.weight and don’t save lm_head.weight separately.

When the inference framework implements the model structure, some model classes, such as Gemma3_mm, Gemma3n_mm, Granite or some EAGLE draft model, actually define a self.lm_head = nn.Linear(...), and then in tie_weights() make lm_head.weight and embed_tokens.weight point to the same Parameter. There can be cases that even don’t define an lm_head module at all, and instead directly use F.linear(hidden, embed_tokens.weight) in forward to produce the output.

In the current setup, suppose we have pp_size = 2 with GLM-4.1V (where tie_word_embeddings = False is used):

  • On the previous stage: There is only PPMissingLayer(), and no lm_head.weight.
  • On the last stage: There is a real lm_head module, so lm_head.weight must be loaded.

The modification is when lm_head.weight is encountered there, we choose to skip it, based on the fact that the current rank is not the last rank.

GLM-4.6V Server:

$SGLANG_USE_CUDA_IPC_TRANSPORT=1 python -m sglang.launch_server --model-path /home/admin/GLM-4.6V --mm-attention-backend fa3 --port 30000 --chunked-prefill-size 8192 --disable-radix-cache --disable-overlap-schedule --attention-backend fa3 --tp 4 --pp-size 2

Client:

$bash bench_n_image.sh 
{"id":"07cdbe828ef6470db12e99d17f63f64a","object":"chat.completion","created":1765271843,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"图中的植物是**刺儿菜**(学名:*Cirsium setosum*),属于菊科蓟属。其典型特征为:茎直立,具纵棱;叶片边缘具刺状齿,两面被蛛丝状毛;头状花序单生于茎顶或枝端,总苞片多层,具刺。  \n\n刺儿菜为多年生草本,常见于田间、路旁、荒地等处,适应性较强。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151336}],"usage":{"prompt_tokens":960,"total_tokens":1058,"completion_tokens":98,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m3.419s
user    0m0.002s
sys     0m0.003s
{"id":"08d083ae213c43478342e73ea6fad238","object":"chat.completion","created":1765271844,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"图中的植物是**刺儿菜**(学名:*Cirsium setosum*),属于菊科蓟属。其典型特征为:茎直立,具纵棱;叶片边缘具刺状齿,两面被蛛丝状毛;头状花序单生于茎顶或枝端,总苞片多层,具刺。  \n\n刺儿菜为多年生草本,常见于田间、路旁、荒地等处,适应性较强。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151336}],"usage":{"prompt_tokens":960,"total_tokens":1058,"completion_tokens":98,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m1.254s
user    0m0.002s
sys     0m0.003s

GLM-4.1V
Server:

$SGLANG_USE_CUDA_IPC_TRANSPORT=1 python -m sglang.launch_server --model-path /home/admin/GLM-4.1V-9B-Thinking --mm-attention-backend fa3 --port 30000 --chunked-prefill-size 8192 --disable-radix-cache --disable-overlap-schedule --attention-backend fa3 --tp 1 --pp-size 2
$bash bench_n_image.sh 
{"id":"7bb67e024d4c4230a4891038318b233b","object":"chat.completion","created":1765272124,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"用户现在需要识别图中的植物,从图片看,这种植物有带刺的叶片,可能是刺儿菜(Cirsium setosum),不过也可能是其他蓟属植物。先看特征:叶片边缘有刺,茎和叶有刺,属于菊科蓟属。刺儿菜学名Cirsium setosum(有时也叫Cirsium arvense?不对,要确认)。不过更常见的可能是刺儿菜,学名Cirsium setosum(Beauv.)Bess.,别名小蓟、刺菜等。需要确认形态:叶片羽状深裂,边缘有刺,茎直立,有刺,符合蓟属特征。所以判断为刺儿菜,学名Cirsium setosum。图中的植物是刺儿菜,其学名是 **Cirsium setosum**(也常被称为 *Cirsium arvense* ,不过二者在分类和形态上有一定差异,需结合地域等判断,刺儿菜更偏向 *Cirsium setosum* 这类形态,叶片等特征符合刺儿菜常见形态)。刺儿菜属于菊科蓟属植物,常生长在草地、路边等处,叶片边缘和茎部有刺,是常见的野生草本植物,部分地区也用于药用等。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151348}],"usage":{"prompt_tokens":956,"total_tokens":1237,"completion_tokens":281,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m3.268s
user    0m0.003s
sys     0m0.001s
{"id":"0379a48690704a5ab774dd4d6e6ed131","object":"chat.completion","created":1765272127,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"用户现在需要识别图中的植物,从图片看,这种植物有带刺的叶片,可能是刺儿菜(Cirsium setosum),不过也可能是其他蓟属植物。先看特征:叶片边缘有刺,茎和叶有刺,属于菊科蓟属。刺儿菜学名Cirsium setosum(有时也叫Cirsium arvense?不对,要确认)。不过更常见的可能是刺儿菜,学名Cirsium setosum(Beauv.)Bess.,别名小蓟、刺菜等。需要确认形态:叶片羽状深裂,边缘有刺,茎直立,有刺,符合蓟属特征。所以判断为刺儿菜,学名Cirsium setosum。图中的植物是刺儿菜,其学名是 **Cirsium setosum**(也常被称为 *Cirsium arvense* ,不过二者在分类和形态上有一定差异,需结合地域等判断,刺儿菜更偏向 *Cirsium setosum* 这类形态,叶片等特征符合刺儿菜常见形态)。刺儿菜属于菊科蓟属植物,常生长在草地、路边等处,叶片边缘和茎部有刺,是常见的野生草本植物,部分地区也用于药用等。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151348}],"usage":{"prompt_tokens":956,"total_tokens":1237,"completion_tokens":281,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m3.097s
user    0m0.001s
sys     0m0.003s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang framework by introducing robust support for Pipeline Parallelism in GLM-4.6V and GLM-4.1V vision-language models. It addresses critical issues related to the distributed loading of model weights and ensures proper parameter propagation within multimodal embedding routines, thereby enabling these models to scale efficiently across multiple devices. The changes are validated with a new accuracy test specifically designed for pipeline parallel configurations.

Highlights

  • Pipeline Parallelism Support: Enabled Pipeline Parallelism for GLM-4.6V and GLM-4.1V models, addressing previous issues that prevented correct operation.
  • Weight Loading Fixes: Corrected an issue where lm_head and other layer weights were incorrectly loaded across all pipeline shards, leading to errors. Weights are now conditionally loaded only on relevant shards, specifically skipping lm_head on non-last ranks.
  • Parameter Passing for Multimodal Embedding: Ensured the general_mm_embed_routine receives necessary PPProxyTensors parameters for proper functioning in a pipeline parallel setup, resolving a missing parameter issue.
  • New Accuracy Test: Introduced a dedicated accuracy test for GLM-4.1V with pipeline parallelism to validate the correctness and performance of the implementation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 9, 2025

cc: @zRzRzRzRzRzRzR

@yuan-luo yuan-luo added performance Multi-modal multi-modal language model vlm labels Dec 9, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for Pipeline Parallelism for GLM-4.6V and GLM-4.1V models. The changes address issues related to lm_head weight loading in pipeline parallel setups and missing PPProxyTensors parameters in general_mm_embed_routine. The addition of PPProxyTensors import and its usage in the forward method and general_mm_embed_routine are correctly implemented. Robustness improvements have been made in load_weights by adding checks for missing parameters. A new test case TestGLM41VPPAccuracy has been added to verify the functionality, which is a good practice.

Comment thread test/srt/test_pp_single_node.py Outdated
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 9, 2025

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Dec 9, 2025
@yuan-luo yuan-luo force-pushed the support_pp_for_glm_4.6v branch from a2062c5 to 79eb673 Compare December 9, 2025 09:34
Comment thread python/sglang/srt/models/glm4v.py
Comment thread python/sglang/srt/models/glm4v.py
@yuan-luo yuan-luo force-pushed the support_pp_for_glm_4.6v branch 2 times, most recently from 74f3c9f to 93fde6b Compare December 9, 2025 13:49
Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread python/sglang/srt/models/glm4v.py
@yuan-luo yuan-luo force-pushed the support_pp_for_glm_4.6v branch from 93fde6b to cbb43b2 Compare December 10, 2025 02:39
@BBuf BBuf merged commit 03836d8 into sgl-project:main Dec 10, 2025
210 of 227 checks passed
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 10, 2025
…n_eagle3_npu

* 'main' of https://github.com/sgl-project/sglang: (142 commits)
  [diffusion] performance: refactor diffusion fuse qkv and apply to qwen-image (sgl-project#14793)
  Revert "fix: checking if tokenizer is in cache before downloading from HF" (sgl-project#14808)
  [diffusion] CI: Add LoRA support to diffusion server configuration and test cases (sgl-project#14697)
  [GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V (sgl-project#14720)
  [diffusion] profile: early exit when enough steps are captured to reduce the size of the trace file (sgl-project#14803)
  fix b200 fa4 ci (sgl-project#14788)
  fix: add missing logic for SGLANG_USE_MODELSCOPE variable (sgl-project#14794)
  [RL] support weight reload for low-bit rollout (sgl-project#9650)
  fix b200 ci (sgl-project#14786)
  handling incomplete rope_scaling config ci after transformers upgrade (sgl-project#14784)
  fix server args bug (sgl-project#14725)
  Fix VLM accuracy thresholds for nightly tests (sgl-project#14777)
  fix: race condition between validation and download locks (sgl-project#14761)
  [Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) (sgl-project#14759)
  [fix] Fix issues for in-flight weight updates (sgl-project#14064)
  Disable 8-gpu-b200 runner in PR tests (sgl-project#14768)
  fix: adding rate limit warning at verify token permission stage (sgl-project#14756)
  Re-add the API serving timing metrics. (sgl-project#14744)
  [ci]: Enable the new hf API (sgl-project#14687)
  [model-gateway] release gateway 0.2.4 (sgl-project#14763)
  ...
shevateng0 pushed a commit to shevateng0/sglang that referenced this pull request Dec 10, 2025
@yuan-luo yuan-luo deleted the support_pp_for_glm_4.6v branch December 12, 2025 01:57
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
ZhengdQin added a commit to ZhengdQin/sglang that referenced this pull request Jan 16, 2026
* [model-gateway] extract conversation out of oai router (sgl-project#14440)

Co-authored-by: key4ng <rukeyang@gmail.com>

* [DeepseekV3.2][NSA][Indexer] Fix PAGED top-k transform for NSA indexer chunked execution on H200 (sgl-project#14325)

* [model-gateway] move oai header util to router header util (sgl-project#14441)

Co-authored-by: key4ng <rukeyang@gmail.com>

* [FIX] trtllm-moe-fp4-renorm for Qwen series models (sgl-project#14350)

* add doc for quantized kv cache (sgl-project#14348)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>

* fix: Correct environment variable syntax in docker-compose configuration (sgl-project#8287)

Signed-off-by: Kay Yan <kay.yan@daocloud.io>

* [model-gateway] move all responses api event from oai to proto (sgl-project#14446)

Co-authored-by: key4ng <rukeyang@gmail.com>

* [model-gateway] add mistral 3 image processor (sgl-project#14445)

Co-authored-by: Chang Su <chang.s.su@oracle.com>

* [model-gateway] grpc to leverage event type (sgl-project#14450)

Co-authored-by: Chang Su <chang.s.su@oracle.com>

* ministral3 (sgl-project#14251)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Yueming Yuan <yy28@illinois.edu>

* [Bug] fix not desired disable fused share experts caused by rocm logic (sgl-project#14432)

* Rename secrets.WHL_TOKEN -> secrets.GH_PAT_FOR_WHL_RELEASE (sgl-project#14421)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* [diffusion] improve: further optimize model load (sgl-project#13836)

* Add CI permissions for user 'yushengsu-thu' (sgl-project#14468)

* [ez] Fix typing (sgl-project#14473)

* Add AMD stage support to /rerun-stage command and fix related bugs (sgl-project#14463)

* Add YAMY1234 to CI Permission (sgl-project#14475)

* clean up gemlite usage (sgl-project#14444)

* [diffusion] chore: further improve model searching logic (sgl-project#14484)

* [diffusion] fix: fix bug about pin memory when offloading (sgl-project#14472)

* [diffusion] cli: add argument --adjust-frames and --override-protected-fields (sgl-project#13996)

Co-authored-by: dev <devnull@example.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* dockerfile: add runtime stage + ubuntu 24.04 (sgl-project#13861)

* [diffusion] fix: fix CLIP text encoder attention mask not used (sgl-project#14364)

Co-authored-by: niehen6174 <niehen.6174@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* Enable RadixCache for Mamba2 models (sgl-project#13584)

* [diffusion] fix: Fix profiler trace missing Python stack in diffusion pipeline (sgl-project#14499)

* support GLM-V vision model dp  (sgl-project#14097)

* [misc] add model arch and type to server info and use it for harmony (sgl-project#14456)

* Add Mistral Large 3 Eagle Support (sgl-project#14466)

Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

* Add Mistral Large 3 to nightly CI tests (sgl-project#14459)

* [diffusion] chore: set allowing overriding protected fields of sampling params as default behavior (sgl-project#14471)

* [model-gateway] move conversation to first class routing (sgl-project#14506)

Co-authored-by: key4ng <rukeyang@gmail.com>

* [Spec] Mamba2 support in target models (sgl-project#13434)

* [diffusion] feat: support cache-dit integration (sgl-project#14234)

Co-authored-by: shuxiguo <shuxiguo@meituan.com>
Co-authored-by: DefTruth <qiustudent_r@163.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-project#14093)

Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

* [model-gateway] Add WASM support for middleware (sgl-project#12471)

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>

* [model-gateway] reorganized conversation handler (sgl-project#14507)

Co-authored-by: key4ng <rukeyang@gmail.com>

* tiny remove deprecated endpoint call (sgl-project#13607)

* [model-gateway] fix server info comment (sgl-project#14508)

* Add Mistral Large 3 basic test to PR CI (sgl-project#14460)

* Fix removing worker will make it healthy forever in prometheus metrics (sgl-project#14420)

* [model-gateway] Make Tokenizer Builder Aware of Env Vars Like HF_ENDPOINT (sgl-project#14405)

* [model-gateway] change sgl-router to sgl-model-gateway (sgl-project#14312)

* [model-gateway] fix left over sgl-router names to sgl-model-gateway (sgl-project#14512)

* [model-gateway] fix logs in smg workflow (sgl-project#14513)

* [model-gateway] fix left over sgl-router names in wasm (sgl-project#14514)

* [model-gateway] fix code owner for wasm (sgl-project#14516)

* chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14427)

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>

* Tiny use trtllm_mha as default when possible (sgl-project#14291)

* [Docs] Add /rerun-stage command to contribution guide (sgl-project#14521)

* Fix safetensors validation to catch corruption after download (sgl-project#14465)

* [CODEOWNER] update codeowner for qwen3-next related (sgl-project#14522)

* fix: fix rmsnorm -> layernorm in qwen3 omni (sgl-project#11791)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* [diffusion] chore: temporarily upgrade diffusers to make Z-image compatible with Cache-DiT (sgl-project#14530)

* [bug] fix notebook to include new keys from model_info (sgl-project#14528)

* Revise DP Multi-Modal Encoder Document (sgl-project#14290)

* [CPU] add mamba fla kernels for Qwen3-next (sgl-project#12324)

* Revert "tiny remove deprecated endpoint call" (sgl-project#14533)

* support mtp with deepseek r1 nvfp4 model (sgl-project#13115)

Co-authored-by: Trevor Morris <tmorris@nvidia.com>

* [diffusion] refactor: simplify sampling params' override logic (sgl-project#14539)

* [diffusion] perf: add QKV fusion optimization for Flux models (sgl-project#14505)

Co-authored-by: Mick <mickjagger19@icloud.com>

* [model-gateway][tracing]: implement request tracing using OpenTelemetry with trace context propagation (HTTP) (sgl-project#13897)

* [diffusion] lora: fix LoRA dtype handling and weight attribute access for z-image model (sgl-project#14543)

Co-authored-by: niehen6174 <nihen6174@gmail.com>

* fix "GrammarMatcher has terminated after accepting the stop token, but is trying to find the next token mask" when both reasoning and spec are enabled (sgl-project#14464)

* [1/n] Fix hanging during DeepGemm Warmup (sgl-project#14493)

* [Bug fix] Add /model_info endpoint to mini_lb (sgl-project#14535)

* [Qwen3-next] remove heuristics and add radix cache kl test (sgl-project#14520)

* [Misc]Register and refactor some environs for dpsk-fp4 and DeepEp (sgl-project#14538)

* chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14518)

* Update CI_PERMISSIONS.json (sgl-project#14552)

* Update DeepSeek V3 docs to use B200 (sgl-project#14447)

* [Doc] Add short explanation on page size (sgl-project#14557)

* [docs] Add missing word in argument description (sgl-project#14205)

* support piecewise cuda graph for Olmo models (sgl-project#14476)

* Enhance prefill PP node robustness (sgl-project#14494)

* DOC update nemo-skills in docs (sgl-project#14555)

Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* remove unecessary dual stream token threshold from the rest of models (qwen moe, kimi linear, etc.) (sgl-project#14337)

* feat(ci): add framework target to release-docker workflows (sgl-project#14559)

* Fix attention backend logic for Qwen3-Next on SM100 (sgl-project#14560)

* [FLA] Add explicit kernel arguments to kda.py for Kimi Linear support (sgl-project#14561)

* Add CUDA kernel size analysis tool for sgl-kernel optimization (sgl-project#14544)

* [DLLM] feat: Add threshold based parallel decoding support (sgl-project#14412)

Co-authored-by: Jinwei Yao <jinweiy@illinois.edu>
Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>

* Add unit-test-backend-8-gpu-b200 to rerun-stage command (sgl-project#14569)

* [apply][2/2] Fused qk_norm_rope for Qwen3-MoE (sgl-project#13998)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* Add Expert Parallelism (EP) support for kimi-k2-thinking (sgl-project#13725)

* Tiny remove wrong import from `python.sglang` (sgl-project#14577)

* Add small model test for spec v2 + dp + trtllm_mla (sgl-project#14576)

* [diffusion] cli: profiling utilities support (sgl-project#14185)

Co-authored-by: jianyingzhu <53300651@qq.com>
Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* [NPU]LoRA: Adding Torch Native backend (sgl-project#14132)

* [BugFix] fix prefixcache performance and accuracy on ascend (sgl-project#13573)

* Fix FP8 KV Triton type issue and add regression test (sgl-project#14553)

* Rename TensorRT Model Optimizer to Model Optimizer (sgl-project#14455)

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

* [CI] Tiny speed up VLM CI (sgl-project#14517)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* [Minor] Temporarily skipping deepep large mtp test (sgl-project#14586)

* [model-gateway] extra accumulator and tool handler in oai router (sgl-project#14587)

* [model-gateway] Fixed WASM Security Vulnerability - Execution Timeout (sgl-project#14588)

* [model-gateway] reorganize metrics, logging, and otel to its own module (sgl-project#14590)

* Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 (sgl-project#14141)

* [CI]Unblock and split spec v2+dp test (sgl-project#14551)

* [Tool Call] Fix DeepSeekV32Detector skipping functions with no params in streaming mode (sgl-project#14573)

* [feat] use cachebuffer to store mm feature to speedup hash (sgl-project#14386)

* [CI] Fix unit-test-backend-8-gpu-b200 running on every /rerun-stage (sgl-project#14591)

* [model-gateway] fix WASM memory limit per module (sgl-project#14600)

* Tiny fix missing policy decision recording (sgl-project#14605)

* Super tiny remove unneeded policy flag (sgl-project#14608)

* [model-gateway] refactor otel to be more efficient (sgl-project#14604)

* Super tiny remove unused select_worker_pair (sgl-project#14609)

* [model-gateway] fix WASM unbounded request/response body read vuln (sgl-project#14612)

* [2/2] Add rope kernel in sgl-kernel  (sgl-project#14452)

* [DLLM] Add initial cuda graph support (sgl-project#14203)

* Super tiny fix unused code in router (sgl-project#14618)

* [Glm46v] Bug fix for accuracy drop and unable to launch server (sgl-project#14585)

Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>

* Fix amd rope definition (sgl-project#14556)

* modify the sgl-kernel to be compatible with transformers 5.x. (sgl-project#14625)

* [Reasoning + Structured Output] make reasoning compatible with structured output (sgl-project#12551)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

* [diffusion] feat: add support for LoRA layers in transformer_2 within LoRAPipeline (sgl-project#14606)

* chore: bump sgl-kernel version to 0.3.19 (sgl-project#14632)

* [cpu] Implement all gather/reduce for arm64 cpu (sgl-project#12527)

* [diffusion] chore: further refine output resolution adjustment logic (sgl-project#14558)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Fix dp-aware incompatible with service-discovery (sgl-project#14629)

* update transformers package version to 5.0.0rc0 (sgl-project#14356)

* chore: bump sgl-kernel version to 0.3.19 (sgl-project#14649)

* chore: bump SGLang version to 0.5.6.post1 (sgl-project#14651)

* [AMD] change fused rms quant interface for aiter upgrade (sgl-project#14497)

* [model-gateway] reducing cpu overhead in various of places (sgl-project#14658)

* [model-gateway] reduce cpu overhead in grpc router (sgl-project#14663)

* [model-gateway] fix WASM arbitrary file read security vol (sgl-project#14664)

* vlm: Use fa3 as the default backend for qwen3 vl (sgl-project#14634)

* [model-gateway] Optimize memory usage in HTTP router (sgl-project#14667)

* fix: use .get() when accessing strict mem-check env variable (sgl-project#14657)

* improve default glm mtp setting (sgl-project#14457)

Signed-off-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* Fix cache-aware router should pick min load instead of min tenant size (sgl-project#14650)

* Bump up diffusers to latest official release version (sgl-project#14670)

* [model-gateway] add OTEL integration to grpc router (sgl-project#14671)

* [CI] Increase max-parallel to 15 for high priority PRs (sgl-project#14675)

* [HiCache] fix condition check when use decode offload (sgl-project#14489)

* [RadixTree] Optimize the Time Complexity of Node Retrieval Operation from O(n*m) to O(n) (sgl-project#13334)

Signed-off-by: CLFutureX <chenyongqyl@163.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* Tiny support printing requests in bench_serving for observability (sgl-project#14652)

* Aiter fp8 kv cache (sgl-project#13147)

* [SMG]feat: implement TokenGuardBody for managing token return (sgl-project#14653)

* [NPU] chore: bump basic software version to 8.3.rc2 (sgl-project#14614)

* [CI] Unblock gb200 cutedsl test (sgl-project#14469)

* Add ffmpeg into sglang docker - required by transformers multimodal V… (sgl-project#14679)

* [Bugfix] Fix KeyError for Mistral-Large-3 rope_scaling config (sgl-project#14627)

* Tiny support sgl-router http response status code metrics (sgl-project#14689)

* [CI] Migrate Eagle 1-GPU tests to test/registered/ (sgl-project#14529)

* Revert "[Bug] fix not desired disable fused share experts caused by r… (sgl-project#14676)

* Add per-request decode tp size (sgl-project#14678)

Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>

* [ci][smg] fix docker release ci and add it to pr test (sgl-project#14683)

* Tiny extract select_worker_min_load (sgl-project#14648)

* Fix dp-aware incompatible with completions and chat completions APIs (sgl-project#14647)

* [CI] Fix Llama 3.1 8B FP4 CI (sgl-project#14699)

* fix: make override DeepseekV2Model work (sgl-project#14707)

* chore: add code owners for deepseek_v2.py (sgl-project#14714)

* [CI] Move mistral large 3 basic to nightly (sgl-project#14622)

* fix the deepep 8 gpu unit test (sgl-project#14601)

* Add fuse_marlin_moe test to ci and add new ep test (sgl-project#14686)

* [Bugfix] Fix environ error in scheduler_runtime_checker_mixin.py (sgl-project#14461)

Signed-off-by: Kun(llfl) <i@imux.top>

* [Feat] Add received_time in serving_base (sgl-project#13432)

Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com>

* fix: prevent HugginqFace access when SGLANG_USE_MODELSCOPE is enabled (sgl-project#12039)

* [Test] Skip STANDALONE speculative decoding tests for different hidden sizes (sgl-project#14733)

* [diffusion] feat: support comparing batch perf (sgl-project#14738)

Co-authored-by: shuxiguo <shuxiguo@meituan.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* Revert "[Feat] Add received_time in serving_base" (sgl-project#14743)

* [Model] Add PaddleOCR-VL Model Support (sgl-project#12953)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* fix rope parameter initialization error caused by transformers v5.0 update (sgl-project#14745)

* [model-gateway] optimize core modules (sgl-project#14751)

* [SMG] perf: optimize tokenizer for reduced CPU and memory overhead (sgl-project#14752)

* Add FP8 Blockwise GEMM Backend Flag `--fp8-gemm-backend` (sgl-project#14379)

* fix: checking if tokenizer is in cache before downloading from HF (sgl-project#14698)

* fix: making rate limit a warning instead of error  (sgl-project#14753)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* move multi-item scoring functions in tokenizer manager into a separate file (sgl-project#14740)

* Improve CI by trying a warmup before unit tests (sgl-project#14669)

* [Perf] Optimize radix tree for cache-aware load balancin (sgl-project#14758)

* [Feature] Add LoRA support for embedding layers (sgl-project#14177)

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Beichen-Ma <bm685@cornell.edu>

* [model-gateway] release gateway 0.2.4 (sgl-project#14763)

* [ci]: Enable the new hf API (sgl-project#14687)

* Re-add the API serving timing metrics. (sgl-project#14744)

Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com>
Co-authored-by: zhanghaotong <zhanghaotong.zht@antgroup.com>

* fix: adding rate limit warning at verify token permission stage (sgl-project#14756)

* Disable 8-gpu-b200 runner in PR tests (sgl-project#14768)

* [fix] Fix issues for in-flight weight updates (sgl-project#14064)

Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>

* [Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) (sgl-project#14759)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* fix: race condition between validation and download locks (sgl-project#14761)

* Fix VLM accuracy thresholds for nightly tests (sgl-project#14777)

* fix server args bug (sgl-project#14725)

* handling incomplete rope_scaling config ci after transformers upgrade (sgl-project#14784)

* fix b200 ci (sgl-project#14786)

* [RL] support weight reload for low-bit rollout (sgl-project#9650)

Co-authored-by: Hecate0821 <hec4te0821@gmail.com>
Co-authored-by: eternally-z <zzywzj@gmail.com>
Co-authored-by: Wilboludriver <wilbolu@outlook.com>
Co-authored-by: Wilbolu <81792854+Wilboludriver@users.noreply.github.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>

* fix: add missing logic for SGLANG_USE_MODELSCOPE variable (sgl-project#14794)

* fix b200 fa4 ci (sgl-project#14788)

* [diffusion] profile: early exit when enough steps are captured to reduce the size of the trace file (sgl-project#14803)

* [GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V (sgl-project#14720)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* [diffusion] CI: Add LoRA support to diffusion server configuration and test cases (sgl-project#14697)

* Revert "fix: checking if tokenizer is in cache before downloading from HF" (sgl-project#14808)

* [diffusion] performance: refactor diffusion fuse qkv and apply to qwen-image (sgl-project#14793)

* [SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API Server (sgl-project#14770)

* [model-gateway] Dynamically Populate Tool Call Parser Choices (sgl-project#14807)

* Support HTTP response status code prometheus metrics (sgl-project#14710)

* Fix router keep nonzero metrics after worker is deleted (sgl-project#14819)

* Tiny fix incorrect worker removal command (sgl-project#14822)

* [NPU] bug fix for mtp and w4a8 (sgl-project#14806)

* [CI] fix UT success check in `test_eagle_infer_beta_dp_attention.py` (sgl-project#14831)

* Fix CI registry scan to only check test/registered directory (sgl-project#14812)

* [model-gateway] add anthropic message api spec (sgl-project#14834)

* [diffusion] doc: fix tiny typo in multimodal_gen/README.md (sgl-project#14830)

* [model-gateway] support customizing Prometheus duration buckets (sgl-project#14716)

* [model-gateway] support engine response http status statistics in router (sgl-project#14712)

* [CI] Reduce stage-b auto-partition from 4 to 2 (sgl-project#14769)

Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>

* Apply back moe_sum_reduce for fused_marlin_moe (sgl-project#14829)

* [diffusion] parallel: pad tokens for video models under sp (sgl-project#14833)

* [diffusion] CI: use unified sampling_params for CI (sgl-project#14045)

* [Auto Sync] Update tool_chat_template_deepseekv31.jinja (20251210) (sgl-project#14837)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jue Wang <zjuwangjue@gmail.com>

* Revert transformers to 4.57.1 (sgl-project#14801)

* [model-gateway] Fix incompatible metric comparison in` PowerOfTwo` policy (sgl-project#14823)

* [bugfix] qwen25-VL support lora (sgl-project#14638)

* fix lora target all + csgmv backend (sgl-project#14796)

* [model-gateway] adds default implementations to RouterTrait in mod.rs (sgl-project#14841)

* [AMD] Add model to AMD nightly test (sgl-project#14442)

* Treat unittest SkipTest exception as pass instead of as failure (sgl-project#14847)

* [model-gateway] code clean up on oai router (sgl-project#14850)

* [model-gateway] fix import order in oai conversation (sgl-project#14851)

* fix fp8 gemm nightly CI (sgl-project#14844)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* fix: restrict cache validation behaviors to CI only (sgl-project#14849)

* Fix CUDA version handling in ci_install_deepep.sh (sgl-project#14854)

* Fix TestGLM41VPPAccuracy test flakiness (sgl-project#14848)

* Minor code style fix for dllm (sgl-project#14836)

* Enable TP for Mamba-based models (sgl-project#14811)

Signed-off-by: Roi Koren <roik@nvidia.com>

* [CI] Temp disable gb200 test (sgl-project#14865)

* Refactor Marlin MoeRunner (sgl-project#14554)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

* [6/n] Fix `num_token_non_padded` computation in prefill (sgl-project#14313)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Runkai Tao <rt572@physics.rutger.edu>

* Remove myself to test CI gate issue (sgl-project#14871)

* fix: creating blobs only once for publish trace retries (sgl-project#14845)

* Move and update MindSpore docs, make it appear on the online documentation (sgl-project#14861)

Co-authored-by: wangtiance <tiancew@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix nightly vlm ci : restore original eval for requests without regex (sgl-project#14875)

* Only count limitations for previous runs  that reaches the test stages (sgl-project#14856)

* [CI][BUG] fix ib setup for disaggregation hicache test (sgl-project#14877)

Signed-off-by: lukotong-7 <shicanwei.scw@alibaba-inc.com>

* [Fix] Remove unused import from test_disaggregation_hicache.py (sgl-project#14880)

* fix: adding temporary bypass for nightly tests (sgl-project#14876)

* Avoid deleting entire cache for missing shards (sgl-project#14754 follow-up) (sgl-project#14853)

* Tiny add more error info for bench_serving (sgl-project#14827)

* Tiny support range ratio in GSP in bench serving (sgl-project#14828)

* [diffusion] feat: enable torch compile to eliminate GPU bubble (sgl-project#13641)

Co-authored-by: jianyingzhu <53300651@qq.com>
Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com>
Co-authored-by: root <root@2u2g-spr-0417.ipp4a1.colossus.nvidia.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* [NPU] adapt dsv3.2 nsa prefill context parallel (sgl-project#14541)

* [diffusion] feat: support sageattn & sageattn3 backend (sgl-project#14878)

* dsv32 multistream opt

* clean code

* delete renormalize in topk

* dsv32 use batch_matmul_transpose in MTP

* modify comment

* Support dynamic w8a8

* dsv3 support ascend_fuseep

* rebase modify

---------

Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Signed-off-by: CLFutureX <chenyongqyl@163.com>
Signed-off-by: Kun(llfl) <i@imux.top>
Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: lukotong-7 <shicanwei.scw@alibaba-inc.com>
Co-authored-by: Simo Lin <linsimo.mark@gmail.com>
Co-authored-by: key4ng <rukeyang@gmail.com>
Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com>
Co-authored-by: Sam <lsam@nvidia.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Co-authored-by: Kay Yan <kay.yan@daocloud.io>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Yueming Yuan <yy28@illinois.edu>
Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com>
Co-authored-by: sglang-bot <sglangbot@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: zyksir <zhuyikai.zyk@gmail.com>
Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com>
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai>
Co-authored-by: dev <devnull@example.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: WenhaoZhang <42087078+niehen6174@users.noreply.github.com>
Co-authored-by: niehen6174 <niehen.6174@gmail.com>
Co-authored-by: roikoren755 <26850796+roikoren755@users.noreply.github.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Yuxuan Zhang <2448370773@qq.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: blahblah <28567807+Brain97@users.noreply.github.com>
Co-authored-by: shuxiguo <shuxiguo@meituan.com>
Co-authored-by: DefTruth <qiustudent_r@163.com>
Co-authored-by: Hudson Xing <77495133+harvenstar@users.noreply.github.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Tony Lu <tonylu@linux.alibaba.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Wenyi Xu <wenyixu101@gmail.com>
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: blzheng <beilei.zheng@intel.com>
Co-authored-by: Rain Jiang <96632942+rainj-me@users.noreply.github.com>
Co-authored-by: Trevor Morris <tmorris@nvidia.com>
Co-authored-by: Feng Su <sufeng@linux.alibaba.com>
Co-authored-by: niehen6174 <nihen6174@gmail.com>
Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com>
Co-authored-by: harrisonlimh <97203667+harrisonlimh@users.noreply.github.com>
Co-authored-by: Lee Nau <lnau@nvidia.com>
Co-authored-by: almaslof <187766901+almaslof@users.noreply.github.com>
Co-authored-by: Rain H <2510421000@qq.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Chen1022 <jincong.cjc@ant-intl.com>
Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: Jinwei Yao <jinweiy@illinois.edu>
Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: AichenF <aichenf@nvidia.com>
Co-authored-by: jianyingzhu <53300651@qq.com>
Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com>
Co-authored-by: Vladimir Serov <serov.vladimir.zser@gmail.com>
Co-authored-by: khalilzhk <khalilzhk@gmail.com>
Co-authored-by: Zhiyu <zhiyuc@nvidia.com>
Co-authored-by: wentx <3843588+momaek@users.noreply.github.com>
Co-authored-by: Nicholas <45984215+liusy58@users.noreply.github.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Muqi Li <muqi1029@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Prozac614 <dwt614707404@163.com>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com>
Co-authored-by: PiteXChen <44110731+CLFutureX@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com>
Co-authored-by: Jimmy <29097382+jimmy-evo@users.noreply.github.com>
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
Co-authored-by: kun-llfl <i@imux.top>
Co-authored-by: zhanghaotong <zhanghaotong.zht@antgroup.com>
Co-authored-by: yrk111222 <2493404415@qq.com>
Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com>
Co-authored-by: Douglas Yang <dyang@college.harvard.edu>
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
Co-authored-by: Beichen-Ma <bm685@cornell.edu>
Co-authored-by: MingxuZh <109504044+MingxuZh@users.noreply.github.com>
Co-authored-by: ShawnY112358 <61113840+ShawnY112358@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: TomerBN-Nvidia <tbarnatan@nvidia.com>
Co-authored-by: Peng Zhang <aniz1905@gmail.com>
Co-authored-by: Hecate0821 <hec4te0821@gmail.com>
Co-authored-by: eternally-z <zzywzj@gmail.com>
Co-authored-by: Wilboludriver <wilbolu@outlook.com>
Co-authored-by: Wilbolu <81792854+Wilboludriver@users.noreply.github.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: liupeng374 <liupeng374@huawei.com>
Co-authored-by: Li Jinliang <975761915@qq.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Co-authored-by: Jue Wang <zjuwangjue@gmail.com>
Co-authored-by: Praneth Paruchuri <pranethparuchuri@gmail.com>
Co-authored-by: Siyuan Chen <41201609+SYChen123@users.noreply.github.com>
Co-authored-by: michael-amd <Michael.Zhang@amd.com>
Co-authored-by: Trang Do <200224632+trangdough@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: yuchengz816-bot <yuchengz816@gmail.com>
Co-authored-by: Runkai Tao <rt572@physics.rutger.edu>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Tiance Wang <wangtiance@gmail.com>
Co-authored-by: wangtiance <tiancew@qq.com>
Co-authored-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: root <root@2u2g-spr-0417.ipp4a1.colossus.nvidia.com>
Co-authored-by: liupeng374 <782420244@qq.com>
ZhengdQin added a commit to ZhengdQin/sglang that referenced this pull request Jan 16, 2026
* [model-gateway] extract conversation out of oai router (sgl-project#14440)

Co-authored-by: key4ng <rukeyang@gmail.com>

* [DeepseekV3.2][NSA][Indexer] Fix PAGED top-k transform for NSA indexer chunked execution on H200 (sgl-project#14325)

* [model-gateway] move oai header util to router header util (sgl-project#14441)

Co-authored-by: key4ng <rukeyang@gmail.com>

* [FIX] trtllm-moe-fp4-renorm for Qwen series models (sgl-project#14350)

* add doc for quantized kv cache (sgl-project#14348)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>

* fix: Correct environment variable syntax in docker-compose configuration (sgl-project#8287)

Signed-off-by: Kay Yan <kay.yan@daocloud.io>

* [model-gateway] move all responses api event from oai to proto (sgl-project#14446)

Co-authored-by: key4ng <rukeyang@gmail.com>

* [model-gateway] add mistral 3 image processor (sgl-project#14445)

Co-authored-by: Chang Su <chang.s.su@oracle.com>

* [model-gateway] grpc to leverage event type (sgl-project#14450)

Co-authored-by: Chang Su <chang.s.su@oracle.com>

* ministral3 (sgl-project#14251)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Yueming Yuan <yy28@illinois.edu>

* [Bug] fix not desired disable fused share experts caused by rocm logic (sgl-project#14432)

* Rename secrets.WHL_TOKEN -> secrets.GH_PAT_FOR_WHL_RELEASE (sgl-project#14421)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* [diffusion] improve: further optimize model load (sgl-project#13836)

* Add CI permissions for user 'yushengsu-thu' (sgl-project#14468)

* [ez] Fix typing (sgl-project#14473)

* Add AMD stage support to /rerun-stage command and fix related bugs (sgl-project#14463)

* Add YAMY1234 to CI Permission (sgl-project#14475)

* clean up gemlite usage (sgl-project#14444)

* [diffusion] chore: further improve model searching logic (sgl-project#14484)

* [diffusion] fix: fix bug about pin memory when offloading (sgl-project#14472)

* [diffusion] cli: add argument --adjust-frames and --override-protected-fields (sgl-project#13996)

Co-authored-by: dev <devnull@example.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* dockerfile: add runtime stage + ubuntu 24.04 (sgl-project#13861)

* [diffusion] fix: fix CLIP text encoder attention mask not used (sgl-project#14364)

Co-authored-by: niehen6174 <niehen.6174@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* Enable RadixCache for Mamba2 models (sgl-project#13584)

* [diffusion] fix: Fix profiler trace missing Python stack in diffusion pipeline (sgl-project#14499)

* support GLM-V vision model dp  (sgl-project#14097)

* [misc] add model arch and type to server info and use it for harmony (sgl-project#14456)

* Add Mistral Large 3 Eagle Support (sgl-project#14466)

Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

* Add Mistral Large 3 to nightly CI tests (sgl-project#14459)

* [diffusion] chore: set allowing overriding protected fields of sampling params as default behavior (sgl-project#14471)

* [model-gateway] move conversation to first class routing (sgl-project#14506)

Co-authored-by: key4ng <rukeyang@gmail.com>

* [Spec] Mamba2 support in target models (sgl-project#13434)

* [diffusion] feat: support cache-dit integration (sgl-project#14234)

Co-authored-by: shuxiguo <shuxiguo@meituan.com>
Co-authored-by: DefTruth <qiustudent_r@163.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-project#14093)

Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

* [model-gateway] Add WASM support for middleware (sgl-project#12471)

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>

* [model-gateway] reorganized conversation handler (sgl-project#14507)

Co-authored-by: key4ng <rukeyang@gmail.com>

* tiny remove deprecated endpoint call (sgl-project#13607)

* [model-gateway] fix server info comment (sgl-project#14508)

* Add Mistral Large 3 basic test to PR CI (sgl-project#14460)

* Fix removing worker will make it healthy forever in prometheus metrics (sgl-project#14420)

* [model-gateway] Make Tokenizer Builder Aware of Env Vars Like HF_ENDPOINT (sgl-project#14405)

* [model-gateway] change sgl-router to sgl-model-gateway (sgl-project#14312)

* [model-gateway] fix left over sgl-router names to sgl-model-gateway (sgl-project#14512)

* [model-gateway] fix logs in smg workflow (sgl-project#14513)

* [model-gateway] fix left over sgl-router names in wasm (sgl-project#14514)

* [model-gateway] fix code owner for wasm (sgl-project#14516)

* chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14427)

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>

* Tiny use trtllm_mha as default when possible (sgl-project#14291)

* [Docs] Add /rerun-stage command to contribution guide (sgl-project#14521)

* Fix safetensors validation to catch corruption after download (sgl-project#14465)

* [CODEOWNER] update codeowner for qwen3-next related (sgl-project#14522)

* fix: fix rmsnorm -> layernorm in qwen3 omni (sgl-project#11791)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* [diffusion] chore: temporarily upgrade diffusers to make Z-image compatible with Cache-DiT (sgl-project#14530)

* [bug] fix notebook to include new keys from model_info (sgl-project#14528)

* Revise DP Multi-Modal Encoder Document (sgl-project#14290)

* [CPU] add mamba fla kernels for Qwen3-next (sgl-project#12324)

* Revert "tiny remove deprecated endpoint call" (sgl-project#14533)

* support mtp with deepseek r1 nvfp4 model (sgl-project#13115)

Co-authored-by: Trevor Morris <tmorris@nvidia.com>

* [diffusion] refactor: simplify sampling params' override logic (sgl-project#14539)

* [diffusion] perf: add QKV fusion optimization for Flux models (sgl-project#14505)

Co-authored-by: Mick <mickjagger19@icloud.com>

* [model-gateway][tracing]: implement request tracing using OpenTelemetry with trace context propagation (HTTP) (sgl-project#13897)

* [diffusion] lora: fix LoRA dtype handling and weight attribute access for z-image model (sgl-project#14543)

Co-authored-by: niehen6174 <nihen6174@gmail.com>

* fix "GrammarMatcher has terminated after accepting the stop token, but is trying to find the next token mask" when both reasoning and spec are enabled (sgl-project#14464)

* [1/n] Fix hanging during DeepGemm Warmup (sgl-project#14493)

* [Bug fix] Add /model_info endpoint to mini_lb (sgl-project#14535)

* [Qwen3-next] remove heuristics and add radix cache kl test (sgl-project#14520)

* [Misc]Register and refactor some environs for dpsk-fp4 and DeepEp (sgl-project#14538)

* chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14518)

* Update CI_PERMISSIONS.json (sgl-project#14552)

* Update DeepSeek V3 docs to use B200 (sgl-project#14447)

* [Doc] Add short explanation on page size (sgl-project#14557)

* [docs] Add missing word in argument description (sgl-project#14205)

* support piecewise cuda graph for Olmo models (sgl-project#14476)

* Enhance prefill PP node robustness (sgl-project#14494)

* DOC update nemo-skills in docs (sgl-project#14555)

Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* remove unecessary dual stream token threshold from the rest of models (qwen moe, kimi linear, etc.) (sgl-project#14337)

* feat(ci): add framework target to release-docker workflows (sgl-project#14559)

* Fix attention backend logic for Qwen3-Next on SM100 (sgl-project#14560)

* [FLA] Add explicit kernel arguments to kda.py for Kimi Linear support (sgl-project#14561)

* Add CUDA kernel size analysis tool for sgl-kernel optimization (sgl-project#14544)

* [DLLM] feat: Add threshold based parallel decoding support (sgl-project#14412)

Co-authored-by: Jinwei Yao <jinweiy@illinois.edu>
Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>

* Add unit-test-backend-8-gpu-b200 to rerun-stage command (sgl-project#14569)

* [apply][2/2] Fused qk_norm_rope for Qwen3-MoE (sgl-project#13998)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* Add Expert Parallelism (EP) support for kimi-k2-thinking (sgl-project#13725)

* Tiny remove wrong import from `python.sglang` (sgl-project#14577)

* Add small model test for spec v2 + dp + trtllm_mla (sgl-project#14576)

* [diffusion] cli: profiling utilities support (sgl-project#14185)

Co-authored-by: jianyingzhu <53300651@qq.com>
Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* [NPU]LoRA: Adding Torch Native backend (sgl-project#14132)

* [BugFix] fix prefixcache performance and accuracy on ascend (sgl-project#13573)

* Fix FP8 KV Triton type issue and add regression test (sgl-project#14553)

* Rename TensorRT Model Optimizer to Model Optimizer (sgl-project#14455)

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

* [CI] Tiny speed up VLM CI (sgl-project#14517)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* [Minor] Temporarily skipping deepep large mtp test (sgl-project#14586)

* [model-gateway] extra accumulator and tool handler in oai router (sgl-project#14587)

* [model-gateway] Fixed WASM Security Vulnerability - Execution Timeout (sgl-project#14588)

* [model-gateway] reorganize metrics, logging, and otel to its own module (sgl-project#14590)

* Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 (sgl-project#14141)

* [CI]Unblock and split spec v2+dp test (sgl-project#14551)

* [Tool Call] Fix DeepSeekV32Detector skipping functions with no params in streaming mode (sgl-project#14573)

* [feat] use cachebuffer to store mm feature to speedup hash (sgl-project#14386)

* [CI] Fix unit-test-backend-8-gpu-b200 running on every /rerun-stage (sgl-project#14591)

* [model-gateway] fix WASM memory limit per module (sgl-project#14600)

* Tiny fix missing policy decision recording (sgl-project#14605)

* Super tiny remove unneeded policy flag (sgl-project#14608)

* [model-gateway] refactor otel to be more efficient (sgl-project#14604)

* Super tiny remove unused select_worker_pair (sgl-project#14609)

* [model-gateway] fix WASM unbounded request/response body read vuln (sgl-project#14612)

* [2/2] Add rope kernel in sgl-kernel  (sgl-project#14452)

* [DLLM] Add initial cuda graph support (sgl-project#14203)

* Super tiny fix unused code in router (sgl-project#14618)

* [Glm46v] Bug fix for accuracy drop and unable to launch server (sgl-project#14585)

Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>

* Fix amd rope definition (sgl-project#14556)

* modify the sgl-kernel to be compatible with transformers 5.x. (sgl-project#14625)

* [Reasoning + Structured Output] make reasoning compatible with structured output (sgl-project#12551)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

* [diffusion] feat: add support for LoRA layers in transformer_2 within LoRAPipeline (sgl-project#14606)

* chore: bump sgl-kernel version to 0.3.19 (sgl-project#14632)

* [cpu] Implement all gather/reduce for arm64 cpu (sgl-project#12527)

* [diffusion] chore: further refine output resolution adjustment logic (sgl-project#14558)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Fix dp-aware incompatible with service-discovery (sgl-project#14629)

* update transformers package version to 5.0.0rc0 (sgl-project#14356)

* chore: bump sgl-kernel version to 0.3.19 (sgl-project#14649)

* chore: bump SGLang version to 0.5.6.post1 (sgl-project#14651)

* [AMD] change fused rms quant interface for aiter upgrade (sgl-project#14497)

* [model-gateway] reducing cpu overhead in various of places (sgl-project#14658)

* [model-gateway] reduce cpu overhead in grpc router (sgl-project#14663)

* [model-gateway] fix WASM arbitrary file read security vol (sgl-project#14664)

* vlm: Use fa3 as the default backend for qwen3 vl (sgl-project#14634)

* [model-gateway] Optimize memory usage in HTTP router (sgl-project#14667)

* fix: use .get() when accessing strict mem-check env variable (sgl-project#14657)

* improve default glm mtp setting (sgl-project#14457)

Signed-off-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* Fix cache-aware router should pick min load instead of min tenant size (sgl-project#14650)

* Bump up diffusers to latest official release version (sgl-project#14670)

* [model-gateway] add OTEL integration to grpc router (sgl-project#14671)

* [CI] Increase max-parallel to 15 for high priority PRs (sgl-project#14675)

* [HiCache] fix condition check when use decode offload (sgl-project#14489)

* [RadixTree] Optimize the Time Complexity of Node Retrieval Operation from O(n*m) to O(n) (sgl-project#13334)

Signed-off-by: CLFutureX <chenyongqyl@163.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* Tiny support printing requests in bench_serving for observability (sgl-project#14652)

* Aiter fp8 kv cache (sgl-project#13147)

* [SMG]feat: implement TokenGuardBody for managing token return (sgl-project#14653)

* [NPU] chore: bump basic software version to 8.3.rc2 (sgl-project#14614)

* [CI] Unblock gb200 cutedsl test (sgl-project#14469)

* Add ffmpeg into sglang docker - required by transformers multimodal V… (sgl-project#14679)

* [Bugfix] Fix KeyError for Mistral-Large-3 rope_scaling config (sgl-project#14627)

* Tiny support sgl-router http response status code metrics (sgl-project#14689)

* [CI] Migrate Eagle 1-GPU tests to test/registered/ (sgl-project#14529)

* Revert "[Bug] fix not desired disable fused share experts caused by r… (sgl-project#14676)

* Add per-request decode tp size (sgl-project#14678)

Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>

* [ci][smg] fix docker release ci and add it to pr test (sgl-project#14683)

* Tiny extract select_worker_min_load (sgl-project#14648)

* Fix dp-aware incompatible with completions and chat completions APIs (sgl-project#14647)

* [CI] Fix Llama 3.1 8B FP4 CI (sgl-project#14699)

* fix: make override DeepseekV2Model work (sgl-project#14707)

* chore: add code owners for deepseek_v2.py (sgl-project#14714)

* [CI] Move mistral large 3 basic to nightly (sgl-project#14622)

* fix the deepep 8 gpu unit test (sgl-project#14601)

* Add fuse_marlin_moe test to ci and add new ep test (sgl-project#14686)

* [Bugfix] Fix environ error in scheduler_runtime_checker_mixin.py (sgl-project#14461)

Signed-off-by: Kun(llfl) <i@imux.top>

* [Feat] Add received_time in serving_base (sgl-project#13432)

Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com>

* fix: prevent HugginqFace access when SGLANG_USE_MODELSCOPE is enabled (sgl-project#12039)

* [Test] Skip STANDALONE speculative decoding tests for different hidden sizes (sgl-project#14733)

* [diffusion] feat: support comparing batch perf (sgl-project#14738)

Co-authored-by: shuxiguo <shuxiguo@meituan.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* Revert "[Feat] Add received_time in serving_base" (sgl-project#14743)

* [Model] Add PaddleOCR-VL Model Support (sgl-project#12953)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* fix rope parameter initialization error caused by transformers v5.0 update (sgl-project#14745)

* [model-gateway] optimize core modules (sgl-project#14751)

* [SMG] perf: optimize tokenizer for reduced CPU and memory overhead (sgl-project#14752)

* Add FP8 Blockwise GEMM Backend Flag `--fp8-gemm-backend` (sgl-project#14379)

* fix: checking if tokenizer is in cache before downloading from HF (sgl-project#14698)

* fix: making rate limit a warning instead of error  (sgl-project#14753)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* move multi-item scoring functions in tokenizer manager into a separate file (sgl-project#14740)

* Improve CI by trying a warmup before unit tests (sgl-project#14669)

* [Perf] Optimize radix tree for cache-aware load balancin (sgl-project#14758)

* [Feature] Add LoRA support for embedding layers (sgl-project#14177)

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Beichen-Ma <bm685@cornell.edu>

* [model-gateway] release gateway 0.2.4 (sgl-project#14763)

* [ci]: Enable the new hf API (sgl-project#14687)

* Re-add the API serving timing metrics. (sgl-project#14744)

Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com>
Co-authored-by: zhanghaotong <zhanghaotong.zht@antgroup.com>

* fix: adding rate limit warning at verify token permission stage (sgl-project#14756)

* Disable 8-gpu-b200 runner in PR tests (sgl-project#14768)

* [fix] Fix issues for in-flight weight updates (sgl-project#14064)

Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>

* [Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) (sgl-project#14759)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* fix: race condition between validation and download locks (sgl-project#14761)

* Fix VLM accuracy thresholds for nightly tests (sgl-project#14777)

* fix server args bug (sgl-project#14725)

* handling incomplete rope_scaling config ci after transformers upgrade (sgl-project#14784)

* fix b200 ci (sgl-project#14786)

* [RL] support weight reload for low-bit rollout (sgl-project#9650)

Co-authored-by: Hecate0821 <hec4te0821@gmail.com>
Co-authored-by: eternally-z <zzywzj@gmail.com>
Co-authored-by: Wilboludriver <wilbolu@outlook.com>
Co-authored-by: Wilbolu <81792854+Wilboludriver@users.noreply.github.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>

* fix: add missing logic for SGLANG_USE_MODELSCOPE variable (sgl-project#14794)

* fix b200 fa4 ci (sgl-project#14788)

* [diffusion] profile: early exit when enough steps are captured to reduce the size of the trace file (sgl-project#14803)

* [GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V (sgl-project#14720)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* [diffusion] CI: Add LoRA support to diffusion server configuration and test cases (sgl-project#14697)

* Revert "fix: checking if tokenizer is in cache before downloading from HF" (sgl-project#14808)

* [diffusion] performance: refactor diffusion fuse qkv and apply to qwen-image (sgl-project#14793)

* [SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API Server (sgl-project#14770)

* [model-gateway] Dynamically Populate Tool Call Parser Choices (sgl-project#14807)

* Support HTTP response status code prometheus metrics (sgl-project#14710)

* Fix router keep nonzero metrics after worker is deleted (sgl-project#14819)

* Tiny fix incorrect worker removal command (sgl-project#14822)

* [NPU] bug fix for mtp and w4a8 (sgl-project#14806)

* [CI] fix UT success check in `test_eagle_infer_beta_dp_attention.py` (sgl-project#14831)

* Fix CI registry scan to only check test/registered directory (sgl-project#14812)

* [model-gateway] add anthropic message api spec (sgl-project#14834)

* [diffusion] doc: fix tiny typo in multimodal_gen/README.md (sgl-project#14830)

* [model-gateway] support customizing Prometheus duration buckets (sgl-project#14716)

* [model-gateway] support engine response http status statistics in router (sgl-project#14712)

* [CI] Reduce stage-b auto-partition from 4 to 2 (sgl-project#14769)

Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>

* Apply back moe_sum_reduce for fused_marlin_moe (sgl-project#14829)

* [diffusion] parallel: pad tokens for video models under sp (sgl-project#14833)

* [diffusion] CI: use unified sampling_params for CI (sgl-project#14045)

* [Auto Sync] Update tool_chat_template_deepseekv31.jinja (20251210) (sgl-project#14837)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jue Wang <zjuwangjue@gmail.com>

* Revert transformers to 4.57.1 (sgl-project#14801)

* [model-gateway] Fix incompatible metric comparison in` PowerOfTwo` policy (sgl-project#14823)

* [bugfix] qwen25-VL support lora (sgl-project#14638)

* fix lora target all + csgmv backend (sgl-project#14796)

* [model-gateway] adds default implementations to RouterTrait in mod.rs (sgl-project#14841)

* [AMD] Add model to AMD nightly test (sgl-project#14442)

* Treat unittest SkipTest exception as pass instead of as failure (sgl-project#14847)

* [model-gateway] code clean up on oai router (sgl-project#14850)

* [model-gateway] fix import order in oai conversation (sgl-project#14851)

* fix fp8 gemm nightly CI (sgl-project#14844)

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* fix: restrict cache validation behaviors to CI only (sgl-project#14849)

* Fix CUDA version handling in ci_install_deepep.sh (sgl-project#14854)

* Fix TestGLM41VPPAccuracy test flakiness (sgl-project#14848)

* Minor code style fix for dllm (sgl-project#14836)

* Enable TP for Mamba-based models (sgl-project#14811)

Signed-off-by: Roi Koren <roik@nvidia.com>

* [CI] Temp disable gb200 test (sgl-project#14865)

* Refactor Marlin MoeRunner (sgl-project#14554)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

* [6/n] Fix `num_token_non_padded` computation in prefill (sgl-project#14313)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Runkai Tao <rt572@physics.rutger.edu>

* Remove myself to test CI gate issue (sgl-project#14871)

* fix: creating blobs only once for publish trace retries (sgl-project#14845)

* Move and update MindSpore docs, make it appear on the online documentation (sgl-project#14861)

Co-authored-by: wangtiance <tiancew@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix nightly vlm ci : restore original eval for requests without regex (sgl-project#14875)

* Only count limitations for previous runs  that reaches the test stages (sgl-project#14856)

* [CI][BUG] fix ib setup for disaggregation hicache test (sgl-project#14877)

Signed-off-by: lukotong-7 <shicanwei.scw@alibaba-inc.com>

* [Fix] Remove unused import from test_disaggregation_hicache.py (sgl-project#14880)

* fix: adding temporary bypass for nightly tests (sgl-project#14876)

* Avoid deleting entire cache for missing shards (sgl-project#14754 follow-up) (sgl-project#14853)

* Tiny add more error info for bench_serving (sgl-project#14827)

* Tiny support range ratio in GSP in bench serving (sgl-project#14828)

* [diffusion] feat: enable torch compile to eliminate GPU bubble (sgl-project#13641)

Co-authored-by: jianyingzhu <53300651@qq.com>
Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com>
Co-authored-by: root <root@2u2g-spr-0417.ipp4a1.colossus.nvidia.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* [NPU] adapt dsv3.2 nsa prefill context parallel (sgl-project#14541)

* [diffusion] feat: support sageattn & sageattn3 backend (sgl-project#14878)

* dsv32 multistream opt

* clean code

* delete renormalize in topk

* dsv32 use batch_matmul_transpose in MTP

* modify comment

* Support dynamic w8a8

* dsv3 support ascend_fuseep

* rebase modify

---------

Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Signed-off-by: CLFutureX <chenyongqyl@163.com>
Signed-off-by: Kun(llfl) <i@imux.top>
Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: lukotong-7 <shicanwei.scw@alibaba-inc.com>
Co-authored-by: Simo Lin <linsimo.mark@gmail.com>
Co-authored-by: key4ng <rukeyang@gmail.com>
Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com>
Co-authored-by: Sam <lsam@nvidia.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Co-authored-by: Kay Yan <kay.yan@daocloud.io>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Yueming Yuan <yy28@illinois.edu>
Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com>
Co-authored-by: sglang-bot <sglangbot@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: zyksir <zhuyikai.zyk@gmail.com>
Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com>
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai>
Co-authored-by: dev <devnull@example.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: WenhaoZhang <42087078+niehen6174@users.noreply.github.com>
Co-authored-by: niehen6174 <niehen.6174@gmail.com>
Co-authored-by: roikoren755 <26850796+roikoren755@users.noreply.github.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Yuxuan Zhang <2448370773@qq.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: blahblah <28567807+Brain97@users.noreply.github.com>
Co-authored-by: shuxiguo <shuxiguo@meituan.com>
Co-authored-by: DefTruth <qiustudent_r@163.com>
Co-authored-by: Hudson Xing <77495133+harvenstar@users.noreply.github.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Tony Lu <tonylu@linux.alibaba.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Wenyi Xu <wenyixu101@gmail.com>
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: blzheng <beilei.zheng@intel.com>
Co-authored-by: Rain Jiang <96632942+rainj-me@users.noreply.github.com>
Co-authored-by: Trevor Morris <tmorris@nvidia.com>
Co-authored-by: Feng Su <sufeng@linux.alibaba.com>
Co-authored-by: niehen6174 <nihen6174@gmail.com>
Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com>
Co-authored-by: harrisonlimh <97203667+harrisonlimh@users.noreply.github.com>
Co-authored-by: Lee Nau <lnau@nvidia.com>
Co-authored-by: almaslof <187766901+almaslof@users.noreply.github.com>
Co-authored-by: Rain H <2510421000@qq.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Chen1022 <jincong.cjc@ant-intl.com>
Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com>
Co-authored-by: Jinwei Yao <jinweiy@illinois.edu>
Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: AichenF <aichenf@nvidia.com>
Co-authored-by: jianyingzhu <53300651@qq.com>
Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com>
Co-authored-by: Vladimir Serov <serov.vladimir.zser@gmail.com>
Co-authored-by: khalilzhk <khalilzhk@gmail.com>
Co-authored-by: Zhiyu <zhiyuc@nvidia.com>
Co-authored-by: wentx <3843588+momaek@users.noreply.github.com>
Co-authored-by: Nicholas <45984215+liusy58@users.noreply.github.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Muqi Li <muqi1029@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Prozac614 <dwt614707404@163.com>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com>
Co-authored-by: PiteXChen <44110731+CLFutureX@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com>
Co-authored-by: Jimmy <29097382+jimmy-evo@users.noreply.github.com>
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
Co-authored-by: kun-llfl <i@imux.top>
Co-authored-by: zhanghaotong <zhanghaotong.zht@antgroup.com>
Co-authored-by: yrk111222 <2493404415@qq.com>
Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com>
Co-authored-by: Douglas Yang <dyang@college.harvard.edu>
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
Co-authored-by: Beichen-Ma <bm685@cornell.edu>
Co-authored-by: MingxuZh <109504044+MingxuZh@users.noreply.github.com>
Co-authored-by: ShawnY112358 <61113840+ShawnY112358@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: TomerBN-Nvidia <tbarnatan@nvidia.com>
Co-authored-by: Peng Zhang <aniz1905@gmail.com>
Co-authored-by: Hecate0821 <hec4te0821@gmail.com>
Co-authored-by: eternally-z <zzywzj@gmail.com>
Co-authored-by: Wilboludriver <wilbolu@outlook.com>
Co-authored-by: Wilbolu <81792854+Wilboludriver@users.noreply.github.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: liupeng374 <liupeng374@huawei.com>
Co-authored-by: Li Jinliang <975761915@qq.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Co-authored-by: Jue Wang <zjuwangjue@gmail.com>
Co-authored-by: Praneth Paruchuri <pranethparuchuri@gmail.com>
Co-authored-by: Siyuan Chen <41201609+SYChen123@users.noreply.github.com>
Co-authored-by: michael-amd <Michael.Zhang@amd.com>
Co-authored-by: Trang Do <200224632+trangdough@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: yuchengz816-bot <yuchengz816@gmail.com>
Co-authored-by: Runkai Tao <rt572@physics.rutger.edu>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Tiance Wang <wangtiance@gmail.com>
Co-authored-by: wangtiance <tiancew@qq.com>
Co-authored-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: root <root@2u2g-spr-0417.ipp4a1.colossus.nvidia.com>
Co-authored-by: liupeng374 <782420244@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants