Skip to content

[router] Refactor OpenAI router: split monolithic file and move location#11359

Merged
slin1237 merged 11 commits intosgl-project:mainfrom
key4ng:oai-router-refactor
Oct 9, 2025
Merged

[router] Refactor OpenAI router: split monolithic file and move location#11359
slin1237 merged 11 commits intosgl-project:mainfrom
key4ng:oai-router-refactor

Conversation

@key4ng
Copy link
Copy Markdown
Collaborator

@key4ng key4ng commented Oct 9, 2025

Motivation

The OpenAI router implementation had grown to 4,547 lines in a single file (http/openai_router.rs), making it difficult to maintain and navigate.

Modifications

  • Split src/routers/http/openai_router.rs (4,547 lines) into modular structure:
    • src/routers/openai/router.rs - Main coordinator (337 lines)
    • src/routers/openai/conversations.rs - Conversation CRUD and persistence (614 lines)
    • src/routers/openai/responses.rs - Response handling utilities (301 lines)
    • src/routers/openai/streaming.rs - Streaming response handler (1,435 lines)
    • src/routers/openai/mcp.rs - MCP tool integration (1,860 lines)
    • src/routers/openai/mod.rs - Module declarations
  • Updated imports across factory.rs, mod.rs, and test files
  • Deleted old http/openai_router.rs file

Accuracy Tests

Tested all features done in previous PRs

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @key4ng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the OpenAI router by breaking down a large, monolithic 4,547-line file into a modular structure across several new Rust files within src/routers/openai/. The changes aim to improve maintainability and navigability by separating concerns such as conversation management, response handling, streaming logic, and MCP tool integration into dedicated modules. Corresponding import paths and module declarations were updated throughout the router's codebase.

Highlights

  • Intent of the PR: This pull request refactors the OpenAI router implementation to improve maintainability and navigability. The previous implementation was a monolithic file (http/openai_router.rs) spanning 4,547 lines, which made it difficult to manage.
  • Key Changes: The primary change involves splitting the large http/openai_router.rs file into a modular structure within a new src/routers/openai/ directory. This involved creating several new Rust modules, updating import paths across the codebase, and deleting the original file.
  • Detailed Breakdown of New Modules: The original openai_router.rs file was broken down into the following specialized modules:
  • src/routers/openai/router.rs (337 lines): Serves as the main coordinator for the OpenAI router.
  • src/routers/openai/conversations.rs (614 lines): Handles Conversation CRUD (Create, Read, Update, Delete) operations and persistence.
  • src/routers/openai/responses.rs (301 lines): Contains utilities for response handling, storage, and patching.
  • src/routers/openai/streaming.rs (1,435 lines): Manages streaming response logic, including SSE parsing, accumulation, and event transformation.
  • src/routers/openai/mcp.rs (1,860 lines): Integrates MCP (Model Context Protocol) tools, managing tool loops, execution, and related event generation.
  • src/routers/openai/mod.rs: Declares the new openai module and its sub-modules.
  • src/routers/openai/utils.rs: Provides utility types and constants for the OpenAI router.
  • Impact: This refactoring significantly improves the code organization, making the OpenAI router easier to understand, maintain, and extend. It isolates concerns into dedicated modules, enhancing code clarity and reducing complexity.
  • Reviewer Activity: No specific reviewer activity was noted in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@key4ng key4ng marked this pull request as draft October 9, 2025 02:56
@key4ng key4ng changed the title [router] Refactor OpenAI router: split monolithic file and restructure [router] Refactor OpenAI router: split monolithic file and restructure[wip] Oct 9, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a great refactoring effort that significantly improves the structure and maintainability of the OpenAI router. Splitting the monolithic file into logical modules makes the codebase much easier to navigate and understand. The new modules for conversations, MCP, responses, and streaming are well-defined.

I've identified a few areas for improvement, mainly around code duplication and minor correctness issues. Addressing these will further enhance the quality of this refactoring. Overall, excellent work!

Comment thread sgl-router/src/routers/openai/mcp.rs Outdated
Comment thread sgl-router/src/routers/openai/responses.rs Outdated
Comment thread sgl-router/src/routers/openai/conversations.rs
Comment thread sgl-router/src/routers/openai/mcp.rs
Comment thread sgl-router/src/routers/openai/router.rs Outdated
Comment thread sgl-router/src/routers/openai/router.rs Outdated
Comment thread sgl-router/src/routers/openai/router.rs Outdated
Comment thread sgl-router/src/routers/openai/streaming.rs
@key4ng key4ng marked this pull request as ready for review October 9, 2025 03:31
@key4ng key4ng changed the title [router] Refactor OpenAI router: split monolithic file and restructure[wip] [router] Refactor OpenAI router: split monolithic file and move it under routers Oct 9, 2025
@key4ng key4ng changed the title [router] Refactor OpenAI router: split monolithic file and move it under routers [router] Refactor OpenAI router: split monolithic file and move location Oct 9, 2025
@key4ng
Copy link
Copy Markdown
Collaborator Author

key4ng commented Oct 9, 2025

Hi @slin1237 it's ready for review

@slin1237 slin1237 merged commit 84768d1 into sgl-project:main Oct 9, 2025
35 checks passed
ch-tiger1 pushed a commit to ch-tiger1/sglang that referenced this pull request Oct 9, 2025
BraveY pushed a commit to openanolis/sglang that referenced this pull request Oct 22, 2025
Merge branch sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main
https://code.alipay.com/Theta/SGLang/pull_requests/342?tab=diff

Reviewed-by: 苏墨 <xuyongfei.xyf@antgroup.com>


* [router] minor code clean up in server startup (sgl-project#10470)
* [bugfix] fix typo (sgl-project#10471)
* [PD metrics] Add latency Histogram metrics of each stage for generate requests (sgl-project#8710)
* [CI] Fix runner for sgl-kernel (sgl-project#9887)
* fix(internvl): fix accuracy issue of normalization (sgl-project#10375)
* fix: gpt-oss streaming dropping normal content when tools are provided but not used (sgl-project#9657)
* model: support solar (sgl-project#8189)
* fix: resolve sgl-kernel ut (sgl-project#10476)
* [1/2] Speed up trtllm_mla attention backend (>10% e2e) (sgl-project#10473)
* Fix `--dataset-path` in `bench_one_batch_server` (sgl-project#10475)
* [Env] minimal version for organizing envs (sgl-project#10479)
* chore: bump v0.3.10 sgl-kernel (sgl-project#10478)
* [router] multi model registration fix (sgl-project#10481)
* [2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (sgl-project#10286)
* [Auto Sync] Update registry.py (20250915) (sgl-project#10484)
* [router] fix worker registration in multi model mode (sgl-project#10486)
* fix crash of DeepSeek-V3 update_weights_from_disk (sgl-project#8863)
* Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue (sgl-project#10434)
* [Hicache] Evaluate Per-Round Metrics in Multiturn Bench (sgl-project#10203)
* [ModelOpt] Respect `kv_cache_quant_algo` in ModelOpt checkpoints (sgl-project#10336)
* Add Logprobs unit test with a loose threshold (sgl-project#10230)
* [router] add router db connector for responses api (sgl-project#10487)
* Remove wrong imports `from sglang.python` (sgl-project#10493)
* [router] fix router manager and router init in server (sgl-project#10499)
* Cache the result of `is_blackwell` platform check (sgl-project#10498)
* feat: update support for qwen3next model (sgl-project#10466)
* Minor fix lint introduced by sgl-project#10466 (sgl-project#10507)
* chore: upgrade sgl-kernel 0.3.10 (sgl-project#10500)
* Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. (sgl-project#10491)
* Fix CI when sgl-kernel is changed but srt is not changed (sgl-project#10515)
* Support sgl-router parallel_batch in bench_one_batch_server (sgl-project#10506)
* [CPU] fix CPU backend sel. issue for Llama4 (sgl-project#10511)
* adjust import setuptools_rust (sgl-project#10524)
* Fix formatting in long code blocks (sgl-project#10528)
* skip vision_model for lora (sgl-project#10530)
* [2/2] Speed up trtllm_mla attention backend (sgl-project#10474)
* support using fa4 on deepseek on blackwell (sgl-project#9928)
* [Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) (sgl-project#10494)
* [Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) (sgl-project#10538)
* feat: add priority based scheduling with priority based request acceptance and preemption (sgl-project#8746)
* Fix decord dependency for aarch64 docker build (sgl-project#10529)
* enable prefix cache with dp (sgl-project#10459)
* [bugfix]hicache bench_long_context.py run failed (sgl-project#10523)
* Remove duplicated code (sgl-project#10545)
* CUDA Arch Independent (sgl-project#8813)
* [bench] Fix random seed in `bench_one_batch_server` (sgl-project#10548)
* [HiCache] Add tests for hicache storage mooncake backend (sgl-project#10171)
* [BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle (sgl-project#9976)
* fix: update dsv3 fp4 ut (sgl-project#10584)
* vlm: remove redundant d2h movement of mm feature tensors (sgl-project#9987)
* Enable trtllm mla prefix extend (sgl-project#10526)
* [ROCm] Fix fp8 quantization accuracy issue. (sgl-project#10558)
* [HICache] introduce evict policy (sgl-project#10190)
* PullRequest: 303 Revert "PullRequest: 291 for fa3 kvcache: revert github "convert mla kvcache to bfloat16""
* aiter v0.1.5.post2 (sgl-project#10563)
* [PD] Improve disaggregation common backend and refactor mooncake backend (sgl-project#10273)
* chore: upgrade mooncake 0.3.6 (sgl-project#10596)
* [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525)
* Scale kkt after reduction (sgl-project#10604)
* fix deepep assert when PD disaggregation == null (sgl-project#8274)
* [RL] Add destroy process group api (sgl-project#9979)
* Feat/add heartbeat mechanism for nixl conn (sgl-project#10222)
* update deepep version for qwen3-next deepep moe (sgl-project#10624)
* support qwen3-next-fp8 deepep (sgl-project#10622)
* Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610)
* [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595)
* Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579)
* feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947)
* Garbage collector regression in the online server (sgl-project#10621)
* [router] refactor worker to builder pattern 1/n (sgl-project#10628)
* refactor: use registry for _get_attention_backend_from_str (sgl-project#10629)
* [Feature] Speculative decoding support lookahead (sgl-project#9873)
* [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553)
* [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586)
* model support: Sarashina2VisionForCausalLM (sgl-project#10632)
* feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631)
* chore: bump sgl-kernel 0.3.11 (sgl-project#10630)
* Hicache L3 backend mooncake optimization configuration reading method (sgl-project#10319)
* [router] refactor worker to builder pattern 2/n (sgl-project#10633)
* [Feature]feat(get_ip): unify get_ip_xxx (sgl-project#10081)
* [router] refactor worker to builder pattern 3/n (sgl-project#10647)
* [sgl-kernel] Support moe_sum_reduce cuda kernel (sgl-project#10321)
* [router] refactor worker to builder pattern 4/n (sgl-project#10650)
* Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 (sgl-project#10634)
* [router] refactor worker to builder pattern 5/n (sgl-project#10653)
* [HiCacheStorage]support page_first_direct layout for generic set&get (sgl-project#10522)
* [router] preserve order of json params using preserve_order feature (sgl-project#10661)
* [router] refactor router and worker management 1/n (sgl-project#10664)
* fix: resolve sync issue (sgl-project#10668)
* [Auto Sync] Update .clang-format (20250919) (sgl-project#10670)
* [router] refactor router and worker management 2/n (sgl-project#10666)
* router-spec: Reorder `ChatCompletionRequest` and fix validation logic (sgl-project#10675)
* chore: cleanup docker image (sgl-project#10671)
* limit sgl-kernel causal conv1d to cuda only (sgl-project#10648)
* [Auto Sync] Update model_runner.py (20250920) (sgl-project#10679)
* [router] refactor router and worker management 2.5/n (sgl-project#10677)
* [1/2] Support deterministic inference with flashinfer attention backend (sgl-project#10645)
* [Auto Sync] Update deepseek_v2.py (20250920) (sgl-project#10683)
* chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile (sgl-project#10681)
* [Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster (sgl-project#10680)
* Replace os.environ in layernorm.py (sgl-project#10684)
* fix(disagg): fix sending KV cache in case of MLA for NIXL backend (sgl-project#10673)
* fix: update run_suite (sgl-project#10685)
* fix: remove awq_dequantize deps (sgl-project#10686)
* [Auto Sync] Update modelopt_quant.py (20250920) (sgl-project#10688)
* [Feature] Support deterministic inference with FA3 backend (sgl-project#10651)
* feat: update server args  (sgl-project#10696)
* Super tiny fix extra logs (sgl-project#10697)
* [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization  (sgl-project#10592)
* Update release-docs.yml (sgl-project#10706)
* Refactors radix cache for extra key support (sgl-project#10317)
* [Router]fix: fix get_load missing api_key (sgl-project#10385)
* fix: disable gpt-oss b200 ut (sgl-project#10716)
* Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU (sgl-project#10714)
* [Auto Sync] Update deepseek_v2.py (20250922) (sgl-project#10717)
* Support deterministic inference with triton backend (sgl-project#10694)
* [deterministic inference] Move batch invariant pkg to sglang (sgl-project#10695)
* [2/2] Support deterministic inference for temperature > 0 (sgl-project#10678)
* [Ascend] codeowner updates for ascend related files (sgl-project#10699)
* [theta] 支持qwen-vl的多模自定义采样
* revert e61d08c [theta] 支持qwen-vl的多模...
* PullRequest: 306 [theta] 支持qwen-vl的多模自定义采样
* [4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% (sgl-project#10709)
* Convert FLASHINFER_WORKSPACE_SIZE to integer (sgl-project#10731)
* EPLB: prefer to use physical experts in the same node (sgl-project#9849)
* fix capture_bs when speculative decoding enabled (sgl-project#10730)
* Fix flaky logprobs test (sgl-project#10728)
* Fix CI TestChunkedSGMV (sgl-project#10737)
* [Docs, minor] Fix LLM doc matrix (sgl-project#10753)
* Add warnings and remove dependency for deterministic inference (sgl-project#10724)
* bugfix: Fix `get_worker_urls_for_model` in http/router.rs (sgl-project#10754)
* [router] refactor router and worker management 3/n (sgl-project#10727)
* [router] update ci so only execute benchmarks when labels are added (sgl-project#10757)
* Fix MTP MoE weight loading with NVFP4 target model. (sgl-project#10758)
* chore: bump sgl-kernel v0.3.12 (sgl-project#10732)
* [Generative Score API] Added test_scores_api.py to github CICD to run per commit (sgl-project#10755)
* refactor zero copy (sgl-project#10300)
* Fix multimodal registry and code sync scripts (sgl-project#10759)
* Enables TRT-LLM backend to be used for target_verify (sgl-project#10281)
* fix: kv events with tp > 1 (sgl-project#10541)
* [Auto Sync] Update flashattention_backend.py (20250922) (sgl-project#10762)
* [Feature] Add MLAProcess for DeepSeek MLA on NPU (sgl-project#10130)
* [Ascend] optimize Qwen-vl on Ascend (sgl-project#10556)
* [Ascend]optimize Qwen3 on Ascend (sgl-project#10574)
* [Auto Sync] Update configurer.py (20250923) (sgl-project#10765)
* [router] refactor router and worker management 4/n (sgl-project#10756)
* PullRequest: 310 新增 BailingMoEV3 模型及其 MLA 支持
* [router] remove pd router draining channel (sgl-project#10767)
* [router] fix logger type mismatch (sgl-project#10774)
* Use simulate acc len from `sglang.environ` (sgl-project#10771)
* Fix trtllm_mla slow concat kernel in MTP (sgl-project#10777)
* Move cached kernel to srt.utils (sgl-project#10776)
* feat: unify dockerfiles (sgl-project#10705)
* Introduce `FutureMap` (sgl-project#10715)
* chore: upgrade sgl-kernel 0.3.12 (sgl-project#10782)
* followup: clean up dockerfiles and release yamls  (sgl-project#10783)
* Clean up server args (sgl-project#10770)
* move `environ` into `sglang.srt` to avoid break SRT auto sync. (sgl-project#10791)
* Fix hicache mooncake backend CI (sgl-project#10792)
* [router] fix cache aware routing strategy and lock contention (sgl-project#10773)
* [router] responses api POST and GET with local storage (sgl-project#10581)
* model: support qwen3-vl series (sgl-project#10323)
* [fix][pd-disag]no need set next batch sampling info done in prefill (sgl-project#10259)
* [ROCm] Update aiter to v0.1.5.post3 (sgl-project#10812)
* [router] use dashmap for radix tree instead of hash for multi model (sgl-project#10814)
* router(grpc): Implement route for chat_cmpl endpoint (sgl-project#10761)
* fix ceval (sgl-project#10504)
* Remove duplicate code in qwen2 model (sgl-project#10540)
* [router] fix axum default body limit (sgl-project#10818)
* Fix latest main ci (sgl-project#10799)
* add tunning files for QWEN-3-NEXT (sgl-project#10794)
* [Auto Sync] Update protocol.py (20250923) (sgl-project#10820)
* fix: draft model IMA by overide max_positional_embeddings (sgl-project#10787)
* [Auto Sync] Update elementwise.py (20250923) (sgl-project#10823)
* [Auto Sync] Update simple_eval_common.py (20250923) (sgl-project#10824)
* [router] Support streaming for Openai Router Response api  (sgl-project#10822)
* [router] add auth middleware for api key auth (sgl-project#10826)
* [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (sgl-project#10825)
* Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" (sgl-project#10828)
* Add CI timeout guidelines (sgl-project#10829)
* [theta] fix serving_tokenization.py
* feat: add cache_salt support to request (sgl-project#10718)
* fix bailing_moe with enable_dp_attention (sgl-project#10860)
* ci: free space on workers for build (sgl-project#10786)
* router-grpc: Support jinja chat template content format detection (sgl-project#10832)
* [router] select first healthy worker on proxied get requests (sgl-project#10827)
* chore: Initial support for input config files (sgl-project#10534)
* router-grpc: Add tools processing and other paramters for apply_chat_template (sgl-project#10877)
* [router] consolidate health endpoints and flush cache (sgl-project#10876)
* Restruct sgl-kernel benchmark (sgl-project#10861)
* [Bug] Fix Issue#10215 (sgl-project#10572)
* [router] consolidate worker get loads (sgl-project#10880)
* [router] Support Oracle DB(ATP) Data Connector (sgl-project#10845)
* [router] simplify tokenizer dev doc (sgl-project#10895)
* [Auto Sync] Update model_config.py (20250925) (sgl-project#10885)
* [ci feature] add ci monitor (sgl-project#10872)
* [HiCache] Cleaning the deprecated host memory state (sgl-project#10778)
* integrate AIBrix KVcache (sgl-project#10376)
* Add fuse_moe per-channel tune (sgl-project#10915)
* [router] consolidate worker load monitoring (sgl-project#10894)
* router: Fix constraint proto and `build_constraint` in grpc router (sgl-project#10881)
* Refactor kv_cache_scheme handling for quantization (sgl-project#10132)
* refactor: Move `grpc/client.rs` to `grpc_client/sglang_scheduler.rs` (sgl-project#10924)
* fix env flashinfer (sgl-project#10910)
* [minor] Remove deprecated function `get_ip` (sgl-project#10883)
* Rename customer label -> custom label (sgl-project#10899)
* [router] change log level to warning (sgl-project#10926)
* [router][refactor] Clean up protobuf fields (sgl-project#10923)
* Replace the Kimi-K2 generated tool call idx with history tool call count (sgl-project#10612)
* [ci] add ci-monitor workflow (sgl-project#10898)
* Remove pull_request trigger from CI monitor workflow (sgl-project#10932)
* router: Support parallel sampling num > 1 in grpc_server and non-stream handling (sgl-project#10929)
* Revert "Refactor kv_cache_scheme handling for quantization (sgl-project#10132)" (sgl-project#10935)
* Update CODEOWNERS to include JustinTong0323 in FC (sgl-project#10939)
* [PD-HiCache]: Support Async Offloading KVCache In Decode Side (sgl-project#10192)
* CI: Fix docker manifest build (sgl-project#10936)
* [router] update owners for router components (sgl-project#10927)
* Fuse write kv buffer into rope for qwen3 moe & bailing moe (sgl-project#10749)
* [router] add grpc client get and set (sgl-project#10955)
* [router]fix code owner syntax error (sgl-project#10956)
* [router] move grpc client from router to worker and builder (sgl-project#10958)
* [router] add move grpc worker management from router to worker manager (sgl-project#10960)
* [router] grpc router regular mode import cleanup (sgl-project#10963)
* [router] remove old/oudated/useless comments (sgl-project#10967)
* [router] remove old/oudated/useless comments across code base (sgl-project#10968)
* ci: fix rate-limit of huggingface with hf auth login (sgl-project#10947)
* Update label field comment to indicate deprecation (sgl-project#10970)
* Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs (sgl-project#10372)
* ci: refactor nightly test (sgl-project#10495)
* refactor loading weights from remote instance coding format (sgl-project#10941)
* [router][grpc] Add helpfer functions for decoder in router.rs and fix specs (sgl-project#10971)
* Add simple docker file for B300 (sgl-project#10944)
* Ci monitor support performance (sgl-project#10965)
* [HiCache]: Support dynamic loading backends for hicache (sgl-project#10551)
* [Bugfix][Minor][Benchmark] Fix some bugs due to PR sgl-project#10495 (sgl-project#10982)
* [router][grpc] Support E2E non-stream chat completions (sgl-project#10980)
* fix: fp8 quantization failure of qwen 2.5 VL 7B model (sgl-project#10112)
* [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (sgl-project#10981)
* fix: make inference deterministic for large TP (sgl-project#10930)
* Add auth to get server info (sgl-project#10751)
* PullRequest: 315 bailingMoE: Fix deepep_mode keyerror
* Add support for topk metadata transferring for PD (sgl-project#10616)
* [PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend (sgl-project#10565)
* Use jsonschema to constrain required or specific tool choice (sgl-project#10550)
* Fix profiler (sgl-project#10997)
* [router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) (sgl-project#10995)
* [router] basic mcp support for openai router response api (sgl-project#10978)
* [router] fix chat template loading and tokenizer path (sgl-project#10999)
* Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' (sgl-project#11009)
* [bugfix]Add empty_context import to two_batch_overlap.py (sgl-project#10964)
* prepare for sglang+verl (sgl-project#10555)
* [sgl-kernel] Optimize concat_mla_k kernel (sgl-project#10543)
* [HiCache] bug: fix mooncake store batch set v1 (sgl-project#11013)
* Fix FusedSetKVBufferArg  in RotaryEmbedding (sgl-project#11003)
* Update GLM-4.5 Model Doc (sgl-project#11017)
* [router] migrate to rust python module for pythonic parser (sgl-project#11033)
* fix: show failed models in nightly ci (sgl-project#10986)
* [router][tool call] Support normal content extraction before tool call (streaming) (sgl-project#11038)
* [router] add harmony tool parser base structure and interface (sgl-project#11036)
* Unify SGL Kernel Releases (sgl-project#10701)
* [1/2] Support FA4 for MHA Prefill in sgl-kernel (sgl-project#10940)
* fix: check if weights are already local before downloading (sgl-project#11015)
* [HiCacheStorage] mooncake store support page_first_direct layout (sgl-project#10591)
* [speculative decoding] rename lookahead to ngram (sgl-project#11010)
* Fix gemma 3 launch with `transformers:` the error: `AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size'` (sgl-project#9614)
* Fix sgl-kernel benchmark dead code  (sgl-project#11022)
* [router][tool call] Improve normal content extraction and error handling (non-stream) (sgl-project#11050)
* chore: upgrade cutedsl 4.2.1 (sgl-project#11054)
* [Ci Monitor] Auto uploaded performance data to sglang_ci_data repo (sgl-project#10976)
* chore: upgrade sgl-kernel 0.3.13 (sgl-project#11056)
* [router] add n to generate sampling params (sgl-project#11069)
* Use more general heuristics to set the default value of --mem-fraction-static (sgl-project#10975)
* [router][tool call] Separate `JsonParser` and `LlamaParser` (sgl-project#11073)
* Fix mem fraction static for nightly tests (sgl-project#11076)
* fix: fp8 mllama4 without vision modules being quantized (sgl-project#10611)
* [router] Use `get_pooled` in `process_single_choice` (sgl-project#11079)
* [router][grpc] Add logprobs support to router (sgl-project#11082)
* feat(reasoning): improve enable thinking from request (sgl-project#10875)
* [Profile] dump memory trace when cuda graph profile is enabled (sgl-project#11083)
* Remove hybrid_linear_attn attention backend and refactor attention registry (sgl-project#10816)
* [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (sgl-project#9642)
* Enable optional FP32 compute for LM Head (sgl-project#10729)
* Update CODEOWNERS for attention/ascend_backend.py (sgl-project#11092)
* [router] grpc router generate endpoint support (sgl-project#11070)
* [router][tool call] Full support for ToolChoice (sgl-project#11085)
* Fix spec filter batch when target extend  (sgl-project#10991)
* [Fix] Resolve performance drop in speculative decoding aiter backend (sgl-project#11087)
* [Auto Sync] Update fused_moe_triton_config.py (20250930) (sgl-project#11099)
* chore: bump sgl-kernel v0.3.14 (sgl-project#11067)
* [router][grpc-server] Fix gRPC server shutdown (sgl-project#11094)
* Fix eagle radix cache (sgl-project#10846)
* [Eval] Add `--repeat` in `run_eval`  (sgl-project#11101)
* [CPU] Adding Memory Capacity Acquisition Functionality (sgl-project#11102)
* Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization (sgl-project#11081)
* Support Dots.ocr model (sgl-project#11071)
* [router][bugfix] Fix input_logprobs handling with None value and `logprob_start_len = -1` (sgl-project#11113)
* Feature/make PEFT adapter module format compatibile (sgl-project#11080)
* fix: KimiK2Detector Improve tool call ID parsing with regex (sgl-project#10972)
* [router] add mcp list and mcp call in output array (sgl-project#11112)
* Organize spec-related data structures (sgl-project#10735)
* [AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm (sgl-project#11114)
* [Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) (sgl-project#11115)
* [Doc] Update multimodal language models documentation (sgl-project#11111)
* Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg (sgl-project#10985)
* docker: x86 dev builds for hopper and blackwell (sgl-project#11075)
* Refactor AMD CI. (sgl-project#11128)
* feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 (sgl-project#10760)
* [HiCache]bug fix: fixed blank item in host_mem_release_queue (sgl-project#11005)
* [Feature] Add EIC as sglang HiCache Storage backend (sgl-project#10271)
* [HiCache] Configurable and Dynamic Prefetch Timeout (sgl-project#10512)
* [router] add pd service in grpc router for pd (sgl-project#11120)
* [router] Add multi-turn tool calling loop support for MCP integration (sgl-project#11143)
* Fix metrics and request tracing (TimeStats) (sgl-project#11123)
* Remove debug print statement from scheduler output (sgl-project#11145)
* Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (sgl-project#10720)
* Fix ngram spec with page size > 1 (sgl-project#11135)
* [ROCm] To reduce the compiling time when using torch compile. (sgl-project#10559)
* Fix DeepSeek chunked prefill memory issue (sgl-project#11149)
* Clean up parallel_state.py (sgl-project#11148)
* Tiny improve dumper (sgl-project#11132)
* Tiny fix missing alt stream in nextn layer (sgl-project#10768)
* Fuse quantize and rope in trtllm_mla MTP (sgl-project#10779)
* Tiny detect slow ranks (sgl-project#10508)
* Remove unused pack `.item()` in paged allocator. (sgl-project#11156)
* Support dispatch low latency (sgl-project#10263)
* Support single batch overlap (sgl-project#10422)
* [router][grpc] Support tool call parser in streaming (sgl-project#11160)
* [model] Add mamba2 and Falcon-H1 support. (sgl-project#10988)
* Clean up ascend allocator (sgl-project#11152)
* fix cpp JIT compilation issue of ngram speculative decoding (sgl-project#10837)
* Tiny cleanup deepseek_v2.py (sgl-project#11163)
* Tiny fix ep_gather behavior different in CI (sgl-project#11130)
* Tiny remove duplicated code (sgl-project#11164)
* [proto] Add script to compile python protos (sgl-project#11171)
* Unify forward output datastructure (sgl-project#11124)
* [grpc] style fix for grpc compilation. (sgl-project#11175)
* Remove dp balance metadata and minimul token balance. (sgl-project#11170)
* Minor fixes for server_args, parallel_state, and test_deterministic.py (sgl-project#11159)
* fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 (sgl-project#11176)
* [router][grpc] Support streaming for v1/chat/completions (sgl-project#11179)
* Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (sgl-project#11138)
* Introduce naming convention in `io_struct` and base sglang io classes. (sgl-project#10133)
* [Generative Scores API] add performance tests to CICD  (sgl-project#10830)
* [1/n] Enable DCA CUDA graph capture (sgl-project#9537)
* [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (sgl-project#11161)
* [CI]] Tee server logs to both file and stdout/stderr using PIPE (sgl-project#11185)
* fix: radix cache memory accounting (sgl-project#10637)
* Tiny add PD disaggregation + DP attention test (sgl-project#11167)
* [router] Steaming support for MCP Tool Calls in OpenAI Router (sgl-project#11173)
* [Feature] Option to save model weights to CPU when memory saver mode is enabled (sgl-project#10873)
* Add --thinking-mode to run_eval (sgl-project#11189)
* [hot-fix] Fix CI break which caused by adding `thinking_mode` in eval (sgl-project#11192)
* Tiny move files to utils folder (sgl-project#11166)
* Fix CUDA illegal memory access issues in speculative decoding (sgl-project#10892)
* Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. (sgl-project#10780)
* Optimize debug log position of PD abort request (sgl-project#11090)
* fix 3fs indices (sgl-project#10855)
* model: support starcoder2 (sgl-project#10609)
* [Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. (sgl-project#10859)
* fix xeon ci check (sgl-project#10838)
* fix qwen2 eagle3 runtime error (sgl-project#10517)
* [minor] fix the lint (sgl-project#11198)
* [Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py (sgl-project#10741)
* [fix]missing prefix_lens_cpu init when p/d disaggregation (sgl-project#11196)
* fix self.enable_kv_cache_events (sgl-project#11178)
* [HICache]: Refactor HiCache CI (sgl-project#11011)
* fix sampling_seed handling when deterministic is enabled (sgl-project#11096)
* [fix]enable flashmla when using draft model P/D attention select (sgl-project#11012)
* [router] fix get load response parsing (sgl-project#11213)
* [router] add grpc router pd mode for chat and generate (sgl-project#11140)
* EAGLE cache fix for HiCache (sgl-project#11215)
* Add --max-new-tokens CLI flag for MMMU evaluation (sgl-project#11217)
* Add DeepSeek-V3.2 Tool Call Template (sgl-project#11063)
* Tiny `skip_sample` adjust (sgl-project#11225)
* [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 (sgl-project#11194)
* Update `v1/responses` to be more OpenAI-compatible. (sgl-project#9624)
* chore: bump sgl-kernel v0.3.14.post1 (sgl-project#11137)
* Update DeepGEMM repository tag to specific commit (sgl-project#11229)
* [Feat] Support Torch Symm Mem AllReduce (sgl-project#10571)
* Refactor and optimize mooncake CI (sgl-project#11162)
* [Fix AMD CI] VRAM cleanup  (sgl-project#11174)
* Update transformers package version to 4.57.0 (sgl-project#11222)
* Remove gdrcopy check in ci_install_deepep.sh (sgl-project#11237)
* Rename runner labels (sgl-project#11228)
* [Auto Sync] Update io_struct.py (20251004) (sgl-project#11206)
* Create two new GH workflows to automatically bump SGLang and Kernel version (sgl-project#10996)
* Fix spec_utils.py (sgl-project#11247)
* ci: make find_local_hf_snapshot_dir more robust (sgl-project#11248)
* [quantization] Fix scale remapping for mllama4 (sgl-project#10042)
* [quantization] Enable aiter mxfp4 fused_moe for Quark (sgl-project#10048)
* Use cu128 for torch audio to fix some CI tests (sgl-project#11251)
* Bump torch_memory_saver 0.0.9rc2 (sgl-project#11252)
* update sgl kernel version to 0.3.14.post1 (sgl-project#11242)
* Update condition for sgl-kernel-benchmark-test (sgl-project#11254)
* feat: add shortcut detection for multimodal templates in Jinja format (sgl-project#11209)
* Improve bot release workflow (sgl-project#11240)
* Add flashmla and fast hadamard transform to Dockerfile (sgl-project#11235)
* Support DeepSeek V3.2 Exp (sgl-project#11061)
* chore: bump SGLang version to 0.5.3rc2 (sgl-project#11259)
* chore: bump SGLang version to 0.5.3 (sgl-project#11263)
* [theta] fix bailing v3
* [router] add ipv6 support across all components (sgl-project#11219)
* Remove env var warnings for release (sgl-project#11262)
* Enable native ModelOpt quantization support (1/3)  (sgl-project#7149)
* [router][tool call] Clean up redundant `detect_format` and `has_tool_markers` (sgl-project#11270)
* disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 (sgl-project#11274)
* docker: add manifest to versioned docker releases (sgl-project#11268)
* [Bug] Fix incorrect assertion in FA4 and add UT. (sgl-project#11182)
* [router][grpc] Refine streaming processes (sgl-project#11277)
* Fix code sync scripts (sgl-project#11276)
* [Auto Sync] Update test_utils.py (20251006) (sgl-project#11280)
* Rename max_micro_batch_size -> pp_max_micro_batch_size (sgl-project#11279)
* reverse the amd ci test back to 1200s and split the 8-gpu deepseek job into two. (sgl-project#11238)
* Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components (sgl-project#11261)
* fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration (sgl-project#11282)
* docs: update sgl-kernel README (sgl-project#11286)
* chore: bump sgl-kernel version to 0.3.15 (sgl-project#11281)
* [router][grpc] Fix proto3 default value mismatches and cleanup unused fields (sgl-project#11283)
* convert test_deterministic into unit tests (sgl-project#11095)
* Feature/longbench v2 evaluation utils (sgl-project#10949)
* [ci] fix pp test (sgl-project#11294)
* EAGLE cache fix for SWARadixCache (sgl-project#11231)
* Remove overlap thread (sgl-project#11210)
* [router] add reasoning and tool parser argument in router (sgl-project#11290)
* Remove sampling info events and overlap thread file (sgl-project#11300)
* Introduce future indices (sgl-project#11301)
* [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (sgl-project#11068)
* [Docs] [Router] Update Observability and Common Issues Section (sgl-project#11302)
* [router] add get server info and get model info in grpc server (sgl-project#11303)
* [router][grpc] Refactor chat template content format detection (sgl-project#11288)
* [Doc] HiCache Design Documents (sgl-project#11027)
* [Doc]: Best Practice for HICache (sgl-project#11001)
* [router] fix grpc connection conversion and add optimization (sgl-project#11305)
* [router][grpc] Fix sampling_params.stop_strs is None (sgl-project#11306)
* Update tool parser and related documentation (sgl-project#11223)
* [router][grpc] Fix error message format in grpc chat handler (sgl-project#11307)
* [quantization] Properly ignore quantization for layers excluded in quant_config (sgl-project#11205)
* [router] support Openai router conversation API CRUD (sgl-project#11297)
* [router][grpc] Fix request_id extraction when n > 1 (sgl-project#11311)
* [router] cleanup worker health check to return early (sgl-project#11310)
* [oai serving chat] Add argument `--sampling-defaults` and fix `ChatCompletionRequest` defaults (sgl-project#11304)
* Clean match_prefix and prepare_for_extend for mem cache V2 (sgl-project#11200)
* ci: unify the model launch method of nightly ci (sgl-project#11230)
* [Chore] Update xgrammar 0.1.24 -> 0.1.25 (sgl-project#10710)
* update sampling_params documentation with defaults (sgl-project#11315)
* Optimize copy_kv_cache for spec decoding (sgl-project#11126)
* Rename `ngram_utils` -> `ngram_info` (sgl-project#11316)
* [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (sgl-project#11314)
* [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (sgl-project#9545)
* [8/N] MoE Refactor: deprecate `EPMoE` (sgl-project#11211)
* Skip weight loading in deepgemm compilation (sgl-project#11312)
* [2/2] Support MHA prefill with FlashAttention 4. (sgl-project#10937)
* [Doc] Update mooncake nvlink transport doc for PD disaggregation (sgl-project#11321)
* fix(decode): adjust ServerArgs import to explicit module path (sgl-project#11007)
* Support LoRA in bench_serving oai interface (sgl-project#11318)
* benchmark: enhance configurable multimodal benchmarking in bench_serving (sgl-project#9812)
* [CI] improve disaggregation CI. (sgl-project#11264)
* [theta] fix tokenization
* model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (sgl-project#10909)
* [router] refactor generate to use new pipeline arch (sgl-project#11323)
* [router] improve reasoning parser lock and reduce req cloning (sgl-project#11336)
* [router][grpc] Cleanup debug logs in grpc_server and grpc_router (sgl-project#11340)
* [router] Fix all unused_qualifications (sgl-project#11341)
* [router] Support history management using conversation (sgl-project#11339)
* [router][grpc] Add dependencies in Cargo.toml to support chat template rendering (sgl-project#11342)
* fix: fix revision for sgl-flash-attn in sgl-kernel (sgl-project#11327)
* [Auto Sync] Update scheduler.py (20251009) (sgl-project#11350)
* [Generative Score API] Multi-Item scoring with custom attention mask. (sgl-project#10979)
* [router][grpc] disable health check generation and increase timeout (sgl-project#11353)
* [router] Refactor OpenAI router: split monolithic file and move location (sgl-project#11359)
* [router][lint] Add unused_qualifications to cargo lint warnings (sgl-project#11366)
* [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (sgl-project#11309)
* PullRequest: 323 [theta] 错误码规范化:1)chat和completions请求的前处理统一为400;2)多模态load data请求返回为标准的http错误码
* [router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (sgl-project#11373)
* add code pp support for nixl (sgl-project#11375)
* fix bench_serving mishandling of internal states (sgl-project#11376)
* PullRequest: 322 支持MTP并使用DeepseekV2AttentionMLA子类化BailingMoEV3AttentionMLA
* [router][grpc] Replace fake health check with correct ones (sgl-project#11387)
* [router] change grpc client from mutable to clone (sgl-project#11394)
* chore: upgrade flashinfer 0.4.0 (sgl-project#11364)
* [router] conversation item API: create, retrieve and delete (sgl-project#11369)
* chore: bump SGLang version to 0.5.3.post1 (sgl-project#11324)
* move more files under srt/utils (sgl-project#11285)
* [grammar] Avoid server crash when grammar backend is None (sgl-project#11401)
* fix: fix gpu-proc affinity set incorrectly when pp_size > 1 (sgl-project#11389)
* [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded (sgl-project#11365)
* [CI] Refactor PD disaggregation test suite (sgl-project#11363)
* Replace pad with cat for better performance (sgl-project#11388)
* fix: reinstall torch in deps install (sgl-project#11414)
* feat(hicache): Support passing prefix keys for l3 store. (sgl-project#9045)
* fix file and object naming scheme in HiCacheNixl to avoid data corruption (sgl-project#10969)
* Dedicated toml files for CPU/XPU (sgl-project#10734)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (sgl-project#11144)
* chore: update pyproject (sgl-project#11420)
* PullRequest: 330 [theta] qwen-vl支持视频base64传入图像帧,如:data:video/jpeg;base64,frame1_base64,frame2_base64,...,frameN_base64
* fix: fix video input for qwen3-vl (sgl-project#11361)
* perf: optimize qwen-vl with symm mem allreduce (sgl-project#11381)
* [HiCache] feat: add multi tenant with prefix tag (sgl-project#9256)
* [CI] Merge build-dev into workflow matrix (sgl-project#11345)
* Revert "perf: optimize qwen-vl with symm mem allreduce" (sgl-project#11436)
* Revert "fix: fix video input for qwen3-vl" (sgl-project#11437)
* Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (sgl-project#11433)
* [router] Fix ci nvcc not found error (sgl-project#11411)
* feat(mooncake): support GB suffix for global_segment_size  (sgl-project#10745)
* Separate allocation logic from scheduler (sgl-project#11313)
* [router] disable rate limiter by default (sgl-project#11435)
* [router] leverage RAII to actively cancel request during client disconnect (sgl-project#11399)
* [router][grpc] Consolidate parser checks for chat completions (sgl-project#11439)
* Reorder PD disagg CI tests (#11438)
* fix: Change dsv32 hack temporary path to use system temp directory (#11445)
* Fix batch invariant ops (#11368)
* [BugFix] test_mla_fp8.py fails on Cublas 12.9 (#11360)
* [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton (#11450)
* Remove tilelang dependency in Dockerfile (#11455)
* Enable native ModelOpt quantization support (2/3) (#9991)
* Reland [1/2] Optimizations and refactors about quant kernel (#10312)
* Super tiny delete unused openai router in sgl-router (#11448)
* Adjust logits metada init for target verify (#11467)
* [Documentation][Configuration] Server args and documentation of PD-Multiplexing. (#11427)
* Fix enable_v2 in int8 quant (#11470)
* [Fix] Fix split prefill with fa3. (#11428)
* fix stop when stream  (#11462)
* Add option to disable `any_whitespace` for `xgrammar` and `llguidance` backends. (#8919)
* PullRequest: 334 [theta] 修复qwen3-vl的各种bug
* [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019)
* fix Xeon CI (#11454)
* [CI] Add nightly builds to dockerhub (#9804)
* [Feature] support regex strings as a stopping condition (#10635)
* Beta spec-overlap for EAGLE (#11398)
* Piecewise CUDA Graph Support & Torch Compile Backend (#10062)
* [Router]: Small Typo in a comment within tree.rs (#11489)
* chore: bump sgl-kernel version to 0.3.16 (#11476)
* [smol] [perf] Qwen3-VL in place op. (#11481)
* [chore][1/N] Avoid using default mutable parameters (#11478)
* [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends (#10172)
* [ perf ] Replace json-> orjson in hot path (#11221)
* [chore][2/N] Avoid using default mutable parameters (#11479)
* Fix the GPT function calling regex to allow dash in the name (#10577)
* bailingMoE: Fix Key error of deepep_mode (#11465)
* Fix CI break by express-laned PRs. (#11499)
* Move args from `global_config` to `environ` (#11332)
* move fla env check position (#11500)
* Temporarily remove b200 tests (#11501)
* Fix port conflicts in CI (#11497)
* temporarily remove b200 tests (#11502)
* Fix unit tests (#11503)
* Bugfix: Fix Type consistency for KV indices in SWARadixCache (#11452)
* doc: add doc for adding new models into nightly-ci (#11443)
* [CI] fix lint (#11509)
* Deprecate `global_server_args_dict` (#11331)
* chore: remove flashinfer cleanup cache (#11514)
* fix: revert temporarily remove b200 tests (#11515)
* [Fix] Improve longbench prompt and other logics (#11474)
* Sync changes on io_struct.py and deterministic ops (#11498)
* [lint] Fix the lint issue (#11516)
* Revert "Deprecate `global_server_args_dict`" (#11520)
* Improve dp attention port assignment scheme (#5889)
* [theta] rebase public/main 1013-2
* [router] openai router: support grok model (#11511)
* docs(router): add token-bucket rate limiting to the docs (#11485)
* [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432)
* Update DeepSeek-R1-FP4 default config on blackwell (#11512)
* [Fix]: add missing device attribute to ChunkCache (#11493)
* [Feature] Support mamba radix cache v0 (#11214)
* ci: improve nightly-ci (#11385)
* [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering (#11505)
* [HICache]: Support 3FS-Store with page_first_direct layout (#11460)
* Tiny fix test run estimated time (#11544)
* [Reland] perf: optimize qwen-vl with symm mem allreduce (#11457)
* [theta] rebase public/main 1013-5
* Depreate `global_server_args_dict` (#11528)
* [theta] rebase public/main 1013-6
* [Fix] Add per_channel_quant parameter to MoE config functions (#11201)
* [router][ci] Add Nightly Release Workflow for SGLang Router (#11527)
* [router] allow tokenizer path to be dir (#11530)
* Remove `tp_worker.worker` (#11548)
* fix: fix video input for qwen3-vl (#11442)
* [NVIDIA] BUMP FA3 (#11444)
* [router][Fix] Include grpc reflection runtime dependency (#11419)
* Adjust overlap event loop (#11507)
* Move deep gemm related arguments to `sglang.srt.environ` (#11547)
* [router][grpc] Further delegate non-stream processing to `processing.rs`  (#11553)
* [router] allow user to specify chat template path (#11549)
* Minor: improve sampler & remove unused fields from model_config.py (#11531)
* [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441)
* Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) (#11557)
* [CI] Add Basic Test for DeepSeek V3.2 (#11308)
* [router][grpc] Add error handling to `generate_tool_constraints` (#11562)
* [NVIDIA] update pyproject.toml to support cu130 option (#11521)
* [CI Monitor] Ci monitor only deal with main branch in default (#11538)
* Tiny cleanup fp4 gemm calls (#11537)
* [router][grpc] Add `serve_grpc` to `launch_server` and log id for HealthCheck (#11564)
* [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds (#11571)
* [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM (#11534)
* chore: bump sgl-kernel version to 0.3.16.post1 (#11573)
* Fix accept rate in speculative decoding metrics (#11572)
* Compilation Folder Reset (#11539)
* [FEATURE] Add Profile Trace Merger for Distributed Traces (#11413)
* [DSv32] Use torch.compile for _get_logits_head_gate (#11565)
* Make DeepEP combine recv do not overlap (#11535)
* bench_serving support PD Disaggregation (#11542)
* Implement LRU eviction policy for LoRA adapters (#11041)
* PullRequest: 337 支持completions协议传入多模态请求
* Revert "[NVIDIA] BUMP FA3 (#11444)" (#11582)
* chore: bump sgl-kernel version to 0.3.16.post2 (#11583)
* [Auto Sync] Update model_config.py (20251014) (#11580)
* Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json (#11587)
* [router][protocols] Add Axum validate extractor and use it for `/v1/chat/completions` endpoint (#11588)
* [router] update generate spec to align with sgl io struct (#11591)
* [router] change worker api to async instead of sync (#11566)
* Update news section in README.md (#11598)
* [router] delete useless table content comment in spec (#11597)
* [router] allow router launch server to use grpc mode (#11600)
* [Docs] [Router]: Update sg-router doc on circuit breaker (#11449)
* [router] when given both local tokenizer and chat template, log all (#11601)
* [AMD CI] Add image and weights caching. (#11593)
* Update release-docker-dev.yml (#11603)
* Optimize Triton Draft Backend (#11556)
* Refactor spec decoding metrics calculation into separate `TokenizerManager` utility function (#11586)
* make radix cache deterministic (#10721)
* move eagle draft post process to cuda graph (#11434)
* Reduce one step decode for draft model. (#11561)
* [router] add py binding and readme for openai router and history backend (#11453)
* [theta] print load mm cost
* [theta] 百灵4头支持tp8
* [router] cleanup app context and move to startup (#11617)
* [router] add chang and keyang to sgl router author (#11620)
* use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. (#11605)
* [router] update router readme to latest features (#11619)
* Fix log for chunked prefix cache (#11624)
* [Auto Sync] Update scheduler.py, server_args.py (20251014) (#11623)
* [Auto Sync] Update collector.py (20251014) (#11625)
* [Minor] Update xgrammar dependency (#11622)
* Update install.md (#11631)
* fix: Update SGL_KERNEL_VERSION to 0.3.15 (#11633)
* [router][grpc] add warm up to grpc server (#11627)
* Refactor kv cache free (#11351)
* [router] update router doc to latest features (#11639)
* fix: upgrade transformers to 4.57.1 (#11628)
* [router] add worker self discovery for metadata (#11638)
* [router] upgrade to 0.2.0 (#11642)
* [theta] qwen vl耗时打印
* [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
* [theta] qwen vl耗时打印
* [1/N]Support  DeepSeek-R1 w4a8 normal deepep (#8247)
* [Fix] Fix accuracy bug in CSGMV kernel caching key. (#11579)
* feat: add add_chunked_prefix_cache_attention_backend (#11636)
* Super tiny improve FA3 import error message (#11590)
* [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl  (#11458)
* [Doc] Update support matrix for attn and hybrid attn (#11293)
* Clean up some Qwen3-Next and deterministic code (#11585)
* docs: update sglang installation guide (#11659)
* [theta] 更新aci镜像和依赖
* Tiny cleanup some eagle unused codes (#11660)
* Fix 1-step draft model forward (#11653)
* [tool call] Fix prev_tool_call_arr management in base_format_detector.py (#11367)
* [router] Fix response api related spec (#11621)
* Fix missing json imports in serving_responses.py (#11681)
* [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674)
* [sgl-kernel] Optimize gguf test (#11667)
* [router][grpc] Simplify model_id determination (#11684)
* [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676)
* chore: bump SGLang version to 0.5.3.post2 (#11680)
* [CI][XPU]enable sglang CI on Intel XPU (#9493)
* enable rmsnorm on XPU (#10248)
* Sync code and test CI; rename some env vars (#11686)
* docs: Add Contributor Covenant Code of Conduct (#11689)
* [theta] dockerfile增加deepgemm编译缓存(需要定期更新😂)
* [Mamba] Increase default mamba_full_memory_ratio to 0.9 (#11679)
* [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912)
* [sgl-kernel] support hadamard (#11663)
* Fix missing a2a backend init of GLM4.5 MoE Block (#11692)
* Split test_intel_amx_attention_backend.py to pass CI of timeout (#11370)
lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants