Skip to content

[router][grpc-server] Fix gRPC server shutdown#11094

Merged
slin1237 merged 2 commits intomainfrom
grpc-server-exit
Sep 30, 2025
Merged

[router][grpc-server] Fix gRPC server shutdown#11094
slin1237 merged 2 commits intomainfrom
grpc-server-exit

Conversation

@slin1237
Copy link
Copy Markdown
Collaborator

Motivation

When shutting down the gRPC server with Ctrl+C, the system exhibited multiple issues:

  1. KeyboardInterrupt tracebacks from scheduler processes in their ZMQ recv loops
  2. Main process hanging - required multiple Ctrl+C presses to exit
  3. Leaked instances and improper cleanup warnings
  4. asyncio event loop not terminating after shutdown signal
  • Scheduler processes received SIGINT: When pressing Ctrl+C, SIGINT was sent to the entire process group, causing scheduler subprocesses to display KeyboardInterrupt tracebacks even though they should only terminate when the parent dies (via kill_itself_when_parent_died()).
  • ZMQ asyncio context not terminated: The zmq.asyncio.Context was never explicitly terminated during shutdown, keeping the asyncio event loop alive with background tasks and preventing clean exit.
  • Atexit handler waiting for child processes: Python's multiprocessing module's atexit handler tried to join() scheduler processes that were still running, causing the main process to hang.

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@slin1237 slin1237 added bug Something isn't working router labels Sep 30, 2025
@slin1237 slin1237 merged commit 5a290a5 into main Sep 30, 2025
68 of 80 checks passed
@slin1237 slin1237 deleted the grpc-server-exit branch September 30, 2025 11:12
ch-tiger1 pushed a commit to ch-tiger1/sglang that referenced this pull request Oct 9, 2025
BraveY pushed a commit to openanolis/sglang that referenced this pull request Oct 22, 2025
Merge branch sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main
https://code.alipay.com/Theta/SGLang/pull_requests/342?tab=diff

Reviewed-by: 苏墨 <xuyongfei.xyf@antgroup.com>


* [router] minor code clean up in server startup (sgl-project#10470)
* [bugfix] fix typo (sgl-project#10471)
* [PD metrics] Add latency Histogram metrics of each stage for generate requests (sgl-project#8710)
* [CI] Fix runner for sgl-kernel (sgl-project#9887)
* fix(internvl): fix accuracy issue of normalization (sgl-project#10375)
* fix: gpt-oss streaming dropping normal content when tools are provided but not used (sgl-project#9657)
* model: support solar (sgl-project#8189)
* fix: resolve sgl-kernel ut (sgl-project#10476)
* [1/2] Speed up trtllm_mla attention backend (>10% e2e) (sgl-project#10473)
* Fix `--dataset-path` in `bench_one_batch_server` (sgl-project#10475)
* [Env] minimal version for organizing envs (sgl-project#10479)
* chore: bump v0.3.10 sgl-kernel (sgl-project#10478)
* [router] multi model registration fix (sgl-project#10481)
* [2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (sgl-project#10286)
* [Auto Sync] Update registry.py (20250915) (sgl-project#10484)
* [router] fix worker registration in multi model mode (sgl-project#10486)
* fix crash of DeepSeek-V3 update_weights_from_disk (sgl-project#8863)
* Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue (sgl-project#10434)
* [Hicache] Evaluate Per-Round Metrics in Multiturn Bench (sgl-project#10203)
* [ModelOpt] Respect `kv_cache_quant_algo` in ModelOpt checkpoints (sgl-project#10336)
* Add Logprobs unit test with a loose threshold (sgl-project#10230)
* [router] add router db connector for responses api (sgl-project#10487)
* Remove wrong imports `from sglang.python` (sgl-project#10493)
* [router] fix router manager and router init in server (sgl-project#10499)
* Cache the result of `is_blackwell` platform check (sgl-project#10498)
* feat: update support for qwen3next model (sgl-project#10466)
* Minor fix lint introduced by sgl-project#10466 (sgl-project#10507)
* chore: upgrade sgl-kernel 0.3.10 (sgl-project#10500)
* Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. (sgl-project#10491)
* Fix CI when sgl-kernel is changed but srt is not changed (sgl-project#10515)
* Support sgl-router parallel_batch in bench_one_batch_server (sgl-project#10506)
* [CPU] fix CPU backend sel. issue for Llama4 (sgl-project#10511)
* adjust import setuptools_rust (sgl-project#10524)
* Fix formatting in long code blocks (sgl-project#10528)
* skip vision_model for lora (sgl-project#10530)
* [2/2] Speed up trtllm_mla attention backend (sgl-project#10474)
* support using fa4 on deepseek on blackwell (sgl-project#9928)
* [Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) (sgl-project#10494)
* [Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) (sgl-project#10538)
* feat: add priority based scheduling with priority based request acceptance and preemption (sgl-project#8746)
* Fix decord dependency for aarch64 docker build (sgl-project#10529)
* enable prefix cache with dp (sgl-project#10459)
* [bugfix]hicache bench_long_context.py run failed (sgl-project#10523)
* Remove duplicated code (sgl-project#10545)
* CUDA Arch Independent (sgl-project#8813)
* [bench] Fix random seed in `bench_one_batch_server` (sgl-project#10548)
* [HiCache] Add tests for hicache storage mooncake backend (sgl-project#10171)
* [BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle (sgl-project#9976)
* fix: update dsv3 fp4 ut (sgl-project#10584)
* vlm: remove redundant d2h movement of mm feature tensors (sgl-project#9987)
* Enable trtllm mla prefix extend (sgl-project#10526)
* [ROCm] Fix fp8 quantization accuracy issue. (sgl-project#10558)
* [HICache] introduce evict policy (sgl-project#10190)
* PullRequest: 303 Revert "PullRequest: 291 for fa3 kvcache: revert github "convert mla kvcache to bfloat16""
* aiter v0.1.5.post2 (sgl-project#10563)
* [PD] Improve disaggregation common backend and refactor mooncake backend (sgl-project#10273)
* chore: upgrade mooncake 0.3.6 (sgl-project#10596)
* [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525)
* Scale kkt after reduction (sgl-project#10604)
* fix deepep assert when PD disaggregation == null (sgl-project#8274)
* [RL] Add destroy process group api (sgl-project#9979)
* Feat/add heartbeat mechanism for nixl conn (sgl-project#10222)
* update deepep version for qwen3-next deepep moe (sgl-project#10624)
* support qwen3-next-fp8 deepep (sgl-project#10622)
* Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610)
* [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595)
* Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579)
* feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947)
* Garbage collector regression in the online server (sgl-project#10621)
* [router] refactor worker to builder pattern 1/n (sgl-project#10628)
* refactor: use registry for _get_attention_backend_from_str (sgl-project#10629)
* [Feature] Speculative decoding support lookahead (sgl-project#9873)
* [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553)
* [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586)
* model support: Sarashina2VisionForCausalLM (sgl-project#10632)
* feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631)
* chore: bump sgl-kernel 0.3.11 (sgl-project#10630)
* Hicache L3 backend mooncake optimization configuration reading method (sgl-project#10319)
* [router] refactor worker to builder pattern 2/n (sgl-project#10633)
* [Feature]feat(get_ip): unify get_ip_xxx (sgl-project#10081)
* [router] refactor worker to builder pattern 3/n (sgl-project#10647)
* [sgl-kernel] Support moe_sum_reduce cuda kernel (sgl-project#10321)
* [router] refactor worker to builder pattern 4/n (sgl-project#10650)
* Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 (sgl-project#10634)
* [router] refactor worker to builder pattern 5/n (sgl-project#10653)
* [HiCacheStorage]support page_first_direct layout for generic set&get (sgl-project#10522)
* [router] preserve order of json params using preserve_order feature (sgl-project#10661)
* [router] refactor router and worker management 1/n (sgl-project#10664)
* fix: resolve sync issue (sgl-project#10668)
* [Auto Sync] Update .clang-format (20250919) (sgl-project#10670)
* [router] refactor router and worker management 2/n (sgl-project#10666)
* router-spec: Reorder `ChatCompletionRequest` and fix validation logic (sgl-project#10675)
* chore: cleanup docker image (sgl-project#10671)
* limit sgl-kernel causal conv1d to cuda only (sgl-project#10648)
* [Auto Sync] Update model_runner.py (20250920) (sgl-project#10679)
* [router] refactor router and worker management 2.5/n (sgl-project#10677)
* [1/2] Support deterministic inference with flashinfer attention backend (sgl-project#10645)
* [Auto Sync] Update deepseek_v2.py (20250920) (sgl-project#10683)
* chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile (sgl-project#10681)
* [Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster (sgl-project#10680)
* Replace os.environ in layernorm.py (sgl-project#10684)
* fix(disagg): fix sending KV cache in case of MLA for NIXL backend (sgl-project#10673)
* fix: update run_suite (sgl-project#10685)
* fix: remove awq_dequantize deps (sgl-project#10686)
* [Auto Sync] Update modelopt_quant.py (20250920) (sgl-project#10688)
* [Feature] Support deterministic inference with FA3 backend (sgl-project#10651)
* feat: update server args  (sgl-project#10696)
* Super tiny fix extra logs (sgl-project#10697)
* [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization  (sgl-project#10592)
* Update release-docs.yml (sgl-project#10706)
* Refactors radix cache for extra key support (sgl-project#10317)
* [Router]fix: fix get_load missing api_key (sgl-project#10385)
* fix: disable gpt-oss b200 ut (sgl-project#10716)
* Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU (sgl-project#10714)
* [Auto Sync] Update deepseek_v2.py (20250922) (sgl-project#10717)
* Support deterministic inference with triton backend (sgl-project#10694)
* [deterministic inference] Move batch invariant pkg to sglang (sgl-project#10695)
* [2/2] Support deterministic inference for temperature > 0 (sgl-project#10678)
* [Ascend] codeowner updates for ascend related files (sgl-project#10699)
* [theta] 支持qwen-vl的多模自定义采样
* revert e61d08c [theta] 支持qwen-vl的多模...
* PullRequest: 306 [theta] 支持qwen-vl的多模自定义采样
* [4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% (sgl-project#10709)
* Convert FLASHINFER_WORKSPACE_SIZE to integer (sgl-project#10731)
* EPLB: prefer to use physical experts in the same node (sgl-project#9849)
* fix capture_bs when speculative decoding enabled (sgl-project#10730)
* Fix flaky logprobs test (sgl-project#10728)
* Fix CI TestChunkedSGMV (sgl-project#10737)
* [Docs, minor] Fix LLM doc matrix (sgl-project#10753)
* Add warnings and remove dependency for deterministic inference (sgl-project#10724)
* bugfix: Fix `get_worker_urls_for_model` in http/router.rs (sgl-project#10754)
* [router] refactor router and worker management 3/n (sgl-project#10727)
* [router] update ci so only execute benchmarks when labels are added (sgl-project#10757)
* Fix MTP MoE weight loading with NVFP4 target model. (sgl-project#10758)
* chore: bump sgl-kernel v0.3.12 (sgl-project#10732)
* [Generative Score API] Added test_scores_api.py to github CICD to run per commit (sgl-project#10755)
* refactor zero copy (sgl-project#10300)
* Fix multimodal registry and code sync scripts (sgl-project#10759)
* Enables TRT-LLM backend to be used for target_verify (sgl-project#10281)
* fix: kv events with tp > 1 (sgl-project#10541)
* [Auto Sync] Update flashattention_backend.py (20250922) (sgl-project#10762)
* [Feature] Add MLAProcess for DeepSeek MLA on NPU (sgl-project#10130)
* [Ascend] optimize Qwen-vl on Ascend (sgl-project#10556)
* [Ascend]optimize Qwen3 on Ascend (sgl-project#10574)
* [Auto Sync] Update configurer.py (20250923) (sgl-project#10765)
* [router] refactor router and worker management 4/n (sgl-project#10756)
* PullRequest: 310 新增 BailingMoEV3 模型及其 MLA 支持
* [router] remove pd router draining channel (sgl-project#10767)
* [router] fix logger type mismatch (sgl-project#10774)
* Use simulate acc len from `sglang.environ` (sgl-project#10771)
* Fix trtllm_mla slow concat kernel in MTP (sgl-project#10777)
* Move cached kernel to srt.utils (sgl-project#10776)
* feat: unify dockerfiles (sgl-project#10705)
* Introduce `FutureMap` (sgl-project#10715)
* chore: upgrade sgl-kernel 0.3.12 (sgl-project#10782)
* followup: clean up dockerfiles and release yamls  (sgl-project#10783)
* Clean up server args (sgl-project#10770)
* move `environ` into `sglang.srt` to avoid break SRT auto sync. (sgl-project#10791)
* Fix hicache mooncake backend CI (sgl-project#10792)
* [router] fix cache aware routing strategy and lock contention (sgl-project#10773)
* [router] responses api POST and GET with local storage (sgl-project#10581)
* model: support qwen3-vl series (sgl-project#10323)
* [fix][pd-disag]no need set next batch sampling info done in prefill (sgl-project#10259)
* [ROCm] Update aiter to v0.1.5.post3 (sgl-project#10812)
* [router] use dashmap for radix tree instead of hash for multi model (sgl-project#10814)
* router(grpc): Implement route for chat_cmpl endpoint (sgl-project#10761)
* fix ceval (sgl-project#10504)
* Remove duplicate code in qwen2 model (sgl-project#10540)
* [router] fix axum default body limit (sgl-project#10818)
* Fix latest main ci (sgl-project#10799)
* add tunning files for QWEN-3-NEXT (sgl-project#10794)
* [Auto Sync] Update protocol.py (20250923) (sgl-project#10820)
* fix: draft model IMA by overide max_positional_embeddings (sgl-project#10787)
* [Auto Sync] Update elementwise.py (20250923) (sgl-project#10823)
* [Auto Sync] Update simple_eval_common.py (20250923) (sgl-project#10824)
* [router] Support streaming for Openai Router Response api  (sgl-project#10822)
* [router] add auth middleware for api key auth (sgl-project#10826)
* [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (sgl-project#10825)
* Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" (sgl-project#10828)
* Add CI timeout guidelines (sgl-project#10829)
* [theta] fix serving_tokenization.py
* feat: add cache_salt support to request (sgl-project#10718)
* fix bailing_moe with enable_dp_attention (sgl-project#10860)
* ci: free space on workers for build (sgl-project#10786)
* router-grpc: Support jinja chat template content format detection (sgl-project#10832)
* [router] select first healthy worker on proxied get requests (sgl-project#10827)
* chore: Initial support for input config files (sgl-project#10534)
* router-grpc: Add tools processing and other paramters for apply_chat_template (sgl-project#10877)
* [router] consolidate health endpoints and flush cache (sgl-project#10876)
* Restruct sgl-kernel benchmark (sgl-project#10861)
* [Bug] Fix Issue#10215 (sgl-project#10572)
* [router] consolidate worker get loads (sgl-project#10880)
* [router] Support Oracle DB(ATP) Data Connector (sgl-project#10845)
* [router] simplify tokenizer dev doc (sgl-project#10895)
* [Auto Sync] Update model_config.py (20250925) (sgl-project#10885)
* [ci feature] add ci monitor (sgl-project#10872)
* [HiCache] Cleaning the deprecated host memory state (sgl-project#10778)
* integrate AIBrix KVcache (sgl-project#10376)
* Add fuse_moe per-channel tune (sgl-project#10915)
* [router] consolidate worker load monitoring (sgl-project#10894)
* router: Fix constraint proto and `build_constraint` in grpc router (sgl-project#10881)
* Refactor kv_cache_scheme handling for quantization (sgl-project#10132)
* refactor: Move `grpc/client.rs` to `grpc_client/sglang_scheduler.rs` (sgl-project#10924)
* fix env flashinfer (sgl-project#10910)
* [minor] Remove deprecated function `get_ip` (sgl-project#10883)
* Rename customer label -> custom label (sgl-project#10899)
* [router] change log level to warning (sgl-project#10926)
* [router][refactor] Clean up protobuf fields (sgl-project#10923)
* Replace the Kimi-K2 generated tool call idx with history tool call count (sgl-project#10612)
* [ci] add ci-monitor workflow (sgl-project#10898)
* Remove pull_request trigger from CI monitor workflow (sgl-project#10932)
* router: Support parallel sampling num > 1 in grpc_server and non-stream handling (sgl-project#10929)
* Revert "Refactor kv_cache_scheme handling for quantization (sgl-project#10132)" (sgl-project#10935)
* Update CODEOWNERS to include JustinTong0323 in FC (sgl-project#10939)
* [PD-HiCache]: Support Async Offloading KVCache In Decode Side (sgl-project#10192)
* CI: Fix docker manifest build (sgl-project#10936)
* [router] update owners for router components (sgl-project#10927)
* Fuse write kv buffer into rope for qwen3 moe & bailing moe (sgl-project#10749)
* [router] add grpc client get and set (sgl-project#10955)
* [router]fix code owner syntax error (sgl-project#10956)
* [router] move grpc client from router to worker and builder (sgl-project#10958)
* [router] add move grpc worker management from router to worker manager (sgl-project#10960)
* [router] grpc router regular mode import cleanup (sgl-project#10963)
* [router] remove old/oudated/useless comments (sgl-project#10967)
* [router] remove old/oudated/useless comments across code base (sgl-project#10968)
* ci: fix rate-limit of huggingface with hf auth login (sgl-project#10947)
* Update label field comment to indicate deprecation (sgl-project#10970)
* Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs (sgl-project#10372)
* ci: refactor nightly test (sgl-project#10495)
* refactor loading weights from remote instance coding format (sgl-project#10941)
* [router][grpc] Add helpfer functions for decoder in router.rs and fix specs (sgl-project#10971)
* Add simple docker file for B300 (sgl-project#10944)
* Ci monitor support performance (sgl-project#10965)
* [HiCache]: Support dynamic loading backends for hicache (sgl-project#10551)
* [Bugfix][Minor][Benchmark] Fix some bugs due to PR sgl-project#10495 (sgl-project#10982)
* [router][grpc] Support E2E non-stream chat completions (sgl-project#10980)
* fix: fp8 quantization failure of qwen 2.5 VL 7B model (sgl-project#10112)
* [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (sgl-project#10981)
* fix: make inference deterministic for large TP (sgl-project#10930)
* Add auth to get server info (sgl-project#10751)
* PullRequest: 315 bailingMoE: Fix deepep_mode keyerror
* Add support for topk metadata transferring for PD (sgl-project#10616)
* [PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend (sgl-project#10565)
* Use jsonschema to constrain required or specific tool choice (sgl-project#10550)
* Fix profiler (sgl-project#10997)
* [router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) (sgl-project#10995)
* [router] basic mcp support for openai router response api (sgl-project#10978)
* [router] fix chat template loading and tokenizer path (sgl-project#10999)
* Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' (sgl-project#11009)
* [bugfix]Add empty_context import to two_batch_overlap.py (sgl-project#10964)
* prepare for sglang+verl (sgl-project#10555)
* [sgl-kernel] Optimize concat_mla_k kernel (sgl-project#10543)
* [HiCache] bug: fix mooncake store batch set v1 (sgl-project#11013)
* Fix FusedSetKVBufferArg  in RotaryEmbedding (sgl-project#11003)
* Update GLM-4.5 Model Doc (sgl-project#11017)
* [router] migrate to rust python module for pythonic parser (sgl-project#11033)
* fix: show failed models in nightly ci (sgl-project#10986)
* [router][tool call] Support normal content extraction before tool call (streaming) (sgl-project#11038)
* [router] add harmony tool parser base structure and interface (sgl-project#11036)
* Unify SGL Kernel Releases (sgl-project#10701)
* [1/2] Support FA4 for MHA Prefill in sgl-kernel (sgl-project#10940)
* fix: check if weights are already local before downloading (sgl-project#11015)
* [HiCacheStorage] mooncake store support page_first_direct layout (sgl-project#10591)
* [speculative decoding] rename lookahead to ngram (sgl-project#11010)
* Fix gemma 3 launch with `transformers:` the error: `AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size'` (sgl-project#9614)
* Fix sgl-kernel benchmark dead code  (sgl-project#11022)
* [router][tool call] Improve normal content extraction and error handling (non-stream) (sgl-project#11050)
* chore: upgrade cutedsl 4.2.1 (sgl-project#11054)
* [Ci Monitor] Auto uploaded performance data to sglang_ci_data repo (sgl-project#10976)
* chore: upgrade sgl-kernel 0.3.13 (sgl-project#11056)
* [router] add n to generate sampling params (sgl-project#11069)
* Use more general heuristics to set the default value of --mem-fraction-static (sgl-project#10975)
* [router][tool call] Separate `JsonParser` and `LlamaParser` (sgl-project#11073)
* Fix mem fraction static for nightly tests (sgl-project#11076)
* fix: fp8 mllama4 without vision modules being quantized (sgl-project#10611)
* [router] Use `get_pooled` in `process_single_choice` (sgl-project#11079)
* [router][grpc] Add logprobs support to router (sgl-project#11082)
* feat(reasoning): improve enable thinking from request (sgl-project#10875)
* [Profile] dump memory trace when cuda graph profile is enabled (sgl-project#11083)
* Remove hybrid_linear_attn attention backend and refactor attention registry (sgl-project#10816)
* [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (sgl-project#9642)
* Enable optional FP32 compute for LM Head (sgl-project#10729)
* Update CODEOWNERS for attention/ascend_backend.py (sgl-project#11092)
* [router] grpc router generate endpoint support (sgl-project#11070)
* [router][tool call] Full support for ToolChoice (sgl-project#11085)
* Fix spec filter batch when target extend  (sgl-project#10991)
* [Fix] Resolve performance drop in speculative decoding aiter backend (sgl-project#11087)
* [Auto Sync] Update fused_moe_triton_config.py (20250930) (sgl-project#11099)
* chore: bump sgl-kernel v0.3.14 (sgl-project#11067)
* [router][grpc-server] Fix gRPC server shutdown (sgl-project#11094)
* Fix eagle radix cache (sgl-project#10846)
* [Eval] Add `--repeat` in `run_eval`  (sgl-project#11101)
* [CPU] Adding Memory Capacity Acquisition Functionality (sgl-project#11102)
* Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization (sgl-project#11081)
* Support Dots.ocr model (sgl-project#11071)
* [router][bugfix] Fix input_logprobs handling with None value and `logprob_start_len = -1` (sgl-project#11113)
* Feature/make PEFT adapter module format compatibile (sgl-project#11080)
* fix: KimiK2Detector Improve tool call ID parsing with regex (sgl-project#10972)
* [router] add mcp list and mcp call in output array (sgl-project#11112)
* Organize spec-related data structures (sgl-project#10735)
* [AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm (sgl-project#11114)
* [Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) (sgl-project#11115)
* [Doc] Update multimodal language models documentation (sgl-project#11111)
* Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg (sgl-project#10985)
* docker: x86 dev builds for hopper and blackwell (sgl-project#11075)
* Refactor AMD CI. (sgl-project#11128)
* feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 (sgl-project#10760)
* [HiCache]bug fix: fixed blank item in host_mem_release_queue (sgl-project#11005)
* [Feature] Add EIC as sglang HiCache Storage backend (sgl-project#10271)
* [HiCache] Configurable and Dynamic Prefetch Timeout (sgl-project#10512)
* [router] add pd service in grpc router for pd (sgl-project#11120)
* [router] Add multi-turn tool calling loop support for MCP integration (sgl-project#11143)
* Fix metrics and request tracing (TimeStats) (sgl-project#11123)
* Remove debug print statement from scheduler output (sgl-project#11145)
* Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (sgl-project#10720)
* Fix ngram spec with page size > 1 (sgl-project#11135)
* [ROCm] To reduce the compiling time when using torch compile. (sgl-project#10559)
* Fix DeepSeek chunked prefill memory issue (sgl-project#11149)
* Clean up parallel_state.py (sgl-project#11148)
* Tiny improve dumper (sgl-project#11132)
* Tiny fix missing alt stream in nextn layer (sgl-project#10768)
* Fuse quantize and rope in trtllm_mla MTP (sgl-project#10779)
* Tiny detect slow ranks (sgl-project#10508)
* Remove unused pack `.item()` in paged allocator. (sgl-project#11156)
* Support dispatch low latency (sgl-project#10263)
* Support single batch overlap (sgl-project#10422)
* [router][grpc] Support tool call parser in streaming (sgl-project#11160)
* [model] Add mamba2 and Falcon-H1 support. (sgl-project#10988)
* Clean up ascend allocator (sgl-project#11152)
* fix cpp JIT compilation issue of ngram speculative decoding (sgl-project#10837)
* Tiny cleanup deepseek_v2.py (sgl-project#11163)
* Tiny fix ep_gather behavior different in CI (sgl-project#11130)
* Tiny remove duplicated code (sgl-project#11164)
* [proto] Add script to compile python protos (sgl-project#11171)
* Unify forward output datastructure (sgl-project#11124)
* [grpc] style fix for grpc compilation. (sgl-project#11175)
* Remove dp balance metadata and minimul token balance. (sgl-project#11170)
* Minor fixes for server_args, parallel_state, and test_deterministic.py (sgl-project#11159)
* fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 (sgl-project#11176)
* [router][grpc] Support streaming for v1/chat/completions (sgl-project#11179)
* Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (sgl-project#11138)
* Introduce naming convention in `io_struct` and base sglang io classes. (sgl-project#10133)
* [Generative Scores API] add performance tests to CICD  (sgl-project#10830)
* [1/n] Enable DCA CUDA graph capture (sgl-project#9537)
* [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (sgl-project#11161)
* [CI]] Tee server logs to both file and stdout/stderr using PIPE (sgl-project#11185)
* fix: radix cache memory accounting (sgl-project#10637)
* Tiny add PD disaggregation + DP attention test (sgl-project#11167)
* [router] Steaming support for MCP Tool Calls in OpenAI Router (sgl-project#11173)
* [Feature] Option to save model weights to CPU when memory saver mode is enabled (sgl-project#10873)
* Add --thinking-mode to run_eval (sgl-project#11189)
* [hot-fix] Fix CI break which caused by adding `thinking_mode` in eval (sgl-project#11192)
* Tiny move files to utils folder (sgl-project#11166)
* Fix CUDA illegal memory access issues in speculative decoding (sgl-project#10892)
* Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. (sgl-project#10780)
* Optimize debug log position of PD abort request (sgl-project#11090)
* fix 3fs indices (sgl-project#10855)
* model: support starcoder2 (sgl-project#10609)
* [Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. (sgl-project#10859)
* fix xeon ci check (sgl-project#10838)
* fix qwen2 eagle3 runtime error (sgl-project#10517)
* [minor] fix the lint (sgl-project#11198)
* [Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py (sgl-project#10741)
* [fix]missing prefix_lens_cpu init when p/d disaggregation (sgl-project#11196)
* fix self.enable_kv_cache_events (sgl-project#11178)
* [HICache]: Refactor HiCache CI (sgl-project#11011)
* fix sampling_seed handling when deterministic is enabled (sgl-project#11096)
* [fix]enable flashmla when using draft model P/D attention select (sgl-project#11012)
* [router] fix get load response parsing (sgl-project#11213)
* [router] add grpc router pd mode for chat and generate (sgl-project#11140)
* EAGLE cache fix for HiCache (sgl-project#11215)
* Add --max-new-tokens CLI flag for MMMU evaluation (sgl-project#11217)
* Add DeepSeek-V3.2 Tool Call Template (sgl-project#11063)
* Tiny `skip_sample` adjust (sgl-project#11225)
* [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 (sgl-project#11194)
* Update `v1/responses` to be more OpenAI-compatible. (sgl-project#9624)
* chore: bump sgl-kernel v0.3.14.post1 (sgl-project#11137)
* Update DeepGEMM repository tag to specific commit (sgl-project#11229)
* [Feat] Support Torch Symm Mem AllReduce (sgl-project#10571)
* Refactor and optimize mooncake CI (sgl-project#11162)
* [Fix AMD CI] VRAM cleanup  (sgl-project#11174)
* Update transformers package version to 4.57.0 (sgl-project#11222)
* Remove gdrcopy check in ci_install_deepep.sh (sgl-project#11237)
* Rename runner labels (sgl-project#11228)
* [Auto Sync] Update io_struct.py (20251004) (sgl-project#11206)
* Create two new GH workflows to automatically bump SGLang and Kernel version (sgl-project#10996)
* Fix spec_utils.py (sgl-project#11247)
* ci: make find_local_hf_snapshot_dir more robust (sgl-project#11248)
* [quantization] Fix scale remapping for mllama4 (sgl-project#10042)
* [quantization] Enable aiter mxfp4 fused_moe for Quark (sgl-project#10048)
* Use cu128 for torch audio to fix some CI tests (sgl-project#11251)
* Bump torch_memory_saver 0.0.9rc2 (sgl-project#11252)
* update sgl kernel version to 0.3.14.post1 (sgl-project#11242)
* Update condition for sgl-kernel-benchmark-test (sgl-project#11254)
* feat: add shortcut detection for multimodal templates in Jinja format (sgl-project#11209)
* Improve bot release workflow (sgl-project#11240)
* Add flashmla and fast hadamard transform to Dockerfile (sgl-project#11235)
* Support DeepSeek V3.2 Exp (sgl-project#11061)
* chore: bump SGLang version to 0.5.3rc2 (sgl-project#11259)
* chore: bump SGLang version to 0.5.3 (sgl-project#11263)
* [theta] fix bailing v3
* [router] add ipv6 support across all components (sgl-project#11219)
* Remove env var warnings for release (sgl-project#11262)
* Enable native ModelOpt quantization support (1/3)  (sgl-project#7149)
* [router][tool call] Clean up redundant `detect_format` and `has_tool_markers` (sgl-project#11270)
* disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 (sgl-project#11274)
* docker: add manifest to versioned docker releases (sgl-project#11268)
* [Bug] Fix incorrect assertion in FA4 and add UT. (sgl-project#11182)
* [router][grpc] Refine streaming processes (sgl-project#11277)
* Fix code sync scripts (sgl-project#11276)
* [Auto Sync] Update test_utils.py (20251006) (sgl-project#11280)
* Rename max_micro_batch_size -> pp_max_micro_batch_size (sgl-project#11279)
* reverse the amd ci test back to 1200s and split the 8-gpu deepseek job into two. (sgl-project#11238)
* Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components (sgl-project#11261)
* fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration (sgl-project#11282)
* docs: update sgl-kernel README (sgl-project#11286)
* chore: bump sgl-kernel version to 0.3.15 (sgl-project#11281)
* [router][grpc] Fix proto3 default value mismatches and cleanup unused fields (sgl-project#11283)
* convert test_deterministic into unit tests (sgl-project#11095)
* Feature/longbench v2 evaluation utils (sgl-project#10949)
* [ci] fix pp test (sgl-project#11294)
* EAGLE cache fix for SWARadixCache (sgl-project#11231)
* Remove overlap thread (sgl-project#11210)
* [router] add reasoning and tool parser argument in router (sgl-project#11290)
* Remove sampling info events and overlap thread file (sgl-project#11300)
* Introduce future indices (sgl-project#11301)
* [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (sgl-project#11068)
* [Docs] [Router] Update Observability and Common Issues Section (sgl-project#11302)
* [router] add get server info and get model info in grpc server (sgl-project#11303)
* [router][grpc] Refactor chat template content format detection (sgl-project#11288)
* [Doc] HiCache Design Documents (sgl-project#11027)
* [Doc]: Best Practice for HICache (sgl-project#11001)
* [router] fix grpc connection conversion and add optimization (sgl-project#11305)
* [router][grpc] Fix sampling_params.stop_strs is None (sgl-project#11306)
* Update tool parser and related documentation (sgl-project#11223)
* [router][grpc] Fix error message format in grpc chat handler (sgl-project#11307)
* [quantization] Properly ignore quantization for layers excluded in quant_config (sgl-project#11205)
* [router] support Openai router conversation API CRUD (sgl-project#11297)
* [router][grpc] Fix request_id extraction when n > 1 (sgl-project#11311)
* [router] cleanup worker health check to return early (sgl-project#11310)
* [oai serving chat] Add argument `--sampling-defaults` and fix `ChatCompletionRequest` defaults (sgl-project#11304)
* Clean match_prefix and prepare_for_extend for mem cache V2 (sgl-project#11200)
* ci: unify the model launch method of nightly ci (sgl-project#11230)
* [Chore] Update xgrammar 0.1.24 -> 0.1.25 (sgl-project#10710)
* update sampling_params documentation with defaults (sgl-project#11315)
* Optimize copy_kv_cache for spec decoding (sgl-project#11126)
* Rename `ngram_utils` -> `ngram_info` (sgl-project#11316)
* [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (sgl-project#11314)
* [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (sgl-project#9545)
* [8/N] MoE Refactor: deprecate `EPMoE` (sgl-project#11211)
* Skip weight loading in deepgemm compilation (sgl-project#11312)
* [2/2] Support MHA prefill with FlashAttention 4. (sgl-project#10937)
* [Doc] Update mooncake nvlink transport doc for PD disaggregation (sgl-project#11321)
* fix(decode): adjust ServerArgs import to explicit module path (sgl-project#11007)
* Support LoRA in bench_serving oai interface (sgl-project#11318)
* benchmark: enhance configurable multimodal benchmarking in bench_serving (sgl-project#9812)
* [CI] improve disaggregation CI. (sgl-project#11264)
* [theta] fix tokenization
* model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (sgl-project#10909)
* [router] refactor generate to use new pipeline arch (sgl-project#11323)
* [router] improve reasoning parser lock and reduce req cloning (sgl-project#11336)
* [router][grpc] Cleanup debug logs in grpc_server and grpc_router (sgl-project#11340)
* [router] Fix all unused_qualifications (sgl-project#11341)
* [router] Support history management using conversation (sgl-project#11339)
* [router][grpc] Add dependencies in Cargo.toml to support chat template rendering (sgl-project#11342)
* fix: fix revision for sgl-flash-attn in sgl-kernel (sgl-project#11327)
* [Auto Sync] Update scheduler.py (20251009) (sgl-project#11350)
* [Generative Score API] Multi-Item scoring with custom attention mask. (sgl-project#10979)
* [router][grpc] disable health check generation and increase timeout (sgl-project#11353)
* [router] Refactor OpenAI router: split monolithic file and move location (sgl-project#11359)
* [router][lint] Add unused_qualifications to cargo lint warnings (sgl-project#11366)
* [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (sgl-project#11309)
* PullRequest: 323 [theta] 错误码规范化:1)chat和completions请求的前处理统一为400;2)多模态load data请求返回为标准的http错误码
* [router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (sgl-project#11373)
* add code pp support for nixl (sgl-project#11375)
* fix bench_serving mishandling of internal states (sgl-project#11376)
* PullRequest: 322 支持MTP并使用DeepseekV2AttentionMLA子类化BailingMoEV3AttentionMLA
* [router][grpc] Replace fake health check with correct ones (sgl-project#11387)
* [router] change grpc client from mutable to clone (sgl-project#11394)
* chore: upgrade flashinfer 0.4.0 (sgl-project#11364)
* [router] conversation item API: create, retrieve and delete (sgl-project#11369)
* chore: bump SGLang version to 0.5.3.post1 (sgl-project#11324)
* move more files under srt/utils (sgl-project#11285)
* [grammar] Avoid server crash when grammar backend is None (sgl-project#11401)
* fix: fix gpu-proc affinity set incorrectly when pp_size > 1 (sgl-project#11389)
* [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded (sgl-project#11365)
* [CI] Refactor PD disaggregation test suite (sgl-project#11363)
* Replace pad with cat for better performance (sgl-project#11388)
* fix: reinstall torch in deps install (sgl-project#11414)
* feat(hicache): Support passing prefix keys for l3 store. (sgl-project#9045)
* fix file and object naming scheme in HiCacheNixl to avoid data corruption (sgl-project#10969)
* Dedicated toml files for CPU/XPU (sgl-project#10734)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (sgl-project#11144)
* chore: update pyproject (sgl-project#11420)
* PullRequest: 330 [theta] qwen-vl支持视频base64传入图像帧,如:data:video/jpeg;base64,frame1_base64,frame2_base64,...,frameN_base64
* fix: fix video input for qwen3-vl (sgl-project#11361)
* perf: optimize qwen-vl with symm mem allreduce (sgl-project#11381)
* [HiCache] feat: add multi tenant with prefix tag (sgl-project#9256)
* [CI] Merge build-dev into workflow matrix (sgl-project#11345)
* Revert "perf: optimize qwen-vl with symm mem allreduce" (sgl-project#11436)
* Revert "fix: fix video input for qwen3-vl" (sgl-project#11437)
* Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (sgl-project#11433)
* [router] Fix ci nvcc not found error (sgl-project#11411)
* feat(mooncake): support GB suffix for global_segment_size  (sgl-project#10745)
* Separate allocation logic from scheduler (sgl-project#11313)
* [router] disable rate limiter by default (sgl-project#11435)
* [router] leverage RAII to actively cancel request during client disconnect (sgl-project#11399)
* [router][grpc] Consolidate parser checks for chat completions (sgl-project#11439)
* Reorder PD disagg CI tests (#11438)
* fix: Change dsv32 hack temporary path to use system temp directory (#11445)
* Fix batch invariant ops (#11368)
* [BugFix] test_mla_fp8.py fails on Cublas 12.9 (#11360)
* [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton (#11450)
* Remove tilelang dependency in Dockerfile (#11455)
* Enable native ModelOpt quantization support (2/3) (#9991)
* Reland [1/2] Optimizations and refactors about quant kernel (#10312)
* Super tiny delete unused openai router in sgl-router (#11448)
* Adjust logits metada init for target verify (#11467)
* [Documentation][Configuration] Server args and documentation of PD-Multiplexing. (#11427)
* Fix enable_v2 in int8 quant (#11470)
* [Fix] Fix split prefill with fa3. (#11428)
* fix stop when stream  (#11462)
* Add option to disable `any_whitespace` for `xgrammar` and `llguidance` backends. (#8919)
* PullRequest: 334 [theta] 修复qwen3-vl的各种bug
* [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019)
* fix Xeon CI (#11454)
* [CI] Add nightly builds to dockerhub (#9804)
* [Feature] support regex strings as a stopping condition (#10635)
* Beta spec-overlap for EAGLE (#11398)
* Piecewise CUDA Graph Support & Torch Compile Backend (#10062)
* [Router]: Small Typo in a comment within tree.rs (#11489)
* chore: bump sgl-kernel version to 0.3.16 (#11476)
* [smol] [perf] Qwen3-VL in place op. (#11481)
* [chore][1/N] Avoid using default mutable parameters (#11478)
* [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends (#10172)
* [ perf ] Replace json-> orjson in hot path (#11221)
* [chore][2/N] Avoid using default mutable parameters (#11479)
* Fix the GPT function calling regex to allow dash in the name (#10577)
* bailingMoE: Fix Key error of deepep_mode (#11465)
* Fix CI break by express-laned PRs. (#11499)
* Move args from `global_config` to `environ` (#11332)
* move fla env check position (#11500)
* Temporarily remove b200 tests (#11501)
* Fix port conflicts in CI (#11497)
* temporarily remove b200 tests (#11502)
* Fix unit tests (#11503)
* Bugfix: Fix Type consistency for KV indices in SWARadixCache (#11452)
* doc: add doc for adding new models into nightly-ci (#11443)
* [CI] fix lint (#11509)
* Deprecate `global_server_args_dict` (#11331)
* chore: remove flashinfer cleanup cache (#11514)
* fix: revert temporarily remove b200 tests (#11515)
* [Fix] Improve longbench prompt and other logics (#11474)
* Sync changes on io_struct.py and deterministic ops (#11498)
* [lint] Fix the lint issue (#11516)
* Revert "Deprecate `global_server_args_dict`" (#11520)
* Improve dp attention port assignment scheme (#5889)
* [theta] rebase public/main 1013-2
* [router] openai router: support grok model (#11511)
* docs(router): add token-bucket rate limiting to the docs (#11485)
* [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432)
* Update DeepSeek-R1-FP4 default config on blackwell (#11512)
* [Fix]: add missing device attribute to ChunkCache (#11493)
* [Feature] Support mamba radix cache v0 (#11214)
* ci: improve nightly-ci (#11385)
* [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering (#11505)
* [HICache]: Support 3FS-Store with page_first_direct layout (#11460)
* Tiny fix test run estimated time (#11544)
* [Reland] perf: optimize qwen-vl with symm mem allreduce (#11457)
* [theta] rebase public/main 1013-5
* Depreate `global_server_args_dict` (#11528)
* [theta] rebase public/main 1013-6
* [Fix] Add per_channel_quant parameter to MoE config functions (#11201)
* [router][ci] Add Nightly Release Workflow for SGLang Router (#11527)
* [router] allow tokenizer path to be dir (#11530)
* Remove `tp_worker.worker` (#11548)
* fix: fix video input for qwen3-vl (#11442)
* [NVIDIA] BUMP FA3 (#11444)
* [router][Fix] Include grpc reflection runtime dependency (#11419)
* Adjust overlap event loop (#11507)
* Move deep gemm related arguments to `sglang.srt.environ` (#11547)
* [router][grpc] Further delegate non-stream processing to `processing.rs`  (#11553)
* [router] allow user to specify chat template path (#11549)
* Minor: improve sampler & remove unused fields from model_config.py (#11531)
* [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441)
* Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) (#11557)
* [CI] Add Basic Test for DeepSeek V3.2 (#11308)
* [router][grpc] Add error handling to `generate_tool_constraints` (#11562)
* [NVIDIA] update pyproject.toml to support cu130 option (#11521)
* [CI Monitor] Ci monitor only deal with main branch in default (#11538)
* Tiny cleanup fp4 gemm calls (#11537)
* [router][grpc] Add `serve_grpc` to `launch_server` and log id for HealthCheck (#11564)
* [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds (#11571)
* [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM (#11534)
* chore: bump sgl-kernel version to 0.3.16.post1 (#11573)
* Fix accept rate in speculative decoding metrics (#11572)
* Compilation Folder Reset (#11539)
* [FEATURE] Add Profile Trace Merger for Distributed Traces (#11413)
* [DSv32] Use torch.compile for _get_logits_head_gate (#11565)
* Make DeepEP combine recv do not overlap (#11535)
* bench_serving support PD Disaggregation (#11542)
* Implement LRU eviction policy for LoRA adapters (#11041)
* PullRequest: 337 支持completions协议传入多模态请求
* Revert "[NVIDIA] BUMP FA3 (#11444)" (#11582)
* chore: bump sgl-kernel version to 0.3.16.post2 (#11583)
* [Auto Sync] Update model_config.py (20251014) (#11580)
* Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json (#11587)
* [router][protocols] Add Axum validate extractor and use it for `/v1/chat/completions` endpoint (#11588)
* [router] update generate spec to align with sgl io struct (#11591)
* [router] change worker api to async instead of sync (#11566)
* Update news section in README.md (#11598)
* [router] delete useless table content comment in spec (#11597)
* [router] allow router launch server to use grpc mode (#11600)
* [Docs] [Router]: Update sg-router doc on circuit breaker (#11449)
* [router] when given both local tokenizer and chat template, log all (#11601)
* [AMD CI] Add image and weights caching. (#11593)
* Update release-docker-dev.yml (#11603)
* Optimize Triton Draft Backend (#11556)
* Refactor spec decoding metrics calculation into separate `TokenizerManager` utility function (#11586)
* make radix cache deterministic (#10721)
* move eagle draft post process to cuda graph (#11434)
* Reduce one step decode for draft model. (#11561)
* [router] add py binding and readme for openai router and history backend (#11453)
* [theta] print load mm cost
* [theta] 百灵4头支持tp8
* [router] cleanup app context and move to startup (#11617)
* [router] add chang and keyang to sgl router author (#11620)
* use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. (#11605)
* [router] update router readme to latest features (#11619)
* Fix log for chunked prefix cache (#11624)
* [Auto Sync] Update scheduler.py, server_args.py (20251014) (#11623)
* [Auto Sync] Update collector.py (20251014) (#11625)
* [Minor] Update xgrammar dependency (#11622)
* Update install.md (#11631)
* fix: Update SGL_KERNEL_VERSION to 0.3.15 (#11633)
* [router][grpc] add warm up to grpc server (#11627)
* Refactor kv cache free (#11351)
* [router] update router doc to latest features (#11639)
* fix: upgrade transformers to 4.57.1 (#11628)
* [router] add worker self discovery for metadata (#11638)
* [router] upgrade to 0.2.0 (#11642)
* [theta] qwen vl耗时打印
* [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
* [theta] qwen vl耗时打印
* [1/N]Support  DeepSeek-R1 w4a8 normal deepep (#8247)
* [Fix] Fix accuracy bug in CSGMV kernel caching key. (#11579)
* feat: add add_chunked_prefix_cache_attention_backend (#11636)
* Super tiny improve FA3 import error message (#11590)
* [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl  (#11458)
* [Doc] Update support matrix for attn and hybrid attn (#11293)
* Clean up some Qwen3-Next and deterministic code (#11585)
* docs: update sglang installation guide (#11659)
* [theta] 更新aci镜像和依赖
* Tiny cleanup some eagle unused codes (#11660)
* Fix 1-step draft model forward (#11653)
* [tool call] Fix prev_tool_call_arr management in base_format_detector.py (#11367)
* [router] Fix response api related spec (#11621)
* Fix missing json imports in serving_responses.py (#11681)
* [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674)
* [sgl-kernel] Optimize gguf test (#11667)
* [router][grpc] Simplify model_id determination (#11684)
* [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676)
* chore: bump SGLang version to 0.5.3.post2 (#11680)
* [CI][XPU]enable sglang CI on Intel XPU (#9493)
* enable rmsnorm on XPU (#10248)
* Sync code and test CI; rename some env vars (#11686)
* docs: Add Contributor Covenant Code of Conduct (#11689)
* [theta] dockerfile增加deepgemm编译缓存(需要定期更新😂)
* [Mamba] Increase default mamba_full_memory_ratio to 0.9 (#11679)
* [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912)
* [sgl-kernel] support hadamard (#11663)
* Fix missing a2a backend init of GLM4.5 MoE Block (#11692)
* Split test_intel_amx_attention_backend.py to pass CI of timeout (#11370)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working router run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants