Skip to content

Refactors radix cache for extra key support#10317

Merged
hnyls2002 merged 18 commits intosgl-project:mainfrom
JustinTong0323:feat-refactor-radix-cache-combine-lora-cache
Sep 21, 2025
Merged

Refactors radix cache for extra key support#10317
hnyls2002 merged 18 commits intosgl-project:mainfrom
JustinTong0323:feat-refactor-radix-cache-combine-lora-cache

Conversation

@JustinTong0323
Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 commented Sep 11, 2025

Motivation

In specific scenarios, it is necessary to isolate the kvcache for certain requests. For instance, requests utilizing different LoRA adapters, or requests for which we do not desire sharing the kvcache with others(ref #9163).

This PR refactors the Key in the Radix cache by adding an extra_key field. Only requests/tokens in the cache that have the same extra_key can share the prefix KV cache.

I would implement the logic for cache salting in following PR.

Modifications

This commit refactors the radix cache to support an extra key (e.g., lora_id, cache_salt) for classifying requests.

The changes include:

  • Adds extra_key to the Req class
  • Replace raw token_ids key to BaseKey class to facilitate request classification.
  • Modifies the match_prefix method in RadixCache and SchedulePolicy to incorporate the BaseKey when matching prefixes.
  • Removes the LoRARadixCache, as the base RadixCache can now handle LoRA IDs through the extra_key mechanism.

TODO List:

  • Add support for HiRadixCache
  • Support SWA radix cache
  • Compatible for cpp/lmc
  • Support radix cache cpp for extra keys (need modify cpp codes, in following PR
  • others I may not considered

Accuracy Tests

Benchmarking and Profiling

Checklist

This commit refactors the radix cache to support an extra key
(e.g., lora_id, cache_salt) for classifying requests.

The changes include:

- Adds `extra_key` to the `Req` class and `BaseKey` class to
  facilitate request classification.
- Modifies the `match_prefix` method in `RadixCache` and
  `SchedulePolicy` to incorporate the `extra_key` when matching
  prefixes.
- Removes the LoRARadixCache, as the base RadixCache can now
  handle LoRA IDs through the `extra_key` mechanism.

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @JustinTong0323, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural improvement to the KV cache management system. By generalizing the key structure within the Radix cache to include an 'extra key' field, the system can now intelligently isolate cache entries based on specific request characteristics, such as the LoRA adapter used or a custom cache salt. This enhancement provides greater flexibility and control over cache sharing behavior, leading to more efficient and tailored memory utilization for diverse request types.

Highlights

  • Generalized Radix Cache Key: The Radix cache has been refactored to support an extra_key field within its BaseKey structure. This allows for more granular control over KV cache sharing, enabling isolation of cache entries for requests with different extra_key values (e.g., LoRA adapters, cache salts).
  • Integration of extra_key: The Req class now includes an extra_key attribute, and the match_prefix methods in RadixCache and SchedulePolicy have been updated to utilize this new key for prefix matching, ensuring that only requests with matching extra_key values can share cached prefixes.
  • Removal of LoRA-specific Cache: The specialized LoRARadixCache has been removed. Its functionality is now subsumed by the generalized RadixCache through the use of the extra_key mechanism, simplifying the caching architecture.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

gemini-code-assist[bot]

This comment was marked as outdated.

JustinTong0323 and others added 5 commits September 11, 2025 11:01
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
gemini-code-assist[bot]

This comment was marked as outdated.

@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/gemini summary

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request significantly refactors the KV cache system by introducing a generalized extra_key concept within the RadixCache. This change allows for flexible isolation of KV caches for different types of requests, such as those using distinct LoRA adapters or requiring specific cache salting, without needing separate cache implementations. The refactoring streamlines the codebase by deprecating the specialized LoRARadixCache and unifying key management under a new BaseKey class, improving maintainability and extensibility for future caching needs.

Highlights

  • Centralized Radix Cache: The LoRARadixCache has been removed, and its functionality is now integrated into the main RadixCache via a new extra_key mechanism.
  • Flexible Key Management: A new BaseKey class is introduced to encapsulate both token_ids and an optional extra_key (e.g., lora_id, cache_salt), allowing for more flexible request classification and KV cache isolation.
  • Enhanced Prefix Matching: The match_prefix methods across various cache implementations (RadixCache, HiRadixCache, SWARadixCache, LMCRadixCache) and scheduling policies (SchedulePolicy) are updated to leverage the BaseKey for more precise prefix matching.
  • Streamlined Request Handling: The Req class now includes the extra_key field, simplifying how requests are classified and managed within the cache system.
Changelog
  • python/sglang/srt/managers/schedule_batch.py
    • Updated Req class to include extra_key and adapted prefix matching logic to use BaseKey, removing LoRARadixCache specific imports and code paths.
  • python/sglang/srt/managers/schedule_policy.py
    • Modified prefix matching and insertion logic to utilize the new BaseKey for improved request classification within the scheduling policy.
  • python/sglang/srt/managers/scheduler.py
    • Removed the now redundant LoRARadixCache initialization logic, reflecting its integration into the main RadixCache.
  • python/sglang/srt/mem_cache/base_prefix_cache.py
    • Generalized the match_prefix method's key type hint to Any to accommodate the new BaseKey object.
  • python/sglang/srt/mem_cache/hiradix_cache.py
    • Adapted various methods (_insert_helper_host, match_prefix, _split_node, insert, _insert_helper) to correctly handle the new BaseKey object and its token_ids and extra_key attributes.
  • python/sglang/srt/mem_cache/lora_radix_cache.py
    • This file was completely removed as its functionality has been absorbed by the refactored RadixCache.
  • python/sglang/srt/mem_cache/radix_cache.py
    • Introduced the BaseKey class, updated TreeNode to use it, and refactored key matching, insertion, and utility functions to operate with BaseKey, including new _check_extra_key and get_child_key functions.
  • python/sglang/srt/mem_cache/radix_cache_cpp.py
    • Modified match_prefix and _insert to accept BaseKey objects, extracting token_ids for the C++ backend, and updated request caching methods accordingly.
  • python/sglang/srt/mem_cache/storage/lmcache/lmc_radix_cache.py
    • Updated match_prefix and request caching to use BaseKey and access its token_ids and extra_key attributes for LM cache operations.
  • python/sglang/srt/mem_cache/swa_radix_cache.py
    • Integrated BaseKey into TreeNode and updated key matching, insertion, and splitting logic to use BaseKey, importing common key utility functions from radix_cache.py.
  • test/srt/test_swa_unittest.py
    • Updated test cases to reflect the changes in SWAKVPool and SWATokenToKVPoolAllocator initialization parameters, and modified tree.insert and tree.match_prefix calls to use the new BaseKey object.
Activity
  • JustinTong0323 initiated a summary request.
  • gemini-code-assist[bot] provided a high priority review comment regarding hashing empty parent_parent_tokens in radix_cache.py.
  • gemini-code-assist[bot] provided a medium priority review comment suggesting to use os.path.join for extra_key concatenation in schedule_batch.py.
  • gemini-code-assist[bot] provided a medium priority review comment recommending to include actual extra_key values in ValueError message in radix_cache.py.
  • gemini-code-assist[bot] provided a high priority review comment identifying a collision risk with simple string concatenation for extra_key and lora_id in schedule_batch.py, suggesting a tuple-like string representation.
  • gemini-code-assist[bot] provided a medium priority review comment noting a malformed docstring for match_prefix in radix_cache.py.
  • gemini-code-assist[bot] provided a medium priority review comment pointing out incorrect usage of raw strings instead of BaseKey objects in radix_cache.py test block, and type mismatch.
  • gemini-code-assist[bot] provided a medium priority review comment suggesting simplifying BaseKey object creation in radix_cache_cpp.py by passing token_ids directly.

Comment thread test/srt/test_swa_unittest.py Outdated
from sglang.srt.managers.schedule_batch import Req


class BaseKey:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Why call it base key? How about putting it to base_prefix_cache?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be additional Key schemas inheriting from this base class in the future. These schemas could contain more information, like for hicache, we could attach data such as location (I'm not familiar with hicache, this is just a random thought).

Copy link
Copy Markdown
Collaborator Author

@JustinTong0323 JustinTong0323 Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I noticed BaseKey is not used in chunked cache, so I suppose it should placed in radix cache.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiezhq-hermann what do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think called it as radix_key might be more straightforward, it is specific to radix trees and can contain different information, including the token ids and the extra information/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JustinTong0323 Do not call this class BaseKey. Name it RadixKey or others.

Fixes an issue where indexing a BaseKey with a single token was returning an integer instead of a BaseKey object.

This change ensures that even when indexing with a single token, a BaseKey object is returned, maintaining consistency and preventing potential errors.

Adds a unit test for radix cache to improve code coverage.

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
@JustinTong0323 JustinTong0323 force-pushed the feat-refactor-radix-cache-combine-lora-cache branch from c9f8f1f to 0db5a8b Compare September 17, 2025 09:17
JustinTong0323 and others added 2 commits September 17, 2025 02:18
Updates radix cache implementations to directly use the key object for slicing,
rather than accessing the underlying token_ids and extra_key attributes.
This simplifies the code and improves readability.

Also, initializes the value tensor with the correct data type.
Adds an advanced prefix match test.

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Comment thread python/sglang/srt/mem_cache/radix_cache.py Outdated
@Fridge003
Copy link
Copy Markdown
Collaborator

@JustinTong0323 Please fix the conflict

@Fridge003
Copy link
Copy Markdown
Collaborator

Fridge003 commented Sep 18, 2025

Seems really clean and tidy to me.
I feel the key of this PR is to make sure it doesn't break all kinds of features (hi-cache, swa-radix...) So please make sure this PR isn't merged until all the related tests can pass


if value is None:
value = [x for x in key]
value = torch.tensor(key.token_ids, dtype=torch.int64)
Copy link
Copy Markdown
Collaborator Author

@JustinTong0323 JustinTong0323 Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this line is to fix a minor error, if value is list, an error would be raised when executetorch.cat(value)
But it seems not used... not sure if we could directly delete this if value is None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value will not be None unless in the __main__ function in this file. Only used for testing, I think.

Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@xiezhq-hermann xiezhq-hermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, would suggest to rename BaseKey to RadixKey though

from sglang.srt.managers.schedule_batch import Req


class BaseKey:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JustinTong0323 Do not call this class BaseKey. Name it RadixKey or others.


if value is None:
value = [x for x in key]
value = torch.tensor(key.token_ids, dtype=torch.int64)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value will not be None unless in the __main__ function in this file. Only used for testing, I think.

@hnyls2002 hnyls2002 enabled auto-merge (squash) September 19, 2025 12:52
@xiezhq-hermann xiezhq-hermann added ready-to-merge The PR is ready to merge after the CI is green. and removed ready-for-review labels Sep 20, 2025
@hnyls2002 hnyls2002 merged commit 12d6cf1 into sgl-project:main Sep 21, 2025
368 of 408 checks passed
HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
@JustinTong0323 JustinTong0323 deleted the feat-refactor-radix-cache-combine-lora-cache branch October 20, 2025 18:55
BraveY pushed a commit to openanolis/sglang that referenced this pull request Oct 22, 2025
Merge branch sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main
https://code.alipay.com/Theta/SGLang/pull_requests/342?tab=diff

Reviewed-by: 苏墨 <xuyongfei.xyf@antgroup.com>


* [router] minor code clean up in server startup (sgl-project#10470)
* [bugfix] fix typo (sgl-project#10471)
* [PD metrics] Add latency Histogram metrics of each stage for generate requests (sgl-project#8710)
* [CI] Fix runner for sgl-kernel (sgl-project#9887)
* fix(internvl): fix accuracy issue of normalization (sgl-project#10375)
* fix: gpt-oss streaming dropping normal content when tools are provided but not used (sgl-project#9657)
* model: support solar (sgl-project#8189)
* fix: resolve sgl-kernel ut (sgl-project#10476)
* [1/2] Speed up trtllm_mla attention backend (>10% e2e) (sgl-project#10473)
* Fix `--dataset-path` in `bench_one_batch_server` (sgl-project#10475)
* [Env] minimal version for organizing envs (sgl-project#10479)
* chore: bump v0.3.10 sgl-kernel (sgl-project#10478)
* [router] multi model registration fix (sgl-project#10481)
* [2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (sgl-project#10286)
* [Auto Sync] Update registry.py (20250915) (sgl-project#10484)
* [router] fix worker registration in multi model mode (sgl-project#10486)
* fix crash of DeepSeek-V3 update_weights_from_disk (sgl-project#8863)
* Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue (sgl-project#10434)
* [Hicache] Evaluate Per-Round Metrics in Multiturn Bench (sgl-project#10203)
* [ModelOpt] Respect `kv_cache_quant_algo` in ModelOpt checkpoints (sgl-project#10336)
* Add Logprobs unit test with a loose threshold (sgl-project#10230)
* [router] add router db connector for responses api (sgl-project#10487)
* Remove wrong imports `from sglang.python` (sgl-project#10493)
* [router] fix router manager and router init in server (sgl-project#10499)
* Cache the result of `is_blackwell` platform check (sgl-project#10498)
* feat: update support for qwen3next model (sgl-project#10466)
* Minor fix lint introduced by sgl-project#10466 (sgl-project#10507)
* chore: upgrade sgl-kernel 0.3.10 (sgl-project#10500)
* Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. (sgl-project#10491)
* Fix CI when sgl-kernel is changed but srt is not changed (sgl-project#10515)
* Support sgl-router parallel_batch in bench_one_batch_server (sgl-project#10506)
* [CPU] fix CPU backend sel. issue for Llama4 (sgl-project#10511)
* adjust import setuptools_rust (sgl-project#10524)
* Fix formatting in long code blocks (sgl-project#10528)
* skip vision_model for lora (sgl-project#10530)
* [2/2] Speed up trtllm_mla attention backend (sgl-project#10474)
* support using fa4 on deepseek on blackwell (sgl-project#9928)
* [Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) (sgl-project#10494)
* [Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) (sgl-project#10538)
* feat: add priority based scheduling with priority based request acceptance and preemption (sgl-project#8746)
* Fix decord dependency for aarch64 docker build (sgl-project#10529)
* enable prefix cache with dp (sgl-project#10459)
* [bugfix]hicache bench_long_context.py run failed (sgl-project#10523)
* Remove duplicated code (sgl-project#10545)
* CUDA Arch Independent (sgl-project#8813)
* [bench] Fix random seed in `bench_one_batch_server` (sgl-project#10548)
* [HiCache] Add tests for hicache storage mooncake backend (sgl-project#10171)
* [BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle (sgl-project#9976)
* fix: update dsv3 fp4 ut (sgl-project#10584)
* vlm: remove redundant d2h movement of mm feature tensors (sgl-project#9987)
* Enable trtllm mla prefix extend (sgl-project#10526)
* [ROCm] Fix fp8 quantization accuracy issue. (sgl-project#10558)
* [HICache] introduce evict policy (sgl-project#10190)
* PullRequest: 303 Revert "PullRequest: 291 for fa3 kvcache: revert github "convert mla kvcache to bfloat16""
* aiter v0.1.5.post2 (sgl-project#10563)
* [PD] Improve disaggregation common backend and refactor mooncake backend (sgl-project#10273)
* chore: upgrade mooncake 0.3.6 (sgl-project#10596)
* [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525)
* Scale kkt after reduction (sgl-project#10604)
* fix deepep assert when PD disaggregation == null (sgl-project#8274)
* [RL] Add destroy process group api (sgl-project#9979)
* Feat/add heartbeat mechanism for nixl conn (sgl-project#10222)
* update deepep version for qwen3-next deepep moe (sgl-project#10624)
* support qwen3-next-fp8 deepep (sgl-project#10622)
* Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610)
* [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595)
* Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579)
* feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947)
* Garbage collector regression in the online server (sgl-project#10621)
* [router] refactor worker to builder pattern 1/n (sgl-project#10628)
* refactor: use registry for _get_attention_backend_from_str (sgl-project#10629)
* [Feature] Speculative decoding support lookahead (sgl-project#9873)
* [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553)
* [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586)
* model support: Sarashina2VisionForCausalLM (sgl-project#10632)
* feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631)
* chore: bump sgl-kernel 0.3.11 (sgl-project#10630)
* Hicache L3 backend mooncake optimization configuration reading method (sgl-project#10319)
* [router] refactor worker to builder pattern 2/n (sgl-project#10633)
* [Feature]feat(get_ip): unify get_ip_xxx (sgl-project#10081)
* [router] refactor worker to builder pattern 3/n (sgl-project#10647)
* [sgl-kernel] Support moe_sum_reduce cuda kernel (sgl-project#10321)
* [router] refactor worker to builder pattern 4/n (sgl-project#10650)
* Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 (sgl-project#10634)
* [router] refactor worker to builder pattern 5/n (sgl-project#10653)
* [HiCacheStorage]support page_first_direct layout for generic set&get (sgl-project#10522)
* [router] preserve order of json params using preserve_order feature (sgl-project#10661)
* [router] refactor router and worker management 1/n (sgl-project#10664)
* fix: resolve sync issue (sgl-project#10668)
* [Auto Sync] Update .clang-format (20250919) (sgl-project#10670)
* [router] refactor router and worker management 2/n (sgl-project#10666)
* router-spec: Reorder `ChatCompletionRequest` and fix validation logic (sgl-project#10675)
* chore: cleanup docker image (sgl-project#10671)
* limit sgl-kernel causal conv1d to cuda only (sgl-project#10648)
* [Auto Sync] Update model_runner.py (20250920) (sgl-project#10679)
* [router] refactor router and worker management 2.5/n (sgl-project#10677)
* [1/2] Support deterministic inference with flashinfer attention backend (sgl-project#10645)
* [Auto Sync] Update deepseek_v2.py (20250920) (sgl-project#10683)
* chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile (sgl-project#10681)
* [Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster (sgl-project#10680)
* Replace os.environ in layernorm.py (sgl-project#10684)
* fix(disagg): fix sending KV cache in case of MLA for NIXL backend (sgl-project#10673)
* fix: update run_suite (sgl-project#10685)
* fix: remove awq_dequantize deps (sgl-project#10686)
* [Auto Sync] Update modelopt_quant.py (20250920) (sgl-project#10688)
* [Feature] Support deterministic inference with FA3 backend (sgl-project#10651)
* feat: update server args  (sgl-project#10696)
* Super tiny fix extra logs (sgl-project#10697)
* [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization  (sgl-project#10592)
* Update release-docs.yml (sgl-project#10706)
* Refactors radix cache for extra key support (sgl-project#10317)
* [Router]fix: fix get_load missing api_key (sgl-project#10385)
* fix: disable gpt-oss b200 ut (sgl-project#10716)
* Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU (sgl-project#10714)
* [Auto Sync] Update deepseek_v2.py (20250922) (sgl-project#10717)
* Support deterministic inference with triton backend (sgl-project#10694)
* [deterministic inference] Move batch invariant pkg to sglang (sgl-project#10695)
* [2/2] Support deterministic inference for temperature > 0 (sgl-project#10678)
* [Ascend] codeowner updates for ascend related files (sgl-project#10699)
* [theta] 支持qwen-vl的多模自定义采样
* revert e61d08c [theta] 支持qwen-vl的多模...
* PullRequest: 306 [theta] 支持qwen-vl的多模自定义采样
* [4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% (sgl-project#10709)
* Convert FLASHINFER_WORKSPACE_SIZE to integer (sgl-project#10731)
* EPLB: prefer to use physical experts in the same node (sgl-project#9849)
* fix capture_bs when speculative decoding enabled (sgl-project#10730)
* Fix flaky logprobs test (sgl-project#10728)
* Fix CI TestChunkedSGMV (sgl-project#10737)
* [Docs, minor] Fix LLM doc matrix (sgl-project#10753)
* Add warnings and remove dependency for deterministic inference (sgl-project#10724)
* bugfix: Fix `get_worker_urls_for_model` in http/router.rs (sgl-project#10754)
* [router] refactor router and worker management 3/n (sgl-project#10727)
* [router] update ci so only execute benchmarks when labels are added (sgl-project#10757)
* Fix MTP MoE weight loading with NVFP4 target model. (sgl-project#10758)
* chore: bump sgl-kernel v0.3.12 (sgl-project#10732)
* [Generative Score API] Added test_scores_api.py to github CICD to run per commit (sgl-project#10755)
* refactor zero copy (sgl-project#10300)
* Fix multimodal registry and code sync scripts (sgl-project#10759)
* Enables TRT-LLM backend to be used for target_verify (sgl-project#10281)
* fix: kv events with tp > 1 (sgl-project#10541)
* [Auto Sync] Update flashattention_backend.py (20250922) (sgl-project#10762)
* [Feature] Add MLAProcess for DeepSeek MLA on NPU (sgl-project#10130)
* [Ascend] optimize Qwen-vl on Ascend (sgl-project#10556)
* [Ascend]optimize Qwen3 on Ascend (sgl-project#10574)
* [Auto Sync] Update configurer.py (20250923) (sgl-project#10765)
* [router] refactor router and worker management 4/n (sgl-project#10756)
* PullRequest: 310 新增 BailingMoEV3 模型及其 MLA 支持
* [router] remove pd router draining channel (sgl-project#10767)
* [router] fix logger type mismatch (sgl-project#10774)
* Use simulate acc len from `sglang.environ` (sgl-project#10771)
* Fix trtllm_mla slow concat kernel in MTP (sgl-project#10777)
* Move cached kernel to srt.utils (sgl-project#10776)
* feat: unify dockerfiles (sgl-project#10705)
* Introduce `FutureMap` (sgl-project#10715)
* chore: upgrade sgl-kernel 0.3.12 (sgl-project#10782)
* followup: clean up dockerfiles and release yamls  (sgl-project#10783)
* Clean up server args (sgl-project#10770)
* move `environ` into `sglang.srt` to avoid break SRT auto sync. (sgl-project#10791)
* Fix hicache mooncake backend CI (sgl-project#10792)
* [router] fix cache aware routing strategy and lock contention (sgl-project#10773)
* [router] responses api POST and GET with local storage (sgl-project#10581)
* model: support qwen3-vl series (sgl-project#10323)
* [fix][pd-disag]no need set next batch sampling info done in prefill (sgl-project#10259)
* [ROCm] Update aiter to v0.1.5.post3 (sgl-project#10812)
* [router] use dashmap for radix tree instead of hash for multi model (sgl-project#10814)
* router(grpc): Implement route for chat_cmpl endpoint (sgl-project#10761)
* fix ceval (sgl-project#10504)
* Remove duplicate code in qwen2 model (sgl-project#10540)
* [router] fix axum default body limit (sgl-project#10818)
* Fix latest main ci (sgl-project#10799)
* add tunning files for QWEN-3-NEXT (sgl-project#10794)
* [Auto Sync] Update protocol.py (20250923) (sgl-project#10820)
* fix: draft model IMA by overide max_positional_embeddings (sgl-project#10787)
* [Auto Sync] Update elementwise.py (20250923) (sgl-project#10823)
* [Auto Sync] Update simple_eval_common.py (20250923) (sgl-project#10824)
* [router] Support streaming for Openai Router Response api  (sgl-project#10822)
* [router] add auth middleware for api key auth (sgl-project#10826)
* [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (sgl-project#10825)
* Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" (sgl-project#10828)
* Add CI timeout guidelines (sgl-project#10829)
* [theta] fix serving_tokenization.py
* feat: add cache_salt support to request (sgl-project#10718)
* fix bailing_moe with enable_dp_attention (sgl-project#10860)
* ci: free space on workers for build (sgl-project#10786)
* router-grpc: Support jinja chat template content format detection (sgl-project#10832)
* [router] select first healthy worker on proxied get requests (sgl-project#10827)
* chore: Initial support for input config files (sgl-project#10534)
* router-grpc: Add tools processing and other paramters for apply_chat_template (sgl-project#10877)
* [router] consolidate health endpoints and flush cache (sgl-project#10876)
* Restruct sgl-kernel benchmark (sgl-project#10861)
* [Bug] Fix Issue#10215 (sgl-project#10572)
* [router] consolidate worker get loads (sgl-project#10880)
* [router] Support Oracle DB(ATP) Data Connector (sgl-project#10845)
* [router] simplify tokenizer dev doc (sgl-project#10895)
* [Auto Sync] Update model_config.py (20250925) (sgl-project#10885)
* [ci feature] add ci monitor (sgl-project#10872)
* [HiCache] Cleaning the deprecated host memory state (sgl-project#10778)
* integrate AIBrix KVcache (sgl-project#10376)
* Add fuse_moe per-channel tune (sgl-project#10915)
* [router] consolidate worker load monitoring (sgl-project#10894)
* router: Fix constraint proto and `build_constraint` in grpc router (sgl-project#10881)
* Refactor kv_cache_scheme handling for quantization (sgl-project#10132)
* refactor: Move `grpc/client.rs` to `grpc_client/sglang_scheduler.rs` (sgl-project#10924)
* fix env flashinfer (sgl-project#10910)
* [minor] Remove deprecated function `get_ip` (sgl-project#10883)
* Rename customer label -> custom label (sgl-project#10899)
* [router] change log level to warning (sgl-project#10926)
* [router][refactor] Clean up protobuf fields (sgl-project#10923)
* Replace the Kimi-K2 generated tool call idx with history tool call count (sgl-project#10612)
* [ci] add ci-monitor workflow (sgl-project#10898)
* Remove pull_request trigger from CI monitor workflow (sgl-project#10932)
* router: Support parallel sampling num > 1 in grpc_server and non-stream handling (sgl-project#10929)
* Revert "Refactor kv_cache_scheme handling for quantization (sgl-project#10132)" (sgl-project#10935)
* Update CODEOWNERS to include JustinTong0323 in FC (sgl-project#10939)
* [PD-HiCache]: Support Async Offloading KVCache In Decode Side (sgl-project#10192)
* CI: Fix docker manifest build (sgl-project#10936)
* [router] update owners for router components (sgl-project#10927)
* Fuse write kv buffer into rope for qwen3 moe & bailing moe (sgl-project#10749)
* [router] add grpc client get and set (sgl-project#10955)
* [router]fix code owner syntax error (sgl-project#10956)
* [router] move grpc client from router to worker and builder (sgl-project#10958)
* [router] add move grpc worker management from router to worker manager (sgl-project#10960)
* [router] grpc router regular mode import cleanup (sgl-project#10963)
* [router] remove old/oudated/useless comments (sgl-project#10967)
* [router] remove old/oudated/useless comments across code base (sgl-project#10968)
* ci: fix rate-limit of huggingface with hf auth login (sgl-project#10947)
* Update label field comment to indicate deprecation (sgl-project#10970)
* Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs (sgl-project#10372)
* ci: refactor nightly test (sgl-project#10495)
* refactor loading weights from remote instance coding format (sgl-project#10941)
* [router][grpc] Add helpfer functions for decoder in router.rs and fix specs (sgl-project#10971)
* Add simple docker file for B300 (sgl-project#10944)
* Ci monitor support performance (sgl-project#10965)
* [HiCache]: Support dynamic loading backends for hicache (sgl-project#10551)
* [Bugfix][Minor][Benchmark] Fix some bugs due to PR sgl-project#10495 (sgl-project#10982)
* [router][grpc] Support E2E non-stream chat completions (sgl-project#10980)
* fix: fp8 quantization failure of qwen 2.5 VL 7B model (sgl-project#10112)
* [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (sgl-project#10981)
* fix: make inference deterministic for large TP (sgl-project#10930)
* Add auth to get server info (sgl-project#10751)
* PullRequest: 315 bailingMoE: Fix deepep_mode keyerror
* Add support for topk metadata transferring for PD (sgl-project#10616)
* [PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend (sgl-project#10565)
* Use jsonschema to constrain required or specific tool choice (sgl-project#10550)
* Fix profiler (sgl-project#10997)
* [router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) (sgl-project#10995)
* [router] basic mcp support for openai router response api (sgl-project#10978)
* [router] fix chat template loading and tokenizer path (sgl-project#10999)
* Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' (sgl-project#11009)
* [bugfix]Add empty_context import to two_batch_overlap.py (sgl-project#10964)
* prepare for sglang+verl (sgl-project#10555)
* [sgl-kernel] Optimize concat_mla_k kernel (sgl-project#10543)
* [HiCache] bug: fix mooncake store batch set v1 (sgl-project#11013)
* Fix FusedSetKVBufferArg  in RotaryEmbedding (sgl-project#11003)
* Update GLM-4.5 Model Doc (sgl-project#11017)
* [router] migrate to rust python module for pythonic parser (sgl-project#11033)
* fix: show failed models in nightly ci (sgl-project#10986)
* [router][tool call] Support normal content extraction before tool call (streaming) (sgl-project#11038)
* [router] add harmony tool parser base structure and interface (sgl-project#11036)
* Unify SGL Kernel Releases (sgl-project#10701)
* [1/2] Support FA4 for MHA Prefill in sgl-kernel (sgl-project#10940)
* fix: check if weights are already local before downloading (sgl-project#11015)
* [HiCacheStorage] mooncake store support page_first_direct layout (sgl-project#10591)
* [speculative decoding] rename lookahead to ngram (sgl-project#11010)
* Fix gemma 3 launch with `transformers:` the error: `AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size'` (sgl-project#9614)
* Fix sgl-kernel benchmark dead code  (sgl-project#11022)
* [router][tool call] Improve normal content extraction and error handling (non-stream) (sgl-project#11050)
* chore: upgrade cutedsl 4.2.1 (sgl-project#11054)
* [Ci Monitor] Auto uploaded performance data to sglang_ci_data repo (sgl-project#10976)
* chore: upgrade sgl-kernel 0.3.13 (sgl-project#11056)
* [router] add n to generate sampling params (sgl-project#11069)
* Use more general heuristics to set the default value of --mem-fraction-static (sgl-project#10975)
* [router][tool call] Separate `JsonParser` and `LlamaParser` (sgl-project#11073)
* Fix mem fraction static for nightly tests (sgl-project#11076)
* fix: fp8 mllama4 without vision modules being quantized (sgl-project#10611)
* [router] Use `get_pooled` in `process_single_choice` (sgl-project#11079)
* [router][grpc] Add logprobs support to router (sgl-project#11082)
* feat(reasoning): improve enable thinking from request (sgl-project#10875)
* [Profile] dump memory trace when cuda graph profile is enabled (sgl-project#11083)
* Remove hybrid_linear_attn attention backend and refactor attention registry (sgl-project#10816)
* [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (sgl-project#9642)
* Enable optional FP32 compute for LM Head (sgl-project#10729)
* Update CODEOWNERS for attention/ascend_backend.py (sgl-project#11092)
* [router] grpc router generate endpoint support (sgl-project#11070)
* [router][tool call] Full support for ToolChoice (sgl-project#11085)
* Fix spec filter batch when target extend  (sgl-project#10991)
* [Fix] Resolve performance drop in speculative decoding aiter backend (sgl-project#11087)
* [Auto Sync] Update fused_moe_triton_config.py (20250930) (sgl-project#11099)
* chore: bump sgl-kernel v0.3.14 (sgl-project#11067)
* [router][grpc-server] Fix gRPC server shutdown (sgl-project#11094)
* Fix eagle radix cache (sgl-project#10846)
* [Eval] Add `--repeat` in `run_eval`  (sgl-project#11101)
* [CPU] Adding Memory Capacity Acquisition Functionality (sgl-project#11102)
* Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization (sgl-project#11081)
* Support Dots.ocr model (sgl-project#11071)
* [router][bugfix] Fix input_logprobs handling with None value and `logprob_start_len = -1` (sgl-project#11113)
* Feature/make PEFT adapter module format compatibile (sgl-project#11080)
* fix: KimiK2Detector Improve tool call ID parsing with regex (sgl-project#10972)
* [router] add mcp list and mcp call in output array (sgl-project#11112)
* Organize spec-related data structures (sgl-project#10735)
* [AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm (sgl-project#11114)
* [Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) (sgl-project#11115)
* [Doc] Update multimodal language models documentation (sgl-project#11111)
* Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg (sgl-project#10985)
* docker: x86 dev builds for hopper and blackwell (sgl-project#11075)
* Refactor AMD CI. (sgl-project#11128)
* feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 (sgl-project#10760)
* [HiCache]bug fix: fixed blank item in host_mem_release_queue (sgl-project#11005)
* [Feature] Add EIC as sglang HiCache Storage backend (sgl-project#10271)
* [HiCache] Configurable and Dynamic Prefetch Timeout (sgl-project#10512)
* [router] add pd service in grpc router for pd (sgl-project#11120)
* [router] Add multi-turn tool calling loop support for MCP integration (sgl-project#11143)
* Fix metrics and request tracing (TimeStats) (sgl-project#11123)
* Remove debug print statement from scheduler output (sgl-project#11145)
* Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (sgl-project#10720)
* Fix ngram spec with page size > 1 (sgl-project#11135)
* [ROCm] To reduce the compiling time when using torch compile. (sgl-project#10559)
* Fix DeepSeek chunked prefill memory issue (sgl-project#11149)
* Clean up parallel_state.py (sgl-project#11148)
* Tiny improve dumper (sgl-project#11132)
* Tiny fix missing alt stream in nextn layer (sgl-project#10768)
* Fuse quantize and rope in trtllm_mla MTP (sgl-project#10779)
* Tiny detect slow ranks (sgl-project#10508)
* Remove unused pack `.item()` in paged allocator. (sgl-project#11156)
* Support dispatch low latency (sgl-project#10263)
* Support single batch overlap (sgl-project#10422)
* [router][grpc] Support tool call parser in streaming (sgl-project#11160)
* [model] Add mamba2 and Falcon-H1 support. (sgl-project#10988)
* Clean up ascend allocator (sgl-project#11152)
* fix cpp JIT compilation issue of ngram speculative decoding (sgl-project#10837)
* Tiny cleanup deepseek_v2.py (sgl-project#11163)
* Tiny fix ep_gather behavior different in CI (sgl-project#11130)
* Tiny remove duplicated code (sgl-project#11164)
* [proto] Add script to compile python protos (sgl-project#11171)
* Unify forward output datastructure (sgl-project#11124)
* [grpc] style fix for grpc compilation. (sgl-project#11175)
* Remove dp balance metadata and minimul token balance. (sgl-project#11170)
* Minor fixes for server_args, parallel_state, and test_deterministic.py (sgl-project#11159)
* fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 (sgl-project#11176)
* [router][grpc] Support streaming for v1/chat/completions (sgl-project#11179)
* Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (sgl-project#11138)
* Introduce naming convention in `io_struct` and base sglang io classes. (sgl-project#10133)
* [Generative Scores API] add performance tests to CICD  (sgl-project#10830)
* [1/n] Enable DCA CUDA graph capture (sgl-project#9537)
* [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (sgl-project#11161)
* [CI]] Tee server logs to both file and stdout/stderr using PIPE (sgl-project#11185)
* fix: radix cache memory accounting (sgl-project#10637)
* Tiny add PD disaggregation + DP attention test (sgl-project#11167)
* [router] Steaming support for MCP Tool Calls in OpenAI Router (sgl-project#11173)
* [Feature] Option to save model weights to CPU when memory saver mode is enabled (sgl-project#10873)
* Add --thinking-mode to run_eval (sgl-project#11189)
* [hot-fix] Fix CI break which caused by adding `thinking_mode` in eval (sgl-project#11192)
* Tiny move files to utils folder (sgl-project#11166)
* Fix CUDA illegal memory access issues in speculative decoding (sgl-project#10892)
* Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. (sgl-project#10780)
* Optimize debug log position of PD abort request (sgl-project#11090)
* fix 3fs indices (sgl-project#10855)
* model: support starcoder2 (sgl-project#10609)
* [Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. (sgl-project#10859)
* fix xeon ci check (sgl-project#10838)
* fix qwen2 eagle3 runtime error (sgl-project#10517)
* [minor] fix the lint (sgl-project#11198)
* [Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py (sgl-project#10741)
* [fix]missing prefix_lens_cpu init when p/d disaggregation (sgl-project#11196)
* fix self.enable_kv_cache_events (sgl-project#11178)
* [HICache]: Refactor HiCache CI (sgl-project#11011)
* fix sampling_seed handling when deterministic is enabled (sgl-project#11096)
* [fix]enable flashmla when using draft model P/D attention select (sgl-project#11012)
* [router] fix get load response parsing (sgl-project#11213)
* [router] add grpc router pd mode for chat and generate (sgl-project#11140)
* EAGLE cache fix for HiCache (sgl-project#11215)
* Add --max-new-tokens CLI flag for MMMU evaluation (sgl-project#11217)
* Add DeepSeek-V3.2 Tool Call Template (sgl-project#11063)
* Tiny `skip_sample` adjust (sgl-project#11225)
* [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 (sgl-project#11194)
* Update `v1/responses` to be more OpenAI-compatible. (sgl-project#9624)
* chore: bump sgl-kernel v0.3.14.post1 (sgl-project#11137)
* Update DeepGEMM repository tag to specific commit (sgl-project#11229)
* [Feat] Support Torch Symm Mem AllReduce (sgl-project#10571)
* Refactor and optimize mooncake CI (sgl-project#11162)
* [Fix AMD CI] VRAM cleanup  (sgl-project#11174)
* Update transformers package version to 4.57.0 (sgl-project#11222)
* Remove gdrcopy check in ci_install_deepep.sh (sgl-project#11237)
* Rename runner labels (sgl-project#11228)
* [Auto Sync] Update io_struct.py (20251004) (sgl-project#11206)
* Create two new GH workflows to automatically bump SGLang and Kernel version (sgl-project#10996)
* Fix spec_utils.py (sgl-project#11247)
* ci: make find_local_hf_snapshot_dir more robust (sgl-project#11248)
* [quantization] Fix scale remapping for mllama4 (sgl-project#10042)
* [quantization] Enable aiter mxfp4 fused_moe for Quark (sgl-project#10048)
* Use cu128 for torch audio to fix some CI tests (sgl-project#11251)
* Bump torch_memory_saver 0.0.9rc2 (sgl-project#11252)
* update sgl kernel version to 0.3.14.post1 (sgl-project#11242)
* Update condition for sgl-kernel-benchmark-test (sgl-project#11254)
* feat: add shortcut detection for multimodal templates in Jinja format (sgl-project#11209)
* Improve bot release workflow (sgl-project#11240)
* Add flashmla and fast hadamard transform to Dockerfile (sgl-project#11235)
* Support DeepSeek V3.2 Exp (sgl-project#11061)
* chore: bump SGLang version to 0.5.3rc2 (sgl-project#11259)
* chore: bump SGLang version to 0.5.3 (sgl-project#11263)
* [theta] fix bailing v3
* [router] add ipv6 support across all components (sgl-project#11219)
* Remove env var warnings for release (sgl-project#11262)
* Enable native ModelOpt quantization support (1/3)  (sgl-project#7149)
* [router][tool call] Clean up redundant `detect_format` and `has_tool_markers` (sgl-project#11270)
* disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 (sgl-project#11274)
* docker: add manifest to versioned docker releases (sgl-project#11268)
* [Bug] Fix incorrect assertion in FA4 and add UT. (sgl-project#11182)
* [router][grpc] Refine streaming processes (sgl-project#11277)
* Fix code sync scripts (sgl-project#11276)
* [Auto Sync] Update test_utils.py (20251006) (sgl-project#11280)
* Rename max_micro_batch_size -> pp_max_micro_batch_size (sgl-project#11279)
* reverse the amd ci test back to 1200s and split the 8-gpu deepseek job into two. (sgl-project#11238)
* Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components (sgl-project#11261)
* fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration (sgl-project#11282)
* docs: update sgl-kernel README (sgl-project#11286)
* chore: bump sgl-kernel version to 0.3.15 (sgl-project#11281)
* [router][grpc] Fix proto3 default value mismatches and cleanup unused fields (sgl-project#11283)
* convert test_deterministic into unit tests (sgl-project#11095)
* Feature/longbench v2 evaluation utils (sgl-project#10949)
* [ci] fix pp test (sgl-project#11294)
* EAGLE cache fix for SWARadixCache (sgl-project#11231)
* Remove overlap thread (sgl-project#11210)
* [router] add reasoning and tool parser argument in router (sgl-project#11290)
* Remove sampling info events and overlap thread file (sgl-project#11300)
* Introduce future indices (sgl-project#11301)
* [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (sgl-project#11068)
* [Docs] [Router] Update Observability and Common Issues Section (sgl-project#11302)
* [router] add get server info and get model info in grpc server (sgl-project#11303)
* [router][grpc] Refactor chat template content format detection (sgl-project#11288)
* [Doc] HiCache Design Documents (sgl-project#11027)
* [Doc]: Best Practice for HICache (sgl-project#11001)
* [router] fix grpc connection conversion and add optimization (sgl-project#11305)
* [router][grpc] Fix sampling_params.stop_strs is None (sgl-project#11306)
* Update tool parser and related documentation (sgl-project#11223)
* [router][grpc] Fix error message format in grpc chat handler (sgl-project#11307)
* [quantization] Properly ignore quantization for layers excluded in quant_config (sgl-project#11205)
* [router] support Openai router conversation API CRUD (sgl-project#11297)
* [router][grpc] Fix request_id extraction when n > 1 (sgl-project#11311)
* [router] cleanup worker health check to return early (sgl-project#11310)
* [oai serving chat] Add argument `--sampling-defaults` and fix `ChatCompletionRequest` defaults (sgl-project#11304)
* Clean match_prefix and prepare_for_extend for mem cache V2 (sgl-project#11200)
* ci: unify the model launch method of nightly ci (sgl-project#11230)
* [Chore] Update xgrammar 0.1.24 -> 0.1.25 (sgl-project#10710)
* update sampling_params documentation with defaults (sgl-project#11315)
* Optimize copy_kv_cache for spec decoding (sgl-project#11126)
* Rename `ngram_utils` -> `ngram_info` (sgl-project#11316)
* [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (sgl-project#11314)
* [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (sgl-project#9545)
* [8/N] MoE Refactor: deprecate `EPMoE` (sgl-project#11211)
* Skip weight loading in deepgemm compilation (sgl-project#11312)
* [2/2] Support MHA prefill with FlashAttention 4. (sgl-project#10937)
* [Doc] Update mooncake nvlink transport doc for PD disaggregation (sgl-project#11321)
* fix(decode): adjust ServerArgs import to explicit module path (sgl-project#11007)
* Support LoRA in bench_serving oai interface (sgl-project#11318)
* benchmark: enhance configurable multimodal benchmarking in bench_serving (sgl-project#9812)
* [CI] improve disaggregation CI. (sgl-project#11264)
* [theta] fix tokenization
* model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (sgl-project#10909)
* [router] refactor generate to use new pipeline arch (sgl-project#11323)
* [router] improve reasoning parser lock and reduce req cloning (sgl-project#11336)
* [router][grpc] Cleanup debug logs in grpc_server and grpc_router (sgl-project#11340)
* [router] Fix all unused_qualifications (sgl-project#11341)
* [router] Support history management using conversation (sgl-project#11339)
* [router][grpc] Add dependencies in Cargo.toml to support chat template rendering (sgl-project#11342)
* fix: fix revision for sgl-flash-attn in sgl-kernel (sgl-project#11327)
* [Auto Sync] Update scheduler.py (20251009) (sgl-project#11350)
* [Generative Score API] Multi-Item scoring with custom attention mask. (sgl-project#10979)
* [router][grpc] disable health check generation and increase timeout (sgl-project#11353)
* [router] Refactor OpenAI router: split monolithic file and move location (sgl-project#11359)
* [router][lint] Add unused_qualifications to cargo lint warnings (sgl-project#11366)
* [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (sgl-project#11309)
* PullRequest: 323 [theta] 错误码规范化:1)chat和completions请求的前处理统一为400;2)多模态load data请求返回为标准的http错误码
* [router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (sgl-project#11373)
* add code pp support for nixl (sgl-project#11375)
* fix bench_serving mishandling of internal states (sgl-project#11376)
* PullRequest: 322 支持MTP并使用DeepseekV2AttentionMLA子类化BailingMoEV3AttentionMLA
* [router][grpc] Replace fake health check with correct ones (sgl-project#11387)
* [router] change grpc client from mutable to clone (sgl-project#11394)
* chore: upgrade flashinfer 0.4.0 (sgl-project#11364)
* [router] conversation item API: create, retrieve and delete (sgl-project#11369)
* chore: bump SGLang version to 0.5.3.post1 (sgl-project#11324)
* move more files under srt/utils (sgl-project#11285)
* [grammar] Avoid server crash when grammar backend is None (sgl-project#11401)
* fix: fix gpu-proc affinity set incorrectly when pp_size > 1 (sgl-project#11389)
* [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded (sgl-project#11365)
* [CI] Refactor PD disaggregation test suite (sgl-project#11363)
* Replace pad with cat for better performance (sgl-project#11388)
* fix: reinstall torch in deps install (sgl-project#11414)
* feat(hicache): Support passing prefix keys for l3 store. (sgl-project#9045)
* fix file and object naming scheme in HiCacheNixl to avoid data corruption (sgl-project#10969)
* Dedicated toml files for CPU/XPU (sgl-project#10734)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (sgl-project#11144)
* chore: update pyproject (sgl-project#11420)
* PullRequest: 330 [theta] qwen-vl支持视频base64传入图像帧,如:data:video/jpeg;base64,frame1_base64,frame2_base64,...,frameN_base64
* fix: fix video input for qwen3-vl (sgl-project#11361)
* perf: optimize qwen-vl with symm mem allreduce (sgl-project#11381)
* [HiCache] feat: add multi tenant with prefix tag (sgl-project#9256)
* [CI] Merge build-dev into workflow matrix (sgl-project#11345)
* Revert "perf: optimize qwen-vl with symm mem allreduce" (sgl-project#11436)
* Revert "fix: fix video input for qwen3-vl" (sgl-project#11437)
* Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (sgl-project#11433)
* [router] Fix ci nvcc not found error (sgl-project#11411)
* feat(mooncake): support GB suffix for global_segment_size  (sgl-project#10745)
* Separate allocation logic from scheduler (sgl-project#11313)
* [router] disable rate limiter by default (sgl-project#11435)
* [router] leverage RAII to actively cancel request during client disconnect (sgl-project#11399)
* [router][grpc] Consolidate parser checks for chat completions (sgl-project#11439)
* Reorder PD disagg CI tests (#11438)
* fix: Change dsv32 hack temporary path to use system temp directory (#11445)
* Fix batch invariant ops (#11368)
* [BugFix] test_mla_fp8.py fails on Cublas 12.9 (#11360)
* [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton (#11450)
* Remove tilelang dependency in Dockerfile (#11455)
* Enable native ModelOpt quantization support (2/3) (#9991)
* Reland [1/2] Optimizations and refactors about quant kernel (#10312)
* Super tiny delete unused openai router in sgl-router (#11448)
* Adjust logits metada init for target verify (#11467)
* [Documentation][Configuration] Server args and documentation of PD-Multiplexing. (#11427)
* Fix enable_v2 in int8 quant (#11470)
* [Fix] Fix split prefill with fa3. (#11428)
* fix stop when stream  (#11462)
* Add option to disable `any_whitespace` for `xgrammar` and `llguidance` backends. (#8919)
* PullRequest: 334 [theta] 修复qwen3-vl的各种bug
* [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019)
* fix Xeon CI (#11454)
* [CI] Add nightly builds to dockerhub (#9804)
* [Feature] support regex strings as a stopping condition (#10635)
* Beta spec-overlap for EAGLE (#11398)
* Piecewise CUDA Graph Support & Torch Compile Backend (#10062)
* [Router]: Small Typo in a comment within tree.rs (#11489)
* chore: bump sgl-kernel version to 0.3.16 (#11476)
* [smol] [perf] Qwen3-VL in place op. (#11481)
* [chore][1/N] Avoid using default mutable parameters (#11478)
* [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends (#10172)
* [ perf ] Replace json-> orjson in hot path (#11221)
* [chore][2/N] Avoid using default mutable parameters (#11479)
* Fix the GPT function calling regex to allow dash in the name (#10577)
* bailingMoE: Fix Key error of deepep_mode (#11465)
* Fix CI break by express-laned PRs. (#11499)
* Move args from `global_config` to `environ` (#11332)
* move fla env check position (#11500)
* Temporarily remove b200 tests (#11501)
* Fix port conflicts in CI (#11497)
* temporarily remove b200 tests (#11502)
* Fix unit tests (#11503)
* Bugfix: Fix Type consistency for KV indices in SWARadixCache (#11452)
* doc: add doc for adding new models into nightly-ci (#11443)
* [CI] fix lint (#11509)
* Deprecate `global_server_args_dict` (#11331)
* chore: remove flashinfer cleanup cache (#11514)
* fix: revert temporarily remove b200 tests (#11515)
* [Fix] Improve longbench prompt and other logics (#11474)
* Sync changes on io_struct.py and deterministic ops (#11498)
* [lint] Fix the lint issue (#11516)
* Revert "Deprecate `global_server_args_dict`" (#11520)
* Improve dp attention port assignment scheme (#5889)
* [theta] rebase public/main 1013-2
* [router] openai router: support grok model (#11511)
* docs(router): add token-bucket rate limiting to the docs (#11485)
* [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432)
* Update DeepSeek-R1-FP4 default config on blackwell (#11512)
* [Fix]: add missing device attribute to ChunkCache (#11493)
* [Feature] Support mamba radix cache v0 (#11214)
* ci: improve nightly-ci (#11385)
* [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering (#11505)
* [HICache]: Support 3FS-Store with page_first_direct layout (#11460)
* Tiny fix test run estimated time (#11544)
* [Reland] perf: optimize qwen-vl with symm mem allreduce (#11457)
* [theta] rebase public/main 1013-5
* Depreate `global_server_args_dict` (#11528)
* [theta] rebase public/main 1013-6
* [Fix] Add per_channel_quant parameter to MoE config functions (#11201)
* [router][ci] Add Nightly Release Workflow for SGLang Router (#11527)
* [router] allow tokenizer path to be dir (#11530)
* Remove `tp_worker.worker` (#11548)
* fix: fix video input for qwen3-vl (#11442)
* [NVIDIA] BUMP FA3 (#11444)
* [router][Fix] Include grpc reflection runtime dependency (#11419)
* Adjust overlap event loop (#11507)
* Move deep gemm related arguments to `sglang.srt.environ` (#11547)
* [router][grpc] Further delegate non-stream processing to `processing.rs`  (#11553)
* [router] allow user to specify chat template path (#11549)
* Minor: improve sampler & remove unused fields from model_config.py (#11531)
* [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441)
* Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) (#11557)
* [CI] Add Basic Test for DeepSeek V3.2 (#11308)
* [router][grpc] Add error handling to `generate_tool_constraints` (#11562)
* [NVIDIA] update pyproject.toml to support cu130 option (#11521)
* [CI Monitor] Ci monitor only deal with main branch in default (#11538)
* Tiny cleanup fp4 gemm calls (#11537)
* [router][grpc] Add `serve_grpc` to `launch_server` and log id for HealthCheck (#11564)
* [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds (#11571)
* [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM (#11534)
* chore: bump sgl-kernel version to 0.3.16.post1 (#11573)
* Fix accept rate in speculative decoding metrics (#11572)
* Compilation Folder Reset (#11539)
* [FEATURE] Add Profile Trace Merger for Distributed Traces (#11413)
* [DSv32] Use torch.compile for _get_logits_head_gate (#11565)
* Make DeepEP combine recv do not overlap (#11535)
* bench_serving support PD Disaggregation (#11542)
* Implement LRU eviction policy for LoRA adapters (#11041)
* PullRequest: 337 支持completions协议传入多模态请求
* Revert "[NVIDIA] BUMP FA3 (#11444)" (#11582)
* chore: bump sgl-kernel version to 0.3.16.post2 (#11583)
* [Auto Sync] Update model_config.py (20251014) (#11580)
* Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json (#11587)
* [router][protocols] Add Axum validate extractor and use it for `/v1/chat/completions` endpoint (#11588)
* [router] update generate spec to align with sgl io struct (#11591)
* [router] change worker api to async instead of sync (#11566)
* Update news section in README.md (#11598)
* [router] delete useless table content comment in spec (#11597)
* [router] allow router launch server to use grpc mode (#11600)
* [Docs] [Router]: Update sg-router doc on circuit breaker (#11449)
* [router] when given both local tokenizer and chat template, log all (#11601)
* [AMD CI] Add image and weights caching. (#11593)
* Update release-docker-dev.yml (#11603)
* Optimize Triton Draft Backend (#11556)
* Refactor spec decoding metrics calculation into separate `TokenizerManager` utility function (#11586)
* make radix cache deterministic (#10721)
* move eagle draft post process to cuda graph (#11434)
* Reduce one step decode for draft model. (#11561)
* [router] add py binding and readme for openai router and history backend (#11453)
* [theta] print load mm cost
* [theta] 百灵4头支持tp8
* [router] cleanup app context and move to startup (#11617)
* [router] add chang and keyang to sgl router author (#11620)
* use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. (#11605)
* [router] update router readme to latest features (#11619)
* Fix log for chunked prefix cache (#11624)
* [Auto Sync] Update scheduler.py, server_args.py (20251014) (#11623)
* [Auto Sync] Update collector.py (20251014) (#11625)
* [Minor] Update xgrammar dependency (#11622)
* Update install.md (#11631)
* fix: Update SGL_KERNEL_VERSION to 0.3.15 (#11633)
* [router][grpc] add warm up to grpc server (#11627)
* Refactor kv cache free (#11351)
* [router] update router doc to latest features (#11639)
* fix: upgrade transformers to 4.57.1 (#11628)
* [router] add worker self discovery for metadata (#11638)
* [router] upgrade to 0.2.0 (#11642)
* [theta] qwen vl耗时打印
* [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
* [theta] qwen vl耗时打印
* [1/N]Support  DeepSeek-R1 w4a8 normal deepep (#8247)
* [Fix] Fix accuracy bug in CSGMV kernel caching key. (#11579)
* feat: add add_chunked_prefix_cache_attention_backend (#11636)
* Super tiny improve FA3 import error message (#11590)
* [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl  (#11458)
* [Doc] Update support matrix for attn and hybrid attn (#11293)
* Clean up some Qwen3-Next and deterministic code (#11585)
* docs: update sglang installation guide (#11659)
* [theta] 更新aci镜像和依赖
* Tiny cleanup some eagle unused codes (#11660)
* Fix 1-step draft model forward (#11653)
* [tool call] Fix prev_tool_call_arr management in base_format_detector.py (#11367)
* [router] Fix response api related spec (#11621)
* Fix missing json imports in serving_responses.py (#11681)
* [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674)
* [sgl-kernel] Optimize gguf test (#11667)
* [router][grpc] Simplify model_id determination (#11684)
* [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676)
* chore: bump SGLang version to 0.5.3.post2 (#11680)
* [CI][XPU]enable sglang CI on Intel XPU (#9493)
* enable rmsnorm on XPU (#10248)
* Sync code and test CI; rename some env vars (#11686)
* docs: Add Contributor Covenant Code of Conduct (#11689)
* [theta] dockerfile增加deepgemm编译缓存(需要定期更新😂)
* [Mamba] Increase default mamba_full_memory_ratio to 0.9 (#11679)
* [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912)
* [sgl-kernel] support hadamard (#11663)
* Fix missing a2a backend init of GLM4.5 MoE Block (#11692)
* Split test_intel_amx_attention_backend.py to pass CI of timeout (#11370)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority lora ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants