Separate allocation logic from scheduler by cctry · Pull Request #11313 · sgl-project/sglang

cctry · 2025-10-08T01:06:23Z

Motivation

Preparation for mem_cache V2.
The allocation logic is moved to mem_cache/ from schedule_batch.py. Ideally, scheduling code should only interact with tree_cache in V1 and memory_manager in V2.

Modifications

create mem_cache/common.py for allocation functions operating allocator and tree cache
refactor the allocation to use two wrappers alloc_for_extend and alloc_for_decode
- in prepare_for_decode, the increment of seqlen is moved after alloc_for_decode for clarity
- in prepare_for_extend, some allocation-needed fields are set before alloc_for_extend
for spec decode, the allocation functions are unchanged and they are to be changed after spec V2
in bench_one_batch.py, create a dummy tree_cache as the placeholder

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

xiezhq-hermann · 2025-10-11T00:41:44Z

+    backup_state: bool = False,
+):
+    allocator = tree_cache.token_to_kv_pool_allocator
+    evict_from_tree_cache(tree_cache, num_tokens)


why evicting proactively here?

this is actually evict_if_needed

if self.token_to_kv_pool_allocator.available_size() < num_tokens: if self.tree_cache is not None: self.tree_cache.evict(num_tokens)

maybe we can rename it a bit, it's actually check for availability and evict if needed

what will be the case that we want to evict nodes regardless of the availability

i also feel the eviction policy is something non-trivial so evict_from_tree_cache will encapsulate complex logic from upper code

the current self.tree_cache.evict is indeed to evict nodes to meet the requirement regardless of the availability, and I think we can probably keep eviction policy under the hood too like what we have for now

airMeng · 2025-10-20T05:44:26Z

-                )
-
-        # Allocate memory
-        if self.token_to_kv_pool_allocator.page_size == 1:


Hi, I find batch,token_to_kv_pool_allocator.page_size not always equals to batch.tree_cache.page_size, then the paged config will go to the wrong allocation, breaks a lot of cases

could you let me know when the page sizes are different? we can add tests to capture this in the future

#11313 (comment) seems the page sizes should be the same and this is just a bug.

But the question is why we need to access the same page size from different places? echos the suggestions here #11645 (comment)

airMeng · 2025-10-20T06:20:00Z

 def extend(reqs, model_runner):
+    # Create dummy tree_cache for benchmarks (no prefix caching, just allocation)
+    dummy_tree_cache = SimpleNamespace(
+        page_size=1,


why hard-code to 1 instead page_size=model_runner.server_args.page_size ?

Thanks for pointing out. wIll fix this

Merge branch sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main https://code.alipay.com/Theta/SGLang/pull_requests/342?tab=diff Reviewed-by: 苏墨 <xuyongfei.xyf@antgroup.com> * [router] minor code clean up in server startup (sgl-project#10470) * [bugfix] fix typo (sgl-project#10471) * [PD metrics] Add latency Histogram metrics of each stage for generate requests (sgl-project#8710) * [CI] Fix runner for sgl-kernel (sgl-project#9887) * fix(internvl): fix accuracy issue of normalization (sgl-project#10375) * fix: gpt-oss streaming dropping normal content when tools are provided but not used (sgl-project#9657) * model: support solar (sgl-project#8189) * fix: resolve sgl-kernel ut (sgl-project#10476) * [1/2] Speed up trtllm_mla attention backend (>10% e2e) (sgl-project#10473) * Fix `--dataset-path` in `bench_one_batch_server` (sgl-project#10475) * [Env] minimal version for organizing envs (sgl-project#10479) * chore: bump v0.3.10 sgl-kernel (sgl-project#10478) * [router] multi model registration fix (sgl-project#10481) * [2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (sgl-project#10286) * [Auto Sync] Update registry.py (20250915) (sgl-project#10484) * [router] fix worker registration in multi model mode (sgl-project#10486) * fix crash of DeepSeek-V3 update_weights_from_disk (sgl-project#8863) * Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue (sgl-project#10434) * [Hicache] Evaluate Per-Round Metrics in Multiturn Bench (sgl-project#10203) * [ModelOpt] Respect `kv_cache_quant_algo` in ModelOpt checkpoints (sgl-project#10336) * Add Logprobs unit test with a loose threshold (sgl-project#10230) * [router] add router db connector for responses api (sgl-project#10487) * Remove wrong imports `from sglang.python` (sgl-project#10493) * [router] fix router manager and router init in server (sgl-project#10499) * Cache the result of `is_blackwell` platform check (sgl-project#10498) * feat: update support for qwen3next model (sgl-project#10466) * Minor fix lint introduced by sgl-project#10466 (sgl-project#10507) * chore: upgrade sgl-kernel 0.3.10 (sgl-project#10500) * Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. (sgl-project#10491) * Fix CI when sgl-kernel is changed but srt is not changed (sgl-project#10515) * Support sgl-router parallel_batch in bench_one_batch_server (sgl-project#10506) * [CPU] fix CPU backend sel. issue for Llama4 (sgl-project#10511) * adjust import setuptools_rust (sgl-project#10524) * Fix formatting in long code blocks (sgl-project#10528) * skip vision_model for lora (sgl-project#10530) * [2/2] Speed up trtllm_mla attention backend (sgl-project#10474) * support using fa4 on deepseek on blackwell (sgl-project#9928) * [Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) (sgl-project#10494) * [Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) (sgl-project#10538) * feat: add priority based scheduling with priority based request acceptance and preemption (sgl-project#8746) * Fix decord dependency for aarch64 docker build (sgl-project#10529) * enable prefix cache with dp (sgl-project#10459) * [bugfix]hicache bench_long_context.py run failed (sgl-project#10523) * Remove duplicated code (sgl-project#10545) * CUDA Arch Independent (sgl-project#8813) * [bench] Fix random seed in `bench_one_batch_server` (sgl-project#10548) * [HiCache] Add tests for hicache storage mooncake backend (sgl-project#10171) * [BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle (sgl-project#9976) * fix: update dsv3 fp4 ut (sgl-project#10584) * vlm: remove redundant d2h movement of mm feature tensors (sgl-project#9987) * Enable trtllm mla prefix extend (sgl-project#10526) * [ROCm] Fix fp8 quantization accuracy issue. (sgl-project#10558) * [HICache] introduce evict policy (sgl-project#10190) * PullRequest: 303 Revert "PullRequest: 291 for fa3 kvcache: revert github "convert mla kvcache to bfloat16"" * aiter v0.1.5.post2 (sgl-project#10563) * [PD] Improve disaggregation common backend and refactor mooncake backend (sgl-project#10273) * chore: upgrade mooncake 0.3.6 (sgl-project#10596) * [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525) * Scale kkt after reduction (sgl-project#10604) * fix deepep assert when PD disaggregation == null (sgl-project#8274) * [RL] Add destroy process group api (sgl-project#9979) * Feat/add heartbeat mechanism for nixl conn (sgl-project#10222) * update deepep version for qwen3-next deepep moe (sgl-project#10624) * support qwen3-next-fp8 deepep (sgl-project#10622) * Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610) * [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595) * Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579) * feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947) * Garbage collector regression in the online server (sgl-project#10621) * [router] refactor worker to builder pattern 1/n (sgl-project#10628) * refactor: use registry for _get_attention_backend_from_str (sgl-project#10629) * [Feature] Speculative decoding support lookahead (sgl-project#9873) * [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553) * [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586) * model support: Sarashina2VisionForCausalLM (sgl-project#10632) * feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631) * chore: bump sgl-kernel 0.3.11 (sgl-project#10630) * Hicache L3 backend mooncake optimization configuration reading method (sgl-project#10319) * [router] refactor worker to builder pattern 2/n (sgl-project#10633) * [Feature]feat(get_ip): unify get_ip_xxx (sgl-project#10081) * [router] refactor worker to builder pattern 3/n (sgl-project#10647) * [sgl-kernel] Support moe_sum_reduce cuda kernel (sgl-project#10321) * [router] refactor worker to builder pattern 4/n (sgl-project#10650) * Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 (sgl-project#10634) * [router] refactor worker to builder pattern 5/n (sgl-project#10653) * [HiCacheStorage]support page_first_direct layout for generic set&get (sgl-project#10522) * [router] preserve order of json params using preserve_order feature (sgl-project#10661) * [router] refactor router and worker management 1/n (sgl-project#10664) * fix: resolve sync issue (sgl-project#10668) * [Auto Sync] Update .clang-format (20250919) (sgl-project#10670) * [router] refactor router and worker management 2/n (sgl-project#10666) * router-spec: Reorder `ChatCompletionRequest` and fix validation logic (sgl-project#10675) * chore: cleanup docker image (sgl-project#10671) * limit sgl-kernel causal conv1d to cuda only (sgl-project#10648) * [Auto Sync] Update model_runner.py (20250920) (sgl-project#10679) * [router] refactor router and worker management 2.5/n (sgl-project#10677) * [1/2] Support deterministic inference with flashinfer attention backend (sgl-project#10645) * [Auto Sync] Update deepseek_v2.py (20250920) (sgl-project#10683) * chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile (sgl-project#10681) * [Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster (sgl-project#10680) * Replace os.environ in layernorm.py (sgl-project#10684) * fix(disagg): fix sending KV cache in case of MLA for NIXL backend (sgl-project#10673) * fix: update run_suite (sgl-project#10685) * fix: remove awq_dequantize deps (sgl-project#10686) * [Auto Sync] Update modelopt_quant.py (20250920) (sgl-project#10688) * [Feature] Support deterministic inference with FA3 backend (sgl-project#10651) * feat: update server args (sgl-project#10696) * Super tiny fix extra logs (sgl-project#10697) * [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization (sgl-project#10592) * Update release-docs.yml (sgl-project#10706) * Refactors radix cache for extra key support (sgl-project#10317) * [Router]fix: fix get_load missing api_key (sgl-project#10385) * fix: disable gpt-oss b200 ut (sgl-project#10716) * Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU (sgl-project#10714) * [Auto Sync] Update deepseek_v2.py (20250922) (sgl-project#10717) * Support deterministic inference with triton backend (sgl-project#10694) * [deterministic inference] Move batch invariant pkg to sglang (sgl-project#10695) * [2/2] Support deterministic inference for temperature > 0 (sgl-project#10678) * [Ascend] codeowner updates for ascend related files (sgl-project#10699) * [theta] 支持qwen-vl的多模自定义采样 * revert e61d08c [theta] 支持qwen-vl的多模... * PullRequest: 306 [theta] 支持qwen-vl的多模自定义采样 * [4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% (sgl-project#10709) * Convert FLASHINFER_WORKSPACE_SIZE to integer (sgl-project#10731) * EPLB: prefer to use physical experts in the same node (sgl-project#9849) * fix capture_bs when speculative decoding enabled (sgl-project#10730) * Fix flaky logprobs test (sgl-project#10728) * Fix CI TestChunkedSGMV (sgl-project#10737) * [Docs, minor] Fix LLM doc matrix (sgl-project#10753) * Add warnings and remove dependency for deterministic inference (sgl-project#10724) * bugfix: Fix `get_worker_urls_for_model` in http/router.rs (sgl-project#10754) * [router] refactor router and worker management 3/n (sgl-project#10727) * [router] update ci so only execute benchmarks when labels are added (sgl-project#10757) * Fix MTP MoE weight loading with NVFP4 target model. (sgl-project#10758) * chore: bump sgl-kernel v0.3.12 (sgl-project#10732) * [Generative Score API] Added test_scores_api.py to github CICD to run per commit (sgl-project#10755) * refactor zero copy (sgl-project#10300) * Fix multimodal registry and code sync scripts (sgl-project#10759) * Enables TRT-LLM backend to be used for target_verify (sgl-project#10281) * fix: kv events with tp > 1 (sgl-project#10541) * [Auto Sync] Update flashattention_backend.py (20250922) (sgl-project#10762) * [Feature] Add MLAProcess for DeepSeek MLA on NPU (sgl-project#10130) * [Ascend] optimize Qwen-vl on Ascend (sgl-project#10556) * [Ascend]optimize Qwen3 on Ascend (sgl-project#10574) * [Auto Sync] Update configurer.py (20250923) (sgl-project#10765) * [router] refactor router and worker management 4/n (sgl-project#10756) * PullRequest: 310 新增 BailingMoEV3 模型及其 MLA 支持 * [router] remove pd router draining channel (sgl-project#10767) * [router] fix logger type mismatch (sgl-project#10774) * Use simulate acc len from `sglang.environ` (sgl-project#10771) * Fix trtllm_mla slow concat kernel in MTP (sgl-project#10777) * Move cached kernel to srt.utils (sgl-project#10776) * feat: unify dockerfiles (sgl-project#10705) * Introduce `FutureMap` (sgl-project#10715) * chore: upgrade sgl-kernel 0.3.12 (sgl-project#10782) * followup: clean up dockerfiles and release yamls (sgl-project#10783) * Clean up server args (sgl-project#10770) * move `environ` into `sglang.srt` to avoid break SRT auto sync. (sgl-project#10791) * Fix hicache mooncake backend CI (sgl-project#10792) * [router] fix cache aware routing strategy and lock contention (sgl-project#10773) * [router] responses api POST and GET with local storage (sgl-project#10581) * model: support qwen3-vl series (sgl-project#10323) * [fix][pd-disag]no need set next batch sampling info done in prefill (sgl-project#10259) * [ROCm] Update aiter to v0.1.5.post3 (sgl-project#10812) * [router] use dashmap for radix tree instead of hash for multi model (sgl-project#10814) * router(grpc): Implement route for chat_cmpl endpoint (sgl-project#10761) * fix ceval (sgl-project#10504) * Remove duplicate code in qwen2 model (sgl-project#10540) * [router] fix axum default body limit (sgl-project#10818) * Fix latest main ci (sgl-project#10799) * add tunning files for QWEN-3-NEXT (sgl-project#10794) * [Auto Sync] Update protocol.py (20250923) (sgl-project#10820) * fix: draft model IMA by overide max_positional_embeddings (sgl-project#10787) * [Auto Sync] Update elementwise.py (20250923) (sgl-project#10823) * [Auto Sync] Update simple_eval_common.py (20250923) (sgl-project#10824) * [router] Support streaming for Openai Router Response api (sgl-project#10822) * [router] add auth middleware for api key auth (sgl-project#10826) * [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (sgl-project#10825) * Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" (sgl-project#10828) * Add CI timeout guidelines (sgl-project#10829) * [theta] fix serving_tokenization.py * feat: add cache_salt support to request (sgl-project#10718) * fix bailing_moe with enable_dp_attention (sgl-project#10860) * ci: free space on workers for build (sgl-project#10786) * router-grpc: Support jinja chat template content format detection (sgl-project#10832) * [router] select first healthy worker on proxied get requests (sgl-project#10827) * chore: Initial support for input config files (sgl-project#10534) * router-grpc: Add tools processing and other paramters for apply_chat_template (sgl-project#10877) * [router] consolidate health endpoints and flush cache (sgl-project#10876) * Restruct sgl-kernel benchmark (sgl-project#10861) * [Bug] Fix Issue#10215 (sgl-project#10572) * [router] consolidate worker get loads (sgl-project#10880) * [router] Support Oracle DB(ATP) Data Connector (sgl-project#10845) * [router] simplify tokenizer dev doc (sgl-project#10895) * [Auto Sync] Update model_config.py (20250925) (sgl-project#10885) * [ci feature] add ci monitor (sgl-project#10872) * [HiCache] Cleaning the deprecated host memory state (sgl-project#10778) * integrate AIBrix KVcache (sgl-project#10376) * Add fuse_moe per-channel tune (sgl-project#10915) * [router] consolidate worker load monitoring (sgl-project#10894) * router: Fix constraint proto and `build_constraint` in grpc router (sgl-project#10881) * Refactor kv_cache_scheme handling for quantization (sgl-project#10132) * refactor: Move `grpc/client.rs` to `grpc_client/sglang_scheduler.rs` (sgl-project#10924) * fix env flashinfer (sgl-project#10910) * [minor] Remove deprecated function `get_ip` (sgl-project#10883) * Rename customer label -> custom label (sgl-project#10899) * [router] change log level to warning (sgl-project#10926) * [router][refactor] Clean up protobuf fields (sgl-project#10923) * Replace the Kimi-K2 generated tool call idx with history tool call count (sgl-project#10612) * [ci] add ci-monitor workflow (sgl-project#10898) * Remove pull_request trigger from CI monitor workflow (sgl-project#10932) * router: Support parallel sampling num > 1 in grpc_server and non-stream handling (sgl-project#10929) * Revert "Refactor kv_cache_scheme handling for quantization (sgl-project#10132)" (sgl-project#10935) * Update CODEOWNERS to include JustinTong0323 in FC (sgl-project#10939) * [PD-HiCache]: Support Async Offloading KVCache In Decode Side (sgl-project#10192) * CI: Fix docker manifest build (sgl-project#10936) * [router] update owners for router components (sgl-project#10927) * Fuse write kv buffer into rope for qwen3 moe & bailing moe (sgl-project#10749) * [router] add grpc client get and set (sgl-project#10955) * [router]fix code owner syntax error (sgl-project#10956) * [router] move grpc client from router to worker and builder (sgl-project#10958) * [router] add move grpc worker management from router to worker manager (sgl-project#10960) * [router] grpc router regular mode import cleanup (sgl-project#10963) * [router] remove old/oudated/useless comments (sgl-project#10967) * [router] remove old/oudated/useless comments across code base (sgl-project#10968) * ci: fix rate-limit of huggingface with hf auth login (sgl-project#10947) * Update label field comment to indicate deprecation (sgl-project#10970) * Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs (sgl-project#10372) * ci: refactor nightly test (sgl-project#10495) * refactor loading weights from remote instance coding format (sgl-project#10941) * [router][grpc] Add helpfer functions for decoder in router.rs and fix specs (sgl-project#10971) * Add simple docker file for B300 (sgl-project#10944) * Ci monitor support performance (sgl-project#10965) * [HiCache]: Support dynamic loading backends for hicache (sgl-project#10551) * [Bugfix][Minor][Benchmark] Fix some bugs due to PR sgl-project#10495 (sgl-project#10982) * [router][grpc] Support E2E non-stream chat completions (sgl-project#10980) * fix: fp8 quantization failure of qwen 2.5 VL 7B model (sgl-project#10112) * [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (sgl-project#10981) * fix: make inference deterministic for large TP (sgl-project#10930) * Add auth to get server info (sgl-project#10751) * PullRequest: 315 bailingMoE: Fix deepep_mode keyerror * Add support for topk metadata transferring for PD (sgl-project#10616) * [PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend (sgl-project#10565) * Use jsonschema to constrain required or specific tool choice (sgl-project#10550) * Fix profiler (sgl-project#10997) * [router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) (sgl-project#10995) * [router] basic mcp support for openai router response api (sgl-project#10978) * [router] fix chat template loading and tokenizer path (sgl-project#10999) * Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' (sgl-project#11009) * [bugfix]Add empty_context import to two_batch_overlap.py (sgl-project#10964) * prepare for sglang+verl (sgl-project#10555) * [sgl-kernel] Optimize concat_mla_k kernel (sgl-project#10543) * [HiCache] bug: fix mooncake store batch set v1 (sgl-project#11013) * Fix FusedSetKVBufferArg in RotaryEmbedding (sgl-project#11003) * Update GLM-4.5 Model Doc (sgl-project#11017) * [router] migrate to rust python module for pythonic parser (sgl-project#11033) * fix: show failed models in nightly ci (sgl-project#10986) * [router][tool call] Support normal content extraction before tool call (streaming) (sgl-project#11038) * [router] add harmony tool parser base structure and interface (sgl-project#11036) * Unify SGL Kernel Releases (sgl-project#10701) * [1/2] Support FA4 for MHA Prefill in sgl-kernel (sgl-project#10940) * fix: check if weights are already local before downloading (sgl-project#11015) * [HiCacheStorage] mooncake store support page_first_direct layout (sgl-project#10591) * [speculative decoding] rename lookahead to ngram (sgl-project#11010) * Fix gemma 3 launch with `transformers:` the error: `AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size'` (sgl-project#9614) * Fix sgl-kernel benchmark dead code (sgl-project#11022) * [router][tool call] Improve normal content extraction and error handling (non-stream) (sgl-project#11050) * chore: upgrade cutedsl 4.2.1 (sgl-project#11054) * [Ci Monitor] Auto uploaded performance data to sglang_ci_data repo (sgl-project#10976) * chore: upgrade sgl-kernel 0.3.13 (sgl-project#11056) * [router] add n to generate sampling params (sgl-project#11069) * Use more general heuristics to set the default value of --mem-fraction-static (sgl-project#10975) * [router][tool call] Separate `JsonParser` and `LlamaParser` (sgl-project#11073) * Fix mem fraction static for nightly tests (sgl-project#11076) * fix: fp8 mllama4 without vision modules being quantized (sgl-project#10611) * [router] Use `get_pooled` in `process_single_choice` (sgl-project#11079) * [router][grpc] Add logprobs support to router (sgl-project#11082) * feat(reasoning): improve enable thinking from request (sgl-project#10875) * [Profile] dump memory trace when cuda graph profile is enabled (sgl-project#11083) * Remove hybrid_linear_attn attention backend and refactor attention registry (sgl-project#10816) * [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (sgl-project#9642) * Enable optional FP32 compute for LM Head (sgl-project#10729) * Update CODEOWNERS for attention/ascend_backend.py (sgl-project#11092) * [router] grpc router generate endpoint support (sgl-project#11070) * [router][tool call] Full support for ToolChoice (sgl-project#11085) * Fix spec filter batch when target extend (sgl-project#10991) * [Fix] Resolve performance drop in speculative decoding aiter backend (sgl-project#11087) * [Auto Sync] Update fused_moe_triton_config.py (20250930) (sgl-project#11099) * chore: bump sgl-kernel v0.3.14 (sgl-project#11067) * [router][grpc-server] Fix gRPC server shutdown (sgl-project#11094) * Fix eagle radix cache (sgl-project#10846) * [Eval] Add `--repeat` in `run_eval` (sgl-project#11101) * [CPU] Adding Memory Capacity Acquisition Functionality (sgl-project#11102) * Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization (sgl-project#11081) * Support Dots.ocr model (sgl-project#11071) * [router][bugfix] Fix input_logprobs handling with None value and `logprob_start_len = -1` (sgl-project#11113) * Feature/make PEFT adapter module format compatibile (sgl-project#11080) * fix: KimiK2Detector Improve tool call ID parsing with regex (sgl-project#10972) * [router] add mcp list and mcp call in output array (sgl-project#11112) * Organize spec-related data structures (sgl-project#10735) * [AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm (sgl-project#11114) * [Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) (sgl-project#11115) * [Doc] Update multimodal language models documentation (sgl-project#11111) * Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg (sgl-project#10985) * docker: x86 dev builds for hopper and blackwell (sgl-project#11075) * Refactor AMD CI. (sgl-project#11128) * feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 (sgl-project#10760) * [HiCache]bug fix: fixed blank item in host_mem_release_queue (sgl-project#11005) * [Feature] Add EIC as sglang HiCache Storage backend (sgl-project#10271) * [HiCache] Configurable and Dynamic Prefetch Timeout (sgl-project#10512) * [router] add pd service in grpc router for pd (sgl-project#11120) * [router] Add multi-turn tool calling loop support for MCP integration (sgl-project#11143) * Fix metrics and request tracing (TimeStats) (sgl-project#11123) * Remove debug print statement from scheduler output (sgl-project#11145) * Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (sgl-project#10720) * Fix ngram spec with page size > 1 (sgl-project#11135) * [ROCm] To reduce the compiling time when using torch compile. (sgl-project#10559) * Fix DeepSeek chunked prefill memory issue (sgl-project#11149) * Clean up parallel_state.py (sgl-project#11148) * Tiny improve dumper (sgl-project#11132) * Tiny fix missing alt stream in nextn layer (sgl-project#10768) * Fuse quantize and rope in trtllm_mla MTP (sgl-project#10779) * Tiny detect slow ranks (sgl-project#10508) * Remove unused pack `.item()` in paged allocator. (sgl-project#11156) * Support dispatch low latency (sgl-project#10263) * Support single batch overlap (sgl-project#10422) * [router][grpc] Support tool call parser in streaming (sgl-project#11160) * [model] Add mamba2 and Falcon-H1 support. (sgl-project#10988) * Clean up ascend allocator (sgl-project#11152) * fix cpp JIT compilation issue of ngram speculative decoding (sgl-project#10837) * Tiny cleanup deepseek_v2.py (sgl-project#11163) * Tiny fix ep_gather behavior different in CI (sgl-project#11130) * Tiny remove duplicated code (sgl-project#11164) * [proto] Add script to compile python protos (sgl-project#11171) * Unify forward output datastructure (sgl-project#11124) * [grpc] style fix for grpc compilation. (sgl-project#11175) * Remove dp balance metadata and minimul token balance. (sgl-project#11170) * Minor fixes for server_args, parallel_state, and test_deterministic.py (sgl-project#11159) * fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 (sgl-project#11176) * [router][grpc] Support streaming for v1/chat/completions (sgl-project#11179) * Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (sgl-project#11138) * Introduce naming convention in `io_struct` and base sglang io classes. (sgl-project#10133) * [Generative Scores API] add performance tests to CICD (sgl-project#10830) * [1/n] Enable DCA CUDA graph capture (sgl-project#9537) * [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (sgl-project#11161) * [CI]] Tee server logs to both file and stdout/stderr using PIPE (sgl-project#11185) * fix: radix cache memory accounting (sgl-project#10637) * Tiny add PD disaggregation + DP attention test (sgl-project#11167) * [router] Steaming support for MCP Tool Calls in OpenAI Router (sgl-project#11173) * [Feature] Option to save model weights to CPU when memory saver mode is enabled (sgl-project#10873) * Add --thinking-mode to run_eval (sgl-project#11189) * [hot-fix] Fix CI break which caused by adding `thinking_mode` in eval (sgl-project#11192) * Tiny move files to utils folder (sgl-project#11166) * Fix CUDA illegal memory access issues in speculative decoding (sgl-project#10892) * Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. (sgl-project#10780) * Optimize debug log position of PD abort request (sgl-project#11090) * fix 3fs indices (sgl-project#10855) * model: support starcoder2 (sgl-project#10609) * [Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. (sgl-project#10859) * fix xeon ci check (sgl-project#10838) * fix qwen2 eagle3 runtime error (sgl-project#10517) * [minor] fix the lint (sgl-project#11198) * [Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py (sgl-project#10741) * [fix]missing prefix_lens_cpu init when p/d disaggregation (sgl-project#11196) * fix self.enable_kv_cache_events (sgl-project#11178) * [HICache]: Refactor HiCache CI (sgl-project#11011) * fix sampling_seed handling when deterministic is enabled (sgl-project#11096) * [fix]enable flashmla when using draft model P/D attention select (sgl-project#11012) * [router] fix get load response parsing (sgl-project#11213) * [router] add grpc router pd mode for chat and generate (sgl-project#11140) * EAGLE cache fix for HiCache (sgl-project#11215) * Add --max-new-tokens CLI flag for MMMU evaluation (sgl-project#11217) * Add DeepSeek-V3.2 Tool Call Template (sgl-project#11063) * Tiny `skip_sample` adjust (sgl-project#11225) * [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 (sgl-project#11194) * Update `v1/responses` to be more OpenAI-compatible. (sgl-project#9624) * chore: bump sgl-kernel v0.3.14.post1 (sgl-project#11137) * Update DeepGEMM repository tag to specific commit (sgl-project#11229) * [Feat] Support Torch Symm Mem AllReduce (sgl-project#10571) * Refactor and optimize mooncake CI (sgl-project#11162) * [Fix AMD CI] VRAM cleanup (sgl-project#11174) * Update transformers package version to 4.57.0 (sgl-project#11222) * Remove gdrcopy check in ci_install_deepep.sh (sgl-project#11237) * Rename runner labels (sgl-project#11228) * [Auto Sync] Update io_struct.py (20251004) (sgl-project#11206) * Create two new GH workflows to automatically bump SGLang and Kernel version (sgl-project#10996) * Fix spec_utils.py (sgl-project#11247) * ci: make find_local_hf_snapshot_dir more robust (sgl-project#11248) * [quantization] Fix scale remapping for mllama4 (sgl-project#10042) * [quantization] Enable aiter mxfp4 fused_moe for Quark (sgl-project#10048) * Use cu128 for torch audio to fix some CI tests (sgl-project#11251) * Bump torch_memory_saver 0.0.9rc2 (sgl-project#11252) * update sgl kernel version to 0.3.14.post1 (sgl-project#11242) * Update condition for sgl-kernel-benchmark-test (sgl-project#11254) * feat: add shortcut detection for multimodal templates in Jinja format (sgl-project#11209) * Improve bot release workflow (sgl-project#11240) * Add flashmla and fast hadamard transform to Dockerfile (sgl-project#11235) * Support DeepSeek V3.2 Exp (sgl-project#11061) * chore: bump SGLang version to 0.5.3rc2 (sgl-project#11259) * chore: bump SGLang version to 0.5.3 (sgl-project#11263) * [theta] fix bailing v3 * [router] add ipv6 support across all components (sgl-project#11219) * Remove env var warnings for release (sgl-project#11262) * Enable native ModelOpt quantization support (1/3) (sgl-project#7149) * [router][tool call] Clean up redundant `detect_format` and `has_tool_markers` (sgl-project#11270) * disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 (sgl-project#11274) * docker: add manifest to versioned docker releases (sgl-project#11268) * [Bug] Fix incorrect assertion in FA4 and add UT. (sgl-project#11182) * [router][grpc] Refine streaming processes (sgl-project#11277) * Fix code sync scripts (sgl-project#11276) * [Auto Sync] Update test_utils.py (20251006) (sgl-project#11280) * Rename max_micro_batch_size -> pp_max_micro_batch_size (sgl-project#11279) * reverse the amd ci test back to 1200s and split the 8-gpu deepseek job into two. (sgl-project#11238) * Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components (sgl-project#11261) * fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration (sgl-project#11282) * docs: update sgl-kernel README (sgl-project#11286) * chore: bump sgl-kernel version to 0.3.15 (sgl-project#11281) * [router][grpc] Fix proto3 default value mismatches and cleanup unused fields (sgl-project#11283) * convert test_deterministic into unit tests (sgl-project#11095) * Feature/longbench v2 evaluation utils (sgl-project#10949) * [ci] fix pp test (sgl-project#11294) * EAGLE cache fix for SWARadixCache (sgl-project#11231) * Remove overlap thread (sgl-project#11210) * [router] add reasoning and tool parser argument in router (sgl-project#11290) * Remove sampling info events and overlap thread file (sgl-project#11300) * Introduce future indices (sgl-project#11301) * [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (sgl-project#11068) * [Docs] [Router] Update Observability and Common Issues Section (sgl-project#11302) * [router] add get server info and get model info in grpc server (sgl-project#11303) * [router][grpc] Refactor chat template content format detection (sgl-project#11288) * [Doc] HiCache Design Documents (sgl-project#11027) * [Doc]: Best Practice for HICache (sgl-project#11001) * [router] fix grpc connection conversion and add optimization (sgl-project#11305) * [router][grpc] Fix sampling_params.stop_strs is None (sgl-project#11306) * Update tool parser and related documentation (sgl-project#11223) * [router][grpc] Fix error message format in grpc chat handler (sgl-project#11307) * [quantization] Properly ignore quantization for layers excluded in quant_config (sgl-project#11205) * [router] support Openai router conversation API CRUD (sgl-project#11297) * [router][grpc] Fix request_id extraction when n > 1 (sgl-project#11311) * [router] cleanup worker health check to return early (sgl-project#11310) * [oai serving chat] Add argument `--sampling-defaults` and fix `ChatCompletionRequest` defaults (sgl-project#11304) * Clean match_prefix and prepare_for_extend for mem cache V2 (sgl-project#11200) * ci: unify the model launch method of nightly ci (sgl-project#11230) * [Chore] Update xgrammar 0.1.24 -> 0.1.25 (sgl-project#10710) * update sampling_params documentation with defaults (sgl-project#11315) * Optimize copy_kv_cache for spec decoding (sgl-project#11126) * Rename `ngram_utils` -> `ngram_info` (sgl-project#11316) * [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (sgl-project#11314) * [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (sgl-project#9545) * [8/N] MoE Refactor: deprecate `EPMoE` (sgl-project#11211) * Skip weight loading in deepgemm compilation (sgl-project#11312) * [2/2] Support MHA prefill with FlashAttention 4. (sgl-project#10937) * [Doc] Update mooncake nvlink transport doc for PD disaggregation (sgl-project#11321) * fix(decode): adjust ServerArgs import to explicit module path (sgl-project#11007) * Support LoRA in bench_serving oai interface (sgl-project#11318) * benchmark: enhance configurable multimodal benchmarking in bench_serving (sgl-project#9812) * [CI] improve disaggregation CI. (sgl-project#11264) * [theta] fix tokenization * model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (sgl-project#10909) * [router] refactor generate to use new pipeline arch (sgl-project#11323) * [router] improve reasoning parser lock and reduce req cloning (sgl-project#11336) * [router][grpc] Cleanup debug logs in grpc_server and grpc_router (sgl-project#11340) * [router] Fix all unused_qualifications (sgl-project#11341) * [router] Support history management using conversation (sgl-project#11339) * [router][grpc] Add dependencies in Cargo.toml to support chat template rendering (sgl-project#11342) * fix: fix revision for sgl-flash-attn in sgl-kernel (sgl-project#11327) * [Auto Sync] Update scheduler.py (20251009) (sgl-project#11350) * [Generative Score API] Multi-Item scoring with custom attention mask. (sgl-project#10979) * [router][grpc] disable health check generation and increase timeout (sgl-project#11353) * [router] Refactor OpenAI router: split monolithic file and move location (sgl-project#11359) * [router][lint] Add unused_qualifications to cargo lint warnings (sgl-project#11366) * [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (sgl-project#11309) * PullRequest: 323 [theta] 错误码规范化：1）chat和completions请求的前处理统一为400；2）多模态load data请求返回为标准的http错误码 * [router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (sgl-project#11373) * add code pp support for nixl (sgl-project#11375) * fix bench_serving mishandling of internal states (sgl-project#11376) * PullRequest: 322 支持MTP并使用DeepseekV2AttentionMLA子类化BailingMoEV3AttentionMLA * [router][grpc] Replace fake health check with correct ones (sgl-project#11387) * [router] change grpc client from mutable to clone (sgl-project#11394) * chore: upgrade flashinfer 0.4.0 (sgl-project#11364) * [router] conversation item API: create, retrieve and delete (sgl-project#11369) * chore: bump SGLang version to 0.5.3.post1 (sgl-project#11324) * move more files under srt/utils (sgl-project#11285) * [grammar] Avoid server crash when grammar backend is None (sgl-project#11401) * fix: fix gpu-proc affinity set incorrectly when pp_size > 1 (sgl-project#11389) * [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded (sgl-project#11365) * [CI] Refactor PD disaggregation test suite (sgl-project#11363) * Replace pad with cat for better performance (sgl-project#11388) * fix: reinstall torch in deps install (sgl-project#11414) * feat(hicache): Support passing prefix keys for l3 store. (sgl-project#9045) * fix file and object naming scheme in HiCacheNixl to avoid data corruption (sgl-project#10969) * Dedicated toml files for CPU/XPU (sgl-project#10734) * Add metrics for speculative decoding (acceptance rate, average acceptance length) (sgl-project#11144) * chore: update pyproject (sgl-project#11420) * PullRequest: 330 [theta] qwen-vl支持视频base64传入图像帧，如：data:video/jpeg;base64,frame1_base64,frame2_base64,...,frameN_base64 * fix: fix video input for qwen3-vl (sgl-project#11361) * perf: optimize qwen-vl with symm mem allreduce (sgl-project#11381) * [HiCache] feat: add multi tenant with prefix tag (sgl-project#9256) * [CI] Merge build-dev into workflow matrix (sgl-project#11345) * Revert "perf: optimize qwen-vl with symm mem allreduce" (sgl-project#11436) * Revert "fix: fix video input for qwen3-vl" (sgl-project#11437) * Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (sgl-project#11433) * [router] Fix ci nvcc not found error (sgl-project#11411) * feat(mooncake): support GB suffix for global_segment_size (sgl-project#10745) * Separate allocation logic from scheduler (sgl-project#11313) * [router] disable rate limiter by default (sgl-project#11435) * [router] leverage RAII to actively cancel request during client disconnect (sgl-project#11399) * [router][grpc] Consolidate parser checks for chat completions (sgl-project#11439) * Reorder PD disagg CI tests (#11438) * fix: Change dsv32 hack temporary path to use system temp directory (#11445) * Fix batch invariant ops (#11368) * [BugFix] test_mla_fp8.py fails on Cublas 12.9 (#11360) * [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton (#11450) * Remove tilelang dependency in Dockerfile (#11455) * Enable native ModelOpt quantization support (2/3) (#9991) * Reland [1/2] Optimizations and refactors about quant kernel (#10312) * Super tiny delete unused openai router in sgl-router (#11448) * Adjust logits metada init for target verify (#11467) * [Documentation][Configuration] Server args and documentation of PD-Multiplexing. (#11427) * Fix enable_v2 in int8 quant (#11470) * [Fix] Fix split prefill with fa3. (#11428) * fix stop when stream (#11462) * Add option to disable `any_whitespace` for `xgrammar` and `llguidance` backends. (#8919) * PullRequest: 334 [theta] 修复qwen3-vl的各种bug * [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019) * fix Xeon CI (#11454) * [CI] Add nightly builds to dockerhub (#9804) * [Feature] support regex strings as a stopping condition (#10635) * Beta spec-overlap for EAGLE (#11398) * Piecewise CUDA Graph Support & Torch Compile Backend (#10062) * [Router]: Small Typo in a comment within tree.rs (#11489) * chore: bump sgl-kernel version to 0.3.16 (#11476) * [smol] [perf] Qwen3-VL in place op. (#11481) * [chore][1/N] Avoid using default mutable parameters (#11478) * [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends (#10172) * [ perf ] Replace json-> orjson in hot path (#11221) * [chore][2/N] Avoid using default mutable parameters (#11479) * Fix the GPT function calling regex to allow dash in the name (#10577) * bailingMoE: Fix Key error of deepep_mode (#11465) * Fix CI break by express-laned PRs. (#11499) * Move args from `global_config` to `environ` (#11332) * move fla env check position (#11500) * Temporarily remove b200 tests (#11501) * Fix port conflicts in CI (#11497) * temporarily remove b200 tests (#11502) * Fix unit tests (#11503) * Bugfix: Fix Type consistency for KV indices in SWARadixCache (#11452) * doc: add doc for adding new models into nightly-ci (#11443) * [CI] fix lint (#11509) * Deprecate `global_server_args_dict` (#11331) * chore: remove flashinfer cleanup cache (#11514) * fix: revert temporarily remove b200 tests (#11515) * [Fix] Improve longbench prompt and other logics (#11474) * Sync changes on io_struct.py and deterministic ops (#11498) * [lint] Fix the lint issue (#11516) * Revert "Deprecate `global_server_args_dict`" (#11520) * Improve dp attention port assignment scheme (#5889) * [theta] rebase public/main 1013-2 * [router] openai router: support grok model (#11511) * docs(router): add token-bucket rate limiting to the docs (#11485) * [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432) * Update DeepSeek-R1-FP4 default config on blackwell (#11512) * [Fix]: add missing device attribute to ChunkCache (#11493) * [Feature] Support mamba radix cache v0 (#11214) * ci: improve nightly-ci (#11385) * [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering (#11505) * [HICache]: Support 3FS-Store with page_first_direct layout (#11460) * Tiny fix test run estimated time (#11544) * [Reland] perf: optimize qwen-vl with symm mem allreduce (#11457) * [theta] rebase public/main 1013-5 * Depreate `global_server_args_dict` (#11528) * [theta] rebase public/main 1013-6 * [Fix] Add per_channel_quant parameter to MoE config functions (#11201) * [router][ci] Add Nightly Release Workflow for SGLang Router (#11527) * [router] allow tokenizer path to be dir (#11530) * Remove `tp_worker.worker` (#11548) * fix: fix video input for qwen3-vl (#11442) * [NVIDIA] BUMP FA3 (#11444) * [router][Fix] Include grpc reflection runtime dependency (#11419) * Adjust overlap event loop (#11507) * Move deep gemm related arguments to `sglang.srt.environ` (#11547) * [router][grpc] Further delegate non-stream processing to `processing.rs` (#11553) * [router] allow user to specify chat template path (#11549) * Minor: improve sampler & remove unused fields from model_config.py (#11531) * [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483) * Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441) * Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) (#11557) * [CI] Add Basic Test for DeepSeek V3.2 (#11308) * [router][grpc] Add error handling to `generate_tool_constraints` (#11562) * [NVIDIA] update pyproject.toml to support cu130 option (#11521) * [CI Monitor] Ci monitor only deal with main branch in default (#11538) * Tiny cleanup fp4 gemm calls (#11537) * [router][grpc] Add `serve_grpc` to `launch_server` and log id for HealthCheck (#11564) * [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds (#11571) * [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM (#11534) * chore: bump sgl-kernel version to 0.3.16.post1 (#11573) * Fix accept rate in speculative decoding metrics (#11572) * Compilation Folder Reset (#11539) * [FEATURE] Add Profile Trace Merger for Distributed Traces (#11413) * [DSv32] Use torch.compile for _get_logits_head_gate (#11565) * Make DeepEP combine recv do not overlap (#11535) * bench_serving support PD Disaggregation (#11542) * Implement LRU eviction policy for LoRA adapters (#11041) * PullRequest: 337 支持completions协议传入多模态请求 * Revert "[NVIDIA] BUMP FA3 (#11444)" (#11582) * chore: bump sgl-kernel version to 0.3.16.post2 (#11583) * [Auto Sync] Update model_config.py (20251014) (#11580) * Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json (#11587) * [router][protocols] Add Axum validate extractor and use it for `/v1/chat/completions` endpoint (#11588) * [router] update generate spec to align with sgl io struct (#11591) * [router] change worker api to async instead of sync (#11566) * Update news section in README.md (#11598) * [router] delete useless table content comment in spec (#11597) * [router] allow router launch server to use grpc mode (#11600) * [Docs] [Router]: Update sg-router doc on circuit breaker (#11449) * [router] when given both local tokenizer and chat template, log all (#11601) * [AMD CI] Add image and weights caching. (#11593) * Update release-docker-dev.yml (#11603) * Optimize Triton Draft Backend (#11556) * Refactor spec decoding metrics calculation into separate `TokenizerManager` utility function (#11586) * make radix cache deterministic (#10721) * move eagle draft post process to cuda graph (#11434) * Reduce one step decode for draft model. (#11561) * [router] add py binding and readme for openai router and history backend (#11453) * [theta] print load mm cost * [theta] 百灵4头支持tp8 * [router] cleanup app context and move to startup (#11617) * [router] add chang and keyang to sgl router author (#11620) * use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. (#11605) * [router] update router readme to latest features (#11619) * Fix log for chunked prefix cache (#11624) * [Auto Sync] Update scheduler.py, server_args.py (20251014) (#11623) * [Auto Sync] Update collector.py (20251014) (#11625) * [Minor] Update xgrammar dependency (#11622) * Update install.md (#11631) * fix: Update SGL_KERNEL_VERSION to 0.3.15 (#11633) * [router][grpc] add warm up to grpc server (#11627) * Refactor kv cache free (#11351) * [router] update router doc to latest features (#11639) * fix: upgrade transformers to 4.57.1 (#11628) * [router] add worker self discovery for metadata (#11638) * [router] upgrade to 0.2.0 (#11642) * [theta] qwen vl耗时打印 * [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423) * [theta] qwen vl耗时打印 * [1/N]Support DeepSeek-R1 w4a8 normal deepep (#8247) * [Fix] Fix accuracy bug in CSGMV kernel caching key. (#11579) * feat: add add_chunked_prefix_cache_attention_backend (#11636) * Super tiny improve FA3 import error message (#11590) * [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl (#11458) * [Doc] Update support matrix for attn and hybrid attn (#11293) * Clean up some Qwen3-Next and deterministic code (#11585) * docs: update sglang installation guide (#11659) * [theta] 更新aci镜像和依赖 * Tiny cleanup some eagle unused codes (#11660) * Fix 1-step draft model forward (#11653) * [tool call] Fix prev_tool_call_arr management in base_format_detector.py (#11367) * [router] Fix response api related spec (#11621) * Fix missing json imports in serving_responses.py (#11681) * [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674) * [sgl-kernel] Optimize gguf test (#11667) * [router][grpc] Simplify model_id determination (#11684) * [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676) * chore: bump SGLang version to 0.5.3.post2 (#11680) * [CI][XPU]enable sglang CI on Intel XPU (#9493) * enable rmsnorm on XPU (#10248) * Sync code and test CI; rename some env vars (#11686) * docs: Add Contributor Covenant Code of Conduct (#11689) * [theta] dockerfile增加deepgemm编译缓存（需要定期更新😂） * [Mamba] Increase default mamba_full_memory_ratio to 0.9 (#11679) * [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912) * [sgl-kernel] support hadamard (#11663) * Fix missing a2a backend init of GLM4.5 MoE Block (#11692) * Split test_intel_amx_attention_backend.py to pass CI of timeout (#11370)

cctry added 14 commits October 3, 2025 12:08

match_prefix

9f238a4

fix

c0fc6a0

prepare_for_extend: adjust order

baa34d7

prepare_for_extend: unify indices writing

ffc57c0

clean

75e3cf7

clean

3fe0dd6

set prefix_indices default

ebb29fc

fix chunk cache

5fa6bbf

prepare_for_extend

9d7b32f

move functions out

a30f56b

better

4d4a272

wip

e3e1d28

wip

217fa5b

Merge remote-tracking branch 'origin/main' into shiyang/mem_v2/alloc

2b54ecc

sglang-bot added the run-ci label Oct 8, 2025

cctry added 2 commits October 7, 2025 18:34

clean

ba41c12

clean

d498464

cctry marked this pull request as ready for review October 8, 2025 01:50

cctry requested review from Ying1123, hnyls2002, kssteven418, merrymercy and xiezhq-hermann as code owners October 8, 2025 01:50

cctry added 3 commits October 7, 2025 18:59

clean

4e0bb42

Merge remote-tracking branch 'origin/main' into shiyang/mem_v2/alloc

6c29a88

fix benchmark

2e8d1a4

cctry requested a review from zhyncs as a code owner October 8, 2025 18:40

cctry self-assigned this Oct 8, 2025

fix hybrid

987a3dd

xiezhq-hermann self-assigned this Oct 9, 2025

xiezhq-hermann assigned merrymercy, hnyls2002, ispobock, hzh0425 and hanming-lu Oct 9, 2025

cctry added 2 commits October 10, 2025 14:10

update error msg

bd22a18

swa

563ba3a

merrymercy approved these changes Oct 11, 2025

View reviewed changes

xiezhq-hermann approved these changes Oct 11, 2025

View reviewed changes

cctry merged commit b36afed into main Oct 11, 2025
193 of 234 checks passed

cctry deleted the shiyang/mem_v2/alloc branch October 11, 2025 00:38

xiezhq-hermann reviewed Oct 11, 2025

View reviewed changes

leavelet mentioned this pull request Oct 12, 2025

[Bug] 'ChunkCache' object has no attribute 'device' #11492

Closed

5 tasks

whybeyoung mentioned this pull request Oct 13, 2025

[Bugfix] minor fix mem device #11523

Closed

rogeryoungh mentioned this pull request Oct 14, 2025

Fix mamba radix cache eviction logic in alloc_req_slots #11616

Merged

4 tasks

narutolhy mentioned this pull request Oct 14, 2025

[Bug] RadixCacheCpp Error when using #11630

Closed

5 tasks

cctry mentioned this pull request Oct 15, 2025

Fix CPP Radix Cache and add test to CI #11645

Merged

4 tasks

airMeng reviewed Oct 20, 2025

View reviewed changes

merrymercy mentioned this pull request Oct 23, 2025

Development Roadmap (2025 Q3) #7736

Closed

1 task

lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025

Separate allocation logic from scheduler (sgl-project#11313)

d6b4a47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate allocation logic from scheduler#11313

Separate allocation logic from scheduler#11313
cctry merged 22 commits intomainfrom
shiyang/mem_v2/alloc

cctry commented Oct 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

xiezhq-hermann Oct 11, 2025

Uh oh!

cctry Oct 11, 2025

Uh oh!

xiezhq-hermann Oct 11, 2025

Uh oh!

cctry Oct 11, 2025

Uh oh!

xiezhq-hermann Oct 11, 2025

Uh oh!

airMeng Oct 20, 2025

Uh oh!

cctry Oct 20, 2025

Uh oh!

airMeng Oct 21, 2025 •

edited

Loading

Uh oh!

airMeng Oct 20, 2025

Uh oh!

cctry Oct 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

cctry commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

airMeng Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cctry Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

cctry commented Oct 8, 2025 •

edited

Loading

airMeng Oct 21, 2025 •

edited

Loading

cctry Oct 20, 2025 •

edited

Loading