Separate allocation logic from scheduler#11313
Conversation
| backup_state: bool = False, | ||
| ): | ||
| allocator = tree_cache.token_to_kv_pool_allocator | ||
| evict_from_tree_cache(tree_cache, num_tokens) |
There was a problem hiding this comment.
why evicting proactively here?
There was a problem hiding this comment.
this is actually evict_if_needed
if self.token_to_kv_pool_allocator.available_size() < num_tokens:
if self.tree_cache is not None:
self.tree_cache.evict(num_tokens)
There was a problem hiding this comment.
maybe we can rename it a bit, it's actually check for availability and evict if needed
There was a problem hiding this comment.
what will be the case that we want to evict nodes regardless of the availability
i also feel the eviction policy is something non-trivial so evict_from_tree_cache will encapsulate complex logic from upper code
There was a problem hiding this comment.
the current self.tree_cache.evict is indeed to evict nodes to meet the requirement regardless of the availability, and I think we can probably keep eviction policy under the hood too like what we have for now
| ) | ||
|
|
||
| # Allocate memory | ||
| if self.token_to_kv_pool_allocator.page_size == 1: |
There was a problem hiding this comment.
Hi, I find batch,token_to_kv_pool_allocator.page_size not always equals to batch.tree_cache.page_size, then the paged config will go to the wrong allocation, breaks a lot of cases
There was a problem hiding this comment.
could you let me know when the page sizes are different? we can add tests to capture this in the future
There was a problem hiding this comment.
#11313 (comment) seems the page sizes should be the same and this is just a bug.
But the question is why we need to access the same page size from different places? echos the suggestions here #11645 (comment)
| def extend(reqs, model_runner): | ||
| # Create dummy tree_cache for benchmarks (no prefix caching, just allocation) | ||
| dummy_tree_cache = SimpleNamespace( | ||
| page_size=1, |
There was a problem hiding this comment.
why hard-code to 1 instead page_size=model_runner.server_args.page_size ?
There was a problem hiding this comment.
Thanks for pointing out. wIll fix this
Merge branch sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main https://code.alipay.com/Theta/SGLang/pull_requests/342?tab=diff Reviewed-by: 苏墨 <xuyongfei.xyf@antgroup.com> * [router] minor code clean up in server startup (sgl-project#10470) * [bugfix] fix typo (sgl-project#10471) * [PD metrics] Add latency Histogram metrics of each stage for generate requests (sgl-project#8710) * [CI] Fix runner for sgl-kernel (sgl-project#9887) * fix(internvl): fix accuracy issue of normalization (sgl-project#10375) * fix: gpt-oss streaming dropping normal content when tools are provided but not used (sgl-project#9657) * model: support solar (sgl-project#8189) * fix: resolve sgl-kernel ut (sgl-project#10476) * [1/2] Speed up trtllm_mla attention backend (>10% e2e) (sgl-project#10473) * Fix `--dataset-path` in `bench_one_batch_server` (sgl-project#10475) * [Env] minimal version for organizing envs (sgl-project#10479) * chore: bump v0.3.10 sgl-kernel (sgl-project#10478) * [router] multi model registration fix (sgl-project#10481) * [2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (sgl-project#10286) * [Auto Sync] Update registry.py (20250915) (sgl-project#10484) * [router] fix worker registration in multi model mode (sgl-project#10486) * fix crash of DeepSeek-V3 update_weights_from_disk (sgl-project#8863) * Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue (sgl-project#10434) * [Hicache] Evaluate Per-Round Metrics in Multiturn Bench (sgl-project#10203) * [ModelOpt] Respect `kv_cache_quant_algo` in ModelOpt checkpoints (sgl-project#10336) * Add Logprobs unit test with a loose threshold (sgl-project#10230) * [router] add router db connector for responses api (sgl-project#10487) * Remove wrong imports `from sglang.python` (sgl-project#10493) * [router] fix router manager and router init in server (sgl-project#10499) * Cache the result of `is_blackwell` platform check (sgl-project#10498) * feat: update support for qwen3next model (sgl-project#10466) * Minor fix lint introduced by sgl-project#10466 (sgl-project#10507) * chore: upgrade sgl-kernel 0.3.10 (sgl-project#10500) * Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. (sgl-project#10491) * Fix CI when sgl-kernel is changed but srt is not changed (sgl-project#10515) * Support sgl-router parallel_batch in bench_one_batch_server (sgl-project#10506) * [CPU] fix CPU backend sel. issue for Llama4 (sgl-project#10511) * adjust import setuptools_rust (sgl-project#10524) * Fix formatting in long code blocks (sgl-project#10528) * skip vision_model for lora (sgl-project#10530) * [2/2] Speed up trtllm_mla attention backend (sgl-project#10474) * support using fa4 on deepseek on blackwell (sgl-project#9928) * [Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) (sgl-project#10494) * [Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) (sgl-project#10538) * feat: add priority based scheduling with priority based request acceptance and preemption (sgl-project#8746) * Fix decord dependency for aarch64 docker build (sgl-project#10529) * enable prefix cache with dp (sgl-project#10459) * [bugfix]hicache bench_long_context.py run failed (sgl-project#10523) * Remove duplicated code (sgl-project#10545) * CUDA Arch Independent (sgl-project#8813) * [bench] Fix random seed in `bench_one_batch_server` (sgl-project#10548) * [HiCache] Add tests for hicache storage mooncake backend (sgl-project#10171) * [BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle (sgl-project#9976) * fix: update dsv3 fp4 ut (sgl-project#10584) * vlm: remove redundant d2h movement of mm feature tensors (sgl-project#9987) * Enable trtllm mla prefix extend (sgl-project#10526) * [ROCm] Fix fp8 quantization accuracy issue. (sgl-project#10558) * [HICache] introduce evict policy (sgl-project#10190) * PullRequest: 303 Revert "PullRequest: 291 for fa3 kvcache: revert github "convert mla kvcache to bfloat16"" * aiter v0.1.5.post2 (sgl-project#10563) * [PD] Improve disaggregation common backend and refactor mooncake backend (sgl-project#10273) * chore: upgrade mooncake 0.3.6 (sgl-project#10596) * [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525) * Scale kkt after reduction (sgl-project#10604) * fix deepep assert when PD disaggregation == null (sgl-project#8274) * [RL] Add destroy process group api (sgl-project#9979) * Feat/add heartbeat mechanism for nixl conn (sgl-project#10222) * update deepep version for qwen3-next deepep moe (sgl-project#10624) * support qwen3-next-fp8 deepep (sgl-project#10622) * Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610) * [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595) * Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579) * feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947) * Garbage collector regression in the online server (sgl-project#10621) * [router] refactor worker to builder pattern 1/n (sgl-project#10628) * refactor: use registry for _get_attention_backend_from_str (sgl-project#10629) * [Feature] Speculative decoding support lookahead (sgl-project#9873) * [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553) * [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586) * model support: Sarashina2VisionForCausalLM (sgl-project#10632) * feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631) * chore: bump sgl-kernel 0.3.11 (sgl-project#10630) * Hicache L3 backend mooncake optimization configuration reading method (sgl-project#10319) * [router] refactor worker to builder pattern 2/n (sgl-project#10633) * [Feature]feat(get_ip): unify get_ip_xxx (sgl-project#10081) * [router] refactor worker to builder pattern 3/n (sgl-project#10647) * [sgl-kernel] Support moe_sum_reduce cuda kernel (sgl-project#10321) * [router] refactor worker to builder pattern 4/n (sgl-project#10650) * Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 (sgl-project#10634) * [router] refactor worker to builder pattern 5/n (sgl-project#10653) * [HiCacheStorage]support page_first_direct layout for generic set&get (sgl-project#10522) * [router] preserve order of json params using preserve_order feature (sgl-project#10661) * [router] refactor router and worker management 1/n (sgl-project#10664) * fix: resolve sync issue (sgl-project#10668) * [Auto Sync] Update .clang-format (20250919) (sgl-project#10670) * [router] refactor router and worker management 2/n (sgl-project#10666) * router-spec: Reorder `ChatCompletionRequest` and fix validation logic (sgl-project#10675) * chore: cleanup docker image (sgl-project#10671) * limit sgl-kernel causal conv1d to cuda only (sgl-project#10648) * [Auto Sync] Update model_runner.py (20250920) (sgl-project#10679) * [router] refactor router and worker management 2.5/n (sgl-project#10677) * [1/2] Support deterministic inference with flashinfer attention backend (sgl-project#10645) * [Auto Sync] Update deepseek_v2.py (20250920) (sgl-project#10683) * chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile (sgl-project#10681) * [Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster (sgl-project#10680) * Replace os.environ in layernorm.py (sgl-project#10684) * fix(disagg): fix sending KV cache in case of MLA for NIXL backend (sgl-project#10673) * fix: update run_suite (sgl-project#10685) * fix: remove awq_dequantize deps (sgl-project#10686) * [Auto Sync] Update modelopt_quant.py (20250920) (sgl-project#10688) * [Feature] Support deterministic inference with FA3 backend (sgl-project#10651) * feat: update server args (sgl-project#10696) * Super tiny fix extra logs (sgl-project#10697) * [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization (sgl-project#10592) * Update release-docs.yml (sgl-project#10706) * Refactors radix cache for extra key support (sgl-project#10317) * [Router]fix: fix get_load missing api_key (sgl-project#10385) * fix: disable gpt-oss b200 ut (sgl-project#10716) * Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU (sgl-project#10714) * [Auto Sync] Update deepseek_v2.py (20250922) (sgl-project#10717) * Support deterministic inference with triton backend (sgl-project#10694) * [deterministic inference] Move batch invariant pkg to sglang (sgl-project#10695) * [2/2] Support deterministic inference for temperature > 0 (sgl-project#10678) * [Ascend] codeowner updates for ascend related files (sgl-project#10699) * [theta] 支持qwen-vl的多模自定义采样 * revert e61d08c [theta] 支持qwen-vl的多模... * PullRequest: 306 [theta] 支持qwen-vl的多模自定义采样 * [4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% (sgl-project#10709) * Convert FLASHINFER_WORKSPACE_SIZE to integer (sgl-project#10731) * EPLB: prefer to use physical experts in the same node (sgl-project#9849) * fix capture_bs when speculative decoding enabled (sgl-project#10730) * Fix flaky logprobs test (sgl-project#10728) * Fix CI TestChunkedSGMV (sgl-project#10737) * [Docs, minor] Fix LLM doc matrix (sgl-project#10753) * Add warnings and remove dependency for deterministic inference (sgl-project#10724) * bugfix: Fix `get_worker_urls_for_model` in http/router.rs (sgl-project#10754) * [router] refactor router and worker management 3/n (sgl-project#10727) * [router] update ci so only execute benchmarks when labels are added (sgl-project#10757) * Fix MTP MoE weight loading with NVFP4 target model. (sgl-project#10758) * chore: bump sgl-kernel v0.3.12 (sgl-project#10732) * [Generative Score API] Added test_scores_api.py to github CICD to run per commit (sgl-project#10755) * refactor zero copy (sgl-project#10300) * Fix multimodal registry and code sync scripts (sgl-project#10759) * Enables TRT-LLM backend to be used for target_verify (sgl-project#10281) * fix: kv events with tp > 1 (sgl-project#10541) * [Auto Sync] Update flashattention_backend.py (20250922) (sgl-project#10762) * [Feature] Add MLAProcess for DeepSeek MLA on NPU (sgl-project#10130) * [Ascend] optimize Qwen-vl on Ascend (sgl-project#10556) * [Ascend]optimize Qwen3 on Ascend (sgl-project#10574) * [Auto Sync] Update configurer.py (20250923) (sgl-project#10765) * [router] refactor router and worker management 4/n (sgl-project#10756) * PullRequest: 310 新增 BailingMoEV3 模型及其 MLA 支持 * [router] remove pd router draining channel (sgl-project#10767) * [router] fix logger type mismatch (sgl-project#10774) * Use simulate acc len from `sglang.environ` (sgl-project#10771) * Fix trtllm_mla slow concat kernel in MTP (sgl-project#10777) * Move cached kernel to srt.utils (sgl-project#10776) * feat: unify dockerfiles (sgl-project#10705) * Introduce `FutureMap` (sgl-project#10715) * chore: upgrade sgl-kernel 0.3.12 (sgl-project#10782) * followup: clean up dockerfiles and release yamls (sgl-project#10783) * Clean up server args (sgl-project#10770) * move `environ` into `sglang.srt` to avoid break SRT auto sync. (sgl-project#10791) * Fix hicache mooncake backend CI (sgl-project#10792) * [router] fix cache aware routing strategy and lock contention (sgl-project#10773) * [router] responses api POST and GET with local storage (sgl-project#10581) * model: support qwen3-vl series (sgl-project#10323) * [fix][pd-disag]no need set next batch sampling info done in prefill (sgl-project#10259) * [ROCm] Update aiter to v0.1.5.post3 (sgl-project#10812) * [router] use dashmap for radix tree instead of hash for multi model (sgl-project#10814) * router(grpc): Implement route for chat_cmpl endpoint (sgl-project#10761) * fix ceval (sgl-project#10504) * Remove duplicate code in qwen2 model (sgl-project#10540) * [router] fix axum default body limit (sgl-project#10818) * Fix latest main ci (sgl-project#10799) * add tunning files for QWEN-3-NEXT (sgl-project#10794) * [Auto Sync] Update protocol.py (20250923) (sgl-project#10820) * fix: draft model IMA by overide max_positional_embeddings (sgl-project#10787) * [Auto Sync] Update elementwise.py (20250923) (sgl-project#10823) * [Auto Sync] Update simple_eval_common.py (20250923) (sgl-project#10824) * [router] Support streaming for Openai Router Response api (sgl-project#10822) * [router] add auth middleware for api key auth (sgl-project#10826) * [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (sgl-project#10825) * Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" (sgl-project#10828) * Add CI timeout guidelines (sgl-project#10829) * [theta] fix serving_tokenization.py * feat: add cache_salt support to request (sgl-project#10718) * fix bailing_moe with enable_dp_attention (sgl-project#10860) * ci: free space on workers for build (sgl-project#10786) * router-grpc: Support jinja chat template content format detection (sgl-project#10832) * [router] select first healthy worker on proxied get requests (sgl-project#10827) * chore: Initial support for input config files (sgl-project#10534) * router-grpc: Add tools processing and other paramters for apply_chat_template (sgl-project#10877) * [router] consolidate health endpoints and flush cache (sgl-project#10876) * Restruct sgl-kernel benchmark (sgl-project#10861) * [Bug] Fix Issue#10215 (sgl-project#10572) * [router] consolidate worker get loads (sgl-project#10880) * [router] Support Oracle DB(ATP) Data Connector (sgl-project#10845) * [router] simplify tokenizer dev doc (sgl-project#10895) * [Auto Sync] Update model_config.py (20250925) (sgl-project#10885) * [ci feature] add ci monitor (sgl-project#10872) * [HiCache] Cleaning the deprecated host memory state (sgl-project#10778) * integrate AIBrix KVcache (sgl-project#10376) * Add fuse_moe per-channel tune (sgl-project#10915) * [router] consolidate worker load monitoring (sgl-project#10894) * router: Fix constraint proto and `build_constraint` in grpc router (sgl-project#10881) * Refactor kv_cache_scheme handling for quantization (sgl-project#10132) * refactor: Move `grpc/client.rs` to `grpc_client/sglang_scheduler.rs` (sgl-project#10924) * fix env flashinfer (sgl-project#10910) * [minor] Remove deprecated function `get_ip` (sgl-project#10883) * Rename customer label -> custom label (sgl-project#10899) * [router] change log level to warning (sgl-project#10926) * [router][refactor] Clean up protobuf fields (sgl-project#10923) * Replace the Kimi-K2 generated tool call idx with history tool call count (sgl-project#10612) * [ci] add ci-monitor workflow (sgl-project#10898) * Remove pull_request trigger from CI monitor workflow (sgl-project#10932) * router: Support parallel sampling num > 1 in grpc_server and non-stream handling (sgl-project#10929) * Revert "Refactor kv_cache_scheme handling for quantization (sgl-project#10132)" (sgl-project#10935) * Update CODEOWNERS to include JustinTong0323 in FC (sgl-project#10939) * [PD-HiCache]: Support Async Offloading KVCache In Decode Side (sgl-project#10192) * CI: Fix docker manifest build (sgl-project#10936) * [router] update owners for router components (sgl-project#10927) * Fuse write kv buffer into rope for qwen3 moe & bailing moe (sgl-project#10749) * [router] add grpc client get and set (sgl-project#10955) * [router]fix code owner syntax error (sgl-project#10956) * [router] move grpc client from router to worker and builder (sgl-project#10958) * [router] add move grpc worker management from router to worker manager (sgl-project#10960) * [router] grpc router regular mode import cleanup (sgl-project#10963) * [router] remove old/oudated/useless comments (sgl-project#10967) * [router] remove old/oudated/useless comments across code base (sgl-project#10968) * ci: fix rate-limit of huggingface with hf auth login (sgl-project#10947) * Update label field comment to indicate deprecation (sgl-project#10970) * Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs (sgl-project#10372) * ci: refactor nightly test (sgl-project#10495) * refactor loading weights from remote instance coding format (sgl-project#10941) * [router][grpc] Add helpfer functions for decoder in router.rs and fix specs (sgl-project#10971) * Add simple docker file for B300 (sgl-project#10944) * Ci monitor support performance (sgl-project#10965) * [HiCache]: Support dynamic loading backends for hicache (sgl-project#10551) * [Bugfix][Minor][Benchmark] Fix some bugs due to PR sgl-project#10495 (sgl-project#10982) * [router][grpc] Support E2E non-stream chat completions (sgl-project#10980) * fix: fp8 quantization failure of qwen 2.5 VL 7B model (sgl-project#10112) * [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (sgl-project#10981) * fix: make inference deterministic for large TP (sgl-project#10930) * Add auth to get server info (sgl-project#10751) * PullRequest: 315 bailingMoE: Fix deepep_mode keyerror * Add support for topk metadata transferring for PD (sgl-project#10616) * [PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend (sgl-project#10565) * Use jsonschema to constrain required or specific tool choice (sgl-project#10550) * Fix profiler (sgl-project#10997) * [router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) (sgl-project#10995) * [router] basic mcp support for openai router response api (sgl-project#10978) * [router] fix chat template loading and tokenizer path (sgl-project#10999) * Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' (sgl-project#11009) * [bugfix]Add empty_context import to two_batch_overlap.py (sgl-project#10964) * prepare for sglang+verl (sgl-project#10555) * [sgl-kernel] Optimize concat_mla_k kernel (sgl-project#10543) * [HiCache] bug: fix mooncake store batch set v1 (sgl-project#11013) * Fix FusedSetKVBufferArg in RotaryEmbedding (sgl-project#11003) * Update GLM-4.5 Model Doc (sgl-project#11017) * [router] migrate to rust python module for pythonic parser (sgl-project#11033) * fix: show failed models in nightly ci (sgl-project#10986) * [router][tool call] Support normal content extraction before tool call (streaming) (sgl-project#11038) * [router] add harmony tool parser base structure and interface (sgl-project#11036) * Unify SGL Kernel Releases (sgl-project#10701) * [1/2] Support FA4 for MHA Prefill in sgl-kernel (sgl-project#10940) * fix: check if weights are already local before downloading (sgl-project#11015) * [HiCacheStorage] mooncake store support page_first_direct layout (sgl-project#10591) * [speculative decoding] rename lookahead to ngram (sgl-project#11010) * Fix gemma 3 launch with `transformers:` the error: `AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size'` (sgl-project#9614) * Fix sgl-kernel benchmark dead code (sgl-project#11022) * [router][tool call] Improve normal content extraction and error handling (non-stream) (sgl-project#11050) * chore: upgrade cutedsl 4.2.1 (sgl-project#11054) * [Ci Monitor] Auto uploaded performance data to sglang_ci_data repo (sgl-project#10976) * chore: upgrade sgl-kernel 0.3.13 (sgl-project#11056) * [router] add n to generate sampling params (sgl-project#11069) * Use more general heuristics to set the default value of --mem-fraction-static (sgl-project#10975) * [router][tool call] Separate `JsonParser` and `LlamaParser` (sgl-project#11073) * Fix mem fraction static for nightly tests (sgl-project#11076) * fix: fp8 mllama4 without vision modules being quantized (sgl-project#10611) * [router] Use `get_pooled` in `process_single_choice` (sgl-project#11079) * [router][grpc] Add logprobs support to router (sgl-project#11082) * feat(reasoning): improve enable thinking from request (sgl-project#10875) * [Profile] dump memory trace when cuda graph profile is enabled (sgl-project#11083) * Remove hybrid_linear_attn attention backend and refactor attention registry (sgl-project#10816) * [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (sgl-project#9642) * Enable optional FP32 compute for LM Head (sgl-project#10729) * Update CODEOWNERS for attention/ascend_backend.py (sgl-project#11092) * [router] grpc router generate endpoint support (sgl-project#11070) * [router][tool call] Full support for ToolChoice (sgl-project#11085) * Fix spec filter batch when target extend (sgl-project#10991) * [Fix] Resolve performance drop in speculative decoding aiter backend (sgl-project#11087) * [Auto Sync] Update fused_moe_triton_config.py (20250930) (sgl-project#11099) * chore: bump sgl-kernel v0.3.14 (sgl-project#11067) * [router][grpc-server] Fix gRPC server shutdown (sgl-project#11094) * Fix eagle radix cache (sgl-project#10846) * [Eval] Add `--repeat` in `run_eval` (sgl-project#11101) * [CPU] Adding Memory Capacity Acquisition Functionality (sgl-project#11102) * Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization (sgl-project#11081) * Support Dots.ocr model (sgl-project#11071) * [router][bugfix] Fix input_logprobs handling with None value and `logprob_start_len = -1` (sgl-project#11113) * Feature/make PEFT adapter module format compatibile (sgl-project#11080) * fix: KimiK2Detector Improve tool call ID parsing with regex (sgl-project#10972) * [router] add mcp list and mcp call in output array (sgl-project#11112) * Organize spec-related data structures (sgl-project#10735) * [AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm (sgl-project#11114) * [Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) (sgl-project#11115) * [Doc] Update multimodal language models documentation (sgl-project#11111) * Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg (sgl-project#10985) * docker: x86 dev builds for hopper and blackwell (sgl-project#11075) * Refactor AMD CI. (sgl-project#11128) * feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 (sgl-project#10760) * [HiCache]bug fix: fixed blank item in host_mem_release_queue (sgl-project#11005) * [Feature] Add EIC as sglang HiCache Storage backend (sgl-project#10271) * [HiCache] Configurable and Dynamic Prefetch Timeout (sgl-project#10512) * [router] add pd service in grpc router for pd (sgl-project#11120) * [router] Add multi-turn tool calling loop support for MCP integration (sgl-project#11143) * Fix metrics and request tracing (TimeStats) (sgl-project#11123) * Remove debug print statement from scheduler output (sgl-project#11145) * Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (sgl-project#10720) * Fix ngram spec with page size > 1 (sgl-project#11135) * [ROCm] To reduce the compiling time when using torch compile. (sgl-project#10559) * Fix DeepSeek chunked prefill memory issue (sgl-project#11149) * Clean up parallel_state.py (sgl-project#11148) * Tiny improve dumper (sgl-project#11132) * Tiny fix missing alt stream in nextn layer (sgl-project#10768) * Fuse quantize and rope in trtllm_mla MTP (sgl-project#10779) * Tiny detect slow ranks (sgl-project#10508) * Remove unused pack `.item()` in paged allocator. (sgl-project#11156) * Support dispatch low latency (sgl-project#10263) * Support single batch overlap (sgl-project#10422) * [router][grpc] Support tool call parser in streaming (sgl-project#11160) * [model] Add mamba2 and Falcon-H1 support. (sgl-project#10988) * Clean up ascend allocator (sgl-project#11152) * fix cpp JIT compilation issue of ngram speculative decoding (sgl-project#10837) * Tiny cleanup deepseek_v2.py (sgl-project#11163) * Tiny fix ep_gather behavior different in CI (sgl-project#11130) * Tiny remove duplicated code (sgl-project#11164) * [proto] Add script to compile python protos (sgl-project#11171) * Unify forward output datastructure (sgl-project#11124) * [grpc] style fix for grpc compilation. (sgl-project#11175) * Remove dp balance metadata and minimul token balance. (sgl-project#11170) * Minor fixes for server_args, parallel_state, and test_deterministic.py (sgl-project#11159) * fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 (sgl-project#11176) * [router][grpc] Support streaming for v1/chat/completions (sgl-project#11179) * Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (sgl-project#11138) * Introduce naming convention in `io_struct` and base sglang io classes. (sgl-project#10133) * [Generative Scores API] add performance tests to CICD (sgl-project#10830) * [1/n] Enable DCA CUDA graph capture (sgl-project#9537) * [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (sgl-project#11161) * [CI]] Tee server logs to both file and stdout/stderr using PIPE (sgl-project#11185) * fix: radix cache memory accounting (sgl-project#10637) * Tiny add PD disaggregation + DP attention test (sgl-project#11167) * [router] Steaming support for MCP Tool Calls in OpenAI Router (sgl-project#11173) * [Feature] Option to save model weights to CPU when memory saver mode is enabled (sgl-project#10873) * Add --thinking-mode to run_eval (sgl-project#11189) * [hot-fix] Fix CI break which caused by adding `thinking_mode` in eval (sgl-project#11192) * Tiny move files to utils folder (sgl-project#11166) * Fix CUDA illegal memory access issues in speculative decoding (sgl-project#10892) * Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. (sgl-project#10780) * Optimize debug log position of PD abort request (sgl-project#11090) * fix 3fs indices (sgl-project#10855) * model: support starcoder2 (sgl-project#10609) * [Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. (sgl-project#10859) * fix xeon ci check (sgl-project#10838) * fix qwen2 eagle3 runtime error (sgl-project#10517) * [minor] fix the lint (sgl-project#11198) * [Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py (sgl-project#10741) * [fix]missing prefix_lens_cpu init when p/d disaggregation (sgl-project#11196) * fix self.enable_kv_cache_events (sgl-project#11178) * [HICache]: Refactor HiCache CI (sgl-project#11011) * fix sampling_seed handling when deterministic is enabled (sgl-project#11096) * [fix]enable flashmla when using draft model P/D attention select (sgl-project#11012) * [router] fix get load response parsing (sgl-project#11213) * [router] add grpc router pd mode for chat and generate (sgl-project#11140) * EAGLE cache fix for HiCache (sgl-project#11215) * Add --max-new-tokens CLI flag for MMMU evaluation (sgl-project#11217) * Add DeepSeek-V3.2 Tool Call Template (sgl-project#11063) * Tiny `skip_sample` adjust (sgl-project#11225) * [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 (sgl-project#11194) * Update `v1/responses` to be more OpenAI-compatible. (sgl-project#9624) * chore: bump sgl-kernel v0.3.14.post1 (sgl-project#11137) * Update DeepGEMM repository tag to specific commit (sgl-project#11229) * [Feat] Support Torch Symm Mem AllReduce (sgl-project#10571) * Refactor and optimize mooncake CI (sgl-project#11162) * [Fix AMD CI] VRAM cleanup (sgl-project#11174) * Update transformers package version to 4.57.0 (sgl-project#11222) * Remove gdrcopy check in ci_install_deepep.sh (sgl-project#11237) * Rename runner labels (sgl-project#11228) * [Auto Sync] Update io_struct.py (20251004) (sgl-project#11206) * Create two new GH workflows to automatically bump SGLang and Kernel version (sgl-project#10996) * Fix spec_utils.py (sgl-project#11247) * ci: make find_local_hf_snapshot_dir more robust (sgl-project#11248) * [quantization] Fix scale remapping for mllama4 (sgl-project#10042) * [quantization] Enable aiter mxfp4 fused_moe for Quark (sgl-project#10048) * Use cu128 for torch audio to fix some CI tests (sgl-project#11251) * Bump torch_memory_saver 0.0.9rc2 (sgl-project#11252) * update sgl kernel version to 0.3.14.post1 (sgl-project#11242) * Update condition for sgl-kernel-benchmark-test (sgl-project#11254) * feat: add shortcut detection for multimodal templates in Jinja format (sgl-project#11209) * Improve bot release workflow (sgl-project#11240) * Add flashmla and fast hadamard transform to Dockerfile (sgl-project#11235) * Support DeepSeek V3.2 Exp (sgl-project#11061) * chore: bump SGLang version to 0.5.3rc2 (sgl-project#11259) * chore: bump SGLang version to 0.5.3 (sgl-project#11263) * [theta] fix bailing v3 * [router] add ipv6 support across all components (sgl-project#11219) * Remove env var warnings for release (sgl-project#11262) * Enable native ModelOpt quantization support (1/3) (sgl-project#7149) * [router][tool call] Clean up redundant `detect_format` and `has_tool_markers` (sgl-project#11270) * disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 (sgl-project#11274) * docker: add manifest to versioned docker releases (sgl-project#11268) * [Bug] Fix incorrect assertion in FA4 and add UT. (sgl-project#11182) * [router][grpc] Refine streaming processes (sgl-project#11277) * Fix code sync scripts (sgl-project#11276) * [Auto Sync] Update test_utils.py (20251006) (sgl-project#11280) * Rename max_micro_batch_size -> pp_max_micro_batch_size (sgl-project#11279) * reverse the amd ci test back to 1200s and split the 8-gpu deepseek job into two. (sgl-project#11238) * Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components (sgl-project#11261) * fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration (sgl-project#11282) * docs: update sgl-kernel README (sgl-project#11286) * chore: bump sgl-kernel version to 0.3.15 (sgl-project#11281) * [router][grpc] Fix proto3 default value mismatches and cleanup unused fields (sgl-project#11283) * convert test_deterministic into unit tests (sgl-project#11095) * Feature/longbench v2 evaluation utils (sgl-project#10949) * [ci] fix pp test (sgl-project#11294) * EAGLE cache fix for SWARadixCache (sgl-project#11231) * Remove overlap thread (sgl-project#11210) * [router] add reasoning and tool parser argument in router (sgl-project#11290) * Remove sampling info events and overlap thread file (sgl-project#11300) * Introduce future indices (sgl-project#11301) * [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (sgl-project#11068) * [Docs] [Router] Update Observability and Common Issues Section (sgl-project#11302) * [router] add get server info and get model info in grpc server (sgl-project#11303) * [router][grpc] Refactor chat template content format detection (sgl-project#11288) * [Doc] HiCache Design Documents (sgl-project#11027) * [Doc]: Best Practice for HICache (sgl-project#11001) * [router] fix grpc connection conversion and add optimization (sgl-project#11305) * [router][grpc] Fix sampling_params.stop_strs is None (sgl-project#11306) * Update tool parser and related documentation (sgl-project#11223) * [router][grpc] Fix error message format in grpc chat handler (sgl-project#11307) * [quantization] Properly ignore quantization for layers excluded in quant_config (sgl-project#11205) * [router] support Openai router conversation API CRUD (sgl-project#11297) * [router][grpc] Fix request_id extraction when n > 1 (sgl-project#11311) * [router] cleanup worker health check to return early (sgl-project#11310) * [oai serving chat] Add argument `--sampling-defaults` and fix `ChatCompletionRequest` defaults (sgl-project#11304) * Clean match_prefix and prepare_for_extend for mem cache V2 (sgl-project#11200) * ci: unify the model launch method of nightly ci (sgl-project#11230) * [Chore] Update xgrammar 0.1.24 -> 0.1.25 (sgl-project#10710) * update sampling_params documentation with defaults (sgl-project#11315) * Optimize copy_kv_cache for spec decoding (sgl-project#11126) * Rename `ngram_utils` -> `ngram_info` (sgl-project#11316) * [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (sgl-project#11314) * [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (sgl-project#9545) * [8/N] MoE Refactor: deprecate `EPMoE` (sgl-project#11211) * Skip weight loading in deepgemm compilation (sgl-project#11312) * [2/2] Support MHA prefill with FlashAttention 4. (sgl-project#10937) * [Doc] Update mooncake nvlink transport doc for PD disaggregation (sgl-project#11321) * fix(decode): adjust ServerArgs import to explicit module path (sgl-project#11007) * Support LoRA in bench_serving oai interface (sgl-project#11318) * benchmark: enhance configurable multimodal benchmarking in bench_serving (sgl-project#9812) * [CI] improve disaggregation CI. (sgl-project#11264) * [theta] fix tokenization * model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (sgl-project#10909) * [router] refactor generate to use new pipeline arch (sgl-project#11323) * [router] improve reasoning parser lock and reduce req cloning (sgl-project#11336) * [router][grpc] Cleanup debug logs in grpc_server and grpc_router (sgl-project#11340) * [router] Fix all unused_qualifications (sgl-project#11341) * [router] Support history management using conversation (sgl-project#11339) * [router][grpc] Add dependencies in Cargo.toml to support chat template rendering (sgl-project#11342) * fix: fix revision for sgl-flash-attn in sgl-kernel (sgl-project#11327) * [Auto Sync] Update scheduler.py (20251009) (sgl-project#11350) * [Generative Score API] Multi-Item scoring with custom attention mask. (sgl-project#10979) * [router][grpc] disable health check generation and increase timeout (sgl-project#11353) * [router] Refactor OpenAI router: split monolithic file and move location (sgl-project#11359) * [router][lint] Add unused_qualifications to cargo lint warnings (sgl-project#11366) * [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (sgl-project#11309) * PullRequest: 323 [theta] 错误码规范化:1)chat和completions请求的前处理统一为400;2)多模态load data请求返回为标准的http错误码 * [router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (sgl-project#11373) * add code pp support for nixl (sgl-project#11375) * fix bench_serving mishandling of internal states (sgl-project#11376) * PullRequest: 322 支持MTP并使用DeepseekV2AttentionMLA子类化BailingMoEV3AttentionMLA * [router][grpc] Replace fake health check with correct ones (sgl-project#11387) * [router] change grpc client from mutable to clone (sgl-project#11394) * chore: upgrade flashinfer 0.4.0 (sgl-project#11364) * [router] conversation item API: create, retrieve and delete (sgl-project#11369) * chore: bump SGLang version to 0.5.3.post1 (sgl-project#11324) * move more files under srt/utils (sgl-project#11285) * [grammar] Avoid server crash when grammar backend is None (sgl-project#11401) * fix: fix gpu-proc affinity set incorrectly when pp_size > 1 (sgl-project#11389) * [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded (sgl-project#11365) * [CI] Refactor PD disaggregation test suite (sgl-project#11363) * Replace pad with cat for better performance (sgl-project#11388) * fix: reinstall torch in deps install (sgl-project#11414) * feat(hicache): Support passing prefix keys for l3 store. (sgl-project#9045) * fix file and object naming scheme in HiCacheNixl to avoid data corruption (sgl-project#10969) * Dedicated toml files for CPU/XPU (sgl-project#10734) * Add metrics for speculative decoding (acceptance rate, average acceptance length) (sgl-project#11144) * chore: update pyproject (sgl-project#11420) * PullRequest: 330 [theta] qwen-vl支持视频base64传入图像帧,如:data:video/jpeg;base64,frame1_base64,frame2_base64,...,frameN_base64 * fix: fix video input for qwen3-vl (sgl-project#11361) * perf: optimize qwen-vl with symm mem allreduce (sgl-project#11381) * [HiCache] feat: add multi tenant with prefix tag (sgl-project#9256) * [CI] Merge build-dev into workflow matrix (sgl-project#11345) * Revert "perf: optimize qwen-vl with symm mem allreduce" (sgl-project#11436) * Revert "fix: fix video input for qwen3-vl" (sgl-project#11437) * Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (sgl-project#11433) * [router] Fix ci nvcc not found error (sgl-project#11411) * feat(mooncake): support GB suffix for global_segment_size (sgl-project#10745) * Separate allocation logic from scheduler (sgl-project#11313) * [router] disable rate limiter by default (sgl-project#11435) * [router] leverage RAII to actively cancel request during client disconnect (sgl-project#11399) * [router][grpc] Consolidate parser checks for chat completions (sgl-project#11439) * Reorder PD disagg CI tests (#11438) * fix: Change dsv32 hack temporary path to use system temp directory (#11445) * Fix batch invariant ops (#11368) * [BugFix] test_mla_fp8.py fails on Cublas 12.9 (#11360) * [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton (#11450) * Remove tilelang dependency in Dockerfile (#11455) * Enable native ModelOpt quantization support (2/3) (#9991) * Reland [1/2] Optimizations and refactors about quant kernel (#10312) * Super tiny delete unused openai router in sgl-router (#11448) * Adjust logits metada init for target verify (#11467) * [Documentation][Configuration] Server args and documentation of PD-Multiplexing. (#11427) * Fix enable_v2 in int8 quant (#11470) * [Fix] Fix split prefill with fa3. (#11428) * fix stop when stream (#11462) * Add option to disable `any_whitespace` for `xgrammar` and `llguidance` backends. (#8919) * PullRequest: 334 [theta] 修复qwen3-vl的各种bug * [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019) * fix Xeon CI (#11454) * [CI] Add nightly builds to dockerhub (#9804) * [Feature] support regex strings as a stopping condition (#10635) * Beta spec-overlap for EAGLE (#11398) * Piecewise CUDA Graph Support & Torch Compile Backend (#10062) * [Router]: Small Typo in a comment within tree.rs (#11489) * chore: bump sgl-kernel version to 0.3.16 (#11476) * [smol] [perf] Qwen3-VL in place op. (#11481) * [chore][1/N] Avoid using default mutable parameters (#11478) * [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends (#10172) * [ perf ] Replace json-> orjson in hot path (#11221) * [chore][2/N] Avoid using default mutable parameters (#11479) * Fix the GPT function calling regex to allow dash in the name (#10577) * bailingMoE: Fix Key error of deepep_mode (#11465) * Fix CI break by express-laned PRs. (#11499) * Move args from `global_config` to `environ` (#11332) * move fla env check position (#11500) * Temporarily remove b200 tests (#11501) * Fix port conflicts in CI (#11497) * temporarily remove b200 tests (#11502) * Fix unit tests (#11503) * Bugfix: Fix Type consistency for KV indices in SWARadixCache (#11452) * doc: add doc for adding new models into nightly-ci (#11443) * [CI] fix lint (#11509) * Deprecate `global_server_args_dict` (#11331) * chore: remove flashinfer cleanup cache (#11514) * fix: revert temporarily remove b200 tests (#11515) * [Fix] Improve longbench prompt and other logics (#11474) * Sync changes on io_struct.py and deterministic ops (#11498) * [lint] Fix the lint issue (#11516) * Revert "Deprecate `global_server_args_dict`" (#11520) * Improve dp attention port assignment scheme (#5889) * [theta] rebase public/main 1013-2 * [router] openai router: support grok model (#11511) * docs(router): add token-bucket rate limiting to the docs (#11485) * [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432) * Update DeepSeek-R1-FP4 default config on blackwell (#11512) * [Fix]: add missing device attribute to ChunkCache (#11493) * [Feature] Support mamba radix cache v0 (#11214) * ci: improve nightly-ci (#11385) * [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering (#11505) * [HICache]: Support 3FS-Store with page_first_direct layout (#11460) * Tiny fix test run estimated time (#11544) * [Reland] perf: optimize qwen-vl with symm mem allreduce (#11457) * [theta] rebase public/main 1013-5 * Depreate `global_server_args_dict` (#11528) * [theta] rebase public/main 1013-6 * [Fix] Add per_channel_quant parameter to MoE config functions (#11201) * [router][ci] Add Nightly Release Workflow for SGLang Router (#11527) * [router] allow tokenizer path to be dir (#11530) * Remove `tp_worker.worker` (#11548) * fix: fix video input for qwen3-vl (#11442) * [NVIDIA] BUMP FA3 (#11444) * [router][Fix] Include grpc reflection runtime dependency (#11419) * Adjust overlap event loop (#11507) * Move deep gemm related arguments to `sglang.srt.environ` (#11547) * [router][grpc] Further delegate non-stream processing to `processing.rs` (#11553) * [router] allow user to specify chat template path (#11549) * Minor: improve sampler & remove unused fields from model_config.py (#11531) * [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483) * Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441) * Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) (#11557) * [CI] Add Basic Test for DeepSeek V3.2 (#11308) * [router][grpc] Add error handling to `generate_tool_constraints` (#11562) * [NVIDIA] update pyproject.toml to support cu130 option (#11521) * [CI Monitor] Ci monitor only deal with main branch in default (#11538) * Tiny cleanup fp4 gemm calls (#11537) * [router][grpc] Add `serve_grpc` to `launch_server` and log id for HealthCheck (#11564) * [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds (#11571) * [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM (#11534) * chore: bump sgl-kernel version to 0.3.16.post1 (#11573) * Fix accept rate in speculative decoding metrics (#11572) * Compilation Folder Reset (#11539) * [FEATURE] Add Profile Trace Merger for Distributed Traces (#11413) * [DSv32] Use torch.compile for _get_logits_head_gate (#11565) * Make DeepEP combine recv do not overlap (#11535) * bench_serving support PD Disaggregation (#11542) * Implement LRU eviction policy for LoRA adapters (#11041) * PullRequest: 337 支持completions协议传入多模态请求 * Revert "[NVIDIA] BUMP FA3 (#11444)" (#11582) * chore: bump sgl-kernel version to 0.3.16.post2 (#11583) * [Auto Sync] Update model_config.py (20251014) (#11580) * Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json (#11587) * [router][protocols] Add Axum validate extractor and use it for `/v1/chat/completions` endpoint (#11588) * [router] update generate spec to align with sgl io struct (#11591) * [router] change worker api to async instead of sync (#11566) * Update news section in README.md (#11598) * [router] delete useless table content comment in spec (#11597) * [router] allow router launch server to use grpc mode (#11600) * [Docs] [Router]: Update sg-router doc on circuit breaker (#11449) * [router] when given both local tokenizer and chat template, log all (#11601) * [AMD CI] Add image and weights caching. (#11593) * Update release-docker-dev.yml (#11603) * Optimize Triton Draft Backend (#11556) * Refactor spec decoding metrics calculation into separate `TokenizerManager` utility function (#11586) * make radix cache deterministic (#10721) * move eagle draft post process to cuda graph (#11434) * Reduce one step decode for draft model. (#11561) * [router] add py binding and readme for openai router and history backend (#11453) * [theta] print load mm cost * [theta] 百灵4头支持tp8 * [router] cleanup app context and move to startup (#11617) * [router] add chang and keyang to sgl router author (#11620) * use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. (#11605) * [router] update router readme to latest features (#11619) * Fix log for chunked prefix cache (#11624) * [Auto Sync] Update scheduler.py, server_args.py (20251014) (#11623) * [Auto Sync] Update collector.py (20251014) (#11625) * [Minor] Update xgrammar dependency (#11622) * Update install.md (#11631) * fix: Update SGL_KERNEL_VERSION to 0.3.15 (#11633) * [router][grpc] add warm up to grpc server (#11627) * Refactor kv cache free (#11351) * [router] update router doc to latest features (#11639) * fix: upgrade transformers to 4.57.1 (#11628) * [router] add worker self discovery for metadata (#11638) * [router] upgrade to 0.2.0 (#11642) * [theta] qwen vl耗时打印 * [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423) * [theta] qwen vl耗时打印 * [1/N]Support DeepSeek-R1 w4a8 normal deepep (#8247) * [Fix] Fix accuracy bug in CSGMV kernel caching key. (#11579) * feat: add add_chunked_prefix_cache_attention_backend (#11636) * Super tiny improve FA3 import error message (#11590) * [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl (#11458) * [Doc] Update support matrix for attn and hybrid attn (#11293) * Clean up some Qwen3-Next and deterministic code (#11585) * docs: update sglang installation guide (#11659) * [theta] 更新aci镜像和依赖 * Tiny cleanup some eagle unused codes (#11660) * Fix 1-step draft model forward (#11653) * [tool call] Fix prev_tool_call_arr management in base_format_detector.py (#11367) * [router] Fix response api related spec (#11621) * Fix missing json imports in serving_responses.py (#11681) * [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674) * [sgl-kernel] Optimize gguf test (#11667) * [router][grpc] Simplify model_id determination (#11684) * [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676) * chore: bump SGLang version to 0.5.3.post2 (#11680) * [CI][XPU]enable sglang CI on Intel XPU (#9493) * enable rmsnorm on XPU (#10248) * Sync code and test CI; rename some env vars (#11686) * docs: Add Contributor Covenant Code of Conduct (#11689) * [theta] dockerfile增加deepgemm编译缓存(需要定期更新😂) * [Mamba] Increase default mamba_full_memory_ratio to 0.9 (#11679) * [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912) * [sgl-kernel] support hadamard (#11663) * Fix missing a2a backend init of GLM4.5 MoE Block (#11692) * Split test_intel_amx_attention_backend.py to pass CI of timeout (#11370)
Motivation
Preparation for mem_cache V2.
The allocation logic is moved to
mem_cache/fromschedule_batch.py. Ideally, scheduling code should only interact withtree_cachein V1 andmemory_managerin V2.Modifications
mem_cache/common.pyfor allocation functions operating allocator and tree cachealloc_for_extendandalloc_for_decodeprepare_for_decode, the increment ofseqlenis moved afteralloc_for_decodefor clarityprepare_for_extend, some allocation-needed fields are set beforealloc_for_extendbench_one_batch.py, create a dummytree_cacheas the placeholderAccuracy Tests
Benchmarking and Profiling
Checklist