Add flashinfer && Oultines#1
Merged
merrymercy merged 2 commits intomainfrom Jan 8, 2024
Merged
Conversation
4 tasks
Ying1123
pushed a commit
that referenced
this pull request
Sep 13, 2024
Sequence Parallel system setup
5 tasks
5 tasks
ispobock
pushed a commit
to ispobock/sglang
that referenced
this pull request
Oct 15, 2024
use triton attention
5 tasks
5 tasks
hubertlu-tw
referenced
this pull request
in hubertlu-tw/sglang
Feb 5, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
5 tasks
5 tasks
Closed
5 tasks
5 tasks
Achazwl
pushed a commit
to Achazwl/sglang
that referenced
this pull request
Mar 2, 2025
zcnrex
referenced
this pull request
in zcnrex/sglang
Mar 5, 2025
[WIP] Add Benchmark for DeepGEMM Group GEMM
timethink
pushed a commit
to timethink/sglang
that referenced
this pull request
Mar 9, 2025
pi314ever
pushed a commit
to pi314ever/sglang
that referenced
this pull request
Apr 23, 2025
Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>
pi314ever
pushed a commit
to pi314ever/sglang
that referenced
this pull request
Apr 23, 2025
* Fix ut mla-test-1-gpu-amd (sgl-project#4813) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (sgl-project#4638) * [k8s] Clarified the usage of shared memory. (sgl-project#4341) * gemma3: impl `get_attention_sliding_window_size` for attn init (sgl-project#4823) * add partial_json_parser and einops (sgl-project#4827) * fix the release doc dependency issue (sgl-project#4828) * Update doc for DeepSeek-V3-0324 (sgl-project#4825) * deps: lazy import optional dependencies `gguf` and `torchvision` (sgl-project#4826) * Update MMMU Benchmark instructions (sgl-project#4694) * Fix the nightly eval by lowering the threshold of `neuralmagic/gemma-2-2b-it-FP8` (sgl-project#4830) * Basic Cleanup (sgl-project#4833) * Support (1 <= dp < tp) in the dp attention in DeepEP (sgl-project#4770) Co-authored-by: Cheng Wan <cwan39@gatech.edu> * [Fix] Add compressed_tensors as deps (sgl-project#4819) * Fix error due to CustomAllreduce setup failure (sgl-project#4815) Signed-off-by: Kebe <mail@kebe7jun.com> * use default for torch.ops (sgl-project#4835) * [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder (sgl-project#3969) * [Misc] Fix issues reported by torchfix (sgl-project#4837) * Include context length in /v1/models response. (sgl-project#4809) * [Fix] `self.worker` assignment in `TpModelWorker` and refactor references (sgl-project#4788) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix the lora adapter when lora path is none (sgl-project#4799) Co-authored-by: Beichen Ma <mabeichen12@gmail.com> * fix: fix typo of comments in w8a8_fp8.py (sgl-project#4843) * Remove retry in nightly tests (sgl-project#4846) * Fix CI of test_patch_torch (sgl-project#4844) * IPv6 support (sgl-project#3949) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * ci: add condition for daily docker build (sgl-project#4487) * [Fix] fix output_top_logprobs is not exist (sgl-project#4597) * fix: when use SGLANG_PORT this env,port is str (sgl-project#4528) Signed-off-by: rongfu.leng <lenronfu@gmail.com> * Support Page Size > 1 for FA3 (sgl-project#4832) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Fix Engine error when enabling DP attention (sgl-project#4648) * fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest (sgl-project#4681) * Support controlling nsys start and end range programmatically (sgl-project#4688) * Remove empty tool function name (sgl-project#4704) Signed-off-by: Kebe <mail@kebe7jun.com> * Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. (sgl-project#4712) * get the python version from env (sgl-project#4729) * Fix torch.cuda.MemPool() internal assertion failure (sgl-project#4687) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Super tiny remove unused code (sgl-project#4750) * Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * Workaround for async copy issue in HPU eager mode (sgl-project#1) Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> * [SW-223847]: Fix sgl_kernel module not available (sgl-project#2) Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> * [Base] Enable torch compile (sgl-project#4) * [SW-226331] disable dynamic shape in torch compile mode Signed-off-by: Mohit Sinha <msinha@habana.ai> --------- Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: rongfu.leng <lenronfu@gmail.com> Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: AinL <gmlwns5176@gmail.com> Co-authored-by: Jiří Suchomel <jiri.suchomel@statsperform.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Daniel Holanda <holand.daniel@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Jon Durbin <jon@jondurbin.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Qiaolin Yu <qy254@cornell.edu> Co-authored-by: Beichen Ma <mabeichen12@gmail.com> Co-authored-by: Jiaqi <57028284+ZhuJiaqi9905@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: warjiang <1096409085@qq.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: rongfu.leng <lenronfu@gmail.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: BroadbentJim <BroadbentJim@users.noreply.github.com> Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> Co-authored-by: DavidChan <chengwei0519@163.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Rahul Vijayaraghavan <rahul.vijayaraghavan@intel.com> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Jay Thakur <jthakur@habana.ai> Co-authored-by: Anshuman Tripathy <atripathy@habana.ai>
5 tasks
5 tasks
shenxiul
added a commit
to shenxiul/sglang
that referenced
this pull request
Apr 17, 2026
The fused-autodetect path sets skip_special_tokens=False so the language prefix survives for parsing, but that also lets the detokenizer preserve downstream special tokens into the user-visible text: * trailing <|endoftext|> after the last word * embedded <|X.XX|> segment-boundary tokens in the timestamps variant Both were flagged by the reviewer in the original comment sgl-project#1 along with the prefix leak. Prior commits fixed the prefix; this one closes the tail. * parse_fused_output now does a defensive re.sub("<\\|[^|]+\\|>") on the post-sentinel transcription. verbose_json segment timing is unaffected — it's built from output_ids via _parse_segments, a separate path that already uses skip_special_tokens=True during per-token decoding. * New WhisperAdapter.strip_special_tokens helper exposed via the base adapter (identity default) so the streaming handler stays model-agnostic. _generate_transcription_stream calls it per-delta in fused mode. * 4 new unit tests cover trailing <|endoftext|>, embedded <|X.XX|>, the strip helper in isolation, and the streaming per-delta scrub. Updated the FSM-abort streaming test: a tail that's nothing but forced-prefix specials now collapses to no emitted delta (warning still logged, language stays unset). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Talantan1102
pushed a commit
to Talantan1102/sglang
that referenced
this pull request
Apr 20, 2026
adapt prefix + mtp on ascend gdn backend
5 tasks
2 tasks
5 tasks
5 tasks
5 tasks
hhhh1252023
referenced
this pull request
in hhhh1252023/sglang_public
Apr 28, 2026
hhhh1252023
referenced
this pull request
in hhhh1252023/sglang_public
Apr 28, 2026
5 tasks
lujangus
added a commit
to tails-mpt/sglang
that referenced
this pull request
May 1, 2026
Direct port of V4 reference Transformer (inference/model.py lines 769-809) with Eagle3 aux-hidden-state capture grafted on per the qwen2.py:652 post-loop trap pattern documented in wiki/pipeline.md known-pitfalls. __init__: - Embed, layers list (43 DeepseekV4DecoderLayer), final_norm RMSNorm parameters allocated. - Top-level mHC head reduction parameters (hc_head_fn, hc_head_base, hc_head_scale per V4 ref lines 794-799). These fold the [b, s, hc_mult, d] hidden state into [b, s, d] before lm_head. - LM head (nn.Linear hidden_size -> vocab_size, no bias). - MTP block list. Per CLAUDE.md rule sgl-project#12 + architecture-notes.md "Eagle3 vs native MTP", MTP is loaded for weight-key compat but NOT executed at Eagle3 inference. Allocated as ModuleList of nn.Identity placeholders. The Eagle3 draft head replaces MTP. forward: - Embed -> hc_mult-copy expand -> layer loop -> post-loop trap -> hc-head fold -> final norm -> lm_head[..., -1] (last-token logits matching V4 reference get_logits). - Aux capture grafted IN the loop: when layer index i is in self._eagle3_layers_to_capture, append h.mean(dim=-2) (fold over the hc_mult copies to get [B, T, d]) BEFORE the layer runs. This captures the layer's INPUT, matching the V3.2/qwen2.py convention. - Post-loop trap: when end_layer (= num_hidden_layers) is in the configured capture set, capture the final hidden state. This handles the qwen2.py:652 pattern where set_eagle3_layers_to_capture maps layer_ids [0, 14, 27] -> [1, 15, 28] (+1 offset). The last layer's mapped index (=num_hidden_layers) is unreachable inside the for-loop; the post-loop trap captures it before normalization, matching the pre-norm convention used for the other aux states. - Aux feature dim per captured layer: hidden_size (mean over hc_mult copies). For V4-Flash slot triple [1, 21, 41]: 3 layers x 4096 dim = 12288 aux feature dim, matching SpecForge draft expectation. load_weights: still raises NotImplementedError, with a detailed TODO spelling out the 5 sub-tasks (key map, FP4 expert dequant, FP8 non-expert handling, tid2eid table, MTP weights). Pattern follows deepseek_common/deepseek_weight_loader.py with FP4 additions. What this enables (when load_weights lands): - Random-init forward pass for shape testing (works today, modulo sparse_attn_v4 stub raising) - Real-checkpoint forward pass for numerical agreement testing vs HF reference - Eagle3 training data generation via sglang as the target backend (per CLAUDE.md rule sgl-project#1: --target-model-backend sglang) Open TODOs (still on Phase 1 in flight): - TODO(phase1-loader): load_weights body - TODO(phase1-kernel): sparse_attn_v4 wiring to NSA tilelang - TODO(phase1-nsa): plug NSAIndexer into V4Attention.indexer (CSA path) - TODO(phase1-tp): swap nn.Linear -> ColumnParallelLinear / RowParallelLinear - TODO(phase1-fp4): V4-aware FP4 Linear that keeps experts packed - TODO(phase1-quant): act_quant calls for FP8 simulation in V4Attention Phase 1 file at this commit: ~1300 lines. End-to-end forward path plumbed; runs out of stubs only at sparse_attn_v4 and load_weights.
5 tasks
5 tasks
Jiminator
added a commit
to Jiminator/sglang
that referenced
this pull request
May 2, 2026
…very Two cross-call data-corruption bugs surfaced during review where the streaming FSM diverged from the regex non-streaming path on malformed inputs. Both fixes add explicit recovery branches in READING_VALUE and require a small regex-side tightening to keep the two paths in sync. P2 sgl-project#1: malformed `<tool_call>fn\n<arg_key>K</arg_key></tool_call>` (max- tokens cutoff after `</arg_key>` before `<arg_value>`) left the streaming FSM stuck in READING_VALUE. The bare-`<` discard ate `</tool_call>` byte- by-byte instead of recognizing it, and a *subsequent* tool call's `<arg_value>` mis-attributed to the orphan `current_pending_key` — silently swallowing the second call's name. Recovery: handle `</tool_call>` in READING_VALUE by closing the active call (orphan key dropped) via the existing `_close_current_call` helper. P2 sgl-project#2: malformed `<arg_key>K1</arg_key><arg_key>K2</arg_key><arg_value>V` (model emitted a key, then re-emitted a new key without a value for the first) bound V to the stale K1 — wrong-argument corruption. Recovery: handle `<arg_key>` in READING_VALUE by replacing the orphan `current_pending_key` with the new one and staying in READING_VALUE. Regex tightening: `arg_pair_regex` key portion changed from `(.*?)` to `([^<]*?)`. The non-greedy `.*?` was backtracking across `</arg_key>` boundaries on the malformed-key shape above, producing a junk key spanning both `<arg_key>` tags (e.g. `"K1</arg_key><arg_key>K2": V`). The `[^<]` constraint blocks the backtrack. Param names never contain `<` in practice; the value side keeps `.*?` because legitimate values can contain `<` (HTML, paths, etc.). Both paths now produce the intuitive `{"K2": V}` for the malformed input. Locked in by 4 regression tests (non-streaming + streaming for each P2 shape). Helper docstrings extended to document the dual call sites.
Jiminator
added a commit
to Jiminator/sglang
that referenced
this pull request
May 2, 2026
…very Two cross-call data-corruption bugs surfaced during review where the streaming FSM diverged from the regex non-streaming path on malformed inputs. Both fixes add explicit recovery branches in READING_VALUE and require a small regex-side tightening to keep the two paths in sync. P2 sgl-project#1: malformed `<tool_call>fn\n<arg_key>K</arg_key></tool_call>` (max- tokens cutoff after `</arg_key>` before `<arg_value>`) left the streaming FSM stuck in READING_VALUE. The bare-`<` discard ate `</tool_call>` byte- by-byte instead of recognizing it, and a *subsequent* tool call's `<arg_value>` mis-attributed to the orphan `current_pending_key` — silently swallowing the second call's name. Recovery: handle `</tool_call>` in READING_VALUE by closing the active call (orphan key dropped) via the existing `_close_current_call` helper. P2 sgl-project#2: malformed `<arg_key>K1</arg_key><arg_key>K2</arg_key><arg_value>V` (model emitted a key, then re-emitted a new key without a value for the first) bound V to the stale K1 — wrong-argument corruption. Recovery: handle `<arg_key>` in READING_VALUE by replacing the orphan `current_pending_key` with the new one and staying in READING_VALUE. Regex tightening: `arg_pair_regex` key portion changed from `(.*?)` to `([^<]*?)`. The non-greedy `.*?` was backtracking across `</arg_key>` boundaries on the malformed-key shape above, producing a junk key spanning both `<arg_key>` tags (e.g. `"K1</arg_key><arg_key>K2": V`). The `[^<]` constraint blocks the backtrack. Param names never contain `<` in practice; the value side keeps `.*?` because legitimate values can contain `<` (HTML, paths, etc.). Both paths now produce the intuitive `{"K2": V}` for the malformed input. Locked in by 4 regression tests (non-streaming + streaming for each P2 shape). Helper docstrings extended to document the dual call sites.
JohnQinAMD
added a commit
to JohnQinAMD/sglang-amd
that referenced
this pull request
May 3, 2026
…g is the cause
Adds SGLANG_FLASH_MLA_SHADOW_REF=<dir> hook that runs ref_sparse_attn_decode
on the SAME tensors the kernel just consumed and logs per-call cos_sim +
max_diff to a CSV. Samples 100% of single_shot calls during a live e2e run
(prior 8-capture saved-tensor test was 0.7% of production call space).
Live e2e run on chi2774 with tier0 (no cuda graph) + SGLANG_HIP_CK_V32_SINGLESHOT=1
+ shadow-ref enabled. Result across 560 production single_shot calls:
Relative diff (kernel vs torch ref):
min: 0.0000%
median: 0.2232%
mean: 0.2129%
p99: 0.4464%
max: 0.4464%
Calls with rel > 0.5%: 0/560
Calls with rel > 1.0%: 0/560
Calls with rel > 5.0%: 0/560
All 560 calls match torch's ref_sparse_attn_decode at sub-bf16-ULP
relative diff. NO single call has a catastrophic delta. Yet e2e
produces garbage tokens.
DEFINITIVE CONCLUSION on Layer-3:
The residual e2e regression is hypothesis sgl-project#1 from the user's list:
cumulative sub-bf16-ULP noise compounded across 60 layers × 30 tokens
= 1800 calls per generated sequence per worker. Each call is within
bf16 floor; the cumulative drift exceeds the model's training-time
robustness envelope.
Hypothesis sgl-project#2 (wrapper cache state corruption): RULED OUT — no outlier
calls in the 560-sample distribution.
Hypothesis sgl-project#3 (cuda graph stream ordering): RULED OUT — shadow-ref ran
in tier0 (no cuda graph) and still showed garbage e2e despite per-call
diffs being uniform and small.
Path forward (unchanged from previous commit):
(b) Model finetune with kernel's specific bit pattern — out of kernel
team scope.
(d) Accept the Layer-3 stopgap and pursue other path-to-1x-B200 levers.
Layer-3 stopgap (ca6f419) remains the production correctness fix.
The kernel + diagnostic infrastructure (now including shadow-ref) is at
its best-attainable bf16-precision-equivalent state.
This commit closes the Layer-3 root-cause investigation. The "why is it
still garbage" answer is now data-grounded: it's NOT a kernel bug, NOT
a cache bug, NOT a graph bug. It IS cumulative ULP compounding across
1800 dependent calls — fundamentally a model-tolerance issue against
the kernel's bit-equivalent-but-bit-different output bit pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnQinAMD
added a commit
to JohnQinAMD/sglang-amd
that referenced
this pull request
May 3, 2026
…crobench The fused-rmsnorm-quant kernel (`fused_rmsnorm_per1x128_quant`, `fused_add_rmsnorm_per1x128_quant` + dual-output variants) replaces the unfused 2-launch chain (aiter add_rmsnorm + dynamic_per_group_scaled_quant) at the q_norm/kv_norm fp8-input sites. Key correctness invariant: `MATCH_BF16_PRODUCTION: tl.constexpr` (default True via wrapper kwarg) round-trips `normed` through bf16 in registers before the per-block fp8 quant. Without it the fp8 codepoints drift by 1 ULP at the saturation boundary in ~2.2% of elements (vs the unfused production path which materializes bf16 in HBM between rmsnorm and quant); compounds across 43 layers into garbage-token logit drift. This was the root cause of the prior A2-sgl-project#1 +22 ms TPOT regression with degenerate sampling. Microbench (v2 framework, 5 production shapes, chi2811): EAGER: torch 116.58 us → triton 19.75 us = 5.90x GRAPH-REPLAY: torch 63.02 us → triton 9.62 us = 6.55x Correctness PASS at all 5 shapes (q_max_diff = 1 fp8 ULP at sat) PMC: VGPR=20, AGPR=0, LDS=0, scratch=0 The Mode A wire (`SGLANG_F4_MODE_A=1` env-gate at MQALayer._forward_prepare, already committed in 0ffd49e) calls the q_norm site only — Mode B (input_layernorm fan-out at L2519) is NOT wired and remains the A2-sgl-project#1 failure surface; do not enable without a separate validation campaign. Live A/B on chi2811 (TPOT 31.21 OFF vs 31.33 ON, +0.12 ms within noise): the Mode A site is launch-count-neutral vs the existing `_aiter_fused_qk_rmsnorm` baseline (replaces qk-norm fusion with kv_norm + fused_q_quant; ~0.15 ms/step microbench prediction in noise). Default OFF; primitive remains available for Mode B follow-up. Files: python/sglang/srt/layers/quantization/fused_rmsnorm_quant.py — kernel microbench/triton_port_v2/bench_f4_rmsnorm_quant.py — v2 microbench Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SammLSH
added a commit
to SammLSH/sglang
that referenced
this pull request
May 4, 2026
Drops the sglang-native session.start/.end + binary-PCM-frame protocol that landed in M1 and replaces it with the OpenAI Realtime transcription-only spec (https://platform.openai.com/docs/guides/realtime-transcription). Endpoint moves from /v1/audio/transcriptions/stream to /v1/realtime. Wire protocol (JSON only, no binary frames): client -> session.update {session.type=transcription, audio.input.{format, sample_rate, transcription.{model,language}, noise_reduction, turn_detection}} -> input_audio_buffer.append {audio: base64-PCM16-LE} -> input_audio_buffer.commit -> input_audio_buffer.clear server -> session.created / session.updated -> input_audio_buffer.committed {item_id, previous_item_id} -> input_audio_buffer.cleared -> conversation.item.created {previous_item_id, item} -> conversation.item.input_audio_transcription.delta -> conversation.item.input_audio_transcription.completed -> conversation.item.input_audio_transcription.failed -> error {error: {type, code, message, param}} sglang-specific deltas vs the spec, all documented in the module docstring: * audio.input.sample_rate is a sglang extension; OpenAI's audio/pcm default is 24 kHz. We accept 16k/24k/48k and resample to 16 kHz internally via librosa before feeding the model. * Server-side VAD is not implemented; turn_detection != null is rejected with vad_not_supported. Clients must commit explicitly. * noise_reduction != null is rejected; include[] is silently dropped. * Deltas stream continuously as audio is appended (one inference per chunk_size_sec of new audio, anchored by the previously emitted prefix). Clients do not need to commit to start receiving deltas; commit only finalizes the turn and emits the committed/item.created/ completed triplet, then resets state for the next turn within the same session. * audio.input.transcription.model stays echo-only per the existing sglang single-model design; multi-model routing belongs upstream. Reviewer-requested changes also bundled in: * sgl-project#1 (encapsulation): handle_realtime_transcription now takes tokenizer_manager, adapter, server_args, and session_semaphore as explicit kwargs; the WS module never reaches into OpenAIServingTranscription privates. * sgl-project#4 (type hints): all new functions and dataclasses are fully annotated. * sgl-project#5 (concurrency cap): adds --asr-max-concurrent-sessions (default 32). Excess connections are accepted, sent error{code: too_many_sessions}, and closed. Out-of-scope follow-ups (TODO in module docstring): * sgl-project#2 (PCM round-trip): would require process_asr_chunk to accept pre-decoded ndarrays; punted to a separate PR. Test refresh in test/manual/models/test_qwen3_asr.py: * _stream_websocket_async rewritten to drive the new protocol (session.update -> append events with base64 -> commit -> drain delta + committed + item.created + completed). * 19/19 tests pass, ~52.7s, stable across 5 consecutive runs (/tmp/asr_openai_run1..5.log).
SammLSH
added a commit
to SammLSH/sglang
that referenced
this pull request
May 4, 2026
Move /v1/audio/transcriptions/stream to /v1/realtime and switch from the M1 session.start/binary-PCM protocol to OpenAI's Realtime transcription wire format. The shared inference driver is untouched, so HTTP SSE and WS still produce byte-identical transcripts; this is purely a transport rewrite. sglang deviations from the spec live in the module docstring: sample_rate is a sglang extension accepting 16/24/48 kHz with internal resample (OpenAI fixes audio/pcm at 24 kHz), turn_detection and noise_reduction must be null (no server-side VAD), include[] is dropped, model is echo-only. Addresses sgl-project#22848 review sgl-project#1 (decouple from OpenAIServingTranscription), sgl-project#4 (type hints), sgl-project#5 (--asr-max-concurrent-sessions, default 32). sgl-project#2 (skip PCM round trip) is deferred since it changes process_asr_chunk's input contract.
This was referenced May 4, 2026
JohnQinAMD
added a commit
to JohnQinAMD/sglang-amd
that referenced
this pull request
May 4, 2026
…ional descriptions Comment-only cleanup. Replaces 14 internal-nickname references (Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with descriptive functional explanations of the surrounding code. No code semantics or behavior change. Equivalent via diff filtered to non-comment lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnQinAMD
added a commit
to JohnQinAMD/sglang-amd
that referenced
this pull request
May 4, 2026
…ional descriptions Comment-only cleanup. Replaces 14 internal-nickname references (Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with descriptive functional explanations of the surrounding code. No code semantics or behavior change. Equivalent via diff filtered to non-comment lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnQinAMD
added a commit
to JohnQinAMD/sglang-amd
that referenced
this pull request
May 4, 2026
…ional descriptions Comment-only cleanup. Replaces 14 internal-nickname references (Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with descriptive functional explanations of the surrounding code. No code semantics or behavior change. Equivalent via diff filtered to non-comment lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnQinAMD
added a commit
to JohnQinAMD/sglang-amd
that referenced
this pull request
May 4, 2026
Comment-only cleanup. Replaces 14 internal-nickname references (Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with descriptive functional explanations of the surrounding code. No code semantics or behavior change. Equivalent via diff filtered to non-comment lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangzhilin-hzl
pushed a commit
to huangzhilin-hzl/sglang
that referenced
this pull request
May 8, 2026
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.