Add flashinfer && Oultines by hnyls2002 · Pull Request #1 · sgl-project/sglang

hnyls2002 · 2024-01-08T05:29:51Z

No description provided.

Sequence Parallel system setup

use triton attention

Shortfin backend

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

update doc

[WIP] Add Benchmark for DeepGEMM Group GEMM

avoid iteration in python

Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>

* Fix ut mla-test-1-gpu-amd (sgl-project#4813) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (sgl-project#4638) * [k8s] Clarified the usage of shared memory. (sgl-project#4341) * gemma3: impl `get_attention_sliding_window_size` for attn init (sgl-project#4823) * add partial_json_parser and einops (sgl-project#4827) * fix the release doc dependency issue (sgl-project#4828) * Update doc for DeepSeek-V3-0324 (sgl-project#4825) * deps: lazy import optional dependencies `gguf` and `torchvision` (sgl-project#4826) * Update MMMU Benchmark instructions (sgl-project#4694) * Fix the nightly eval by lowering the threshold of `neuralmagic/gemma-2-2b-it-FP8` (sgl-project#4830) * Basic Cleanup (sgl-project#4833) * Support (1 <= dp < tp) in the dp attention in DeepEP (sgl-project#4770) Co-authored-by: Cheng Wan <cwan39@gatech.edu> * [Fix] Add compressed_tensors as deps (sgl-project#4819) * Fix error due to CustomAllreduce setup failure (sgl-project#4815) Signed-off-by: Kebe <mail@kebe7jun.com> * use default for torch.ops (sgl-project#4835) * [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder (sgl-project#3969) * [Misc] Fix issues reported by torchfix (sgl-project#4837) * Include context length in /v1/models response. (sgl-project#4809) * [Fix] `self.worker` assignment in `TpModelWorker` and refactor references (sgl-project#4788) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix the lora adapter when lora path is none (sgl-project#4799) Co-authored-by: Beichen Ma <mabeichen12@gmail.com> * fix: fix typo of comments in w8a8_fp8.py (sgl-project#4843) * Remove retry in nightly tests (sgl-project#4846) * Fix CI of test_patch_torch (sgl-project#4844) * IPv6 support (sgl-project#3949) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * ci: add condition for daily docker build (sgl-project#4487) * [Fix] fix output_top_logprobs is not exist (sgl-project#4597) * fix: when use SGLANG_PORT this env,port is str (sgl-project#4528) Signed-off-by: rongfu.leng <lenronfu@gmail.com> * Support Page Size > 1 for FA3 (sgl-project#4832) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Fix Engine error when enabling DP attention (sgl-project#4648) * fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest (sgl-project#4681) * Support controlling nsys start and end range programmatically (sgl-project#4688) * Remove empty tool function name (sgl-project#4704) Signed-off-by: Kebe <mail@kebe7jun.com> * Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. (sgl-project#4712) * get the python version from env (sgl-project#4729) * Fix torch.cuda.MemPool() internal assertion failure (sgl-project#4687) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Super tiny remove unused code (sgl-project#4750) * Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * Workaround for async copy issue in HPU eager mode (sgl-project#1) Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> * [SW-223847]: Fix sgl_kernel module not available (sgl-project#2) Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> * [Base] Enable torch compile (sgl-project#4) * [SW-226331] disable dynamic shape in torch compile mode Signed-off-by: Mohit Sinha <msinha@habana.ai> --------- Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: rongfu.leng <lenronfu@gmail.com> Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: AinL <gmlwns5176@gmail.com> Co-authored-by: Jiří Suchomel <jiri.suchomel@statsperform.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Daniel Holanda <holand.daniel@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Jon Durbin <jon@jondurbin.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Qiaolin Yu <qy254@cornell.edu> Co-authored-by: Beichen Ma <mabeichen12@gmail.com> Co-authored-by: Jiaqi <57028284+ZhuJiaqi9905@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: warjiang <1096409085@qq.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: rongfu.leng <lenronfu@gmail.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: BroadbentJim <BroadbentJim@users.noreply.github.com> Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> Co-authored-by: DavidChan <chengwei0519@163.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Rahul Vijayaraghavan <rahul.vijayaraghavan@intel.com> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Jay Thakur <jthakur@habana.ai> Co-authored-by: Anshuman Tripathy <atripathy@habana.ai>

The fused-autodetect path sets skip_special_tokens=False so the language prefix survives for parsing, but that also lets the detokenizer preserve downstream special tokens into the user-visible text: * trailing <|endoftext|> after the last word * embedded <|X.XX|> segment-boundary tokens in the timestamps variant Both were flagged by the reviewer in the original comment sgl-project#1 along with the prefix leak. Prior commits fixed the prefix; this one closes the tail. * parse_fused_output now does a defensive re.sub("<\\|[^|]+\\|>") on the post-sentinel transcription. verbose_json segment timing is unaffected — it's built from output_ids via _parse_segments, a separate path that already uses skip_special_tokens=True during per-token decoding. * New WhisperAdapter.strip_special_tokens helper exposed via the base adapter (identity default) so the streaming handler stays model-agnostic. _generate_transcription_stream calls it per-delta in fused mode. * 4 new unit tests cover trailing <|endoftext|>, embedded <|X.XX|>, the strip helper in isolation, and the streaming per-delta scrub. Updated the FSM-abort streaming test: a tail that's nothing but forced-prefix specials now collapses to no emitted delta (warning still logged, language stays unset). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

adapt prefix + mtp on ascend gdn backend

…32p)

Direct port of V4 reference Transformer (inference/model.py lines 769-809) with Eagle3 aux-hidden-state capture grafted on per the qwen2.py:652 post-loop trap pattern documented in wiki/pipeline.md known-pitfalls. __init__: - Embed, layers list (43 DeepseekV4DecoderLayer), final_norm RMSNorm parameters allocated. - Top-level mHC head reduction parameters (hc_head_fn, hc_head_base, hc_head_scale per V4 ref lines 794-799). These fold the [b, s, hc_mult, d] hidden state into [b, s, d] before lm_head. - LM head (nn.Linear hidden_size -> vocab_size, no bias). - MTP block list. Per CLAUDE.md rule sgl-project#12 + architecture-notes.md "Eagle3 vs native MTP", MTP is loaded for weight-key compat but NOT executed at Eagle3 inference. Allocated as ModuleList of nn.Identity placeholders. The Eagle3 draft head replaces MTP. forward: - Embed -> hc_mult-copy expand -> layer loop -> post-loop trap -> hc-head fold -> final norm -> lm_head[..., -1] (last-token logits matching V4 reference get_logits). - Aux capture grafted IN the loop: when layer index i is in self._eagle3_layers_to_capture, append h.mean(dim=-2) (fold over the hc_mult copies to get [B, T, d]) BEFORE the layer runs. This captures the layer's INPUT, matching the V3.2/qwen2.py convention. - Post-loop trap: when end_layer (= num_hidden_layers) is in the configured capture set, capture the final hidden state. This handles the qwen2.py:652 pattern where set_eagle3_layers_to_capture maps layer_ids [0, 14, 27] -> [1, 15, 28] (+1 offset). The last layer's mapped index (=num_hidden_layers) is unreachable inside the for-loop; the post-loop trap captures it before normalization, matching the pre-norm convention used for the other aux states. - Aux feature dim per captured layer: hidden_size (mean over hc_mult copies). For V4-Flash slot triple [1, 21, 41]: 3 layers x 4096 dim = 12288 aux feature dim, matching SpecForge draft expectation. load_weights: still raises NotImplementedError, with a detailed TODO spelling out the 5 sub-tasks (key map, FP4 expert dequant, FP8 non-expert handling, tid2eid table, MTP weights). Pattern follows deepseek_common/deepseek_weight_loader.py with FP4 additions. What this enables (when load_weights lands): - Random-init forward pass for shape testing (works today, modulo sparse_attn_v4 stub raising) - Real-checkpoint forward pass for numerical agreement testing vs HF reference - Eagle3 training data generation via sglang as the target backend (per CLAUDE.md rule sgl-project#1: --target-model-backend sglang) Open TODOs (still on Phase 1 in flight): - TODO(phase1-loader): load_weights body - TODO(phase1-kernel): sparse_attn_v4 wiring to NSA tilelang - TODO(phase1-nsa): plug NSAIndexer into V4Attention.indexer (CSA path) - TODO(phase1-tp): swap nn.Linear -> ColumnParallelLinear / RowParallelLinear - TODO(phase1-fp4): V4-aware FP4 Linear that keeps experts packed - TODO(phase1-quant): act_quant calls for FP8 simulation in V4Attention Phase 1 file at this commit: ~1300 lines. End-to-end forward path plumbed; runs out of stubs only at sparse_attn_v4 and load_weights.

…very Two cross-call data-corruption bugs surfaced during review where the streaming FSM diverged from the regex non-streaming path on malformed inputs. Both fixes add explicit recovery branches in READING_VALUE and require a small regex-side tightening to keep the two paths in sync. P2 sgl-project#1: malformed `<tool_call>fn\n<arg_key>K</arg_key></tool_call>` (max- tokens cutoff after `</arg_key>` before `<arg_value>`) left the streaming FSM stuck in READING_VALUE. The bare-`<` discard ate `</tool_call>` byte- by-byte instead of recognizing it, and a *subsequent* tool call's `<arg_value>` mis-attributed to the orphan `current_pending_key` — silently swallowing the second call's name. Recovery: handle `</tool_call>` in READING_VALUE by closing the active call (orphan key dropped) via the existing `_close_current_call` helper. P2 sgl-project#2: malformed `<arg_key>K1</arg_key><arg_key>K2</arg_key><arg_value>V` (model emitted a key, then re-emitted a new key without a value for the first) bound V to the stale K1 — wrong-argument corruption. Recovery: handle `<arg_key>` in READING_VALUE by replacing the orphan `current_pending_key` with the new one and staying in READING_VALUE. Regex tightening: `arg_pair_regex` key portion changed from `(.*?)` to `([^<]*?)`. The non-greedy `.*?` was backtracking across `</arg_key>` boundaries on the malformed-key shape above, producing a junk key spanning both `<arg_key>` tags (e.g. `"K1</arg_key><arg_key>K2": V`). The `[^<]` constraint blocks the backtrack. Param names never contain `<` in practice; the value side keeps `.*?` because legitimate values can contain `<` (HTML, paths, etc.). Both paths now produce the intuitive `{"K2": V}` for the malformed input. Locked in by 4 regression tests (non-streaming + streaming for each P2 shape). Helper docstrings extended to document the dual call sites.

…g is the cause Adds SGLANG_FLASH_MLA_SHADOW_REF=<dir> hook that runs ref_sparse_attn_decode on the SAME tensors the kernel just consumed and logs per-call cos_sim + max_diff to a CSV. Samples 100% of single_shot calls during a live e2e run (prior 8-capture saved-tensor test was 0.7% of production call space). Live e2e run on chi2774 with tier0 (no cuda graph) + SGLANG_HIP_CK_V32_SINGLESHOT=1 + shadow-ref enabled. Result across 560 production single_shot calls: Relative diff (kernel vs torch ref): min: 0.0000% median: 0.2232% mean: 0.2129% p99: 0.4464% max: 0.4464% Calls with rel > 0.5%: 0/560 Calls with rel > 1.0%: 0/560 Calls with rel > 5.0%: 0/560 All 560 calls match torch's ref_sparse_attn_decode at sub-bf16-ULP relative diff. NO single call has a catastrophic delta. Yet e2e produces garbage tokens. DEFINITIVE CONCLUSION on Layer-3: The residual e2e regression is hypothesis sgl-project#1 from the user's list: cumulative sub-bf16-ULP noise compounded across 60 layers × 30 tokens = 1800 calls per generated sequence per worker. Each call is within bf16 floor; the cumulative drift exceeds the model's training-time robustness envelope. Hypothesis sgl-project#2 (wrapper cache state corruption): RULED OUT — no outlier calls in the 560-sample distribution. Hypothesis sgl-project#3 (cuda graph stream ordering): RULED OUT — shadow-ref ran in tier0 (no cuda graph) and still showed garbage e2e despite per-call diffs being uniform and small. Path forward (unchanged from previous commit): (b) Model finetune with kernel's specific bit pattern — out of kernel team scope. (d) Accept the Layer-3 stopgap and pursue other path-to-1x-B200 levers. Layer-3 stopgap (ca6f419) remains the production correctness fix. The kernel + diagnostic infrastructure (now including shadow-ref) is at its best-attainable bf16-precision-equivalent state. This commit closes the Layer-3 root-cause investigation. The "why is it still garbage" answer is now data-grounded: it's NOT a kernel bug, NOT a cache bug, NOT a graph bug. It IS cumulative ULP compounding across 1800 dependent calls — fundamentally a model-tolerance issue against the kernel's bit-equivalent-but-bit-different output bit pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…crobench The fused-rmsnorm-quant kernel (`fused_rmsnorm_per1x128_quant`, `fused_add_rmsnorm_per1x128_quant` + dual-output variants) replaces the unfused 2-launch chain (aiter add_rmsnorm + dynamic_per_group_scaled_quant) at the q_norm/kv_norm fp8-input sites. Key correctness invariant: `MATCH_BF16_PRODUCTION: tl.constexpr` (default True via wrapper kwarg) round-trips `normed` through bf16 in registers before the per-block fp8 quant. Without it the fp8 codepoints drift by 1 ULP at the saturation boundary in ~2.2% of elements (vs the unfused production path which materializes bf16 in HBM between rmsnorm and quant); compounds across 43 layers into garbage-token logit drift. This was the root cause of the prior A2-sgl-project#1 +22 ms TPOT regression with degenerate sampling. Microbench (v2 framework, 5 production shapes, chi2811): EAGER: torch 116.58 us → triton 19.75 us = 5.90x GRAPH-REPLAY: torch 63.02 us → triton 9.62 us = 6.55x Correctness PASS at all 5 shapes (q_max_diff = 1 fp8 ULP at sat) PMC: VGPR=20, AGPR=0, LDS=0, scratch=0 The Mode A wire (`SGLANG_F4_MODE_A=1` env-gate at MQALayer._forward_prepare, already committed in 0ffd49e) calls the q_norm site only — Mode B (input_layernorm fan-out at L2519) is NOT wired and remains the A2-sgl-project#1 failure surface; do not enable without a separate validation campaign. Live A/B on chi2811 (TPOT 31.21 OFF vs 31.33 ON, +0.12 ms within noise): the Mode A site is launch-count-neutral vs the existing `_aiter_fused_qk_rmsnorm` baseline (replaces qk-norm fusion with kv_norm + fused_q_quant; ~0.15 ms/step microbench prediction in noise). Default OFF; primitive remains available for Mode B follow-up. Files: python/sglang/srt/layers/quantization/fused_rmsnorm_quant.py — kernel microbench/triton_port_v2/bench_f4_rmsnorm_quant.py — v2 microbench Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the sglang-native session.start/.end + binary-PCM-frame protocol that landed in M1 and replaces it with the OpenAI Realtime transcription-only spec (https://platform.openai.com/docs/guides/realtime-transcription). Endpoint moves from /v1/audio/transcriptions/stream to /v1/realtime. Wire protocol (JSON only, no binary frames): client -> session.update {session.type=transcription, audio.input.{format, sample_rate, transcription.{model,language}, noise_reduction, turn_detection}} -> input_audio_buffer.append {audio: base64-PCM16-LE} -> input_audio_buffer.commit -> input_audio_buffer.clear server -> session.created / session.updated -> input_audio_buffer.committed {item_id, previous_item_id} -> input_audio_buffer.cleared -> conversation.item.created {previous_item_id, item} -> conversation.item.input_audio_transcription.delta -> conversation.item.input_audio_transcription.completed -> conversation.item.input_audio_transcription.failed -> error {error: {type, code, message, param}} sglang-specific deltas vs the spec, all documented in the module docstring: * audio.input.sample_rate is a sglang extension; OpenAI's audio/pcm default is 24 kHz. We accept 16k/24k/48k and resample to 16 kHz internally via librosa before feeding the model. * Server-side VAD is not implemented; turn_detection != null is rejected with vad_not_supported. Clients must commit explicitly. * noise_reduction != null is rejected; include[] is silently dropped. * Deltas stream continuously as audio is appended (one inference per chunk_size_sec of new audio, anchored by the previously emitted prefix). Clients do not need to commit to start receiving deltas; commit only finalizes the turn and emits the committed/item.created/ completed triplet, then resets state for the next turn within the same session. * audio.input.transcription.model stays echo-only per the existing sglang single-model design; multi-model routing belongs upstream. Reviewer-requested changes also bundled in: * sgl-project#1 (encapsulation): handle_realtime_transcription now takes tokenizer_manager, adapter, server_args, and session_semaphore as explicit kwargs; the WS module never reaches into OpenAIServingTranscription privates. * sgl-project#4 (type hints): all new functions and dataclasses are fully annotated. * sgl-project#5 (concurrency cap): adds --asr-max-concurrent-sessions (default 32). Excess connections are accepted, sent error{code: too_many_sessions}, and closed. Out-of-scope follow-ups (TODO in module docstring): * sgl-project#2 (PCM round-trip): would require process_asr_chunk to accept pre-decoded ndarrays; punted to a separate PR. Test refresh in test/manual/models/test_qwen3_asr.py: * _stream_websocket_async rewritten to drive the new protocol (session.update -> append events with base64 -> commit -> drain delta + committed + item.created + completed). * 19/19 tests pass, ~52.7s, stable across 5 consecutive runs (/tmp/asr_openai_run1..5.log).

Move /v1/audio/transcriptions/stream to /v1/realtime and switch from the M1 session.start/binary-PCM protocol to OpenAI's Realtime transcription wire format. The shared inference driver is untouched, so HTTP SSE and WS still produce byte-identical transcripts; this is purely a transport rewrite. sglang deviations from the spec live in the module docstring: sample_rate is a sglang extension accepting 16/24/48 kHz with internal resample (OpenAI fixes audio/pcm at 24 kHz), turn_detection and noise_reduction must be null (no server-side VAD), include[] is dropped, model is echo-only. Addresses sgl-project#22848 review sgl-project#1 (decouple from OpenAIServingTranscription), sgl-project#4 (type hints), sgl-project#5 (--asr-max-concurrent-sessions, default 32). sgl-project#2 (skip PCM round trip) is deferred since it changes process_asr_chunk's input contract.

…ional descriptions Comment-only cleanup. Replaces 14 internal-nickname references (Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with descriptive functional explanations of the surrounding code. No code semantics or behavior change. Equivalent via diff filtered to non-comment lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Comment-only cleanup. Replaces 14 internal-nickname references (Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with descriptive functional explanations of the surrounding code. No code semantics or behavior change. Equivalent via diff filtered to non-comment lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hnyls2002 added 2 commits January 8, 2024 05:07

add 3rdparty flashinfer

c98aa03

add outlines reference

3d999ad

merrymercy merged commit ead5b39 into main Jan 8, 2024

merrymercy deleted the 3rd-party branch January 8, 2024 16:26

Rookie-Kai mentioned this pull request Aug 14, 2024

[Bug] Always Watch Dog TimeOut #1093

Closed

4 tasks

Ying1123 pushed a commit that referenced this pull request Sep 13, 2024

Merge pull request #1 from ivanium/pr-sp-rope

a11bc61

Sequence Parallel system setup

wonderisland mentioned this pull request Sep 19, 2024

[Bug] illegal memory access encountered #1467

Closed

5 tasks

learninmou mentioned this pull request Sep 25, 2024

[Bug] sglang run for few hours, it will stop returning valid response #1270

Closed

5 tasks

ispobock pushed a commit to ispobock/sglang that referenced this pull request Oct 15, 2024

Merge pull request sgl-project#1 from ispobock/support-qwen2-vl

ef9c116

use triton attention

stbaione referenced this pull request in nod-ai/sglang Nov 13, 2024

Merge pull request #1 from stbaione/shortfin-backend

9f65287

Shortfin backend

jischein mentioned this pull request Jan 7, 2025

[Bug] Long output and issues when running benchmark_serving.py on DeepSeek-V3 #2746

Closed

5 tasks

CSEEduanyu mentioned this pull request Jan 26, 2025

[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3 #2803

Closed

5 tasks

hubertlu-tw referenced this pull request in hubertlu-tw/sglang Feb 5, 2025

[Wave] Add Wave Prefill Attention Op and MLIR prints (#1)

16d2bcb

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

zhaotyer mentioned this pull request Feb 14, 2025

[Bug] DeepSeek-R1-BF16 can't output with /v1/chat/completions on 4 node*8*A100 #3572

Closed

5 tasks

lambert0312 mentioned this pull request Feb 18, 2025

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Merged

ToughK mentioned this pull request Feb 18, 2025

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

Closed

5 tasks

mahaocong90 mentioned this pull request Feb 26, 2025

[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

Closed

5 tasks

verigle mentioned this pull request Feb 27, 2025

[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836

Closed

5 tasks

Achazwl pushed a commit to Achazwl/sglang that referenced this pull request Mar 2, 2025

Merge pull request sgl-project#1 from Achazwl/simplify_lm_head

06354e5

update doc

zcnrex referenced this pull request in zcnrex/sglang Mar 5, 2025

Merge pull request #1 from hebiao064/ds_group_gemm

5c57a85

[WIP] Add Benchmark for DeepGEMM Group GEMM

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Add flashinfer && Oultines (sgl-project#1)

aa4a46e

ch-wan referenced this pull request in ch-wan/sglang Mar 14, 2025

Merge pull request #1 from ch-wan/pr/4232

e83268b

avoid iteration in python

riou-chen mentioned this pull request Apr 17, 2025

[Bug] run eagle3 failed #5448

Closed

dongyibo mentioned this pull request May 19, 2025

[Bug] eagle2【CUDA error: an illegal memory access was encountered】 #6309

Closed

5 tasks

yuleiqin mentioned this pull request May 26, 2025

[Bug] main pd version Exception: Failed to encode tensor map: 700 #6590

Closed

5 tasks

Talantan1102 pushed a commit to Talantan1102/sglang that referenced this pull request Apr 20, 2026

Merge pull request sgl-project#1 from silencejade/br_adpat_prefix

39e744f

adapt prefix + mtp on ascend gdn backend

thanhhao98 mentioned this pull request Apr 21, 2026

[Bug Fix] Sync FlashInfer autotune across TP ranks to unblock --enable-symm-mem #23317

Draft

5 tasks

shenxiul mentioned this pull request Apr 23, 2026

Skip torch.cuda.empty_cache() in weight update flush path #22998

Merged

2 tasks

jhinpan mentioned this pull request Apr 24, 2026

Add AMD support for DeepSeek V4 #23608

Open

silencejade mentioned this pull request Apr 25, 2026

[NPU] Fix mrope_position computation in Eagle Worker v2 with PlanStream #23423

Open

5 tasks

Johnsonms mentioned this pull request Apr 25, 2026

Flux2 nvfp4 quantization correctness on Blackwell (B200) #23625

Merged

5 tasks

Yunzez mentioned this pull request Apr 28, 2026

[Bug] Scheduler crash in LoRA, loss of availability #23141

Open

5 tasks

hhhh1252023 referenced this pull request in hhhh1252023/sglang_public Apr 28, 2026

Remove extra 16p script not in Issue #1

c4ad4bc

hhhh1252023 referenced this pull request in hhhh1252023/sglang_public Apr 28, 2026

Update Kimi test configs to match Issue #1 specs (8p, 1p1d_16p, 2p1d_…

b9d27c3

…32p)

Jiminator mentioned this pull request Apr 30, 2026

[Model] Laguna-XS.2 Model Support #24184

Closed

5 tasks

Jiminator mentioned this pull request May 1, 2026

[Model] Laguna-XS.2 Model Support #24204

Merged

5 tasks

Wen-xuan-Xu mentioned this pull request May 2, 2026

[Bug] Empty micro-batch produced after KV-pool retraction crashes rotary_embedding and store_cache on Gemma-4 31B with --swa-full-tokens-ratio 0.1 #24252

Closed

5 tasks

This was referenced May 4, 2026

[CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models #24279

Merged

[DIAGNOSTIC, DO NOT MERGE] Pin test_prefill_logits to marco/mcdse-2b-v1 to reproduce the 5090 failure #24327

Closed

kenzhangwangshu mentioned this pull request May 6, 2026

[Bug] flashinfer_trtllm MoE runner has no DeepEP fused func registered — blocks EP+NVFP4 on Blackwell #24502

Open

huangzhilin-hzl pushed a commit to huangzhilin-hzl/sglang that referenced this pull request May 8, 2026

Fix missing moe_permute_prepare source (sgl-project#1)

8415eb8

Gs1997XX mentioned this pull request May 8, 2026

DeepSeek-V4 Day 0 Support on NPUs #23598

Open

6 tasks

JackLeeHal mentioned this pull request May 9, 2026

[Question] running DeepSeek-V4-Pro on B300 #24776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flashinfer && Oultines#1

Add flashinfer && Oultines#1
merrymercy merged 2 commits intomainfrom
3rd-party

hnyls2002 commented Jan 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hnyls2002 commented Jan 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants