Add SRT json decode example by hnyls2002 · Pull Request #2 · sgl-project/sglang

hnyls2002 · 2024-01-09T08:07:33Z

Only suitable for SRT backend, using dtype's regex and a comma.

hnyls2002 · 2024-01-09T08:09:46Z

Result of SRT json generation:

Generate a JSON object to describe the basic information of a city.
{
  "name": "New York",
  "population": 8500000,
  "area": 3026000,
  "latitude": 40.712786,
  "country": "United States",
  "timezone": "Eastern Standard Time"
}

* test: test cases of combining multiple attention kernel calls to implement a sequence parallel kernel. Verified with 2 sp workers * fix: simplify flashinfer kernel initialization (begin_forward() and end_forward()) * test: add logic for sp worker 1 which is basically the same but with different orders of kernel calls * chore: format tweak * feat: a general seq parallel attention kernel that achieves workload balance * fix: minor tweak loop iteration within ring attention * feat [radix_attention]: seq_parallel kernel with sync communication. TODO: turn communication into async fashion and overlap it with computation * test: update test cases for seq parallel attn kernel. Need to disable kv cache management before testing because we haven't implemented kv cache management for seq parallel yet * chore [radix_attention]: format tweak * feat: async communication within ring attention * fix [parallel_utils]: add missed files * fix [infer_batch]: set default values for newly added sp-related metadata * fix [bench_latency]: minor fixes to input args * feat [parallel_utils]: get actual tp rank and size when both TP and SP are enabled * feat [linear]: add QKVParallelLinear * feat [llama2]: update llama model to use our QKVParallelLinear * feat [model_runner]: initialize model parallel with sequence parallel * fix [infer_batch]: 1. a minor issue when calling get_prefill_indices; 2. flashinfer intialization args * fix [bench_latency]: load model with sp_rank * feat [radix_attention]: automatically dispatch to seq-parallel attn kernel when sp_size > 1 * debug: stash current debug changes * fix [radix_attention]: reshape q tensor before running the kernel * bug fix for sp layout types * fix: adjust tensor layout. TODO: fix many dirty hacks and hardcoded values * fix [wip]: disable p2p communication within ring attention for now. TODO: fix the bug that causes communication hang. * chore [bench_latency]: disable decode for now since we haven't supported it * upstream with correct prefill sp layout * fix early exit on decode SP * chore: tweak format * update layout * bug fix * fix [linear, radix_attention]: fix q head indexes per SP worker to align with GQA setting. * fix [infer_batch]: set up flashinfer kernels for the batch size > 1 case * chore: tweak format * fix [radix_attention]: revert commented-out kv cache store operations in normal attention * fix: adjust k, v tensor shape to align with both TP and SP setting * chore [llama2]: minor adjustment * fix: update bench_latency to evenly distribute each sequence across all SP workers to avoid the layout issue * test: update test cases to align with current kernel in args * fix [model_runner]: initialize TokenToKVPool with correct num_heads and enable KV cache store in SP attention * chore [radix_attention]: clean up comments * fix [model_runner]: correct num_heads in memory profiling as well to avoid OOM * fix [infer_batch]: adopt SP KV cache allocation * feat [linear]: correctly partition q proj along the num_heads dimension with GQA * chore [llama2]: clean up stable variables * feat [infer_batch]: adjust positions to SP layout when preparing input_metadata * feat [infer_batch]: use dedicate paged attn kernel for cross-SP-shard attn * feat [parallel_state]: creat sequence parallel comm groups * test [sp_comm_group]: simple test case with sp_size = 2 * doc [parallel_state]: doc string for our SP group organization * fix [infer_batch]: add padding zeros to positions tensor and out_cache_loc to fix positional encoding and KV cache store * feat [radix_attn, infer_batch]: create masks for padded sequences and now attn works for unevenly-distributed sequenses too * chore [bench_latency]: revert original prompts * fix [parallel_state]: rename "actual" to "kv" * refactor [radix_attention]: unified two cases with differnt comm-comp tradeoffs * chore: rename "actual_tp_[size|rank]" to "kv_tp_[size|rank]" * fix [infer_batch]: ensure prefix_lens is not None in init_flashinfer_args * fix [infer_batch]: only pad positions and out_cache_loc for prefill * chore [linear]: clean up and revise comments * chore [parallel_state]: revise comments * chore [linear]: revise comments and class names * chore [radix_attention]: add defensive checks --------- Co-authored-by: ZYHowell <yhzhuang@cmu.edu>

Enable bench_serving benchmark for SGLang + Add `fork` and `batch` to Example Script

update from geon

Remove duplicate for fp8 groupgemm and remove CN docs

* [SW-223847]: import awq_dequantize if cuda avaialble * fix * fix * fix --------- Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>

* Fix ut mla-test-1-gpu-amd (sgl-project#4813) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (sgl-project#4638) * [k8s] Clarified the usage of shared memory. (sgl-project#4341) * gemma3: impl `get_attention_sliding_window_size` for attn init (sgl-project#4823) * add partial_json_parser and einops (sgl-project#4827) * fix the release doc dependency issue (sgl-project#4828) * Update doc for DeepSeek-V3-0324 (sgl-project#4825) * deps: lazy import optional dependencies `gguf` and `torchvision` (sgl-project#4826) * Update MMMU Benchmark instructions (sgl-project#4694) * Fix the nightly eval by lowering the threshold of `neuralmagic/gemma-2-2b-it-FP8` (sgl-project#4830) * Basic Cleanup (sgl-project#4833) * Support (1 <= dp < tp) in the dp attention in DeepEP (sgl-project#4770) Co-authored-by: Cheng Wan <cwan39@gatech.edu> * [Fix] Add compressed_tensors as deps (sgl-project#4819) * Fix error due to CustomAllreduce setup failure (sgl-project#4815) Signed-off-by: Kebe <mail@kebe7jun.com> * use default for torch.ops (sgl-project#4835) * [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder (sgl-project#3969) * [Misc] Fix issues reported by torchfix (sgl-project#4837) * Include context length in /v1/models response. (sgl-project#4809) * [Fix] `self.worker` assignment in `TpModelWorker` and refactor references (sgl-project#4788) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix the lora adapter when lora path is none (sgl-project#4799) Co-authored-by: Beichen Ma <mabeichen12@gmail.com> * fix: fix typo of comments in w8a8_fp8.py (sgl-project#4843) * Remove retry in nightly tests (sgl-project#4846) * Fix CI of test_patch_torch (sgl-project#4844) * IPv6 support (sgl-project#3949) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * ci: add condition for daily docker build (sgl-project#4487) * [Fix] fix output_top_logprobs is not exist (sgl-project#4597) * fix: when use SGLANG_PORT this env,port is str (sgl-project#4528) Signed-off-by: rongfu.leng <lenronfu@gmail.com> * Support Page Size > 1 for FA3 (sgl-project#4832) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Fix Engine error when enabling DP attention (sgl-project#4648) * fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest (sgl-project#4681) * Support controlling nsys start and end range programmatically (sgl-project#4688) * Remove empty tool function name (sgl-project#4704) Signed-off-by: Kebe <mail@kebe7jun.com> * Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. (sgl-project#4712) * get the python version from env (sgl-project#4729) * Fix torch.cuda.MemPool() internal assertion failure (sgl-project#4687) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Super tiny remove unused code (sgl-project#4750) * Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * Workaround for async copy issue in HPU eager mode (sgl-project#1) Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> * [SW-223847]: Fix sgl_kernel module not available (sgl-project#2) Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> * [Base] Enable torch compile (sgl-project#4) * [SW-226331] disable dynamic shape in torch compile mode Signed-off-by: Mohit Sinha <msinha@habana.ai> --------- Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: rongfu.leng <lenronfu@gmail.com> Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: AinL <gmlwns5176@gmail.com> Co-authored-by: Jiří Suchomel <jiri.suchomel@statsperform.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Daniel Holanda <holand.daniel@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Jon Durbin <jon@jondurbin.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Qiaolin Yu <qy254@cornell.edu> Co-authored-by: Beichen Ma <mabeichen12@gmail.com> Co-authored-by: Jiaqi <57028284+ZhuJiaqi9905@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: warjiang <1096409085@qq.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: rongfu.leng <lenronfu@gmail.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: BroadbentJim <BroadbentJim@users.noreply.github.com> Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> Co-authored-by: DavidChan <chengwei0519@163.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Rahul Vijayaraghavan <rahul.vijayaraghavan@intel.com> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Jay Thakur <jthakur@habana.ai> Co-authored-by: Anshuman Tripathy <atripathy@habana.ai>

sgl-project#4 Q3.A 4-arm: added host-tier-on row to RESULTS.md table, paper §6.3 tab:q3a updated. Default + HiMambaRadixCache costs 7-11% latency vs default, reproducing the paper's offload-fetch tax claim. sgl-project#2 Setting 4 saturation-blind fix: - cross_pool_planner.py: new SGLANG_XPOOL_QDEPTH_TRIGGER env var (default 0 = legacy behavior preserved). When >0, the planner ALSO fires a transfer when one pool is saturated (above its high watermark) AND queue_depth >= trigger — even if the other pool is above its low watermark. Recovers gradient information at saturation. - agent.py: passes num_queue_reqs to planner.decide(); logs xpool_plan_queue_depth in the JSONL stream. - 35_planner_qdepth_unit.py: 5/5 unit tests pass — qdepth=0 preserves legacy, qdepth>0 fires saturation+queue, queue_depth field populated. The fix is gated so existing runs see no behavior change. Sweep 1 multi-seed re-run with the new mode pending (will compare proxy V_kv' + V_mamba' decisions across ratios with vs without queue signal).

…very Two cross-call data-corruption bugs surfaced during review where the streaming FSM diverged from the regex non-streaming path on malformed inputs. Both fixes add explicit recovery branches in READING_VALUE and require a small regex-side tightening to keep the two paths in sync. P2 sgl-project#1: malformed `<tool_call>fn\n<arg_key>K</arg_key></tool_call>` (max- tokens cutoff after `</arg_key>` before `<arg_value>`) left the streaming FSM stuck in READING_VALUE. The bare-`<` discard ate `</tool_call>` byte- by-byte instead of recognizing it, and a *subsequent* tool call's `<arg_value>` mis-attributed to the orphan `current_pending_key` — silently swallowing the second call's name. Recovery: handle `</tool_call>` in READING_VALUE by closing the active call (orphan key dropped) via the existing `_close_current_call` helper. P2 sgl-project#2: malformed `<arg_key>K1</arg_key><arg_key>K2</arg_key><arg_value>V` (model emitted a key, then re-emitted a new key without a value for the first) bound V to the stale K1 — wrong-argument corruption. Recovery: handle `<arg_key>` in READING_VALUE by replacing the orphan `current_pending_key` with the new one and staying in READING_VALUE. Regex tightening: `arg_pair_regex` key portion changed from `(.*?)` to `([^<]*?)`. The non-greedy `.*?` was backtracking across `</arg_key>` boundaries on the malformed-key shape above, producing a junk key spanning both `<arg_key>` tags (e.g. `"K1</arg_key><arg_key>K2": V`). The `[^<]` constraint blocks the backtrack. Param names never contain `<` in practice; the value side keeps `.*?` because legitimate values can contain `<` (HTML, paths, etc.). Both paths now produce the intuitive `{"K2": V}` for the malformed input. Locked in by 4 regression tests (non-streaming + streaming for each P2 shape). Helper docstrings extended to document the dual call sites.

traverse_tree's inner dfs recurses with retrieve_next_token[curr] and reads draft_tokens[curr], both of which return 0-d tensors. xgrammar 0.1.32 silently coerced these to int via the FFI binding; 0.2.0 enforces the int signature on GrammarMatcher.accept_token / fill_next_token_bitmask and raises: TypeError: Mismatched type on argument sgl-project#2 ... Expected `int` but got `ffi.Tensor` Cast at the recursion sites (so curr stays an int per its annotation) and at the accept_token call site (since draft_tokens stays a tensor). Add a unit test that runs traverse_tree on a recording grammar and rejects any tensor argument to accept_token / fill_vocab_mask.

…g is the cause Adds SGLANG_FLASH_MLA_SHADOW_REF=<dir> hook that runs ref_sparse_attn_decode on the SAME tensors the kernel just consumed and logs per-call cos_sim + max_diff to a CSV. Samples 100% of single_shot calls during a live e2e run (prior 8-capture saved-tensor test was 0.7% of production call space). Live e2e run on chi2774 with tier0 (no cuda graph) + SGLANG_HIP_CK_V32_SINGLESHOT=1 + shadow-ref enabled. Result across 560 production single_shot calls: Relative diff (kernel vs torch ref): min: 0.0000% median: 0.2232% mean: 0.2129% p99: 0.4464% max: 0.4464% Calls with rel > 0.5%: 0/560 Calls with rel > 1.0%: 0/560 Calls with rel > 5.0%: 0/560 All 560 calls match torch's ref_sparse_attn_decode at sub-bf16-ULP relative diff. NO single call has a catastrophic delta. Yet e2e produces garbage tokens. DEFINITIVE CONCLUSION on Layer-3: The residual e2e regression is hypothesis sgl-project#1 from the user's list: cumulative sub-bf16-ULP noise compounded across 60 layers × 30 tokens = 1800 calls per generated sequence per worker. Each call is within bf16 floor; the cumulative drift exceeds the model's training-time robustness envelope. Hypothesis sgl-project#2 (wrapper cache state corruption): RULED OUT — no outlier calls in the 560-sample distribution. Hypothesis sgl-project#3 (cuda graph stream ordering): RULED OUT — shadow-ref ran in tier0 (no cuda graph) and still showed garbage e2e despite per-call diffs being uniform and small. Path forward (unchanged from previous commit): (b) Model finetune with kernel's specific bit pattern — out of kernel team scope. (d) Accept the Layer-3 stopgap and pursue other path-to-1x-B200 levers. Layer-3 stopgap (ca6f419) remains the production correctness fix. The kernel + diagnostic infrastructure (now including shadow-ref) is at its best-attainable bf16-precision-equivalent state. This commit closes the Layer-3 root-cause investigation. The "why is it still garbage" answer is now data-grounded: it's NOT a kernel bug, NOT a cache bug, NOT a graph bug. It IS cumulative ULP compounding across 1800 dependent calls — fundamentally a model-tolerance issue against the kernel's bit-equivalent-but-bit-different output bit pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the sglang-native session.start/.end + binary-PCM-frame protocol that landed in M1 and replaces it with the OpenAI Realtime transcription-only spec (https://platform.openai.com/docs/guides/realtime-transcription). Endpoint moves from /v1/audio/transcriptions/stream to /v1/realtime. Wire protocol (JSON only, no binary frames): client -> session.update {session.type=transcription, audio.input.{format, sample_rate, transcription.{model,language}, noise_reduction, turn_detection}} -> input_audio_buffer.append {audio: base64-PCM16-LE} -> input_audio_buffer.commit -> input_audio_buffer.clear server -> session.created / session.updated -> input_audio_buffer.committed {item_id, previous_item_id} -> input_audio_buffer.cleared -> conversation.item.created {previous_item_id, item} -> conversation.item.input_audio_transcription.delta -> conversation.item.input_audio_transcription.completed -> conversation.item.input_audio_transcription.failed -> error {error: {type, code, message, param}} sglang-specific deltas vs the spec, all documented in the module docstring: * audio.input.sample_rate is a sglang extension; OpenAI's audio/pcm default is 24 kHz. We accept 16k/24k/48k and resample to 16 kHz internally via librosa before feeding the model. * Server-side VAD is not implemented; turn_detection != null is rejected with vad_not_supported. Clients must commit explicitly. * noise_reduction != null is rejected; include[] is silently dropped. * Deltas stream continuously as audio is appended (one inference per chunk_size_sec of new audio, anchored by the previously emitted prefix). Clients do not need to commit to start receiving deltas; commit only finalizes the turn and emits the committed/item.created/ completed triplet, then resets state for the next turn within the same session. * audio.input.transcription.model stays echo-only per the existing sglang single-model design; multi-model routing belongs upstream. Reviewer-requested changes also bundled in: * sgl-project#1 (encapsulation): handle_realtime_transcription now takes tokenizer_manager, adapter, server_args, and session_semaphore as explicit kwargs; the WS module never reaches into OpenAIServingTranscription privates. * sgl-project#4 (type hints): all new functions and dataclasses are fully annotated. * sgl-project#5 (concurrency cap): adds --asr-max-concurrent-sessions (default 32). Excess connections are accepted, sent error{code: too_many_sessions}, and closed. Out-of-scope follow-ups (TODO in module docstring): * sgl-project#2 (PCM round-trip): would require process_asr_chunk to accept pre-decoded ndarrays; punted to a separate PR. Test refresh in test/manual/models/test_qwen3_asr.py: * _stream_websocket_async rewritten to drive the new protocol (session.update -> append events with base64 -> commit -> drain delta + committed + item.created + completed). * 19/19 tests pass, ~52.7s, stable across 5 consecutive runs (/tmp/asr_openai_run1..5.log).

Move /v1/audio/transcriptions/stream to /v1/realtime and switch from the M1 session.start/binary-PCM protocol to OpenAI's Realtime transcription wire format. The shared inference driver is untouched, so HTTP SSE and WS still produce byte-identical transcripts; this is purely a transport rewrite. sglang deviations from the spec live in the module docstring: sample_rate is a sglang extension accepting 16/24/48 kHz with internal resample (OpenAI fixes audio/pcm at 24 kHz), turn_detection and noise_reduction must be null (no server-side VAD), include[] is dropped, model is echo-only. Addresses sgl-project#22848 review sgl-project#1 (decouple from OpenAIServingTranscription), sgl-project#4 (type hints), sgl-project#5 (--asr-max-concurrent-sessions, default 32). sgl-project#2 (skip PCM round trip) is deferred since it changes process_asr_chunk's input contract.

…ional descriptions Comment-only cleanup. Replaces 14 internal-nickname references (Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with descriptive functional explanations of the surrounding code. No code semantics or behavior change. Equivalent via diff filtered to non-comment lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Comment-only cleanup. Replaces 14 internal-nickname references (Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with descriptive functional explanations of the surrounding code. No code semantics or behavior change. Equivalent via diff filtered to non-comment lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hnyls2002 added 2 commits January 9, 2024 08:05

add srt json example

bea526d

add latitude to show float generation

907a1c4

merrymercy added 2 commits January 9, 2024 20:33

fix

b2ad435

update

6528a88

merrymercy merged commit 331848d into main Jan 9, 2024

merrymercy deleted the srt-json branch January 9, 2024 20:35

Rookie-Kai mentioned this pull request Aug 14, 2024

[Bug] Always Watch Dog TimeOut #1093

Closed

4 tasks

wonderisland mentioned this pull request Sep 19, 2024

[Bug] illegal memory access encountered #1467

Closed

5 tasks

learninmou mentioned this pull request Sep 25, 2024

[Bug] sglang run for few hours, it will stop returning valid response #1270

Closed

5 tasks

stbaione referenced this pull request in nod-ai/sglang Nov 13, 2024

Merge pull request #2 from stbaione/shortfin-benchmark-updates

4c1fc1f

Enable bench_serving benchmark for SGLang + Add `fork` and `batch` to Example Script

jischein mentioned this pull request Jan 7, 2025

[Bug] Long output and issues when running benchmark_serving.py on DeepSeek-V3 #2746

Closed

5 tasks

kbumsik referenced this pull request in DeepAuto-AI/sglang Jan 23, 2025

Merge pull request #2 from gmlwns2000/hip12-offload-add-hip

68a3150

update from geon

CSEEduanyu mentioned this pull request Jan 26, 2025

[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3 #2803

Closed

5 tasks

zhaotyer mentioned this pull request Feb 14, 2025

[Bug] DeepSeek-R1-BF16 can't output with /v1/chat/completions on 4 node*8*A100 #3572

Closed

5 tasks

lambert0312 mentioned this pull request Feb 18, 2025

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Merged

ToughK mentioned this pull request Feb 18, 2025

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

Closed

5 tasks

mahaocong90 mentioned this pull request Feb 26, 2025

[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

Closed

5 tasks

verigle mentioned this pull request Feb 27, 2025

[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836

Closed

5 tasks

zcnrex pushed a commit to zcnrex/sglang that referenced this pull request Mar 5, 2025

Merge pull request sgl-project#2 from hebiao064/fix_fp8_groupgemm

0a2dc38

Remove duplicate for fp8 groupgemm and remove CN docs

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Add SRT json decode example (sgl-project#2)

d76e81c

17Reset mentioned this pull request Mar 18, 2025

[Bug] The --dp-size option will cause an error #4544

Closed

5 tasks

riou-chen mentioned this pull request Apr 17, 2025

[Bug] run eagle3 failed #5448

Closed

narutolhy mentioned this pull request Apr 14, 2026

[Perf] Bypass detokenizer for prefill-only requests #22748

Open

5 tasks

JustinTong0323 mentioned this pull request Apr 17, 2026

[Whisper] Automatic language detection via structured generation #22997

Merged

jybsuper mentioned this pull request Apr 19, 2026

[LoRA] Fix EP + per-expert MoE LoRA illegal memory access #23178

Merged

5 tasks

BBuf mentioned this pull request Apr 19, 2026

[Diffusion] Add Qwen Image ModelOpt FP8 support #23155

Merged

thanhhao98 mentioned this pull request Apr 21, 2026

[Bug Fix] Sync FlashInfer autotune across TP ranks to unblock --enable-symm-mem #23317

Draft

5 tasks

shenxiul mentioned this pull request Apr 23, 2026

Skip torch.cuda.empty_cache() in weight update flush path #22998

Merged

2 tasks

jhinpan mentioned this pull request Apr 24, 2026

Add AMD support for DeepSeek V4 #23608

Open

silencejade mentioned this pull request Apr 25, 2026

[NPU] Fix mrope_position computation in Eagle Worker v2 with PlanStream #23423

Open

5 tasks

Johnsonms mentioned this pull request Apr 25, 2026

Flux2 nvfp4 quantization correctness on Blackwell (B200) #23625

Merged

5 tasks

Yunzez mentioned this pull request Apr 28, 2026

[Bug] Scheduler crash in LoRA, loss of availability #23141

Open

5 tasks

Jiminator mentioned this pull request Apr 30, 2026

[Model] Laguna-XS.2 Model Support #24184

Closed

5 tasks

Wen-xuan-Xu mentioned this pull request May 2, 2026

[Bug] Empty micro-batch produced after KV-pool retraction crashes rotary_embedding and store_cache on Gemma-4 31B with --swa-full-tokens-ratio 0.1 #24252

Closed

5 tasks

Jiminator mentioned this pull request May 2, 2026

[Model] Laguna-XS.2 Model Support #24204

Merged

5 tasks

This was referenced May 4, 2026

[CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models #24279

Merged

[DIAGNOSTIC, DO NOT MERGE] Pin test_prefill_logits to marco/mcdse-2b-v1 to reproduce the 5090 failure #24327

Closed

huangzhilin-hzl added a commit to huangzhilin-hzl/sglang that referenced this pull request May 8, 2026

Fix Humming DeepSeek V4 startup bugs (sgl-project#2)

ff72f25

Gs1997XX mentioned this pull request May 8, 2026

DeepSeek-V4 Day 0 Support on NPUs #23598

Open

6 tasks

JackLeeHal mentioned this pull request May 9, 2026

[Question] running DeepSeek-V4-Pro on B300 #24776

Open

bingxche mentioned this pull request May 9, 2026

[AMD] DSv4 nightly hotfix + schedule-aware --continue-on-error in AMD CI #24825

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SRT json decode example#2

Add SRT json decode example#2
merrymercy merged 4 commits intomainfrom
srt-json

hnyls2002 commented Jan 9, 2024

Uh oh!

hnyls2002 commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hnyls2002 commented Jan 9, 2024

Uh oh!

hnyls2002 commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants