Skip to content

Add SRT json decode example#2

Merged
merrymercy merged 4 commits intomainfrom
srt-json
Jan 9, 2024
Merged

Add SRT json decode example#2
merrymercy merged 4 commits intomainfrom
srt-json

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

Only suitable for SRT backend, using dtype's regex and a comma.

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

Result of SRT json generation:

Generate a JSON object to describe the basic information of a city.
{
  "name": "New York",
  "population": 8500000,
  "area": 3026000,
  "latitude": 40.712786,
  "country": "United States",
  "timezone": "Eastern Standard Time"
}

@merrymercy merrymercy merged commit 331848d into main Jan 9, 2024
@merrymercy merrymercy deleted the srt-json branch January 9, 2024 20:35
@Rookie-Kai Rookie-Kai mentioned this pull request Aug 14, 2024
4 tasks
Ying1123 pushed a commit that referenced this pull request Sep 13, 2024
* test: test cases of combining multiple attention kernel calls to implement a sequence parallel kernel. Verified with 2 sp workers

* fix: simplify flashinfer kernel initialization (begin_forward() and end_forward())

* test: add logic for sp worker 1 which is basically the same but with different orders of kernel calls

* chore: format tweak

* feat: a general seq parallel attention kernel that achieves workload balance

* fix: minor tweak loop iteration within ring attention

* feat [radix_attention]: seq_parallel kernel with sync communication.

TODO: turn communication into async fashion and overlap it with computation

* test: update test cases for seq parallel attn kernel. Need to disable kv cache management before testing because we haven't implemented kv cache management for seq parallel yet

* chore [radix_attention]: format tweak

* feat: async communication within ring attention

* fix [parallel_utils]: add missed files

* fix [infer_batch]: set default values for newly added sp-related metadata

* fix [bench_latency]: minor fixes to input args

* feat [parallel_utils]: get actual tp rank and size when both TP and SP are enabled

* feat [linear]: add QKVParallelLinear

* feat [llama2]: update llama model to use our QKVParallelLinear

* feat [model_runner]: initialize model parallel with sequence parallel

* fix [infer_batch]: 1. a minor issue when calling get_prefill_indices; 2. flashinfer intialization args

* fix [bench_latency]: load model with sp_rank

* feat [radix_attention]: automatically dispatch to seq-parallel attn kernel when sp_size > 1

* debug: stash current debug changes

* fix [radix_attention]: reshape q tensor before running the kernel

* bug fix for sp layout types

* fix: adjust tensor layout. TODO: fix many dirty hacks and hardcoded values

* fix [wip]: disable p2p communication within ring attention for now. TODO: fix the bug that causes communication hang.

* chore [bench_latency]: disable decode for now since we haven't supported it

* upstream with correct prefill sp layout

* fix early exit on decode SP

* chore: tweak format

* update layout

* bug fix

* fix [linear, radix_attention]: fix q head indexes per SP worker to align with GQA setting.

* fix [infer_batch]: set up flashinfer kernels for the batch size > 1 case

* chore: tweak format

* fix [radix_attention]: revert commented-out kv cache store operations in normal attention

* fix: adjust k, v tensor shape to align with both TP and SP setting

* chore [llama2]: minor adjustment

* fix: update bench_latency to evenly distribute each sequence across all SP workers to avoid the layout issue

* test: update test cases to align with current kernel in args

* fix [model_runner]: initialize TokenToKVPool with correct num_heads and enable KV cache store in SP attention

* chore [radix_attention]: clean up comments

* fix [model_runner]: correct num_heads in memory profiling as well to avoid OOM

* fix [infer_batch]: adopt SP KV cache allocation

* feat [linear]: correctly partition q proj along the num_heads dimension with GQA

* chore [llama2]: clean up stable variables

* feat [infer_batch]: adjust positions to SP layout when preparing input_metadata

* feat [infer_batch]: use dedicate paged attn kernel for cross-SP-shard attn

* feat [parallel_state]: creat sequence parallel comm groups

* test [sp_comm_group]: simple test case with sp_size = 2

* doc [parallel_state]: doc string for our SP group organization

* fix [infer_batch]: add padding zeros to positions tensor and out_cache_loc to fix positional encoding and KV cache store

* feat [radix_attn, infer_batch]: create masks for padded sequences and now attn works for unevenly-distributed sequenses too

* chore [bench_latency]: revert original prompts

* fix [parallel_state]: rename "actual" to "kv"

* refactor [radix_attention]: unified two cases with differnt comm-comp tradeoffs

* chore: rename "actual_tp_[size|rank]" to "kv_tp_[size|rank]"

* fix [infer_batch]: ensure prefix_lens is not None in init_flashinfer_args

* fix [infer_batch]: only pad positions and out_cache_loc for prefill

* chore [linear]: clean up and revise comments

* chore [parallel_state]: revise comments

* chore [linear]: revise comments and class names

* chore [radix_attention]: add defensive checks

---------

Co-authored-by: ZYHowell <yhzhuang@cmu.edu>
Ying1123 pushed a commit that referenced this pull request Sep 13, 2024
* test: test cases of combining multiple attention kernel calls to implement a sequence parallel kernel. Verified with 2 sp workers

* fix: simplify flashinfer kernel initialization (begin_forward() and end_forward())

* test: add logic for sp worker 1 which is basically the same but with different orders of kernel calls

* chore: format tweak

* feat: a general seq parallel attention kernel that achieves workload balance

* fix: minor tweak loop iteration within ring attention

* feat [radix_attention]: seq_parallel kernel with sync communication.

TODO: turn communication into async fashion and overlap it with computation

* test: update test cases for seq parallel attn kernel. Need to disable kv cache management before testing because we haven't implemented kv cache management for seq parallel yet

* chore [radix_attention]: format tweak

* feat: async communication within ring attention

* fix [parallel_utils]: add missed files

* fix [infer_batch]: set default values for newly added sp-related metadata

* fix [bench_latency]: minor fixes to input args

* feat [parallel_utils]: get actual tp rank and size when both TP and SP are enabled

* feat [linear]: add QKVParallelLinear

* feat [llama2]: update llama model to use our QKVParallelLinear

* feat [model_runner]: initialize model parallel with sequence parallel

* fix [infer_batch]: 1. a minor issue when calling get_prefill_indices; 2. flashinfer intialization args

* fix [bench_latency]: load model with sp_rank

* feat [radix_attention]: automatically dispatch to seq-parallel attn kernel when sp_size > 1

* debug: stash current debug changes

* fix [radix_attention]: reshape q tensor before running the kernel

* bug fix for sp layout types

* fix: adjust tensor layout. TODO: fix many dirty hacks and hardcoded values

* fix [wip]: disable p2p communication within ring attention for now. TODO: fix the bug that causes communication hang.

* chore [bench_latency]: disable decode for now since we haven't supported it

* upstream with correct prefill sp layout

* fix early exit on decode SP

* chore: tweak format

* update layout

* bug fix

* fix [linear, radix_attention]: fix q head indexes per SP worker to align with GQA setting.

* fix [infer_batch]: set up flashinfer kernels for the batch size > 1 case

* chore: tweak format

* fix [radix_attention]: revert commented-out kv cache store operations in normal attention

* fix: adjust k, v tensor shape to align with both TP and SP setting

* chore [llama2]: minor adjustment

* fix: update bench_latency to evenly distribute each sequence across all SP workers to avoid the layout issue

* test: update test cases to align with current kernel in args

* fix [model_runner]: initialize TokenToKVPool with correct num_heads and enable KV cache store in SP attention

* chore [radix_attention]: clean up comments

* fix [model_runner]: correct num_heads in memory profiling as well to avoid OOM

* fix [infer_batch]: adopt SP KV cache allocation

* feat [linear]: correctly partition q proj along the num_heads dimension with GQA

* chore [llama2]: clean up stable variables

* feat [infer_batch]: adjust positions to SP layout when preparing input_metadata

* feat [infer_batch]: use dedicate paged attn kernel for cross-SP-shard attn

* feat [parallel_state]: creat sequence parallel comm groups

* test [sp_comm_group]: simple test case with sp_size = 2

* doc [parallel_state]: doc string for our SP group organization

* fix [infer_batch]: add padding zeros to positions tensor and out_cache_loc to fix positional encoding and KV cache store

* feat [radix_attn, infer_batch]: create masks for padded sequences and now attn works for unevenly-distributed sequenses too

* chore [bench_latency]: revert original prompts

* fix [parallel_state]: rename "actual" to "kv"

* refactor [radix_attention]: unified two cases with differnt comm-comp tradeoffs

* chore: rename "actual_tp_[size|rank]" to "kv_tp_[size|rank]"

* fix [infer_batch]: ensure prefix_lens is not None in init_flashinfer_args

* fix [infer_batch]: only pad positions and out_cache_loc for prefill

* chore [linear]: clean up and revise comments

* chore [parallel_state]: revise comments

* chore [linear]: revise comments and class names

* chore [radix_attention]: add defensive checks

---------

Co-authored-by: ZYHowell <yhzhuang@cmu.edu>
stbaione referenced this pull request in nod-ai/sglang Nov 13, 2024
Enable bench_serving benchmark for SGLang + Add `fork` and `batch` to Example Script
kbumsik referenced this pull request in DeepAuto-AI/sglang Jan 23, 2025
zcnrex pushed a commit to zcnrex/sglang that referenced this pull request Mar 5, 2025
Remove duplicate for fp8 groupgemm and remove CN docs
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Apr 23, 2025
* [SW-223847]: import awq_dequantize if cuda avaialble

* fix

* fix

* fix

---------

Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Apr 23, 2025
* Fix ut mla-test-1-gpu-amd (sgl-project#4813)

Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

* Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (sgl-project#4638)

* [k8s] Clarified the usage of shared memory. (sgl-project#4341)

* gemma3: impl `get_attention_sliding_window_size` for attn init (sgl-project#4823)

* add partial_json_parser and einops (sgl-project#4827)

* fix the release doc dependency issue (sgl-project#4828)

* Update doc for DeepSeek-V3-0324 (sgl-project#4825)

* deps: lazy import optional dependencies `gguf` and `torchvision` (sgl-project#4826)

* Update MMMU Benchmark instructions (sgl-project#4694)

* Fix the nightly eval by lowering the threshold of `neuralmagic/gemma-2-2b-it-FP8` (sgl-project#4830)

* Basic Cleanup (sgl-project#4833)

* Support (1 <= dp < tp) in the dp attention in DeepEP (sgl-project#4770)

Co-authored-by: Cheng Wan <cwan39@gatech.edu>

* [Fix] Add compressed_tensors as deps (sgl-project#4819)

* Fix error due to CustomAllreduce setup failure (sgl-project#4815)

Signed-off-by: Kebe <mail@kebe7jun.com>

* use default for torch.ops (sgl-project#4835)

* [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder (sgl-project#3969)

* [Misc] Fix issues reported by torchfix (sgl-project#4837)

* Include context length in /v1/models response. (sgl-project#4809)

* [Fix] `self.worker` assignment in `TpModelWorker` and refactor references (sgl-project#4788)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* Fix the lora adapter when lora path is none (sgl-project#4799)

Co-authored-by: Beichen Ma <mabeichen12@gmail.com>

* fix: fix typo of comments in w8a8_fp8.py (sgl-project#4843)

* Remove retry in nightly tests (sgl-project#4846)

* Fix CI of test_patch_torch (sgl-project#4844)

* IPv6 support (sgl-project#3949)

Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* ci: add condition for daily docker build (sgl-project#4487)

* [Fix] fix output_top_logprobs is not exist (sgl-project#4597)

* fix: when use SGLANG_PORT this env,port is str (sgl-project#4528)

Signed-off-by: rongfu.leng <lenronfu@gmail.com>

* Support Page Size > 1 for FA3 (sgl-project#4832)

Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* Fix Engine error when enabling DP attention (sgl-project#4648)

* fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest (sgl-project#4681)

* Support controlling nsys start and end range programmatically (sgl-project#4688)

* Remove empty tool function name (sgl-project#4704)

Signed-off-by: Kebe <mail@kebe7jun.com>

* Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. (sgl-project#4712)

* get the python version from env (sgl-project#4729)

* Fix torch.cuda.MemPool() internal assertion failure (sgl-project#4687)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* Super tiny remove unused code (sgl-project#4750)

* Support with_stack and record_shapes in profiler (sgl-project#4740)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840)

* Fix CI tests (sgl-project#4853)

* Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855)

* Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863)

* [Feature] add multi-rank support for Lora (sgl-project#4492)

Co-authored-by: rudy152 <czh1137892874@gmail.com>

* Clean up `import vllm` in quantization/__init__.py (sgl-project#4834)

* Fix wrong variable name when stopping memory profile (sgl-project#4772)

* [Feat] support deepgemm for cmake (sgl-project#4864)

* Make torch compile configurable for biased_grouped_topk (sgl-project#4749)

* update sgl-kernel test ci (sgl-project#4866)

* fix sampling issue (sgl-project#4871)

* bump sgl-kernel 0.0.5.post4 (sgl-project#4768)

* fix sgl-kernel cu118 build (sgl-project#4872)

* [Feature] Support FA3 backend for MLA (sgl-project#4831)

* upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873)

* update torch compile doc (sgl-project#4874)

* bump v0.4.4.post3 (sgl-project#4878)

* Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882)

* Improve stack trace of retry errors (sgl-project#4845)

* Tiny fix doc error (sgl-project#4795)

* [Docs] Update DeepGEMM at README.md (sgl-project#4886)

* Update CODEOWNERS (sgl-project#4889)

* Delete test_deep_gemm.py (sgl-project#4891)

* Add deepseek style fused moe group gate selection kernel (sgl-project#4530)

* quick fix: add default for new kernel (sgl-project#4898)

* remove setup for sgl-kernel (sgl-project#4899)

* [Misc] Clean m.def and add Development Tips (sgl-project#4890)

* fix allreduce test (sgl-project#4909)

* Support page size > 1 + eagle (sgl-project#4908)

* Fix retract for page size > 1 (sgl-project#4914)

* [Feature] use pytest for sgl-kernel (sgl-project#4896)

* fix bmm fp8 (sgl-project#4926)

* Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927)

* Fix 2-gpu CI test and suppress some warnings (sgl-project#4930)

* [feat] add fa3 in sgl-kernel (sgl-project#4902)

Co-authored-by: Sleepcoo <Sleepcoo@gmail.com>

* Fix sglang frontend's incorrect dependency on torch (sgl-project#4931)

* [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932)

* cleanup sgl-kernel (sgl-project#4933)

* [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925)

* Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883)

Co-authored-by: ch-wan <cwan39@gatech.edu>

* [Fix] Add torch compile for torch.clamp back (sgl-project#4936)

* Fix oom error for large page size (sgl-project#4913)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* [feat] interface for platforms abstraction (sgl-project#4928)

* [Fix] revert clean m.def for cudagraph (sgl-project#4944)

* refactor: multimodal data (sgl-project#4754)

* bump sgl-kernel v0.0.6 (sgl-project#4950)

* [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953)

* use fa3 in sgl-kernel (sgl-project#4954)

* Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959)

* [Feature] Support DeepEP Low Latency (sgl-project#4767)

Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: ch-wan <cwan39@gatech.edu>

* update bench_serving (sgl-project#4958)

* Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977)

* [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: GeLee <leege233@gmail.com>

* Large page size aligned hierarchical caching (sgl-project#4581)

* bug fix for hicache host eviction (sgl-project#4989)

* sgl scaled_fp8_quant support output padding (sgl-project#4861)

* Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951)

Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>

* Update tokenizer_manager.py (sgl-project#5008)

* [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817)

* update cutlass tag (sgl-project#5011)

* Feature/revise docs ci (sgl-project#5009)

* fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727)

Co-authored-by: yuethe <yuethe@tencent.com>

* [Build] Support build sgl-kernel with ccache (sgl-project#5020)

* fix deepgemm as well (sgl-project#5030)

* try to fix ci oserror (sgl-project#5024)

* Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005)

* Small refactor DeepEPMode to clean up code a bit (sgl-project#4992)

* [Fix] fix fa3 build at cu118 (sgl-project#5036)

* Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048)

* bump sgl-kernel v0.0.7 (sgl-project#5046)

* update eagle-3 docs (sgl-project#4796)

Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn>

* Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039)

Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>

* Update the retry count (sgl-project#5051)

* upgrade sgl-kernel v0.0.7 (sgl-project#5049)

* [2/3] fix dsv3 awq issue  (sgl-project#4625)

Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>

* Feature/revise docs ci (sgl-project#5056)

* Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057)

* [fix] remove `cuda_device_count_stateless` (sgl-project#5060)

* Small refactor DeepEPDispatcher into subclasses (sgl-project#4994)

* Support async DeepEP by splitting into two stages (sgl-project#4995)

* Cleanup unused resources after DeepEP operation (sgl-project#4996)

* Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918)

* [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072)

* fix dummy-load deepseekv2 (sgl-project#4535)

* support sgl-kernel on blackwell (sgl-project#5074)

* FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050)

Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Chunan Zeng <zcnrex@gmail.com>

* [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052)

* upgrade transformers 4.51.0 (sgl-project#5088)

* sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079)

* bump sgl-kernel 0.0.8 (sgl-project#5089)

* python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080)

* bump v0.4.4.post4 (sgl-project#5091)

* Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097)

Co-authored-by: shuaills <shishuaiuoe@gmail.com>

* Add Llama4 support (sgl-project#5092)

Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@163.com>

* Fix refactor error - fp8.py (sgl-project#5106)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* bump v0.4.5 (sgl-project#5117)

* Workaround for async copy issue in HPU eager mode (sgl-project#1)

Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>
Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>

* [SW-223847]: Fix sgl_kernel module not available (sgl-project#2)

Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>

* [Base] Enable torch compile (sgl-project#4)

* [SW-226331] disable dynamic shape in torch compile mode

Signed-off-by: Mohit Sinha <msinha@habana.ai>

---------

Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>
Signed-off-by: Mohit Sinha <msinha@habana.ai>
Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com>
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
Co-authored-by: AinL <gmlwns5176@gmail.com>
Co-authored-by: Jiří Suchomel <jiri.suchomel@statsperform.com>
Co-authored-by: Juwan Yoo <ryan@tmfi.us>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: Ravi Theja <ravi03071991@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Daniel Holanda <holand.daniel@gmail.com>
Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com>
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Jon Durbin <jon@jondurbin.com>
Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Qiaolin Yu <qy254@cornell.edu>
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
Co-authored-by: Jiaqi <57028284+ZhuJiaqi9905@users.noreply.github.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com>
Co-authored-by: warjiang <1096409085@qq.com>
Co-authored-by: lambert0312 <lambert80.ios@gmail.com>
Co-authored-by: rongfu.leng <lenronfu@gmail.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: BroadbentJim <BroadbentJim@users.noreply.github.com>
Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>
Co-authored-by: DavidChan <chengwei0519@163.com>
Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com>
Co-authored-by: rudy152 <czh1137892874@gmail.com>
Co-authored-by: Fr4nk1in <sh.fu@outlook.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: Sleepcoo <Sleepcoo@gmail.com>
Co-authored-by: SEPLOS <seplos@aliyun.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: GeLee <leege233@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>
Co-authored-by: Kaiyu Yang <yangky@umich.edu>
Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com>
Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com>
Co-authored-by: yuethe <yuethe@tencent.com>
Co-authored-by: simveit <69345428+simveit@users.noreply.github.com>
Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn>
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: Tommy Yang <tommyyang0524@gmail.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Rahul Vijayaraghavan <rahul.vijayaraghavan@intel.com>
Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai>
Co-authored-by: Jay Thakur <jthakur@habana.ai>
Co-authored-by: Anshuman Tripathy <atripathy@habana.ai>
rucnyz added a commit to rucnyz/sglang that referenced this pull request Apr 30, 2026
sgl-project#4 Q3.A 4-arm: added host-tier-on row to RESULTS.md table, paper §6.3
tab:q3a updated. Default + HiMambaRadixCache costs 7-11% latency vs
default, reproducing the paper's offload-fetch tax claim.

sgl-project#2 Setting 4 saturation-blind fix:
- cross_pool_planner.py: new SGLANG_XPOOL_QDEPTH_TRIGGER env var (default
  0 = legacy behavior preserved). When >0, the planner ALSO fires a
  transfer when one pool is saturated (above its high watermark) AND
  queue_depth >= trigger — even if the other pool is above its low
  watermark. Recovers gradient information at saturation.
- agent.py: passes num_queue_reqs to planner.decide(); logs
  xpool_plan_queue_depth in the JSONL stream.
- 35_planner_qdepth_unit.py: 5/5 unit tests pass — qdepth=0 preserves
  legacy, qdepth>0 fires saturation+queue, queue_depth field
  populated.

The fix is gated so existing runs see no behavior change. Sweep 1
multi-seed re-run with the new mode pending (will compare proxy V_kv'
+ V_mamba' decisions across ratios with vs without queue signal).
@Jiminator Jiminator mentioned this pull request May 2, 2026
5 tasks
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 2, 2026
…very

Two cross-call data-corruption bugs surfaced during review where the
streaming FSM diverged from the regex non-streaming path on malformed
inputs. Both fixes add explicit recovery branches in READING_VALUE and
require a small regex-side tightening to keep the two paths in sync.

P2 sgl-project#1: malformed `<tool_call>fn\n<arg_key>K</arg_key></tool_call>` (max-
tokens cutoff after `</arg_key>` before `<arg_value>`) left the streaming
FSM stuck in READING_VALUE. The bare-`<` discard ate `</tool_call>` byte-
by-byte instead of recognizing it, and a *subsequent* tool call's
`<arg_value>` mis-attributed to the orphan `current_pending_key` —
silently swallowing the second call's name. Recovery: handle
`</tool_call>` in READING_VALUE by closing the active call (orphan key
dropped) via the existing `_close_current_call` helper.

P2 sgl-project#2: malformed `<arg_key>K1</arg_key><arg_key>K2</arg_key><arg_value>V`
(model emitted a key, then re-emitted a new key without a value for the
first) bound V to the stale K1 — wrong-argument corruption. Recovery:
handle `<arg_key>` in READING_VALUE by replacing the orphan
`current_pending_key` with the new one and staying in READING_VALUE.

Regex tightening: `arg_pair_regex` key portion changed from `(.*?)` to
`([^<]*?)`. The non-greedy `.*?` was backtracking across `</arg_key>`
boundaries on the malformed-key shape above, producing a junk key
spanning both `<arg_key>` tags (e.g. `"K1</arg_key><arg_key>K2": V`).
The `[^<]` constraint blocks the backtrack. Param names never contain
`<` in practice; the value side keeps `.*?` because legitimate values
can contain `<` (HTML, paths, etc.).

Both paths now produce the intuitive `{"K2": V}` for the malformed input.
Locked in by 4 regression tests (non-streaming + streaming for each P2
shape). Helper docstrings extended to document the dual call sites.
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 2, 2026
…very

Two cross-call data-corruption bugs surfaced during review where the
streaming FSM diverged from the regex non-streaming path on malformed
inputs. Both fixes add explicit recovery branches in READING_VALUE and
require a small regex-side tightening to keep the two paths in sync.

P2 sgl-project#1: malformed `<tool_call>fn\n<arg_key>K</arg_key></tool_call>` (max-
tokens cutoff after `</arg_key>` before `<arg_value>`) left the streaming
FSM stuck in READING_VALUE. The bare-`<` discard ate `</tool_call>` byte-
by-byte instead of recognizing it, and a *subsequent* tool call's
`<arg_value>` mis-attributed to the orphan `current_pending_key` —
silently swallowing the second call's name. Recovery: handle
`</tool_call>` in READING_VALUE by closing the active call (orphan key
dropped) via the existing `_close_current_call` helper.

P2 sgl-project#2: malformed `<arg_key>K1</arg_key><arg_key>K2</arg_key><arg_value>V`
(model emitted a key, then re-emitted a new key without a value for the
first) bound V to the stale K1 — wrong-argument corruption. Recovery:
handle `<arg_key>` in READING_VALUE by replacing the orphan
`current_pending_key` with the new one and staying in READING_VALUE.

Regex tightening: `arg_pair_regex` key portion changed from `(.*?)` to
`([^<]*?)`. The non-greedy `.*?` was backtracking across `</arg_key>`
boundaries on the malformed-key shape above, producing a junk key
spanning both `<arg_key>` tags (e.g. `"K1</arg_key><arg_key>K2": V`).
The `[^<]` constraint blocks the backtrack. Param names never contain
`<` in practice; the value side keeps `.*?` because legitimate values
can contain `<` (HTML, paths, etc.).

Both paths now produce the intuitive `{"K2": V}` for the malformed input.
Locked in by 4 regression tests (non-streaming + streaming for each P2
shape). Helper docstrings extended to document the dual call sites.
JustinTong0323 added a commit to Seven-Streams/sglang that referenced this pull request May 2, 2026
traverse_tree's inner dfs recurses with retrieve_next_token[curr] and reads
draft_tokens[curr], both of which return 0-d tensors. xgrammar 0.1.32 silently
coerced these to int via the FFI binding; 0.2.0 enforces the int signature on
GrammarMatcher.accept_token / fill_next_token_bitmask and raises:

  TypeError: Mismatched type on argument sgl-project#2 ... Expected `int` but got `ffi.Tensor`

Cast at the recursion sites (so curr stays an int per its annotation) and at
the accept_token call site (since draft_tokens stays a tensor). Add a unit
test that runs traverse_tree on a recording grammar and rejects any tensor
argument to accept_token / fill_vocab_mask.
JohnQinAMD added a commit to JohnQinAMD/sglang-amd that referenced this pull request May 3, 2026
…g is the cause

Adds SGLANG_FLASH_MLA_SHADOW_REF=<dir> hook that runs ref_sparse_attn_decode
on the SAME tensors the kernel just consumed and logs per-call cos_sim +
max_diff to a CSV. Samples 100% of single_shot calls during a live e2e run
(prior 8-capture saved-tensor test was 0.7% of production call space).

Live e2e run on chi2774 with tier0 (no cuda graph) + SGLANG_HIP_CK_V32_SINGLESHOT=1
+ shadow-ref enabled. Result across 560 production single_shot calls:

  Relative diff (kernel vs torch ref):
    min:    0.0000%
    median: 0.2232%
    mean:   0.2129%
    p99:    0.4464%
    max:    0.4464%
  Calls with rel > 0.5%: 0/560
  Calls with rel > 1.0%: 0/560
  Calls with rel > 5.0%: 0/560

All 560 calls match torch's ref_sparse_attn_decode at sub-bf16-ULP
relative diff. NO single call has a catastrophic delta. Yet e2e
produces garbage tokens.

DEFINITIVE CONCLUSION on Layer-3:
  The residual e2e regression is hypothesis sgl-project#1 from the user's list:
  cumulative sub-bf16-ULP noise compounded across 60 layers × 30 tokens
  = 1800 calls per generated sequence per worker. Each call is within
  bf16 floor; the cumulative drift exceeds the model's training-time
  robustness envelope.

  Hypothesis sgl-project#2 (wrapper cache state corruption): RULED OUT — no outlier
  calls in the 560-sample distribution.
  Hypothesis sgl-project#3 (cuda graph stream ordering): RULED OUT — shadow-ref ran
  in tier0 (no cuda graph) and still showed garbage e2e despite per-call
  diffs being uniform and small.

Path forward (unchanged from previous commit):
  (b) Model finetune with kernel's specific bit pattern — out of kernel
      team scope.
  (d) Accept the Layer-3 stopgap and pursue other path-to-1x-B200 levers.

Layer-3 stopgap (ca6f419) remains the production correctness fix.
The kernel + diagnostic infrastructure (now including shadow-ref) is at
its best-attainable bf16-precision-equivalent state.

This commit closes the Layer-3 root-cause investigation. The "why is it
still garbage" answer is now data-grounded: it's NOT a kernel bug, NOT
a cache bug, NOT a graph bug. It IS cumulative ULP compounding across
1800 dependent calls — fundamentally a model-tolerance issue against
the kernel's bit-equivalent-but-bit-different output bit pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SammLSH added a commit to SammLSH/sglang that referenced this pull request May 4, 2026
Drops the sglang-native session.start/.end + binary-PCM-frame protocol
that landed in M1 and replaces it with the OpenAI Realtime
transcription-only spec (https://platform.openai.com/docs/guides/realtime-transcription).
Endpoint moves from /v1/audio/transcriptions/stream to /v1/realtime.

Wire protocol (JSON only, no binary frames):
  client -> session.update {session.type=transcription, audio.input.{format,
            sample_rate, transcription.{model,language}, noise_reduction,
            turn_detection}}
         -> input_audio_buffer.append {audio: base64-PCM16-LE}
         -> input_audio_buffer.commit
         -> input_audio_buffer.clear
  server -> session.created / session.updated
         -> input_audio_buffer.committed {item_id, previous_item_id}
         -> input_audio_buffer.cleared
         -> conversation.item.created {previous_item_id, item}
         -> conversation.item.input_audio_transcription.delta
         -> conversation.item.input_audio_transcription.completed
         -> conversation.item.input_audio_transcription.failed
         -> error {error: {type, code, message, param}}

sglang-specific deltas vs the spec, all documented in the module docstring:
  * audio.input.sample_rate is a sglang extension; OpenAI's audio/pcm
    default is 24 kHz. We accept 16k/24k/48k and resample to 16 kHz
    internally via librosa before feeding the model.
  * Server-side VAD is not implemented; turn_detection != null is
    rejected with vad_not_supported. Clients must commit explicitly.
  * noise_reduction != null is rejected; include[] is silently dropped.
  * Deltas stream continuously as audio is appended (one inference per
    chunk_size_sec of new audio, anchored by the previously emitted
    prefix). Clients do not need to commit to start receiving deltas;
    commit only finalizes the turn and emits the committed/item.created/
    completed triplet, then resets state for the next turn within the
    same session.
  * audio.input.transcription.model stays echo-only per the existing
    sglang single-model design; multi-model routing belongs upstream.

Reviewer-requested changes also bundled in:
  * sgl-project#1 (encapsulation): handle_realtime_transcription now takes
    tokenizer_manager, adapter, server_args, and session_semaphore as
    explicit kwargs; the WS module never reaches into
    OpenAIServingTranscription privates.
  * sgl-project#4 (type hints): all new functions and dataclasses are fully
    annotated.
  * sgl-project#5 (concurrency cap): adds --asr-max-concurrent-sessions (default
    32). Excess connections are accepted, sent error{code:
    too_many_sessions}, and closed.

Out-of-scope follow-ups (TODO in module docstring):
  * sgl-project#2 (PCM round-trip): would require process_asr_chunk to accept
    pre-decoded ndarrays; punted to a separate PR.

Test refresh in test/manual/models/test_qwen3_asr.py:
  * _stream_websocket_async rewritten to drive the new protocol
    (session.update -> append events with base64 -> commit -> drain
    delta + committed + item.created + completed).
  * 19/19 tests pass, ~52.7s, stable across 5 consecutive runs
    (/tmp/asr_openai_run1..5.log).
SammLSH added a commit to SammLSH/sglang that referenced this pull request May 4, 2026
Move /v1/audio/transcriptions/stream to /v1/realtime and switch from
the M1 session.start/binary-PCM protocol to OpenAI's Realtime
transcription wire format. The shared inference driver is untouched,
so HTTP SSE and WS still produce byte-identical transcripts; this is
purely a transport rewrite.

sglang deviations from the spec live in the module docstring:
sample_rate is a sglang extension accepting 16/24/48 kHz with internal
resample (OpenAI fixes audio/pcm at 24 kHz), turn_detection and
noise_reduction must be null (no server-side VAD), include[] is
dropped, model is echo-only.

Addresses sgl-project#22848 review sgl-project#1 (decouple from OpenAIServingTranscription),
sgl-project#4 (type hints), sgl-project#5 (--asr-max-concurrent-sessions, default 32).
sgl-project#2 (skip PCM round trip) is deferred since it changes process_asr_chunk's
input contract.
JohnQinAMD added a commit to JohnQinAMD/sglang-amd that referenced this pull request May 4, 2026
…ional descriptions

Comment-only cleanup. Replaces 14 internal-nickname references
(Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with
descriptive functional explanations of the surrounding code.

No code semantics or behavior change. Equivalent via diff filtered
to non-comment lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnQinAMD added a commit to JohnQinAMD/sglang-amd that referenced this pull request May 4, 2026
…ional descriptions

Comment-only cleanup. Replaces 14 internal-nickname references
(Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with
descriptive functional explanations of the surrounding code.

No code semantics or behavior change. Equivalent via diff filtered
to non-comment lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnQinAMD added a commit to JohnQinAMD/sglang-amd that referenced this pull request May 4, 2026
…ional descriptions

Comment-only cleanup. Replaces 14 internal-nickname references
(Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with
descriptive functional explanations of the surrounding code.

No code semantics or behavior change. Equivalent via diff filtered
to non-comment lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnQinAMD added a commit to JohnQinAMD/sglang-amd that referenced this pull request May 4, 2026
Comment-only cleanup. Replaces 14 internal-nickname references
(Phase 24, A2-sgl-project#1, A2-sgl-project#2, MEGA-3', Phase A1, Phase 13, etc.) with
descriptive functional explanations of the surrounding code.

No code semantics or behavior change. Equivalent via diff filtered
to non-comment lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangzhilin-hzl added a commit to huangzhilin-hzl/sglang that referenced this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants