Skip to content

FP4 weight loading and inference (2/2)#3972

Merged
zhyncs merged 6 commits intosgl-project:mainfrom
trevor-m:fp4-upstream-checkpoints
Apr 9, 2025
Merged

FP4 weight loading and inference (2/2)#3972
zhyncs merged 6 commits intosgl-project:mainfrom
trevor-m:fp4-upstream-checkpoints

Conversation

@trevor-m
Copy link
Copy Markdown
Collaborator

Motivation

Requires #3899 for new kernels in sgl-kernel.

This PR is part 2 of 2 to adds support for modelopt FP4 quantized models.
Tested using fp4 quantized Llama 3.1 model.

This work was adapted from the following - thanks @pavanimajety @kaixih @kushanam!
vllm-project/vllm#12784
vllm-project/vllm#13571
vllm-project/vllm#12520

Modifications

Adds modelopt_fp4 quantization method.
Adds ModelOptFp4Config and ModelOptFp4LinearMethod to utilize new fp4 kernels for linear layers

Checklist

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 22, 2025

Hi @trevor-m Can you help resolve the conflicts? Thanks. cc @kushanam @elfiegg

@pavanimajety
Copy link
Copy Markdown
Collaborator

@zhyncs The conflicts should automatically resolve once the part 1 of the PR is merged and this PR is rebased on top of that.

@trevor-m trevor-m force-pushed the fp4-upstream-checkpoints branch from f6d5855 to 21c070e Compare March 24, 2025 22:55
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 25, 2025

Hi @trevor-m @kushanam @elfiegg May you help fix this PR's conflicts?

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 25, 2025


@pavanimajety

@trevor-m trevor-m force-pushed the fp4-upstream-checkpoints branch from 21c070e to f63cd4d Compare March 25, 2025 02:58
@trevor-m
Copy link
Copy Markdown
Collaborator Author

trevor-m commented Mar 25, 2025

@pavanimajety

Sorry about that, just fixed! @zhyncs

@pavanimajety
Copy link
Copy Markdown
Collaborator

Thanks @trevor-m and @zhyncs! I missed there was a code conflict.

@trevor-m trevor-m force-pushed the fp4-upstream-checkpoints branch from f63cd4d to ef2222d Compare March 25, 2025 16:46
@Edwardf0t1
Copy link
Copy Markdown
Collaborator

Hi @trevor-m @pavanimajety @zhyncs Great work! could we resolve conflicts and get this merged? Thanks.

@trevor-m trevor-m force-pushed the fp4-upstream-checkpoints branch from e078cfa to 165e613 Compare April 2, 2025 23:57
@trevor-m
Copy link
Copy Markdown
Collaborator Author

trevor-m commented Apr 3, 2025

Hi @trevor-m @pavanimajety @zhyncs Great work! could we resolve conflicts and get this merged? Thanks.

Thanks @Edwardf0t1, fixed.

@trevor-m trevor-m force-pushed the fp4-upstream-checkpoints branch from 165e613 to ccada8a Compare April 7, 2025 16:34
@Edwardf0t1 Edwardf0t1 self-requested a review April 7, 2025 21:45
Comment thread python/sglang/srt/layers/quantization/modelopt_quant.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

Copy link
Copy Markdown
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May you provide the e2e gsm8k result for nvidia/DeepSeek-R1-FP4

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

@Edwardf0t1
Copy link
Copy Markdown
Collaborator

May you provide the e2e gsm8k result for nvidia/DeepSeek-R1-FP4

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

@zhyncs I don't think this PR would work for DS R1 FP4 since that requires FP4 MoE kernels.

@trevor-m could you run a quick test and show the outputs for the following:

import sglang as sgl

def main():

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = {"temperature": 0.8, "top_p": 0.95}
    llm = sgl.Engine(model_path="nvidia/Llama-3.3-70B-Instruct-FP4", quantization="modelopt_fp4")

    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == "__main__":
    main()

trevor-m added 3 commits April 8, 2025 22:16
Fixes

trying to fix

cleanup

Run pre-commit

Remove unused is_cutlass_fp4 supported

remove sgl-kernel changes
@trevor-m
Copy link
Copy Markdown
Collaborator Author

trevor-m commented Apr 8, 2025

May you provide the e2e gsm8k result for nvidia/DeepSeek-R1-FP4

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

As @Edwardf0t1 mentioned we can't run DS R1 FP4 yet. But I ran your benchmark commands for nvidia/Llama-3.3-70B-Instruct-FP4:

Accuracy: 0.939
Invalid: 0.000
Latency: 65.557 s
Output throughput: 2229.165 token/s

@trevor-m
Copy link
Copy Markdown
Collaborator Author

trevor-m commented Apr 9, 2025

@trevor-m could you run a quick test and show the outputs for the following:

import sglang as sgl

def main():

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = {"temperature": 0.8, "top_p": 0.95}
    llm = sgl.Engine(model_path="nvidia/Llama-3.3-70B-Instruct-FP4", quantization="modelopt_fp4")

    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == "__main__":
    main()

Results

===============================
Prompt: Hello, my name is
Generated text:  Angelique. I am a German Shorthaired Pointer. I am a very happy and energetic dog. I love playing with my toys and going for walks with my human family. I also love swimming and running around in the backyard. My favorite thing to do is play fetch with my favorite ball. I will do just about anything for a belly rub. I am a very loyal and loving companion and I promise to always be by your side. I hope you like my picture and that you will come to visit me soon. I get along great with other dogs and I love people. I am ready to go to my forever home and be
===============================
Prompt: The president of the United States is
Generated text:  just one of many important leaders in the country. The U.S. government has three branches: the executive, legislative, and judicial. The legislative branch is further divided into the Senate and the House of Representatives. The president is the head of the executive branch, but they don't have absolute power. They must work with Congress to pass laws and make decisions for the country.
A. the executive branch of the U.S. government
B. the legislative branch of the U.S. government
C. the judicial branch of the U.S. government
D. the head of the Senate
E. the head of the House of Representatives
===============================
Prompt: The capital of France is
Generated text:  famous for its stunning architecture, art museums, and romantic atmosphere. But there's more to Paris than just the Eiffel Tower and the Louvre. This City Guide will take you on a journey through the city's charming neighborhoods, historic landmarks, and cultural attractions, helping you to discover the authentic Paris that locals love.
We'll start with the basics: when to go, how to get around, and where to stay. Then, we'll dive into the city's must-see attractions, from the iconic Notre-Dame Cathedral to the trendy Marais neighborhood. You'll learn about the city's rich history, from the French Revolution
===============================
Prompt: The future of AI is
Generated text:  not just about the technology, but about the people who will shape it. At Mosaic Smart Data, we believe that the best results come from collaboration between humans and AI systems. Our people are experts in data science, technology, and business, and we work together to deliver cutting-edge AI solutions that drive real value for our clients.
Our team has a passion for innovation and a drive to make a real difference in the financial services industry. We believe in empowering our people to take ownership and to continuously learn and develop their skills. If you are excited about the potential of AI to transform the financial services industry and want to be part of a

@zhyncs zhyncs merged commit 11d760d into sgl-project:main Apr 9, 2025
20 of 26 checks passed
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Apr 9, 2025

Great work! @trevor-m @Edwardf0t1

finger92 pushed a commit to protagolabs/sglang that referenced this pull request Apr 10, 2025
thyecust pushed a commit to thyecust/sglang that referenced this pull request Apr 11, 2025
jianan-gu pushed a commit to jianan-gu/sglang that referenced this pull request Apr 13, 2025
DiweiSun pushed a commit to DiweiSun/sglang that referenced this pull request Apr 16, 2025
jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Apr 23, 2025
* Support with_stack and record_shapes in profiler (sgl-project#4740)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840)

* Fix CI tests (sgl-project#4853)

* Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855)

* Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863)

* [Feature] add multi-rank support for Lora (sgl-project#4492)

Co-authored-by: rudy152 <czh1137892874@gmail.com>

* Clean up `import vllm` in quantization/__init__.py (sgl-project#4834)

* Fix wrong variable name when stopping memory profile (sgl-project#4772)

* [Feat] support deepgemm for cmake (sgl-project#4864)

* Make torch compile configurable for biased_grouped_topk (sgl-project#4749)

* update sgl-kernel test ci (sgl-project#4866)

* fix sampling issue (sgl-project#4871)

* bump sgl-kernel 0.0.5.post4 (sgl-project#4768)

* fix sgl-kernel cu118 build (sgl-project#4872)

* [Feature] Support FA3 backend for MLA (sgl-project#4831)

* upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873)

* update torch compile doc (sgl-project#4874)

* bump v0.4.4.post3 (sgl-project#4878)

* Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882)

* Improve stack trace of retry errors (sgl-project#4845)

* Tiny fix doc error (sgl-project#4795)

* [Docs] Update DeepGEMM at README.md (sgl-project#4886)

* Update CODEOWNERS (sgl-project#4889)

* Delete test_deep_gemm.py (sgl-project#4891)

* Add deepseek style fused moe group gate selection kernel (sgl-project#4530)

* quick fix: add default for new kernel (sgl-project#4898)

* remove setup for sgl-kernel (sgl-project#4899)

* [Misc] Clean m.def and add Development Tips (sgl-project#4890)

* fix allreduce test (sgl-project#4909)

* Support page size > 1 + eagle (sgl-project#4908)

* Fix retract for page size > 1 (sgl-project#4914)

* [Feature] use pytest for sgl-kernel (sgl-project#4896)

* fix bmm fp8 (sgl-project#4926)

* Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927)

* Fix 2-gpu CI test and suppress some warnings (sgl-project#4930)

* [feat] add fa3 in sgl-kernel (sgl-project#4902)

Co-authored-by: Sleepcoo <Sleepcoo@gmail.com>

* Fix sglang frontend's incorrect dependency on torch (sgl-project#4931)

* [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932)

* cleanup sgl-kernel (sgl-project#4933)

* [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925)

* Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883)

Co-authored-by: ch-wan <cwan39@gatech.edu>

* [Fix] Add torch compile for torch.clamp back (sgl-project#4936)

* Fix oom error for large page size (sgl-project#4913)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* [feat] interface for platforms abstraction (sgl-project#4928)

* [Fix] revert clean m.def for cudagraph (sgl-project#4944)

* refactor: multimodal data (sgl-project#4754)

* bump sgl-kernel v0.0.6 (sgl-project#4950)

* [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953)

* use fa3 in sgl-kernel (sgl-project#4954)

* Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959)

* [Feature] Support DeepEP Low Latency (sgl-project#4767)

Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: ch-wan <cwan39@gatech.edu>

* update bench_serving (sgl-project#4958)

* Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977)

* [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: GeLee <leege233@gmail.com>

* Large page size aligned hierarchical caching (sgl-project#4581)

* bug fix for hicache host eviction (sgl-project#4989)

* sgl scaled_fp8_quant support output padding (sgl-project#4861)

* Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951)

Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>

* Update tokenizer_manager.py (sgl-project#5008)

* [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817)

* update cutlass tag (sgl-project#5011)

* Feature/revise docs ci (sgl-project#5009)

* fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727)

Co-authored-by: yuethe <yuethe@tencent.com>

* [Build] Support build sgl-kernel with ccache (sgl-project#5020)

* fix deepgemm as well (sgl-project#5030)

* try to fix ci oserror (sgl-project#5024)

* Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005)

* Small refactor DeepEPMode to clean up code a bit (sgl-project#4992)

* [Fix] fix fa3 build at cu118 (sgl-project#5036)

* Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048)

* bump sgl-kernel v0.0.7 (sgl-project#5046)

* update eagle-3 docs (sgl-project#4796)

Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn>

* Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039)

Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>

* Update the retry count (sgl-project#5051)

* upgrade sgl-kernel v0.0.7 (sgl-project#5049)

* [2/3] fix dsv3 awq issue  (sgl-project#4625)

Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>

* Feature/revise docs ci (sgl-project#5056)

* Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057)

* [fix] remove `cuda_device_count_stateless` (sgl-project#5060)

* Small refactor DeepEPDispatcher into subclasses (sgl-project#4994)

* Support async DeepEP by splitting into two stages (sgl-project#4995)

* Cleanup unused resources after DeepEP operation (sgl-project#4996)

* Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918)

* [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072)

* fix dummy-load deepseekv2 (sgl-project#4535)

* support sgl-kernel on blackwell (sgl-project#5074)

* FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050)

Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Chunan Zeng <zcnrex@gmail.com>

* [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052)

* upgrade transformers 4.51.0 (sgl-project#5088)

* sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079)

* bump sgl-kernel 0.0.8 (sgl-project#5089)

* python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080)

* bump v0.4.4.post4 (sgl-project#5091)

* Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097)

Co-authored-by: shuaills <shishuaiuoe@gmail.com>

* Add Llama4 support (sgl-project#5092)

Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@163.com>

* Fix refactor error - fp8.py (sgl-project#5106)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* bump v0.4.5 (sgl-project#5117)

* [ci] fix llama4 ci error (sgl-project#5126)

* Refactor and Optimize FA3 Code (sgl-project#5090)

Co-authored-by: Qingquan Song <ustcsqq@gmail.com>

* Add Llama4 user guide (sgl-project#5133)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

* [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137)

* feat: disable grammar restrictions within reasoning sections (sgl-project#4984)

Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn>
Co-authored-by: DarkSharpness <2040703891@qq.com>

* [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145)

* [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140)

* fix multimodal hash feature (sgl-project#5083)

* Fix run time error in ROCm platform (sgl-project#5147)

Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu>

* [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103)

* Add unit test on page_size > 1 and mla and  integration test for Flash Attention 3 (sgl-project#4760)

* Use public model for FA3 speculative decode testing (sgl-project#5152)

* Add dummy grok test to amd CI. (sgl-project#5115)

* fix empty_cache error in pt_weights_iterator (sgl-project#5151)

Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com>

* Fix torch compile errors (sgl-project#5158)

* Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686)

Co-authored-by: qingquansong <ustcsqq@gmail.com>

* [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* Add optimized native kernels in sgl-kernel (sgl-project#5150)

Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: YanbingJiang <yanbing.jiang@intel.com>
Co-authored-by: blzheng <beilei.zheng@intel.com>

* [PD] Simplify mini LB (sgl-project#4911)

Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>

* Small improvement of native api docs (sgl-project#5139)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

* [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* Support 2x8xH100 for Llama 4 (sgl-project#5159)

* FP4 weight loading and inference (2/2) (sgl-project#3972)

* Fix multimodal hashing error (sgl-project#5174)

* Tiny disable model that does not work (sgl-project#5175)

* [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173)

* [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068)

Co-authored-by: ch-wan <cwan39@gatech.edu>

* docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110)

Signed-off-by: Kay Yan <kay.yan@daocloud.io>

* feat: add DeepGEMM build warning (sgl-project#5176)

Co-authored-by: grimoire <streetyao@live.com>

* fix: use DeepEPDispatcher on CUDA (sgl-project#5180)

* [DeepEP] fix: import buffer error (sgl-project#5179)

* Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058)

* [Misc] clean up vllm in sgl-kernel test (sgl-project#5189)

* Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185)

Co-authored-by: wunhuang <wunhuang@amd.com>

* Optimize topk operation in llama4 (sgl-project#5128)

* Support Llama4 fp8 inference (sgl-project#5194)

Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: zhyncs <me@zhyncs.com>

* [ci] fix ci test fused_moe op (sgl-project#5102)

* model: support mllama4 (sgl-project#5144)

* update grok test (sgl-project#5171)

* sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207)

* Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196)

* fix: log warning when disable cuda graph (sgl-project#5209)

* [metrics] Add in queue metrics (sgl-project#4444)

* Fix DeepSeek error when using DeepEP mode (sgl-project#5190)

* reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086)

* [PD] Support KV transfer with mooncake (sgl-project#4880)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Co-authored-by: shangmingc <csmthu@gmail.com>

* [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204)

* Update deps for mllama4 (sgl-project#5215)

* Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213)

* ROCm sgl-kernel: compatible to later torch (sgl-project#5167)

* [Misc] Clean sgl-kernel test  (sgl-project#5216)

* Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245)

* Fix torch.compile cacheing (sgl-project#5259)

Co-authored-by: zhyncs <me@zhyncs.com>

* ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228)

* Optimize attention in llama4 (sgl-project#5127)

* Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262)

Co-authored-by: ch-wan <cwan39@gatech.edu>

* Support `--enable-llama4-multimodal` (sgl-project#5254)

* [fix] fix mrope positions not picked up (sgl-project#5265)

* doc: nested loop code for offline engine (sgl-project#5244)

* fix: examples for token_in_token_out_vlm  (sgl-project#5193)

* Fix a 404 link in send_request.ipynb (sgl-project#5280)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* fix: enable fp4 compilation on cu128 (sgl-project#5286)

* feat: add cu128 identifier for sgl-kernel (sgl-project#5287)

* chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288)

* chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289)

* [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292)

* [Docs] Supported Model Docs - Major restructuring (sgl-project#5290)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

* fix: update update_wheel_index for cu128 (sgl-project#5300)

* [Docs] Remove the older supported docs section (sgl-project#5301)

* remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298)

* feat: add blackwell Dockerfile (sgl-project#5302)

* feat: add blackwell workflow (sgl-project#5303)

* fix: use fa3 unit test on hopper only (sgl-project#5304)

* misc: update blackwell Dockerfile (sgl-project#5306)

* fix: remove cublas_grouped_gemm (sgl-project#5307)

* fix: update flash attn (sgl-project#5308)

* fix: use deepgemm only on hopper (sgl-project#5310)

* [VLM] Adopt fast image processor by default (sgl-project#5065)

* Adjust ci test threshold (sgl-project#5271)

* Blackwell Cutlass MLA kernel (sgl-project#5142)

* misc: cleanup 3rdparty (sgl-project#5311)

* update variable naming and comments for rocm (sgl-project#5299)

* Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120)

* Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315)

* Fix fa3 window size setup (sgl-project#5316)

* chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317)

* feat: use fa3 mla by default on hopper (sgl-project#5210)

Co-authored-by: yundai424 <yundai424@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>

* Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884)

* Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321)

* refine fused_moe tuning docs (sgl-project#5294)

* Support server based rollout in Verlengine (sgl-project#4848)

Co-authored-by: Jin Pan <jpan236@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>

* [Feat] Add sparse attn to sgl-kernel (sgl-project#5327)

* fix: solve cu118 issue for cutlass mla (sgl-project#5331)

* chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332)

* ci: update release node (sgl-project#5333)

* fix: determine if flashinfer is installed (sgl-project#5336)

* feat: adapt merge_state (sgl-project#5337)

* misc: update sagemaker Dockerfile (sgl-project#5341)

* Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322)

* docs: update adoption and sponsorship list with Oracle (sgl-project#5343)

* chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342)

* Fix typo: infight -> inflight (sgl-project#5357)

* [PD] Add transfer backend abstraction (sgl-project#5328)

* fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161)

Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>

* fix sgl-project#5322 (sgl-project#5359)

* feat: update experiment_runner (sgl-project#5360)

* [DeepEP] Reduce routed scaling overhead (sgl-project#5277)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

* Free metadata_buffer_index after transfer finished (sgl-project#5364)

* Free metadata_buffer_index after transfer finished (sgl-project#5364)

* Fix DeepSeek DP Attention + torch compile (sgl-project#5367)

Co-authored-by: ispobock <ispobaoke@163.com>

* Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003)

* Fix PD disaggregation bugs (sgl-project#5326)

* [PD Bug] fix  MLA get_contiguous_buf_infos error (sgl-project#5384)

* [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370)

* Apply deepseek cuda rope (sgl-project#5385)

Co-authored-by: Yineng Zhang <me@zhyncs.com>

* apply fused moe gate in ds v3/r1 (sgl-project#5371)

Co-authored-by: Yineng Zhang <me@zhyncs.com>

* fix: update test config (sgl-project#5392)

* [Fix] Turn off DeepGEMM by default (sgl-project#5263)

* minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393)

* Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368)

* Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291)

Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>

* [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279)

* chore: upgrade DeepGEMM (sgl-project#5395)

* fix: update pr-test-sgl-kernel (sgl-project#5399)

* kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381)

* chore: bump sgl-kernel 0.0.9 (sgl-project#5400)

* chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401)

* Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406)

* Fix bench_serving with random-ids (sgl-project#5214)

* [misc] fix ci flaky case (sgl-project#5352)

* [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412)

* Support dynamic connection and TP 16 (sgl-project#5351)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416)

* [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: ybyang <ybyang7@iflytek.com>

* Distinguish bootstrap key only in decode server (sgl-project#5422)

* [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423)

* [minor] cleanup cmakelists.txt (sgl-project#5420)

* bugfix: fix merge_state_v2 cuda graph (sgl-project#5419)

* chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430)

* fix: solve release issue (sgl-project#5434)

* BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431)

* feat: update model_specific_adjustment (sgl-project#5344)

Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>

* chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436)

* Fix ignore_eos parameter when loading a chat template (sgl-project#5264)

* add attention backend supporting matrix in the doc (sgl-project#5211)

Co-authored-by: Stefan He <hebiaobuaa@gmail.com>

* Support BNB quantization for llama/mllama (sgl-project#5038)

Co-authored-by: Yuhao Yang <yyh073@foxmail.com>

* [Docs] Update start/install.md (sgl-project#5398)

* [Minor] Move torch.compile patch to a better place (sgl-project#5397)

* [Bug fix] need record start time in pd mode (sgl-project#5425)

* Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113)

* chore: bump v0.4.5.post1 (sgl-project#5445)

* Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)"

This reverts commit 0eac714.

---------

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Juwan Yoo <ryan@tmfi.us>
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com>
Co-authored-by: rudy152 <czh1137892874@gmail.com>
Co-authored-by: Fr4nk1in <sh.fu@outlook.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: Sleepcoo <Sleepcoo@gmail.com>
Co-authored-by: SEPLOS <seplos@aliyun.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: GeLee <leege233@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>
Co-authored-by: Kaiyu Yang <yangky@umich.edu>
Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com>
Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com>
Co-authored-by: yuethe <yuethe@tencent.com>
Co-authored-by: simveit <69345428+simveit@users.noreply.github.com>
Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn>
Co-authored-by: Ravi Theja <ravi03071991@gmail.com>
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: Tommy Yang <tommyyang0524@gmail.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: Yun Dai <yundai424@gmail.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com>
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: DangKai <dangkai4u@outlook.com>
Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com>
Co-authored-by: shangmingc <csmthu@gmail.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: YanbingJiang <yanbing.jiang@intel.com>
Co-authored-by: blzheng <beilei.zheng@intel.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Trevor Morris <tmorris@nvidia.com>
Co-authored-by: Kay Yan <kay.yan@daocloud.io>
Co-authored-by: grimoire <streetyao@live.com>
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com>
Co-authored-by: Teng Ma <805522925@qq.com>
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: Yusong Gao <yusong.gao@icloud.com>
Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com>
Co-authored-by: lambert0312 <lambert80.ios@gmail.com>
Co-authored-by: tianlian yi <91449279+yitianlian@users.noreply.github.com>
Co-authored-by: Jin Pan <jpan236@wisc.edu>
Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>
Co-authored-by: yulei <yuulei12@gmail.com>
Co-authored-by: Yongtong Wu <914554688@qq.com>
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com>
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
Co-authored-by: Yangcheng Li <bluebluelitchi@hotmail.com>
Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: mRSun15 <3150105645@zju.edu.cn>
Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com>
Co-authored-by: Yuhao Yang <yyh073@foxmail.com>
@merrymercy merrymercy mentioned this pull request Apr 26, 2025
67 tasks
@Graham1025
Copy link
Copy Markdown

Graham1025 commented Jun 12, 2025

@trevor-m could you run a quick test and show the outputs for the following:

import sglang as sgl

def main():

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = {"temperature": 0.8, "top_p": 0.95}
    llm = sgl.Engine(model_path="nvidia/Llama-3.3-70B-Instruct-FP4", quantization="modelopt_fp4")

    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == "__main__":
    main()

Results

===============================
Prompt: Hello, my name is
Generated text:  Angelique. I am a German Shorthaired Pointer. I am a very happy and energetic dog. I love playing with my toys and going for walks with my human family. I also love swimming and running around in the backyard. My favorite thing to do is play fetch with my favorite ball. I will do just about anything for a belly rub. I am a very loyal and loving companion and I promise to always be by your side. I hope you like my picture and that you will come to visit me soon. I get along great with other dogs and I love people. I am ready to go to my forever home and be
===============================
Prompt: The president of the United States is
Generated text:  just one of many important leaders in the country. The U.S. government has three branches: the executive, legislative, and judicial. The legislative branch is further divided into the Senate and the House of Representatives. The president is the head of the executive branch, but they don't have absolute power. They must work with Congress to pass laws and make decisions for the country.
A. the executive branch of the U.S. government
B. the legislative branch of the U.S. government
C. the judicial branch of the U.S. government
D. the head of the Senate
E. the head of the House of Representatives
===============================
Prompt: The capital of France is
Generated text:  famous for its stunning architecture, art museums, and romantic atmosphere. But there's more to Paris than just the Eiffel Tower and the Louvre. This City Guide will take you on a journey through the city's charming neighborhoods, historic landmarks, and cultural attractions, helping you to discover the authentic Paris that locals love.
We'll start with the basics: when to go, how to get around, and where to stay. Then, we'll dive into the city's must-see attractions, from the iconic Notre-Dame Cathedral to the trendy Marais neighborhood. You'll learn about the city's rich history, from the French Revolution
===============================
Prompt: The future of AI is
Generated text:  not just about the technology, but about the people who will shape it. At Mosaic Smart Data, we believe that the best results come from collaboration between humans and AI systems. Our people are experts in data science, technology, and business, and we work together to deliver cutting-edge AI solutions that drive real value for our clients.
Our team has a passion for innovation and a drive to make a real difference in the financial services industry. We believe in empowering our people to take ownership and to continuously learn and develop their skills. If you are excited about the potential of AI to transform the financial services industry and want to be part of a

@trevor-m Thanks for your great work! But on my 5090 device, I got follow error:
Traceback (most recent call last):
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 307, in init
self.capture()
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 386, in capture
) = self.capture_one_batch_size(bs, forward)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 512, in capture_one_batch_size
with torch.cuda.graph(graph, pool=global_graph_memory_pool, stream=stream):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lixiang/lixinlei/envir/yes/envs/guihao-fp4/lib/python3.12/site-packages/torch/cuda/graphs.py", line 186, in exit
self.cuda_graph.capture_end()
File "/home/lixiang/lixinlei/envir/yes/envs/guihao-fp4/lib/python3.12/site-packages/torch/cuda/graphs.py", line 84, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/managers/scheduler.py", line 2297, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/managers/scheduler.py", line 277, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 64, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/managers/tp_worker.py", line 78, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/model_executor/model_runner.py", line 231, in init
self.initialize(min_per_gpu_memory)
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/model_executor/model_runner.py", line 305, in initialize
self.init_cuda_graphs()
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/model_executor/model_runner.py", line 1098, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lixiang/guihao/lpex-sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 309, in init
raise Exception(
Exception: Capture CUDA graph failed: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

  1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
  2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
  3. disable torch compile by not using --enable-torch-compile
  4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
    Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

How to fix this error? Thanks
Disable cuda graph can fix it, but with low performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants