You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
remove from vllm.model_executor.layers.activation import GeluAndMul, SiluAndMul in sglang/python/sglang/srt/layers/activation.py.
Support GemmaRMSNorm and RMSNorm in AMD.
remove from vllm.model_executor.layers.layernorm import GemmaRMSNorm, RMSNorm in sglang/python/sglang/srt/layers/layernorm.py.
Support rotary_embedding kernel in AMD.
Support for ops.moe_sum in AMD, remove the dependency on vllm ops.moe_sum. Used in fused_moe_triton.py.
Benchmark vllm ops.moe_align_block_size, moe_align_block_size_triton, and sgl_moe_align_block_size, and remove the num_experts=256 limitation in sgl_moe_align_block_size. After this, directly select the kernel from moe_align_block_size_triton and sgl_moe_align_block_size, and remove the dependency on vllm ops.moe_align_block_size. Used in fused_moe_triton.py. remove moe_align vllm dep #4249 & refine sgl_moe_align_block_size_benchmark #4327
Implement scaled_int8_quant in sgl-kernel and remove the current dependency on vllm ops.scaled_int8_quant. Used in fused_moe_triton.py. @zcnrex
Implement per_token_group_quant_int8 in CUDA, replacing the current per_token_group_quant_int8 triton implementation. Used in fused_moe_triton.py. @zcnrex
Support for apply_rope_with_cos_sin_cache_inplace kernel in AMD, remove the current dependencies on vllm os.rotary_embedding . Used in rotary_embedding.py.
Checklist
Motivation
silu_and_mulandgelu_and_mulin AMD, remove the current dependencies onvllm ops.silu_and_mulandops.gelu_and_muli. Used infused_moe_triton.py. [ROCm] Enable silu_and_mul, gelu_and_mul, gelu_tanh_and_mul in amd platform #4150 @yiakwy-xpu-ml-framework-teamfrom vllm.model_executor.layers.activation import GeluAndMul, SiluAndMulinsglang/python/sglang/srt/layers/activation.py.from vllm.model_executor.layers.layernorm import GemmaRMSNorm, RMSNorminsglang/python/sglang/srt/layers/layernorm.py.rotary_embeddingkernel in AMD.ops.moe_sumin AMD, remove the dependency onvllm ops.moe_sum. Used infused_moe_triton.py.vllm ops.moe_align_block_size,moe_align_block_size_triton, andsgl_moe_align_block_size, and remove thenum_experts=256limitation insgl_moe_align_block_size. After this, directly select the kernel frommoe_align_block_size_tritonandsgl_moe_align_block_size, and remove the dependency onvllm ops.moe_align_block_size. Used infused_moe_triton.py. remove moe_align vllm dep #4249 & refine sgl_moe_align_block_size_benchmark #4327scaled_int8_quantinsgl-kerneland remove the current dependency onvllm ops.scaled_int8_quant. Used infused_moe_triton.py. @zcnrexper_token_group_quant_int8in CUDA, replacing the currentper_token_group_quant_int8 tritonimplementation. Used infused_moe_triton.py. @zcnrexsglang_per_token_group_quant_fp8in AMD. Used infused_moe_triton.py. [tools] add fp8 max/min constant in utils #3959 [ROCm] Enable per token group quant fp8 in amd #3702 @yiakwy-xpu-ml-framework-teamscaled_fp8_quantkernel and remove the current dependency onvllm ops.scaled_fp8_quant. (This is in progress, 50% complete—see [link]([quant kernel] sgl-kernel support per_tensor_quant fp8 #3786) for per-tensor support, and @hebiao064 is working on per-token support.vllm ops.scaled_fp8_quantwill support both per-tensor and per-token.). Used infused_moe_triton.py,layer.pyandfp8.py. @BBuf @hebiao064 Add sgl_per_token_quant_fp8 #4089 [Refactor] Reducing code duplication across FP8 CUDA quantization kernels #4163. https://github.com/sgl-project/sglang/pull/4231。https://github.com/sgl-project/sglang/pull/4215apply_rope_with_cos_sin_cache_inplacekernel in AMD, remove the current dependencies onvllm os.rotary_embedding. Used inrotary_embedding.py.topk_softmaxkernel and remove the current dependency onvllm.ops.topk_softmax. Used intopk.py. Add moe topk softmax templated from vllm #4302topk_softmaxin amd. [ROCm] enable moe topk softmax in amd #4448ops.topk_softmaxinpython/sglang/srt/layers/moe/topk.py. remove vllm ops.topk_softmax dependency #4498awq_dequantizekernel and remove the current dependency onvllm ops.awq_dequantize. Used indeepseek_nextn.pyanddeepseek_v2.py. Add awq dequantize kernel to sgl with 1x to 3x speedup #4104 @zcnrexRelated resources
No response