apply wna16marlin kernel in moe weight only quantization#5639
apply wna16marlin kernel in moe weight only quantization#5639AniZpZ wants to merge 43 commits intosgl-project:mainfrom
Conversation
Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: HandH1998 <1335248067@qq.com>
Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com> * fix: remove part of vllm dependency and import error * fix: modify the include package of gptq_marlin_repack * fix: remove vllm dependency of fused_moe * fix: delay the try_get_optimal_moe_config import * fix: replace namespace of gptq_marlin_repack * fix: update the head file of gptq_marlin_repack.cu * update cmakelists * modify the repack.cu * modify namespace * update head files * update namespace marlin->marlin_moe_wna16 * update namespace * add namespace * update headfile * remove invalid parentheses * update CMakeLists.txt * update headfiles * update headsfile * remove nested define * add parentheses * update * move cuh to cu * move identifier from namespace to outside * update * add namespace scope * remove condition of compile define * add null implementation for host * remove namespace scope * remove sm75 * remove define conditions & add gptq_marlin_repack_meta impl * remove repack_meta * add register namespace * add namespace in sgl_kernel_ops * add register namespace * delay the moe_align_block_size import * modify the import of moe_align_block_size * add scalar_type.py & modify import of fused_moe.py * add compilation condition & remove VLLM_AVAILABLE * remove VLLM_AVAILABLE
|
update accuracy accuracy subject: abstract_algebra, #q:100, acc: 0.730 |
|
We are working to completely remove the vLLM dependency and I will inform you once completed. |
Merge branch 'wyt/fix_vllm_dependency of git@code.alipay.com:Theta/SGLang.git into w4a16_marlin https://code.alipay.com/Theta/SGLang/pull_requests/90 Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com> * fix: remove vllm dependency in compressed_tensors_moe & add gptq_marlin_moe_repack in gptq * update gptq_marlin_repack in gptq * add condition of fp8_config in radixattention * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove gptq_marlin_gemm kernel * add unit test * add copyright & add unit test
Merge branch 'wyt/fix_vllm_dependency of git@code.alipay.com:Theta/SGLang.git into w4a16_marlin https://code.alipay.com/Theta/SGLang/pull_requests/91 Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com> * fix: remove vllm dependency in compressed_tensors_moe & add gptq_marlin_moe_repack in gptq * update gptq_marlin_repack in gptq * add condition of fp8_config in radixattention * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove gptq_marlin_gemm kernel * add unit test * add copyright & add unit test * chore: fix PytestWarning & update quant_utils
Motivation
apply wna16marlin kernel to speed up weight only quantization dsv3 model

we oberserve a decoding speed of over 100tps when bs=1 on 8*H20 platform
performance(based on 0.4.6)
Modifications
update route of compressed_tensors_moe and fix some bugs
Co-author: @huangtingwei9988
Note: PR553 by @yych0745 @HandH1998 @sleepcoo add wna16 marlin moe to sglang which removes the dependency on vllm
Checklist