-
-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Closed as not planned
Labels
new-modelRequests to new modelsRequests to new modelsperformancePerformance-related issuesPerformance-related issuesstaleOver 90 days of inactivityOver 90 days of inactivity
Description
This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute!
- Follow up [Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization #11523: enhance testing with shapes of production models and run it regularly on H100.
- Solving via cutlas blockwise quantization kernels.
- Follow up Deepseek v3 #11502:
- Test and enable torch.compile
-
Refactor MoEMethodBase to unify and clean up the extra arguments ofscoring_funcande_correction_bias - Kernel tuning for 8xH200, MI300x, H100 (TP16 and TP8PP2 case)
- Use https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py, but adapt it for the w8a8 fused moe kernel.
- CUDA Graph support
- MLA [WIP] Deepseek V2 MLA #10927 @simon-mo
- Support nextn prediction heads (EAGLE style prediction heads)
- Support expert parallelism for MoE.
- Support data parallelism for MLA.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
new-modelRequests to new modelsRequests to new modelsperformancePerformance-related issuesPerformance-related issuesstaleOver 90 days of inactivityOver 90 days of inactivity