-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Update QMoE kernel with optimizations #26091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
426e1ac to
509e17c
Compare
| size_t expected_size = MlasQ4GemmPackBSize(out_qtype, static_cast<size_t>(cols), static_cast<size_t>(rows)); | ||
| return expected_size > 0; | ||
| // TEMPORARY: Disable direct Q4 GEMM | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to disable this optimized path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was giving gibberish output after some tokens. Need some deep dive on that side
| } else if (num_tokens == 1) { | ||
| // Single token decoding: use 1 thread (routing overhead not worth parallelizing) | ||
| optimal_routing_threads = 1; | ||
| } else if (num_tokens < min_work_per_thread) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any benchmark results show that this change can help performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was not much performance difference, it was not more than 10% improvement.
| num_expert_threads = std::min(num_expert_threads, 8); | ||
| } else if (total_work < 384) { | ||
| // Very large workload - use more threads | ||
| num_expert_threads = std::min(num_expert_threads, 12); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we consider number of CPU cores in num_expert_threads computation? If I have a CPU with 8 cores, using more than 8 threads will slow down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not, will try it
509e17c to
41133b7
Compare
41133b7 to
8289fcb
Compare
This pull request introduces several improvements and fixes to the Mixture-of-Experts (MoE) quantization kernel in ONNX Runtime, focusing on threading efficiency, correctness of block-wise quantization, and code maintainability. Notably, it disables direct Q4 GEMM for debugging, adds stricter validation for quantization parameters, and refines parallelization heuristics for both routing and expert computation to better support both batch and decoding scenarios.
Threading and Parallelization Improvements:
QMoECPU<T>::Computeto better balance parallelism for both small (decoding) and large (batch) workloads. This includes new adaptive thresholds and more granular control over thread allocation for different workload sizes. [1] [2] [3]Block-wise Quantization and Dequantization Fixes:
ValidateBlockwiseQuantizationto enforce thathidden_sizeandinter_sizeare divisible byblock_sizewhen block-wise quantization is enabled, preventing misconfiguration. [1] [2]blocks_per_rowfor block-wise scales in both FC1 and FC2 layers, ensuring proper indexing and memory access during dequantization. [1] [2]blocks_per_rowcalculation and to avoid unnecessary thread pool usage for small workloads. [1] [2]Debugging and Temporary Changes:
falseinCanUseMlasQ4Gemmto facilitate debugging of output issues.These changes collectively improve the robustness, efficiency, and maintainability of the MoE quantization implementation.