Skip to content

Conversation

@apsonawane
Copy link
Contributor

This pull request introduces several improvements and fixes to the Mixture-of-Experts (MoE) quantization kernel in ONNX Runtime, focusing on threading efficiency, correctness of block-wise quantization, and code maintainability. Notably, it disables direct Q4 GEMM for debugging, adds stricter validation for quantization parameters, and refines parallelization heuristics for both routing and expert computation to better support both batch and decoding scenarios.

Threading and Parallelization Improvements:

  • Refined the logic for determining the number of routing and expert threads in QMoECPU<T>::Compute to better balance parallelism for both small (decoding) and large (batch) workloads. This includes new adaptive thresholds and more granular control over thread allocation for different workload sizes. [1] [2] [3]
  • Improved the heuristics for when to parallelize activation and dequantization, scaling thresholds based on model configuration to ensure efficient thread utilization.

Block-wise Quantization and Dequantization Fixes:

  • Added a new helper function ValidateBlockwiseQuantization to enforce that hidden_size and inter_size are divisible by block_size when block-wise quantization is enabled, preventing misconfiguration. [1] [2]
  • Corrected the calculation of blocks_per_row for block-wise scales in both FC1 and FC2 layers, ensuring proper indexing and memory access during dequantization. [1] [2]
  • Updated parallel dequantization logic to use the correct blocks_per_row calculation and to avoid unnecessary thread pool usage for small workloads. [1] [2]

Debugging and Temporary Changes:

  • Temporarily disabled the use of direct Q4 GEMM by always returning false in CanUseMlasQ4Gemm to facilitate debugging of output issues.

These changes collectively improve the robustness, efficiency, and maintainability of the MoE quantization implementation.

size_t expected_size = MlasQ4GemmPackBSize(out_qtype, static_cast<size_t>(cols), static_cast<size_t>(rows));
return expected_size > 0;
// TEMPORARY: Disable direct Q4 GEMM
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to disable this optimized path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was giving gibberish output after some tokens. Need some deep dive on that side

} else if (num_tokens == 1) {
// Single token decoding: use 1 thread (routing overhead not worth parallelizing)
optimal_routing_threads = 1;
} else if (num_tokens < min_work_per_thread) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any benchmark results show that this change can help performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was not much performance difference, it was not more than 10% improvement.

num_expert_threads = std::min(num_expert_threads, 8);
} else if (total_work < 384) {
// Very large workload - use more threads
num_expert_threads = std::min(num_expert_threads, 12);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider number of CPU cores in num_expert_threads computation? If I have a CPU with 8 cores, using more than 8 threads will slow down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not, will try it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants