[Graph][Quantization] Multi-stage software pipelining and update parallel k rule#364
[Graph][Quantization] Multi-stage software pipelining and update parallel k rule#364Aalanli merged 17 commits intohidet-org:mainfrom
Conversation
|
Hi @Aalanli, can you put the performance numbers of whether to use the casting tricks and the improvement of parallel k? |
|
Yes, I can get the performance numbers after the previous PR is merged. |
|
It seems that the packed conversion offers minor performance improvements, while higher pipeline stages only offers performance improvements for certain shapes, while performance regressions on others. I think this is due to reduced occupancy due to shared memory usage, as it is higher in the second implementation than the first (indeed, according to ncu, the first impl is limited by registers while the second impl is limited by shared memory). It may be helpful to benchmark this on A100, for higher pipelining stages to show performance improvements. # k-parts=1
# bench_ref bench_packed_quant
# 1024 0.062464 0.060416
# 2048 0.062464 0.061664
# 4096 0.062464 0.065536
# 8192 0.201728 0.190464
# 16384 0.780288 0.726208
# k-parts=4
# bench_ref bench_packed_quant
# 1024 0.059392 0.060208
# 2048 0.062464 0.061440
# 4096 0.061440 0.061440
# 8192 0.176128 0.192512
# 16384 0.632832 0.688416
# k-parts=4
# bench_ref bench_packed_quant
# (1, 4096, 11008) 0.063488 0.065536
# (1, 11008, 4096) 0.067584 0.064512
# (1, 4096, 4096) 0.065536 0.064512
# (1, 4096, 32000) 0.155648 0.157696
# bench_ref bench_packed_quant
# (128, 4096, 11008) 0.137216 0.142336
# (128, 11008, 4096) 0.129024 0.130048
# (128, 4096, 4096) 0.064512 0.063488
# (128, 4096, 32000) 0.337920 0.352256 |
Sure, go ahead. |
In the future, we might put num_stages in our search space (like triton). |
|
I put num-stages in the search space, I forgot to mention that this PR also adds support for parallel-k searching for the quantized kernel. |
|
It looks good to me. Feel free to merge by yourself when you think it is ready. |
|
@Aalanli , is this PR ready to be merged? |
|
1 similar comment
|
|
Hi @Aalanli , could you rebase this PR regards the main branch? Thanks! |




Update quantization implementation to support multi-stage pipeling and vectorized upcasting trick.
Update operator resolution rules to support parallel-k searching.