[ROCm] Enable per token group quant fp8 in amd#3702
[ROCm] Enable per token group quant fp8 in amd#3702yiakwy-xpu-ml-framework-team wants to merge 12 commits intosgl-project:mainfrom
Conversation
|
@HaiShaw Hi, can you have a look? Thanks. |
HaiShaw
left a comment
There was a problem hiding this comment.
Preferably, code refactor is needed.
Also, some correctness to solve.
There was a problem hiding this comment.
Can you make an FP8_E4M3_MAX global (outside of functions), and refer to it later?
There was a problem hiding this comment.
Sorry for late reply. I have been working on MLA related function since yesterday.
Sure. I can put it inside "sglang.srt.utils" so that it comes with "_is_hip".
Does it sounds good ?
Also, can I make it in later PR, since this modification may be out of scope this PR? I will fix it as you suggested
There was a problem hiding this comment.
we should not boilerplate sgl-kernel code with flashinfer's.
better to make changes to flashinfer, and then use it.
There was a problem hiding this comment.
Yes I agree. I have marked it as tempory solution as flashinfer-rocm is not fully supported and ready to use.
As far as I know, SGlang will continuously use flash::vec_t for vectorization of 128 bit data laoding. With this tempory support, we don't need to modify related CUDA codes.
Will it sound reasonable ?
There was a problem hiding this comment.
no need to keep using FLASHINFER_INLINE here, it is very common macro.
There was a problem hiding this comment.
Yes, it comes with flashinfer::vec_t tempory device functions support.
c777940 to
e1ec0e8
Compare
|
please fix the conflicts |
Motivation
This is follow up of PR#3664
Modifications
ROCm test
Checklist