Skip to content

[ROCm] Enable per token group quant fp8 in amd#3702

Closed
yiakwy-xpu-ml-framework-team wants to merge 12 commits intosgl-project:mainfrom
yiakwy-xpu-ml-framework-team:enable_per_token_group_quant_fp8_in_amd
Closed

[ROCm] Enable per token group quant fp8 in amd#3702
yiakwy-xpu-ml-framework-team wants to merge 12 commits intosgl-project:mainfrom
yiakwy-xpu-ml-framework-team:enable_per_token_group_quant_fp8_in_amd

Conversation

@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Feb 19, 2025

Motivation

This is follow up of PR#3664

Modifications

  • enable shlf_xor_sync in AMD platform
  • enable flashinfer vec_t in AMD platform (to be reomved once flashinfer-rocm [POC] ready)

ROCm test

截屏2025-03-06 15 57 47

Checklist

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Feb 20, 2025

@HaiShaw Hi, can you have a look? Thanks.

@HaiShaw HaiShaw self-requested a review February 23, 2025 03:54
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferably, code refactor is needed.
Also, some correctness to solve.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make an FP8_E4M3_MAX global (outside of functions), and refer to it later?

Copy link
Copy Markdown
Contributor Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late reply. I have been working on MLA related function since yesterday.

Sure. I can put it inside "sglang.srt.utils" so that it comes with "_is_hip".

Does it sounds good ?

Also, can I make it in later PR, since this modification may be out of scope this PR? I will fix it as you suggested

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

FYI, #3959

Comment thread sgl-kernel/benchmark/bench_per_token_group_quant_fp8.py Outdated
Comment thread sgl-kernel/src/sgl-kernel/csrc/per_token_group_quant_fp8.cu Outdated
Comment thread sgl-kernel/src/sgl-kernel/csrc/per_token_group_quant_fp8.cu Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not boilerplate sgl-kernel code with flashinfer's.
better to make changes to flashinfer, and then use it.

Copy link
Copy Markdown
Contributor Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree. I have marked it as tempory solution as flashinfer-rocm is not fully supported and ready to use.

As far as I know, SGlang will continuously use flash::vec_t for vectorization of 128 bit data laoding. With this tempory support, we don't need to modify related CUDA codes.

Will it sound reasonable ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to keep using FLASHINFER_INLINE here, it is very common macro.

Copy link
Copy Markdown
Contributor Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it comes with flashinfer::vec_t tempory device functions support.

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team force-pushed the enable_per_token_group_quant_fp8_in_amd branch from c777940 to e1ec0e8 Compare March 6, 2025 08:15
Comment thread sgl-kernel/src/sgl-kernel/csrc/gemm/per_token_group_quant_fp8.cu
@BBuf BBuf mentioned this pull request Mar 7, 2025
18 tasks
@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor Author

check list :

[ ] rebase, adapt after #3959 merged
[ ] make sure test_mla issue fixed and verify test_mla #4214

@merrymercy
Copy link
Copy Markdown
Contributor

please fix the conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants