fix CompressedTensorsW8A8Int8 min_capability#13914
Conversation
Summary of ChangesHello @mmdbhs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request expands the compatibility of the CompressedTensorsW8A8Int8 quantization method to a wider range of NVIDIA GPUs. By adjusting the minimum required compute capability, the change allows this specific quantization scheme to be utilized on Ampere architecture GPUs (SM80) in addition to the previously supported Lovelace architecture (SM89) and newer. This modification directly addresses the motivation that SM80 GPUs support this feature, making the quantization more accessible. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly lowers the minimum compute capability for CompressedTensorsW8A8Int8 from 8.9 (Lovelace) to 8.0 (Ampere), as the underlying CUDA kernel supports the Ampere architecture. This change enables the feature on a broader range of GPUs. My review includes a suggestion to update a code comment that became misleading after this change.
| # lovelace and up | ||
| return 89 | ||
| return 80 |
There was a problem hiding this comment.
The comment on line 38 is now misleading. Compute capability 8.0 corresponds to the Ampere architecture, while Lovelace is 8.9 and higher. To avoid confusion, please update the comment to reflect the new minimum requirement.
| # lovelace and up | |
| return 89 | |
| return 80 | |
| # ampere and up | |
| return 80 |
There was a problem hiding this comment.
Fix the comment at same time
There was a problem hiding this comment.
Fix the comment at same time
done
|
Others LGTM |
|
@AniZpZ @Edwardf0t1 @BBuf @ch-wan I would appreciate it if you could review this PR. It's a small modification. |
|
@FlamingoPg I would appreciate it if you could review this PR. It's a small modification. |
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (89 commits) [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160) [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047) [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017) [diffusion] fix: cache dit with parallel (sgl-project#15163) chore: change npu pr-test a2 runner (sgl-project#15152) [Feature] Fuse mrope all in 1 kernel (sgl-project#14906) Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116) Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862) fix: adding date and fixing release name issue (sgl-project#15174) [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324) feature: PR wheel (sgl-project#15170) [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005) fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914) Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158) [model-gateway] add mcp and discovery metrics (sgl-project#15156) fix: move ci-bot (sgl-project#15154) Fix import warnings (sgl-project#15144) ci: adding errors to Github summary (sgl-project#14778) [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147) [model-gateway] upgrade axum and axum server (sgl-project#15146) ... # Conflicts: # python/sglang/srt/server_args.py
Co-authored-by: Fan Yin <1106310035@qq.com> Co-authored-by: Peng Zhang <aniz1905@gmail.com>
Co-authored-by: Fan Yin <1106310035@qq.com> Co-authored-by: Peng Zhang <aniz1905@gmail.com>
Motivation
sm80 gpu supports CompressedTensorsW8A8Int8.
Modifications
set CompressedTensorsW8A8Int8:get_min_capability(): 89 -> 80
Accuracy Tests
Benchmarking and Profiling
Checklist