Skip to content

fix CompressedTensorsW8A8Int8 min_capability#13914

Merged
AniZpZ merged 5 commits intosgl-project:mainfrom
mmdbhs:fix/CompressedTensorsW8A8Int8-min_capability
Dec 15, 2025
Merged

fix CompressedTensorsW8A8Int8 min_capability#13914
AniZpZ merged 5 commits intosgl-project:mainfrom
mmdbhs:fix/CompressedTensorsW8A8Int8-min_capability

Conversation

@mmdbhs
Copy link
Copy Markdown
Contributor

@mmdbhs mmdbhs commented Nov 25, 2025

Motivation

sm80 gpu supports CompressedTensorsW8A8Int8.

Modifications

set CompressedTensorsW8A8Int8:get_min_capability(): 89 -> 80

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @mmdbhs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request expands the compatibility of the CompressedTensorsW8A8Int8 quantization method to a wider range of NVIDIA GPUs. By adjusting the minimum required compute capability, the change allows this specific quantization scheme to be utilized on Ampere architecture GPUs (SM80) in addition to the previously supported Lovelace architecture (SM89) and newer. This modification directly addresses the motivation that SM80 GPUs support this feature, making the quantization more accessible.

Highlights

  • GPU Capability Support: The minimum GPU compute capability required for the CompressedTensorsW8A8Int8 quantization scheme has been lowered from 8.9 (Lovelace) to 8.0 (Ampere), enabling broader hardware support for this optimization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly lowers the minimum compute capability for CompressedTensorsW8A8Int8 from 8.9 (Lovelace) to 8.0 (Ampere), as the underlying CUDA kernel supports the Ampere architecture. This change enables the feature on a broader range of GPUs. My review includes a suggestion to update a code comment that became misleading after this change.

Comment on lines 38 to +39
# lovelace and up
return 89
return 80
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on line 38 is now misleading. Compute capability 8.0 corresponds to the Ampere architecture, while Lovelace is 8.9 and higher. To avoid confusion, please update the comment to reflect the new minimum requirement.

Suggested change
# lovelace and up
return 89
return 80
# ampere and up
return 80

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the comment at same time

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the comment at same time

done

@FlamingoPg
Copy link
Copy Markdown
Collaborator

Others LGTM

@mmdbhs
Copy link
Copy Markdown
Contributor Author

mmdbhs commented Nov 27, 2025

@AniZpZ @Edwardf0t1 @BBuf @ch-wan I would appreciate it if you could review this PR. It's a small modification.

@mmdbhs
Copy link
Copy Markdown
Contributor Author

mmdbhs commented Dec 2, 2025

@FlamingoPg I would appreciate it if you could review this PR. It's a small modification.

@AniZpZ AniZpZ added the ready-to-merge The PR is ready to merge after the CI is green. label Dec 14, 2025
@AniZpZ AniZpZ self-assigned this Dec 15, 2025
@AniZpZ AniZpZ merged commit 16e6bc2 into sgl-project:main Dec 15, 2025
141 of 150 checks passed
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 15, 2025
…n_eagle3_npu

* 'main' of https://github.com/sgl-project/sglang: (89 commits)
  [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160)
  [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047)
  [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017)
  [diffusion] fix: cache dit with parallel (sgl-project#15163)
  chore: change npu pr-test a2 runner (sgl-project#15152)
  [Feature] Fuse mrope all in 1 kernel (sgl-project#14906)
  Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116)
  Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862)
  fix: adding date and fixing release name issue (sgl-project#15174)
  [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324)
  feature: PR wheel (sgl-project#15170)
  [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005)
  fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914)
  Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158)
  [model-gateway] add mcp and discovery metrics (sgl-project#15156)
  fix: move ci-bot (sgl-project#15154)
  Fix import warnings (sgl-project#15144)
  ci: adding errors to Github summary (sgl-project#14778)
  [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147)
  [model-gateway] upgrade axum and axum server (sgl-project#15146)
  ...

# Conflicts:
#	python/sglang/srt/server_args.py
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 17, 2025
Co-authored-by: Fan Yin <1106310035@qq.com>
Co-authored-by: Peng Zhang <aniz1905@gmail.com>
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Co-authored-by: Fan Yin <1106310035@qq.com>
Co-authored-by: Peng Zhang <aniz1905@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants