[NPU]TP Communications compression For Qwen3 models for NPU#20520
[NPU]TP Communications compression For Qwen3 models for NPU#20520sglang-npu-bot merged 44 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant optimization by enabling INT8 quantization for tensor parallel communications during the prefill stage of Qwen3 models when running on NPU devices. This enhancement aims to reduce communication bandwidth and latency, leading to improved performance without compromising model accuracy, as validated by extensive benchmarking and accuracy tests. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces INT8 communication compression for Tensor Parallelism on NPU devices, specifically for Qwen3 models. The implementation adds a new server argument --quantize-tp-communications and a quantized all_reduce operation for NPUs. This feature is designed to be active only during the prefill phase to optimize performance. The changes are well-contained within the distributed communication and model-specific layers. I've identified a potential issue in python/sglang/srt/layers/linear.py where quantization might be unsafely applied when the forward context is unavailable, and I have provided a suggested fix.
|
@egvenediktov please resolve lint issue |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
I merged it as only one GPU CI failed due to environment issue |
|
@egvenediktov could you please create another PR for document in docs_new, which is the new dir created by community |
…ect#20520) Co-authored-by: ronnie_zheng <zl19940307@163.com>
Motivation
Implemented INT8 TP communications compression on prefill for Qwen3 models.
Compression achieves average 5% performance improvement on prefill intense benchmarks (see Benchmarking and Profiling). Accuracy tests show no degradation on average (BoolQ, C-Eval, HellaSwag; see Accuracy Tests).
Description
TP introduce communications between devices after o_proj in attention and after down_proj in mlp. To reduce latency of communications we can quantize them before sending to other devices and dequantize right after communication is complete.
Below is profiling of TP communications for FP and compressed cases.
Modifications
9 files changed:
python/sglang/srt/server_args.py(Added a server argument for communications compression, added checks for the argument)python/sglang/srt/models/qwen2.py(Added ForwardBatch passing interface from MLP layer to down_proj linear)python/sglang/srt/models/qwen3.py(Added ForwardBatch passing interface to MLP layer)python/sglang/srt/layers/linear.py(Added logic for enabling quantization of communications)python/sglang/srt/distributed/communication_op.py(Added interface to all_reduce operation to enable communications quantization)python/sglang/srt/distributed/parallel_state.py(Added interface and logic of communications quantization for all_reduce operation)python/sglang/srt/distributed/device_communicators/npu_communicator.py(Added implementation of all_reduce with quantized communications; quantization scheme - per-token, symmetric)python/sglang/srt/layers/communicator.py(Added logic of communications quantization for residual add operation with all_reduce)benchmark/boolq/bench_sglang.py(import fix)Accuracy Tests
Server launch:
baseline
ASCEND_RT_VISIBLE_DEVICES=3,6 python -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path /home/ckpt/Qwen3-32B/ --port 30088 --cuda-graph-max-bs 64 --max-prefill-tokens 32768 --chunked-prefill-size -1 --mem-fraction-static 0.8with communications quantization
ASCEND_RT_VISIBLE_DEVICES=3,6 python -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path /home/ckpt/Qwen3-32B/ --port 30088 --cuda-graph-max-bs 64 --max-prefill-tokens 32768 --chunked-prefill-size -1 --mem-fraction-static 0.8 --quantize-tp-communicationsBoolQ:
Client launch
python ./benchmark/boolq/bench_sglang.py --port 30088 --train-data-path "path/to/train/data/file" --test-data-path "path/to/test/data/file" --parallel 64 --num-questions 3270C-Eval:
Client launch
python ./benchmark/ceval/bench_sglang.py --port 30088 --data-path "path/to/data" --num-questions 1346HellaSwag
Client launch
python ./benchmark/hellaswag/bench_sglang.py --port 30088 --data-path "path/to/data" --num-questions 50000Benchmarking and Profiling
Acceleration of prefill intense benchmarks for Qwen3 32B on A2 server (acceleration = FP_time / INT8_time or acceleration = INT8_throughput / FP_throughput)
Server args:
ASCEND_RT_VISIBLE_DEVICES=3,6 python -m sglang.launch_server --device npu --attention-backend ascend --trust-remote-code --tp-size 2 --model-path /home/ckpt/Qwen3-32B/ --port 30088 --cuda-graph-max-bs 64 --max-prefill-tokens 32768 --chunked-prefill-size -1 --mem-fraction-static 0.8 --disable-radix-cache --quantize-tp-communicationsChecklist