[Reland] perf: optimize qwen-vl with symm mem allreduce#11457
[Reland] perf: optimize qwen-vl with symm mem allreduce#11457hnyls2002 merged 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request aims to successfully re-integrate critical performance optimizations for Qwen-VL, which were previously reverted due to an environmental test timeout. It ensures the stability of the testing suite by adjusting timeout configurations and enhances the model's distributed communication and rotary embedding mechanisms. The changes specifically improve the handling of all-reduce operations and introduce support for interleaved multimodal rotary embeddings, contributing to more efficient and accurate model execution. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request relands performance optimizations for qwen-vl using symmetric memory all-reduce and also fixes a flaky test by increasing its timeout. The changes involve enabling symm_mem_comm in the parallel state, adding support for interleaved MRoPE, and integrating LayerCommunicator for all-reduce fusion in the Qwen2 model. My review has identified a critical correctness bug in the all-reduce implementation and a potential performance improvement. Please see the detailed comments below.
|
vllm dependency ci test includes this case: ./test/srt/test_gptqmodel_dynamic.py |
|
Tracking the backtrace in question: |
|
The reason is qwen2 doesn't support dp_attention. |
aaff096 to
4fc2cb5
Compare
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughUpdates include adjusted symmetric-memory all-reduce size limits, a new symmetric-memory all-reduce path in parallel_state, added interleaved multimodal RoPE support with constructor parameter and helper, robust seq_lens_sum computation for non-tensor inputs, and a test metadata time change. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant ParallelState as parallel_state._all_reduce_in_place
participant SymmComm as symm_mem_comm
participant NCCL as pynccl_comm
participant Torch as torch.distributed
Caller->>ParallelState: _all_reduce_in_place(input_)
alt pynccl available and enabled
ParallelState->>NCCL: all_reduce(input_)
NCCL-->>ParallelState: done
else symm_mem_comm available and enabled
ParallelState->>SymmComm: all_reduce(input_)
SymmComm-->>ParallelState: done
else
ParallelState->>Torch: all_reduce(input_, group=device_group)
Torch-->>ParallelState: done
end
ParallelState-->>Caller: input_ reduced in-place
sequenceDiagram
autonumber
participant Caller
participant MRoPE as MRotaryEmbedding.forward
participant Helper as apply_interleaved_rope
Caller->>MRoPE: forward(x, positions)
alt positions.ndim == 2 and mrope_interleaved
MRoPE->>Helper: transform cos, sin with mrope_section
Helper-->>MRoPE: interleaved cos/sin
MRoPE->>MRoPE: apply RoPE with interleaved layout
else
MRoPE->>MRoPE: standard per-section concat and apply
end
MRoPE-->>Caller: rotated embeddings
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
With the fix, the test case passed. |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/sglang/srt/distributed/parallel_state.py (1)
588-601: Assertion misses symm_mem path in out-of-place reducer.If only symm_mem is enabled, assert any([qr, ca, pymscclpp]) can trip. Include symm_mem_comm in the assertion.
Apply this diff:
- assert any([qr_comm, ca_comm, pymscclpp_comm]) + assert any([qr_comm, ca_comm, pymscclpp_comm, symm_mem_comm])
♻️ Duplicate comments (2)
python/sglang/srt/distributed/parallel_state.py (1)
606-611: Bug: symm_mem all-reduce not in-place (result discarded).symm_mem_comm.all_reduce returns a new tensor unless out is provided. Current call does nothing to input_. Use the in-place output buffer.
Apply this diff:
- elif symm_mem_comm is not None and not symm_mem_comm.disabled: - symm_mem_comm.all_reduce(input_) + elif symm_mem_comm is not None and not symm_mem_comm.disabled: + # Perform the reduction into the same buffer + symm_mem_comm.all_reduce(input_, out=input_)python/sglang/srt/layers/rotary_embedding.py (1)
1011-1020: Interleaving logic and extra clone; propose safer, allocation‑efficient mapping.Issues:
- clone is unnecessary and adds allocation.
- Slices like 1: mrope_section[i]*3:3 assume a 3x stride layout in the source x[i]; with chunked input [TT..|HH..|WW..], H/W sources should be taken from [:mrope_section[i]], not stepped positions. Also need bounds so dest indices don’t exceed D.
Use an explicit, bounded interleave into a fresh buffer to avoid aliasing:
-def apply_interleaved_rope(x: torch.Tensor, mrope_section: list[int]) -> torch.Tensor: - """Apply interleaved MRoPE to 3D rotary embeddings. - Reorganizes frequency layout from chunked [TTT...HHH...WWW] to - interleaved [THTHWHTHW...TT], preserving frequency continuity. - """ - x_t = x[0].clone() - x_t[..., 1 : mrope_section[1] * 3 : 3] = x[1, ..., 1 : mrope_section[1] * 3 : 3] - x_t[..., 2 : mrope_section[2] * 3 : 3] = x[2, ..., 2 : mrope_section[2] * 3 : 3] - return x_t +def apply_interleaved_rope(x: torch.Tensor, mrope_section: list[int]) -> torch.Tensor: + """ + x: [3, ..., D] with chunked layout per axis: + x[0][..., :t] (T), x[1][..., :h] (H), x[2][..., :w] (W) + Returns: [..., D] with interleaved T,H,W at strides of 3; leftover T fills tail. + """ + t, h, w = mrope_section + D = x.shape[-1] + out = torch.empty_like(x[0]) + # Fill from T/H/W sources into 0/1/2 mod-3 slots respectively. + kt = min(t, (D + 2) // 3) + kh = min(h, (D - 1 + 2) // 3) # ceil((D-1)/3) + kw = min(w, (D - 2 + 2) // 3) # ceil((D-2)/3) + out[..., 0 : 3 * kt : 3] = x[0, ..., :kt] + out[..., 1 : 1 + 3 * kh : 3] = x[1, ..., :kh] + out[..., 2 : 2 + 3 * kw : 3] = x[2, ..., :kw] + # Fill remaining slots (if any) from the tail of T. + filled = max(3 * kt, 1 + 3 * kh, 2 + 3 * kw) + if filled < D and t > kt: + out[..., filled:D] = x[0, ..., kt : kt + (D - filled)] + return outThis avoids mutation of shared storage, removes clone, and matches the chunked→interleaved intent.
🧹 Nitpick comments (1)
python/sglang/srt/layers/rotary_embedding.py (1)
1041-1073: Replace print with logging; avoid stdout in library code.These prints will spam logs and aren’t controllable. Use logger.warning/info and gate behind rank checks if needed.
Apply this diff:
- print( - f"MRoPE section sum mismatch: expected {expected_sum}, got {actual_sum}. " - f"Adjusting mrope_section to match rotary_dim // 2 = {expected_sum}" - ) + logger.warning( + "MRoPE section sum mismatch: expected %d, got %d. Adjusting to %d", + expected_sum, actual_sum, expected_sum, + ) ... - print( - f"Corrected mrope_section: {self.mrope_section} (sum={sum(self.mrope_section)})" - ) + logger.info( + "Corrected mrope_section: %s (sum=%d)", + self.mrope_section, sum(self.mrope_section), + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between b5dcfd4 and 4fc2cb5e0d74e594ab4eeb961f9b8360002d41dd.
📒 Files selected for processing (5)
python/sglang/srt/distributed/device_communicators/all_reduce_utils.py(1 hunks)python/sglang/srt/distributed/parallel_state.py(1 hunks)python/sglang/srt/layers/rotary_embedding.py(4 hunks)python/sglang/srt/managers/schedule_batch.py(1 hunks)scripts/sort_testcases_alphabetically.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/sort_testcases_alphabetically.py (1)
test/srt/run_suite.py (1)
TestFile(9-11)
python/sglang/srt/distributed/parallel_state.py (2)
python/sglang/srt/distributed/device_communicators/symm_mem.py (1)
all_reduce(134-164)python/sglang/srt/distributed/device_communicators/pynccl.py (1)
all_reduce(126-148)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (25)
- GitHub Check: vllm-dependency-test
- GitHub Check: run-all-notebooks
- GitHub Check: build-test (all)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 11)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 8)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 9)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 10)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 5)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 7)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 0)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 2)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 3)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 1)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 4)
- GitHub Check: unit-test-backend-1-gpu-amd (linux-mi325-gpu-1, 6)
- GitHub Check: unit-test-backend-8-gpu-amd (linux-mi300-gpu-8, 0)
- GitHub Check: unit-test-sgl-kernel-amd (linux-mi325-gpu-1)
- GitHub Check: unit-test-backend-2-gpu-amd (linux-mi325-gpu-2)
- GitHub Check: unit-test-backend-8-gpu-amd (linux-mi300-gpu-8, 1)
- GitHub Check: bench-test-2-gpu-amd (linux-mi325-gpu-2)
- GitHub Check: performance-test-1-gpu-part-1-amd (linux-mi325-gpu-1)
- GitHub Check: accuracy-test-1-gpu-amd (linux-mi325-gpu-1)
- GitHub Check: performance-test-1-gpu-part-2-amd (linux-mi325-gpu-1)
- GitHub Check: accuracy-test-2-gpu-amd (linux-mi325-gpu-2)
- GitHub Check: mla-test-1-gpu-amd (linux-mi325-gpu-1)
🔇 Additional comments (3)
scripts/sort_testcases_alphabetically.py (1)
176-176: Bump to 800s looks fine; verify CI scheduling impact.Large increase may reshuffle buckets or exceed per-job timeouts. Confirm this won’t cause queue starvation on vllm_dependency_test shard.
python/sglang/srt/distributed/device_communicators/all_reduce_utils.py (1)
6-8: Increased SYMM_MEM max sizes—please validate against device memory budget.The larger caps should help throughput but can raise pressure on symmetric buffers. Sanity-check:
- Typical per-rank tensor sizes that hit these buckets
- Multi-model co-location and fragmentation behavior
If available, share a quick microbenchmark delta before/after.
Also applies to: 12-12
python/sglang/srt/managers/schedule_batch.py (1)
1576-1581: Robust seq_lens_sum computation LGTM.Covers tensor and list paths; keeps an int. Please sanity-check callers (e.g., get_model_worker_batch) won’t assume seq_lens_cpu is always a Tensor.
JustinTong0323
left a comment
There was a problem hiding this comment.
LGTM. But why would it affect vllm test?
424525e to
f61a6ce
Compare
f61a6ce to
a94c8fe
Compare
…11457) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…11457) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Motivation
The previous PR #11381 was reverted due to break vllm dependency test, the following case failed: ./test/srt/test_gptqmodel_dynamic.py.
The root cause is previous PR includes logic to fuse all reduce to improve performance which has dependency on enabling dp_attention. But Qwen2 doesn't support dp_attention, so the logic asserts in prerequisite check. Below has more details. This PR is to revert the fuse all reduce part and reland #11381.
For more details please refer to #11381 .
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Summary by CodeRabbit