Precompute swa cache location by ispobock · Pull Request #20449 · sgl-project/sglang

ispobock · 2026-03-12T12:07:23Z

Motivation

Precompute out_cache_loc_swa once to avoid translation in each swa layer. Applied for normal extend and piecewise cuda graph.

Modifications

Accuracy Tests

python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-V2-Flash --tp-size 4 --trust-remote-code --mem-fraction-static 0.7 --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' --attention-backend fa3
python3 benchmark/gsm8k/bench_sglang.py --parallel 1400 --num-questions 1400

Accuracy: 0.853
Invalid: 0.000
Latency: 25.939 s
Output throughput: 6821.148 token/s

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-12T12:07:40Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance optimization by precomputing the SWA (Sliding Window Attention) cache location. By calculating out_cache_loc_swa upfront, the system avoids repeated translation costs across multiple SWA layers, leading to more efficient execution, particularly benefiting operations involving normal extensions and piecewise CUDA graphs. This change streamlines the memory access pattern for SWA, contributing to overall system speed.

Highlights

Performance Optimization: The out_cache_loc_swa is now precomputed once, eliminating redundant translation operations within each SWA layer. This optimization applies to both normal extend and piecewise CUDA graph operations.
Code Clean-up: A TODO comment related to out_cache_loc_swa integration into multiple attention backends was removed from forward_batch_info.py.
CUDA Graph Integration: The precomputed SWA cache location is now correctly utilized within the piecewise_cuda_graph_runner for improved efficiency during CUDA graph replay.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/model_executor/forward_batch_info.py
- Removed a TODO comment regarding out_cache_loc_swa integration.
- Added logic within the init_new method to precompute out_cache_loc_swa for hybrid SWA configurations.
python/sglang/srt/model_executor/model_runner.py
- Integrated the use of the precomputed out_cache_loc_swa within the _forward_raw method.
python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
- Corrected a variable reference from forward_batch.out_cache_loc_swa to buffers.out_cache_loc_swa in replay_prepare.
- Applied the precomputed out_cache_loc_swa within the replay_prepare method for CUDA graph operations.

Activity

No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

ispobock · 2026-03-12T12:07:57Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request introduces a performance optimization by precomputing the Sliding Window Attention (SWA) cache location. The changes are well-implemented and logically sound. The precomputation is correctly handled in ForwardBatch.init_new for the standard execution path and within replay_prepare for the piecewise CUDA graph path. This avoids redundant translations in each SWA layer, which should improve performance as intended. The related modifications in model_runner.py and piecewise_cuda_graph_runner.py correctly utilize these precomputed values. Additionally, a minor bug fix in piecewise_cuda_graph_runner.py improves the robustness of the SWA check. Overall, the changes are correct and beneficial.

use swa loc

92914b1

ispobock requested review from Fridge003, Ying1123, hebiao064, hnyls2002 and merrymercy as code owners March 12, 2026 12:07

github-actions Bot added the run-ci label Mar 12, 2026

gemini-code-assist Bot reviewed Mar 12, 2026

View reviewed changes

ispobock added 2 commits March 14, 2026 18:41

Merge branch 'main' into use-swa-loc

a89be4a

fix dp

cf056ac

ispobock merged commit 39336f5 into main Mar 16, 2026
222 of 252 checks passed

ispobock deleted the use-swa-loc branch March 16, 2026 06:38

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Precompute swa cache location (sgl-project#20449)

a28c0b2

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

Precompute swa cache location (sgl-project#20449)

e5e5d36

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Precompute swa cache location (sgl-project#20449)

0f5d4c1

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Precompute swa cache location (sgl-project#20449)

59a1012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompute swa cache location#20449

Precompute swa cache location#20449
ispobock merged 3 commits intomainfrom
use-swa-loc

ispobock commented Mar 12, 2026

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

ispobock commented Mar 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ispobock commented Mar 12, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

ispobock commented Mar 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant