Skip to content

Precompute swa cache location#20449

Merged
ispobock merged 3 commits intomainfrom
use-swa-loc
Mar 16, 2026
Merged

Precompute swa cache location#20449
ispobock merged 3 commits intomainfrom
use-swa-loc

Conversation

@ispobock
Copy link
Copy Markdown
Collaborator

Motivation

Precompute out_cache_loc_swa once to avoid translation in each swa layer. Applied for normal extend and piecewise cuda graph.

Modifications

Accuracy Tests

python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-V2-Flash --tp-size 4 --trust-remote-code --mem-fraction-static 0.7 --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' --attention-backend fa3
python3 benchmark/gsm8k/bench_sglang.py --parallel 1400 --num-questions 1400

Accuracy: 0.853
Invalid: 0.000
Latency: 25.939 s
Output throughput: 6821.148 token/s

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance optimization by precomputing the SWA (Sliding Window Attention) cache location. By calculating out_cache_loc_swa upfront, the system avoids repeated translation costs across multiple SWA layers, leading to more efficient execution, particularly benefiting operations involving normal extensions and piecewise CUDA graphs. This change streamlines the memory access pattern for SWA, contributing to overall system speed.

Highlights

  • Performance Optimization: The out_cache_loc_swa is now precomputed once, eliminating redundant translation operations within each SWA layer. This optimization applies to both normal extend and piecewise CUDA graph operations.
  • Code Clean-up: A TODO comment related to out_cache_loc_swa integration into multiple attention backends was removed from forward_batch_info.py.
  • CUDA Graph Integration: The precomputed SWA cache location is now correctly utilized within the piecewise_cuda_graph_runner for improved efficiency during CUDA graph replay.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/model_executor/forward_batch_info.py
    • Removed a TODO comment regarding out_cache_loc_swa integration.
    • Added logic within the init_new method to precompute out_cache_loc_swa for hybrid SWA configurations.
  • python/sglang/srt/model_executor/model_runner.py
    • Integrated the use of the precomputed out_cache_loc_swa within the _forward_raw method.
  • python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
    • Corrected a variable reference from forward_batch.out_cache_loc_swa to buffers.out_cache_loc_swa in replay_prepare.
    • Applied the precomputed out_cache_loc_swa within the replay_prepare method for CUDA graph operations.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ispobock
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization by precomputing the Sliding Window Attention (SWA) cache location. The changes are well-implemented and logically sound. The precomputation is correctly handled in ForwardBatch.init_new for the standard execution path and within replay_prepare for the piecewise CUDA graph path. This avoids redundant translations in each SWA layer, which should improve performance as intended. The related modifications in model_runner.py and piecewise_cuda_graph_runner.py correctly utilize these precomputed values. Additionally, a minor bug fix in piecewise_cuda_graph_runner.py improves the robustness of the SWA check. Overall, the changes are correct and beneficial.

@ispobock ispobock merged commit 39336f5 into main Mar 16, 2026
222 of 252 checks passed
@ispobock ispobock deleted the use-swa-loc branch March 16, 2026 06:38
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant