Skip to content

Enable memory saver for hybrid model#11974

Merged
ispobock merged 12 commits intosgl-project:mainfrom
ocss884:enable_hybrid_mem_saver
Nov 4, 2025
Merged

Enable memory saver for hybrid model#11974
ispobock merged 12 commits intosgl-project:mainfrom
ocss884:enable_hybrid_mem_saver

Conversation

@ocss884
Copy link
Copy Markdown
Collaborator

@ocss884 ocss884 commented Oct 22, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ocss884, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a memory saving mechanism for the hybrid model, specifically targeting the Mamba state and KV cache allocations. By introducing an enable_memory_saver flag and integrating a TorchMemorySaverAdapter, the changes allow for more efficient GPU memory utilization. This optimization aims to reduce the overall memory footprint of the model during inference, which can be critical for deploying larger models or handling increased concurrency.

Highlights

  • Memory Saver Integration: Introduced an enable_memory_saver boolean parameter across various memory pool constructors and methods to control memory optimization.
  • Mamba Memory Pool Optimization: The MambaMemoryPool now utilizes a TorchMemorySaverAdapter to wrap the allocation of conv_state and temporal_state, allowing for conditional memory saving.
  • Hybrid Memory Pool Configuration: The HybridMemoryPool constructor and its internal _init_mamba_pool method have been updated to accept and propagate the enable_memory_saver flag to underlying memory components, including the KVCache.
  • Model Runner Update: The init_memory_pool function within the model_runner now passes the server_args.enable_memory_saver setting to the HybridMemoryPool during initialization, enabling server-wide control over this feature.
  • Unit Test Enhancement: A unit test (test_hybrid_linear_kv_pool) has been modified to explicitly enable enable_memory_saver when initializing the HybridMemoryPool, ensuring test coverage for the new functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ocss884
Copy link
Copy Markdown
Collaborator Author

ocss884 commented Oct 22, 2025

@yizhang2077 @fzyzcjy

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the memory saver feature for hybrid models. The changes primarily involve propagating the enable_memory_saver flag through various components, including MambaPool, HybridReqToTokenPool, and HybridLinearKVPool. The flag is then used to wrap memory-intensive buffer allocations with the TorchMemorySaverAdapter, which is the intended behavior. The implementation appears correct and consistent. I've found one minor issue regarding some leftover commented-out code that should be cleaned up.

Comment on lines +239 to +241
# def _create_buffers(self):
# with self.memory_saver_adapter.region(GPU_MEMORY_TYPE_KV_CACHE):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out code appears to be a leftover from a refactoring. To improve code clarity and maintainability, it should be removed.

Comment thread python/sglang/srt/mem_cache/memory_pool.py Outdated
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if it is just adding with blocks and test passes

Comment thread python/sglang/srt/mem_cache/memory_pool.py Outdated
Copy link
Copy Markdown
Collaborator

@yizhang2077 yizhang2077 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

@fzyzcjy fzyzcjy added the run-ci label Oct 23, 2025
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM reading the new diff

@ispobock ispobock merged commit 173e0f7 into sgl-project:main Nov 4, 2025
17 of 48 checks passed
@Fridge003
Copy link
Copy Markdown
Collaborator

Fridge003 commented Nov 4, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants