Skip to content

Emem qwen3 next (After https://github.com/sgl-project/sglang/pull/13581)#14597

Open
pansicheng wants to merge 4 commits intosgl-project:mainfrom
pansicheng:emem-qwen3-next
Open

Emem qwen3 next (After https://github.com/sgl-project/sglang/pull/13581)#14597
pansicheng wants to merge 4 commits intosgl-project:mainfrom
pansicheng:emem-qwen3-next

Conversation

@pansicheng
Copy link
Copy Markdown
Collaborator

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Use oversubscribe instead of expand

Implement elastic memory pool for KV cache

Implement elastic memory pool allocator

ElasticMempoolOrchestrator

Fix resizing timing of elastic mempool during prefill batch creation

Fix can_unmap

Simplify reduction

Enhance elastic memory management with free_all, improved token tracking, and optimized orchestration

Add CUDA synchronization in orchestrator resize operations

Clean code
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @pansicheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements an elastic memory management system within SGLang, allowing for dynamic resizing of KV cache and Mamba state memory pools on the GPU. This enhancement is crucial for optimizing memory utilization and potentially increasing the number of concurrent requests or context lengths, particularly for demanding models like Qwen3-Next. The changes involve core memory allocation logic, scheduler integration, and model-specific adaptations to leverage this new dynamic capability.

Highlights

  • Elastic Memory Pool Introduction: This pull request introduces a new 'Elastic Memory Pool' feature, designed to dynamically manage GPU memory for KV cache and Mamba states, aiming to improve memory efficiency and throughput for large language models.
  • Dynamic Memory Orchestration: A new ElasticMempoolOrchestrator is added to coordinate between different memory allocators, enabling real-time expansion and reduction of memory allocations based on demand.
  • Refactored Memory Management: Existing memory pool and allocator classes have been refactored and extended with elastic counterparts (ElasticMHATokenToKVPool, ElasticSWAKVPool, ElasticTokenToKVPoolAllocator, etc.) to support dynamic resizing using ETensor.
  • Qwen3-Next Integration and Benchmarking: The elastic memory system is integrated into the model runner, with specific support for GptOssForCausalLM and Qwen3-Next models. New development scripts and plotting tools are included for benchmarking and analyzing token usage and performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: Elastic Memory Management (emem). The changes are extensive, adding new modules for elastic allocators, memory pools, and an orchestrator. The core idea is to allow dynamic resizing of memory pools (like KV cache) based on usage, which can improve memory utilization, especially in scenarios with varying workloads.

The implementation uses kvcached.etensor for the underlying elastic tensors, which is a solid foundation. The integration into the existing scheduler and memory management components seems well-thought-out, with conditional logic to enable/disable the feature via an environment variable. The refactoring in memory_pool.py and allocator.py to support extensibility is a good software engineering practice.

I have a few comments on potential improvements and possible bugs, mainly concerning efficiency in the allocator, a potentially incorrect assertion, and a logic change in weight loading that might be a bug. I've also pointed out some minor code redundancies.

Overall, this is a great contribution that adds a powerful capability to SGLang.

mamba_value
)
# Do we need an assertion here? Can we just skip the cache when mamba_value_forked is None?
assert mamba_value_forked is not None, "Can not alloc mamba cache"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assert statement will crash the server if mamba cache allocation fails even after an eviction attempt. As the comment on the preceding line suggests, it would be more robust to handle this case gracefully by skipping the caching of this request instead of asserting.

if mamba_value_forked is None:
    logger.warning("Could not allocate mamba cache for an unfinished request. Caching is skipped.")
    return

self.free_pages = self.free_pages[:-reduce_size]
unmap_num, cur_size = self._kvcache.reduce(new_size)
logger.debug(f"{(unmap_num, cur_size)=}")
assert cur_size == self.free_pages[-1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This assertion seems too strong. It assumes that after truncation, the largest free page index is exactly equal to the new size of the pool. This holds only if there are no "holes" in the free page list right before the consecutive block at the tail that is being partially reclaimed. If there are holes, self.free_pages[-1] could be smaller than cur_size, causing a false assertion failure. A more robust check would be assert self.size == cur_size (which is done on the next line) and perhaps assert self.free_pages[-1] <= cur_size. The ElasticMambaPoolAllocator uses assert cur_size == self._kvcache.free_slots[-1] + 1, which seems more correct for its 0-indexed slots.

Comment on lines 1012 to +1013
name.endswith(".bias") or name.endswith("_bias")
) and name not in params_dict:
) or name not in params_dict:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic in this if condition seems to have been unintentionally changed from and to or. The original logic was (A or B) and C, but the new logic is (A or B) or C. This will cause all parameters ending with .bias or _bias to be skipped during weight loading, which is likely not the intended behavior. The goal is probably to skip biases that are not present in the checkpoint (params_dict), not all biases. The second hunk correctly adds a check for name not in params_dict, which makes this change even more suspicious. I recommend reverting this part of the condition.

Suggested change
name.endswith(".bias") or name.endswith("_bias")
) and name not in params_dict:
) or name not in params_dict:
name.endswith(".bias") or name.endswith("_bias")
) and name not in params_dict:

Comment on lines +55 to +57
def alloc(self, need_size: int):
self.merge_and_sort_free()
return super().alloc(need_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The alloc method calls self.merge_and_sort_free() on every invocation. This can be inefficient as it sorts the free pages list every time an allocation is requested, especially for frequent small allocations. Consider optimizing this by sorting only when necessary, for example, when an allocation would fail without merging and sorting, or by using a more efficient data structure for free pages that doesn't require frequent sorting (like a min-heap or a balanced tree if contiguous blocks are needed).

assert False

reduce_size = tail_consecutive_size // 2
reduce_size = reduce_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is redundant and can be removed.

)
cu_mem_per_state_layer = cu_page_num_per_state_layer * cu_page_size
token_num = cu_mem_per_state_layer // self.state_memsize
token_num = token_num
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is redundant and can be removed.

@pansicheng pansicheng mentioned this pull request Apr 17, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant