Emem qwen3 next (After https://github.com/sgl-project/sglang/pull/13581) by pansicheng · Pull Request #14597 · sgl-project/sglang

pansicheng · 2025-12-08T02:28:45Z

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Use oversubscribe instead of expand Implement elastic memory pool for KV cache Implement elastic memory pool allocator ElasticMempoolOrchestrator Fix resizing timing of elastic mempool during prefill batch creation Fix can_unmap Simplify reduction Enhance elastic memory management with free_all, improved token tracking, and optimized orchestration Add CUDA synchronization in orchestrator resize operations Clean code

gemini-code-assist · 2025-12-08T02:29:11Z

Summary of Changes

Hello @pansicheng, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements an elastic memory management system within SGLang, allowing for dynamic resizing of KV cache and Mamba state memory pools on the GPU. This enhancement is crucial for optimizing memory utilization and potentially increasing the number of concurrent requests or context lengths, particularly for demanding models like Qwen3-Next. The changes involve core memory allocation logic, scheduler integration, and model-specific adaptations to leverage this new dynamic capability.

Highlights

Elastic Memory Pool Introduction: This pull request introduces a new 'Elastic Memory Pool' feature, designed to dynamically manage GPU memory for KV cache and Mamba states, aiming to improve memory efficiency and throughput for large language models.
Dynamic Memory Orchestration: A new ElasticMempoolOrchestrator is added to coordinate between different memory allocators, enabling real-time expansion and reduction of memory allocations based on demand.
Refactored Memory Management: Existing memory pool and allocator classes have been refactored and extended with elastic counterparts (ElasticMHATokenToKVPool, ElasticSWAKVPool, ElasticTokenToKVPoolAllocator, etc.) to support dynamic resizing using ETensor.
Qwen3-Next Integration and Benchmarking: The elastic memory system is integrated into the model runner, with specific support for GptOssForCausalLM and Qwen3-Next models. New development scripts and plotting tools are included for benchmarking and analyzing token usage and performance.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature: Elastic Memory Management (emem). The changes are extensive, adding new modules for elastic allocators, memory pools, and an orchestrator. The core idea is to allow dynamic resizing of memory pools (like KV cache) based on usage, which can improve memory utilization, especially in scenarios with varying workloads.

The implementation uses kvcached.etensor for the underlying elastic tensors, which is a solid foundation. The integration into the existing scheduler and memory management components seems well-thought-out, with conditional logic to enable/disable the feature via an environment variable. The refactoring in memory_pool.py and allocator.py to support extensibility is a good software engineering practice.

I have a few comments on potential improvements and possible bugs, mainly concerning efficiency in the allocator, a potentially incorrect assertion, and a logic change in weight loading that might be a bug. I've also pointed out some minor code redundancies.

Overall, this is a great contribution that adds a powerful capability to SGLang.

gemini-code-assist · 2025-12-08T02:31:22Z

                mamba_value
            )
+            # Do we need an assertion here? Can we just skip the cache when mamba_value_forked is None?
            assert mamba_value_forked is not None, "Can not alloc mamba cache"


The assert statement will crash the server if mamba cache allocation fails even after an eviction attempt. As the comment on the preceding line suggests, it would be more robust to handle this case gracefully by skipping the caching of this request instead of asserting.

if mamba_value_forked is None: logger.warning("Could not allocate mamba cache for an unfinished request. Caching is skipped.") return

gemini-code-assist · 2025-12-08T02:31:22Z

+        self.free_pages = self.free_pages[:-reduce_size]
+        unmap_num, cur_size = self._kvcache.reduce(new_size)
+        logger.debug(f"{(unmap_num, cur_size)=}")
+        assert cur_size == self.free_pages[-1]


This assertion seems too strong. It assumes that after truncation, the largest free page index is exactly equal to the new size of the pool. This holds only if there are no "holes" in the free page list right before the consecutive block at the tail that is being partially reclaimed. If there are holes, self.free_pages[-1] could be smaller than cur_size, causing a false assertion failure. A more robust check would be assert self.size == cur_size (which is done on the next line) and perhaps assert self.free_pages[-1] <= cur_size. The ElasticMambaPoolAllocator uses assert cur_size == self._kvcache.free_slots[-1] + 1, which seems more correct for its 0-indexed slots.

gemini-code-assist · 2025-12-08T02:31:22Z

                        name.endswith(".bias") or name.endswith("_bias")
-                    ) and name not in params_dict:
+                    ) or name not in params_dict:


The logic in this if condition seems to have been unintentionally changed from and to or. The original logic was (A or B) and C, but the new logic is (A or B) or C. This will cause all parameters ending with .bias or _bias to be skipped during weight loading, which is likely not the intended behavior. The goal is probably to skip biases that are not present in the checkpoint (params_dict), not all biases. The second hunk correctly adds a check for name not in params_dict, which makes this change even more suspicious. I recommend reverting this part of the condition.

Suggested change

name.endswith(".bias") or name.endswith("_bias")

) and name not in params_dict:

) or name not in params_dict:

name.endswith(".bias") or name.endswith("_bias")

) and name not in params_dict:

gemini-code-assist · 2025-12-08T02:31:22Z

+    def alloc(self, need_size: int):
+        self.merge_and_sort_free()
+        return super().alloc(need_size)


The alloc method calls self.merge_and_sort_free() on every invocation. This can be inefficient as it sorts the free pages list every time an allocation is requested, especially for frequent small allocations. Consider optimizing this by sorting only when necessary, for example, when an allocation would fail without merging and sorting, or by using a more efficient data structure for free pages that doesn't require frequent sorting (like a min-heap or a balanced tree if contiguous blocks are needed).

gemini-code-assist · 2025-12-08T02:31:22Z

+                assert False
+
+        reduce_size = tail_consecutive_size // 2
+        reduce_size = reduce_size


This line is redundant and can be removed.

gemini-code-assist · 2025-12-08T02:31:22Z

+        )
+        cu_mem_per_state_layer = cu_page_num_per_state_layer * cu_page_size
+        token_num = cu_mem_per_state_layer // self.state_memsize
+        token_num = token_num


This line is redundant and can be removed.

pansicheng added 4 commits December 4, 2025 15:22

rebase main

4a17745

gpt-oss

a8d691e

qwen3-next

16e7c27

pansicheng requested review from Fridge003, Ying1123, hanming-lu, hnyls2002, ispobock, merrymercy, xiezhq-hermann, yizhang2077 and zhyncs as code owners December 8, 2025 02:28

gemini-code-assist Bot reviewed Dec 8, 2025

View reviewed changes

pansicheng mentioned this pull request Apr 17, 2026

Integration with elasticmem #13581

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emem qwen3 next (After https://github.com/sgl-project/sglang/pull/13581)#14597

Emem qwen3 next (After https://github.com/sgl-project/sglang/pull/13581)#14597
pansicheng wants to merge 4 commits intosgl-project:mainfrom
pansicheng:emem-qwen3-next

pansicheng commented Dec 8, 2025

Uh oh!

gemini-code-assist Bot commented Dec 8, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pansicheng commented Dec 8, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 8, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant