Emem qwen3 next (After https://github.com/sgl-project/sglang/pull/13581)#14597
Emem qwen3 next (After https://github.com/sgl-project/sglang/pull/13581)#14597pansicheng wants to merge 4 commits intosgl-project:mainfrom
Conversation
Use oversubscribe instead of expand Implement elastic memory pool for KV cache Implement elastic memory pool allocator ElasticMempoolOrchestrator Fix resizing timing of elastic mempool during prefill batch creation Fix can_unmap Simplify reduction Enhance elastic memory management with free_all, improved token tracking, and optimized orchestration Add CUDA synchronization in orchestrator resize operations Clean code
Summary of ChangesHello @pansicheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements an elastic memory management system within SGLang, allowing for dynamic resizing of KV cache and Mamba state memory pools on the GPU. This enhancement is crucial for optimizing memory utilization and potentially increasing the number of concurrent requests or context lengths, particularly for demanding models like Qwen3-Next. The changes involve core memory allocation logic, scheduler integration, and model-specific adaptations to leverage this new dynamic capability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: Elastic Memory Management (emem). The changes are extensive, adding new modules for elastic allocators, memory pools, and an orchestrator. The core idea is to allow dynamic resizing of memory pools (like KV cache) based on usage, which can improve memory utilization, especially in scenarios with varying workloads.
The implementation uses kvcached.etensor for the underlying elastic tensors, which is a solid foundation. The integration into the existing scheduler and memory management components seems well-thought-out, with conditional logic to enable/disable the feature via an environment variable. The refactoring in memory_pool.py and allocator.py to support extensibility is a good software engineering practice.
I have a few comments on potential improvements and possible bugs, mainly concerning efficiency in the allocator, a potentially incorrect assertion, and a logic change in weight loading that might be a bug. I've also pointed out some minor code redundancies.
Overall, this is a great contribution that adds a powerful capability to SGLang.
| mamba_value | ||
| ) | ||
| # Do we need an assertion here? Can we just skip the cache when mamba_value_forked is None? | ||
| assert mamba_value_forked is not None, "Can not alloc mamba cache" |
There was a problem hiding this comment.
The assert statement will crash the server if mamba cache allocation fails even after an eviction attempt. As the comment on the preceding line suggests, it would be more robust to handle this case gracefully by skipping the caching of this request instead of asserting.
if mamba_value_forked is None:
logger.warning("Could not allocate mamba cache for an unfinished request. Caching is skipped.")
return| self.free_pages = self.free_pages[:-reduce_size] | ||
| unmap_num, cur_size = self._kvcache.reduce(new_size) | ||
| logger.debug(f"{(unmap_num, cur_size)=}") | ||
| assert cur_size == self.free_pages[-1] |
There was a problem hiding this comment.
This assertion seems too strong. It assumes that after truncation, the largest free page index is exactly equal to the new size of the pool. This holds only if there are no "holes" in the free page list right before the consecutive block at the tail that is being partially reclaimed. If there are holes, self.free_pages[-1] could be smaller than cur_size, causing a false assertion failure. A more robust check would be assert self.size == cur_size (which is done on the next line) and perhaps assert self.free_pages[-1] <= cur_size. The ElasticMambaPoolAllocator uses assert cur_size == self._kvcache.free_slots[-1] + 1, which seems more correct for its 0-indexed slots.
| name.endswith(".bias") or name.endswith("_bias") | ||
| ) and name not in params_dict: | ||
| ) or name not in params_dict: |
There was a problem hiding this comment.
The logic in this if condition seems to have been unintentionally changed from and to or. The original logic was (A or B) and C, but the new logic is (A or B) or C. This will cause all parameters ending with .bias or _bias to be skipped during weight loading, which is likely not the intended behavior. The goal is probably to skip biases that are not present in the checkpoint (params_dict), not all biases. The second hunk correctly adds a check for name not in params_dict, which makes this change even more suspicious. I recommend reverting this part of the condition.
| name.endswith(".bias") or name.endswith("_bias") | |
| ) and name not in params_dict: | |
| ) or name not in params_dict: | |
| name.endswith(".bias") or name.endswith("_bias") | |
| ) and name not in params_dict: |
| def alloc(self, need_size: int): | ||
| self.merge_and_sort_free() | ||
| return super().alloc(need_size) |
There was a problem hiding this comment.
The alloc method calls self.merge_and_sort_free() on every invocation. This can be inefficient as it sorts the free pages list every time an allocation is requested, especially for frequent small allocations. Consider optimizing this by sorting only when necessary, for example, when an allocation would fail without merging and sorting, or by using a more efficient data structure for free pages that doesn't require frequent sorting (like a min-heap or a balanced tree if contiguous blocks are needed).
| assert False | ||
|
|
||
| reduce_size = tail_consecutive_size // 2 | ||
| reduce_size = reduce_size |
| ) | ||
| cu_mem_per_state_layer = cu_page_num_per_state_layer * cu_page_size | ||
| token_num = cu_mem_per_state_layer // self.state_memsize | ||
| token_num = token_num |
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist