Integration with elasticmem#13581
Conversation
| # TODO: a more efficient way | ||
| @override | ||
| def alloc(self, need_size: int): | ||
| self.merge_and_sort_free() |
There was a problem hiding this comment.
make it more efficient
There was a problem hiding this comment.
now we sort only during defragmentation
| if self.token_usage() > 0.9: | ||
| return False | ||
|
|
||
| self.evict(self.evictable_size()) |
There was a problem hiding this comment.
does can_unmap need to evict and merge_and_sort, since both seems to be time consuming
There was a problem hiding this comment.
Now can_unmap skips eviction and merge_and_sort, using an unused_pages tensor to track consecutive tail pages
Use oversubscribe instead of expand Implement elastic memory pool for KV cache Implement elastic memory pool allocator ElasticMempoolOrchestrator Fix resizing timing of elastic mempool during prefill batch creation Fix can_unmap Simplify reduction Enhance elastic memory management with free_all, improved token tracking, and optimized orchestration Add CUDA synchronization in orchestrator resize operations Clean code
|
Nice. I see how we try to improve max running batch size with this. In parallel, do we target to improve prefix cache hit rate with this as well by analyzing which of swa or full causing cache hit miss? |
@hanming-lu No problem. The current PR focuses on balancing pool usage to maximize batch size when some pools near capacity. Next, we’ll monitor cache hit rates per pool and optimize scaling strategies to boost hit rates under balanced loads. Metrics and adaptive scaling will need further design, let’s tackle this next! |
After merging upstream main into the PR sgl-project#13581 branch, several compatibility issues arose due to SWA code being refactored from memory_pool.py to swa_memory_pool.py: - Add page_size parameter to SWATokenToKVPoolAllocator in allocator.py - Fix elastic_allocator.py to import SWATokenToKVPoolAllocator from swa_memory_pool instead of allocator (fixes isinstance check in SWARadixCache) - Rewrite ElasticSWATokenToKVPoolAllocator to replace parent allocators post-init instead of overriding _create_allocator (which parent no longer calls) - Rewrite ElasticSWAKVPool to pass ElasticMHATokenToKVPool as pool class and recreate pools with pool_name parameter - Fix isinstance check in model_runner_kv_cache_mixin.py (use isinstance instead of __class__ ==) - Add missing get_float_env_var utility function to utils/common.py Made-with: Cursor
|
hi, will this feature support GDN models and mamba models? |
There’s a PR for Qwen3-Next support here: #14597. I’ll try to move it forward as soon as possible. |
Thanks for your reply, I'm very interested in supporting dynamic memory pool for mamba/GDN models, is there anything I can help with? |
Motivation
This PR implements dynamic scaling between different attention-type pools within the hybrid model in sglang, based on elasticmem.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist