Integration with elasticmem by pansicheng · Pull Request #13581 · sgl-project/sglang

pansicheng · 2025-11-19T12:18:30Z

Motivation

This PR implements dynamic scaling between different attention-type pools within the hybrid model in sglang, based on elasticmem.

______________________________________________________________________a-Flowchart (10)

Modifications

Accuracy Tests

Benchmarking and Profiling

export SGLANG_ELASTIC_MEM_POOL=true
export SGLANG_RATIO=1.0
nohup python3 -m sglang.launch_server \
  --log-level debug \
  --model /home/t4/models/lvm-data/Llama-4-Scout-17B-16E-Instruct \
  --tp 2 \
  --attention-backend fa3 \
  --hybrid-kvcache-ratio ${SGLANG_RATIO} \
  --context-length 200000 \
  > nohup.emem.${SGLANG_ELASTIC_MEM_POOL}.ratio.${SGLANG_RATIO}.out 2>&1 \
  &

export SGLANG_ELASTIC_MEM_POOL=true
export SGLANG_RATIO=1.0
nohup python3 -m sglang.bench_serving --backend sglang \
  --dataset-name random --dataset-path /home/t4/models/lvm-data/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1024 --random-input 1024 --random-output 1024 --random-range-ratio 1 \
  --max-concurrency 128 \
  > nohup.bench.${SGLANG_ELASTIC_MEM_POOL}.ratio.${SGLANG_RATIO}.out 2>&1 \
  &

Horizontal axis: Each time step represents a log entry for Prefill/Decode.
Vertical axis: As shown in the figure, it represents:
- Token usage for different pools,
- Real-time running requests,
- Generation throughput during decoding (set to 0 during prefill).
First row: Current static pool configuration with --hybrid-kvcache-ratio=0.6 (relatively balanced allocation between the full and swa pools).
Second row: Current static pool configuration with --hybrid-kvcache-ratio=1.0 (most GPU memory allocated to the full pool).
Third row: Elastic pool configuration with --hybrid-kvcache-ratio=1.0 (initial allocation favors the full pool but supports dynamic runtime adjustments).

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

yizhang2077 · 2025-12-04T08:43:13Z

+    # TODO: a more efficient way
+    @override
+    def alloc(self, need_size: int):
+        self.merge_and_sort_free()


make it more efficient

now we sort only during defragmentation

yizhang2077 · 2025-12-04T08:52:45Z

+        if self.token_usage() > 0.9:
+            return False
+
+        self.evict(self.evictable_size())


does can_unmap need to evict and merge_and_sort, since both seems to be time consuming

Now can_unmap skips eviction and merge_and_sort, using an unused_pages tensor to track consecutive tail pages

Use oversubscribe instead of expand Implement elastic memory pool for KV cache Implement elastic memory pool allocator ElasticMempoolOrchestrator Fix resizing timing of elastic mempool during prefill batch creation Fix can_unmap Simplify reduction Enhance elastic memory management with free_all, improved token tracking, and optimized orchestration Add CUDA synchronization in orchestrator resize operations Clean code

hanming-lu · 2025-12-13T00:05:38Z

Nice. I see how we try to improve max running batch size with this. In parallel, do we target to improve prefix cache hit rate with this as well by analyzing which of swa or full causing cache hit miss?

pansicheng · 2025-12-17T02:33:56Z

Nice. I see how we try to improve max running batch size with this. In parallel, do we target to improve prefix cache hit rate with this as well by analyzing which of swa or full causing cache hit miss?

@hanming-lu No problem. The current PR focuses on balancing pool usage to maximize batch size when some pools near capacity. Next, we’ll monitor cache hit rates per pool and optimize scaling strategies to boost hit rates under balanced loads. Metrics and adaptive scaling will need further design, let’s tackle this next!

After merging upstream main into the PR sgl-project#13581 branch, several compatibility issues arose due to SWA code being refactored from memory_pool.py to swa_memory_pool.py: - Add page_size parameter to SWATokenToKVPoolAllocator in allocator.py - Fix elastic_allocator.py to import SWATokenToKVPoolAllocator from swa_memory_pool instead of allocator (fixes isinstance check in SWARadixCache) - Rewrite ElasticSWATokenToKVPoolAllocator to replace parent allocators post-init instead of overriding _create_allocator (which parent no longer calls) - Rewrite ElasticSWAKVPool to pass ElasticMHATokenToKVPool as pool class and recreate pools with pool_name parameter - Fix isinstance check in model_runner_kv_cache_mixin.py (use isinstance instead of __class__ ==) - Add missing get_float_env_var utility function to utils/common.py Made-with: Cursor

ZelinMa557 · 2026-04-17T06:38:05Z

hi, will this feature support GDN models and mamba models?

pansicheng · 2026-04-17T07:13:26Z

hi, will this feature support GDN models and mamba models?

There’s a PR for Qwen3-Next support here: #14597. I’ll try to move it forward as soon as possible.

ZelinMa557 · 2026-04-17T08:33:06Z

hi, will this feature support GDN models and mamba models?

There’s a PR for Qwen3-Next support here: #14597. I’ll try to move it forward as soon as possible.

Thanks for your reply, I'm very interested in supporting dynamic memory pool for mamba/GDN models, is there anything I can help with?

pansicheng assigned pansicheng, Ying1123 and yizhang2077 and unassigned pansicheng Nov 19, 2025

pansicheng marked this pull request as ready for review December 4, 2025 07:52

pansicheng requested review from Fridge003, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners December 4, 2025 07:52

yizhang2077 reviewed Dec 4, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler.py Outdated

yizhang2077 reviewed Dec 4, 2025

View reviewed changes

Comment thread scripts/emem/plot/main.py

yizhang2077 reviewed Dec 4, 2025

View reviewed changes

pansicheng added 2 commits December 4, 2025 15:22

rebase main

4a17745

pansicheng force-pushed the emem branch from 00c7523 to 4a17745 Compare December 4, 2025 16:07

yizhang2077 reviewed Dec 5, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/elastic/elasticmem_orchestrator.py

gpt-oss

a8d691e

pansicheng requested a review from hanming-lu as a code owner December 7, 2025 08:25

yizhang2077 reviewed Dec 9, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/elastic/elasticmem_orchestrator.py Outdated

yizhang2077 reviewed Dec 9, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/elastic/elasticmem_orchestrator.py Outdated

pansicheng force-pushed the emem branch from b00bda7 to f58a0df Compare December 17, 2025 03:20

tail_consecutive & defragmentation

28f1fb0

pansicheng force-pushed the emem branch from f58a0df to 28f1fb0 Compare December 23, 2025 14:05

remove dev assert

9ef0146

merrymercy mentioned this pull request Apr 16, 2026

Development Roadmap (2026 Q2) #22949

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with elasticmem#13581

Integration with elasticmem#13581
pansicheng wants to merge 5 commits intosgl-project:mainfrom
pansicheng:emem

pansicheng commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

yizhang2077 Dec 4, 2025

Uh oh!

pansicheng Dec 17, 2025

Uh oh!

yizhang2077 Dec 4, 2025

Uh oh!

pansicheng Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanming-lu commented Dec 13, 2025 •

edited

Loading

Uh oh!

pansicheng commented Dec 17, 2025

Uh oh!

ZelinMa557 commented Apr 17, 2026

Uh oh!

pansicheng commented Apr 17, 2026

Uh oh!

ZelinMa557 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pansicheng commented Nov 19, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Uh oh!

Uh oh!

yizhang2077 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

pansicheng Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

yizhang2077 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

pansicheng Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanming-lu commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pansicheng commented Dec 17, 2025

Uh oh!

ZelinMa557 commented Apr 17, 2026

Uh oh!

pansicheng commented Apr 17, 2026

Uh oh!

ZelinMa557 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hanming-lu commented Dec 13, 2025 •

edited

Loading