[Roadmap] Distributed Serving Enhancement on 2025 H2

Roadmap of Distributed Serving Enhancement on 2025 H2
1. P/D Disaggregated Serving @ShangmingCai
* [x] Implement P/D disaggregation without GPUDirect RDMA by leveraging CPU buffer copy. @alogfans https://github.com/kvcache-ai/Mooncake/pull/702 (usage: `export MC_FORCE_TCP=True`)
* [x] Support Mooncake P/D Disaggregation with PP @ShangmingCai @ssssnow #8846 https://github.com/sgl-project/sglang/issues/11857
* [ ] Minimize reconfiguration overhead during P/D scaling, role transitions, and topology changes. @hzh0425  @LLLL114 https://github.com/sgl-project/sglang/pull/9325
* [x] OME integration (auto config) @slin1237 

2. Global KVCache Pool @ykwd  @huangtingwei9988

* [x] Enable global KVCache sharing via Mooncake store with a pluggable backend interface. (Based on https://github.com/sgl-project/sglang/pull/7704)
	- [x] Initial integration @huangtingwei9988  @zhangzuo21 https://github.com/sgl-project/sglang/pull/7211
	- [x] Fine-grained prefetch
	- [x] advanced prefetch policy
* [x] Integrate global KV cache sharing with P/D disaggregation. @ShangmingCai @hzh0425 
	- [x] KV Cache sharing among Prefill instances https://github.com/sgl-project/sglang/pull/8516
	- [x] Decode instances async contribute KV Cache to reduce TTFT of multi-turn conversation https://github.com/sgl-project/sglang/pull/10192
* [x] Support KVCache-aware request routing with global cache integration. @slin1237 @ShangmingCai 
* [x] 3FS Mini Manager: enabling cross-machine reuse capability for 3FS @hzh0425 @pansicheng 
* [ ] 3FS Operator: providing one-click deployment for 3FS. （OME Operator repository @pansicheng @hzh0425 @WANNA959)

3. RAS (Reliability, Availability, Serviceability)

* [x] Implement P/D health monitoring and fast reconfiguration leveraging disaggregated architecture. @ShangmingCai @whybeyoung 
	- [x] Abort request gracefully in P/D mode when client actively kills/disconnect the HTTP request  #8177 #8352
	- [x] Support health check based on /health_generate https://github.com/sgl-project/sglang/issues/8444  @whybeyoung 
	- [x] Address Tokenizer Manager bottleneck issue #8964 @whybeyoung @LLLL114 
	- [ ] Address De Tokenizer Manager bottleneck issue https://github.com/sgl-project/sglang/pull/9970/ @whybeyoung @LLLL114 
	- [x] Fast recovery policy support (Done in SGLang Model Gateway @slin1237 )
* [x] Introduce Elastic EP and cooperate with EPLB to tolerate partial GPU failures during inference. @UNIDY2002 @HanHan009527 https://github.com/sgl-project/sglang/pull/10423 https://github.com/sgl-project/sglang/issues/10606
* [x] Implement fine-grained profiling for PD with EP/DP/PP. #8965 #9962 #10804  @sufeng-buaa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Distributed Serving Enhancement on 2025 H2 #8210

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] Distributed Serving Enhancement on 2025 H2 #8210

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions