Roadmap of Distributed Serving Enhancement on 2025 H2 1. P/D Disaggregated Serving @ShangmingCai * [x] Implement P/D disaggregation without GPUDirect RDMA by leveraging CPU buffer copy. @alogfans https://github.com/kvcache-ai/Mooncake/pull/702 (usage: `export MC_FORCE_TCP=True`) * [x] Support Mooncake P/D Disaggregation with PP @ShangmingCai @ssssnow #8846 https://github.com/sgl-project/sglang/issues/11857 * [ ] Minimize reconfiguration overhead during P/D scaling, role transitions, and topology changes. @hzh0425 @LLLL114 https://github.com/sgl-project/sglang/pull/9325 * [x] OME integration (auto config) @slin1237 2. Global KVCache Pool @ykwd @huangtingwei9988 * [x] Enable global KVCache sharing via Mooncake store with a pluggable backend interface. (Based on https://github.com/sgl-project/sglang/pull/7704) - [x] Initial integration @huangtingwei9988 @zhangzuo21 https://github.com/sgl-project/sglang/pull/7211 - [x] Fine-grained prefetch - [x] advanced prefetch policy * [x] Integrate global KV cache sharing with P/D disaggregation. @ShangmingCai @hzh0425 - [x] KV Cache sharing among Prefill instances https://github.com/sgl-project/sglang/pull/8516 - [x] Decode instances async contribute KV Cache to reduce TTFT of multi-turn conversation https://github.com/sgl-project/sglang/pull/10192 * [x] Support KVCache-aware request routing with global cache integration. @slin1237 @ShangmingCai * [x] 3FS Mini Manager: enabling cross-machine reuse capability for 3FS @hzh0425 @pansicheng * [ ] 3FS Operator: providing one-click deployment for 3FS. (OME Operator repository @pansicheng @hzh0425 @WANNA959) 3. RAS (Reliability, Availability, Serviceability) * [x] Implement P/D health monitoring and fast reconfiguration leveraging disaggregated architecture. @ShangmingCai @whybeyoung - [x] Abort request gracefully in P/D mode when client actively kills/disconnect the HTTP request #8177 #8352 - [x] Support health check based on /health_generate https://github.com/sgl-project/sglang/issues/8444 @whybeyoung - [x] Address Tokenizer Manager bottleneck issue #8964 @whybeyoung @LLLL114 - [ ] Address De Tokenizer Manager bottleneck issue https://github.com/sgl-project/sglang/pull/9970/ @whybeyoung @LLLL114 - [x] Fast recovery policy support (Done in SGLang Model Gateway @slin1237 ) * [x] Introduce Elastic EP and cooperate with EPLB to tolerate partial GPU failures during inference. @UNIDY2002 @HanHan009527 https://github.com/sgl-project/sglang/pull/10423 https://github.com/sgl-project/sglang/issues/10606 * [x] Implement fine-grained profiling for PD with EP/DP/PP. #8965 #9962 #10804 @sufeng-buaa
Roadmap of Distributed Serving Enhancement on 2025 H2
export MC_FORCE_TCP=True)