You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2. Please use English, otherwise it will be closed.
Motivation
Overview
Merge Prefill-Decode Load Balancer (PDLB) functionality into SGLang Router to support both traditional load balancing and prefill-decode disaggregated routing.
Key Insight: Since PDLB has very minimal to no users, we can implement the optimal solution without migration.
System Architecture
graph TB
subgraph "Unified SGLang Router"
A[Router Core] --> B{Policy Detection}
B --> C[Regular Router]
B --> D[PD Router]
C --> C1[RoundRobin]
C --> C2[Random]
C --> C3[CacheAware]
D --> D1[PD Random]
D --> D2[PD PowerOfTwo]
D --> D3[PD CacheAware]
D3 --> E[Tree-Based Selection]
E --> E1[Prefill Tree]
E --> E2[Load Tracking]
end
subgraph "Worker Infrastructure"
F[Regular Workers]
G[Prefill Workers]
H[Decode Workers]
end
C --> F
D --> G
D --> H
style D3 fill:#2E8B57,color:#fff
style E fill:#DAA520,color:#fff
style A fill:#1E90FF,color:#fff
Text extraction from PD requests for cache matching
Load balancing fallback when system is imbalanced
PD-specific metrics and monitoring
graph TD
A[PD Request] --> B[Extract Text]
B --> C{Load Balanced?}
C -->|Yes - Imbalanced| D[Use Load Balancing]
D --> D1[Select Least Loaded Prefill]
D --> D2[Select PowerOfTwo Decode]
C -->|No - Balanced| E[Use Cache-Aware]
E --> F[Tree Prefix Match]
F --> G{Match Rate > Threshold?}
G -->|Yes - Cache Hit| H[Route to Matched Worker]
G -->|No - Cache Miss| I[Route to Smallest Tree Worker]
H --> J[Update Tree & Load Tracking]
I --> J
D1 --> J
D2 --> K[Send Requests]
J --> K
style H fill:#2E8B57,color:#fff
style I fill:#B22222,color:#fff
style D1 fill:#1E90FF,color:#fff
Checklist
Motivation
Overview
Merge Prefill-Decode Load Balancer (PDLB) functionality into SGLang Router to support both traditional load balancing and prefill-decode disaggregated routing.
Key Insight: Since PDLB has very minimal to no users, we can implement the optimal solution without migration.
System Architecture
graph TB subgraph "Unified SGLang Router" A[Router Core] --> B{Policy Detection} B --> C[Regular Router] B --> D[PD Router] C --> C1[RoundRobin] C --> C2[Random] C --> C3[CacheAware] D --> D1[PD Random] D --> D2[PD PowerOfTwo] D --> D3[PD CacheAware] D3 --> E[Tree-Based Selection] E --> E1[Prefill Tree] E --> E2[Load Tracking] end subgraph "Worker Infrastructure" F[Regular Workers] G[Prefill Workers] H[Decode Workers] end C --> F D --> G D --> H style D3 fill:#2E8B57,color:#fff style E fill:#DAA520,color:#fff style A fill:#1E90FF,color:#fffImplementation Phases
Phase 1A: Extract PDLB Components
pd_types.rswith essential PDLB typesEngineInfo,Bootstraptrait,SingleOrBatch<T>PDSelectionPolicyenum (Random, PowerOfTwo, CacheAware)Phase 1B: Core PD Router Extension
PrefillDecodevariantPrefillDecodeConfigtoPolicyConfigRouter.new_pd()constructorPhase 2: Bootstrap & Dual Dispatch
Bytes→Box<dyn Bootstrap>)Phase 3: Cache-Aware PD Implementation
graph TD A[PD Request] --> B[Extract Text] B --> C{Load Balanced?} C -->|Yes - Imbalanced| D[Use Load Balancing] D --> D1[Select Least Loaded Prefill] D --> D2[Select PowerOfTwo Decode] C -->|No - Balanced| E[Use Cache-Aware] E --> F[Tree Prefix Match] F --> G{Match Rate > Threshold?} G -->|Yes - Cache Hit| H[Route to Matched Worker] G -->|No - Cache Miss| I[Route to Smallest Tree Worker] H --> J[Update Tree & Load Tracking] I --> J D1 --> J D2 --> K[Send Requests] J --> K style H fill:#2E8B57,color:#fff style I fill:#B22222,color:#fff style D1 fill:#1E90FF,color:#fffPhase 4: Testing & Polish
Status:
Python API Design
Desired Implementation
Command Line Interface
# PD mode with cache-aware routing python -m sglang_router.launch_router \ --policy prefill_decode \ --prefill-urls http://prefill1:8080:9000 http://prefill2:8080 \ --decode-urls http://decode1:8081 http://decode2:8081 \ --pd-policy cache_aware \ --cache-threshold 0.6 \ --host 0.0.0.0 \ --port 8080Key Challenges
1. Request Processing Paradigm Shift
Bytes→ Single worker selection2. Bootstrap Injection Complexity
SingleOrBatch<T>for batch requests3. Cache-Aware PD Routing
Success Metrics
Related resources
No response