You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This proposal outlines a architectural improvement plan for the SGLang Router, a high-performance load balancer that supports both traditional and disaggregated (Prefill-Decode) routing modes. The improvements focus on enhancing maintainability and extensibility without disrupting existing functionality. These changes lay the foundation for a long-term transformation where sgl-router evolves from a simple proxy into a full-featured OpenAI API server with native tool calling, session management, and direct gRPC communication with SGLang's backend services.
Current Architecture Overview
The SGLang Router currently operates as an HTTP proxy that distributes requests across multiple SGLang server instances. It supports both regular routing mode and prefill-decode (PD) disaggregated routing mode, with multiple load balancing policies including random, round-robin, cache-aware, and power-of-two selection. The implementation consists of several large monolithic files that mix concerns and make maintenance challenging. (See Appendix for detailed architecture diagrams)
System Components
1. Entry Point (lib.rs)
The main entry point provides Python bindings through PyO3:
#[pyclass]structRouter{// Configurationhost:String,port:u16,worker_urls:Vec<String>,policy:PolicyType,// PD Mode specificpd_disaggregation:bool,prefill_urls:Option<Vec<(String,Option<u16>)>>,decode_urls:Option<Vec<String>>,// Policy parameterscache_threshold:f32,balance_abs_threshold:usize,balance_rel_threshold:f32,// ... more fields}
2. HTTP Server (server.rs)
Actix-web based server exposing multiple endpoints:
graph LR
subgraph "API Endpoints"
subgraph "OpenAI API"
CC["/v1/chat/completions"]
CO["/v1/completions"]
GE["/generate"]
end
subgraph "Management"
AW["/add_worker"]
RW["/remove_worker"]
LW["/list_workers"]
end
subgraph "Monitoring"
HE["/health"]
GL["/get_loads"]
SI["/get_server_info"]
end
end
subgraph "Request Processing"
RP["Request Parser"]
RA["Request Adapter"]
RO["Router Selection"]
end
subgraph "Response Handling"
ST["Streaming\n(SSE)"]
JS["JSON\nResponse"]
ER["Error\nHandler"]
end
%% Flow connections
CC --> RP
CO --> RP
GE --> RP
RP --> RA
RA --> RO
RO --> ST
RO --> JS
RO --> ER
Loading
3. Router Implementation (router.rs)
The router is implemented as an enum with four variants:
classDiagram
class Router {
<<enumeration>>
Random
RoundRobin
CacheAware
PrefillDecode
}
class Random {
-worker_urls: Arc~RwLock~Vec~String~~~
-timeout_secs: u64
-interval_secs: u64
+route(request) HttpResponse
+add_worker(url) Result
+remove_worker(url) Result
}
class RoundRobin {
-worker_urls: Arc~RwLock~Vec~String~~~
-current_index: AtomicUsize
-timeout_secs: u64
+route(request) HttpResponse
+get_next_worker() String
}
class CacheAware {
-worker_urls: Arc~RwLock~Vec~String~~~
-tree_map: Arc~DashMap~String, Tree~~
-running_queue: Arc~Mutex~HashMap~String, usize~~~
-config: CacheAwareConfig
+route(request) HttpResponse
+select_by_cache(text) String
+is_load_balanced() bool
}
class PrefillDecode {
-pd_router: Arc~PDRouter~
+route(request) HttpResponse
+forward_to_pd() HttpResponse
}
Router <|-- Random
Router <|-- RoundRobin
Router <|-- CacheAware
Router <|-- PrefillDecode
Example: URL validation in Python code, mode compatibility checks in server startup, policy parameter validation in individual routers
Proposed Improvements
The following improvements are designed to address immediate pain points while laying the groundwork for our long-term vision of transforming sgl-router into a full OpenAI API server. Each phase builds capabilities that serve both current needs and future evolution.
Proposed Project Structure
The refactored codebase will reorganize existing files into focused modules:
This abstraction is crucial for the long-term vision, as it allows the router to treat both traditional HTTP endpoints and future gRPC connections uniformly.
Replace enum with trait-based design, enabling future dual-mode operation:
pubtraitRouter:Send + Sync{asyncfnroute(&self,req:HttpRequest,body:Value,route:&str) -> HttpResponse;asyncfnadd_worker(&self,worker:Arc<dynWorker>) -> Result<(),RouterError>;asyncfnremove_worker(&self,url:&str) -> Result<(),RouterError>;fnapply_discovery_update(&self,update:DiscoveryUpdate);}pubstructRouterFactory;implRouterFactory{pubasyncfncreate_router(config:&RouterConfig) -> Result<Arc<dynRouter>,RouterError>;// Future: create_api_server(config) for full OpenAI API mode}
This factory pattern is essential for supporting both traditional proxy mode and future API server mode, allowing runtime selection based on configuration.
Long-Term Vision
From Load Balancer to Full OpenAI API Server
The architectural improvements proposed in this document are designed with a transformative long-term vision: evolving sgl-router from a simple HTTP proxy into a fully-featured OpenAI-compatible API server that directly integrates with SGLang's backend services.
Target Capabilities
Dual Operating Modes
Traditional Router Mode: Continue supporting the current proxy behavior for backward compatibility
API Server Mode: Full OpenAI API implementation with advanced features
Streaming response support with proper SSE formatting
Error handling matching OpenAI's API behavior
Tool Calling Framework
Native support for function/tool calling without relying on backend servers
Extensible executor system (HTTP, Python, Shell, custom integrations)
Tool result integration directly in the conversation flow
Security sandboxing and permission management
Direct gRPC Communication
Replace HTTP forwarding with efficient gRPC calls to SGLang's scheduler
Connection pooling and load balancing
Streaming support for real-time token generation
Reduced latency through protocol optimization and avoid
Implementation Phases
Detailed Timeline
gantt
title SGLang Router Improvement Timeline
dateFormat YYYY-MM-DD
section Phase 1
Configuration Module :t1, 2025-06-26, 5d
Worker Abstraction :t2, after t1, 6d
RoutingPolicy Trait :t3, after t2, 7d
Policy Migration :t4, after t3, 6d
section Phase 2
Centralized Observability :t5, after t4, 4d
Enhanced Service Discovery :t6, after t4, 6d
section Phase 3
Router Factory :t7, after t6, 7d
section Testing
Integration Testing :t8, after t7, 5d
Performance Validation :t9, after t8, 3d
Documentation :t10, after t8, 3d
Loading
Risk Analysis
Technical Risks
Risk
Impact
Probability
Mitigation
Performance Regression
High
Medium
Continuous benchmarking, profiling
Breaking Changes
High
Low
Feature flags, gradual rollout
Memory Leaks
Medium
Low
Stress testing, leak detection
Thread Safety Issues
High
Medium
Race condition testing, careful review
Conclusion
This comprehensive improvement plan addresses fundamental architectural issues while maintaining system stability. The phased approach ensures each improvement builds on the previous, creating a more maintainable, extensible, and reliable routing system for SGLang.
Appendix: Architecture Diagrams
High-Level Architecture
graph TB
subgraph "Client Layer"
PY[Python Client<br/>SGLang]
HTTP[HTTP Client<br/>OpenAI Compatible]
end
subgraph "Router Layer"
R[Router<br/>lib.rs/PyO3]
S[HTTP Server<br/>server.rs]
subgraph "Routing Modes"
REG[Regular Router<br/>router.rs]
PD[PD Router<br/>pd_router.rs]
end
subgraph "Routing Policies"
RND[Random]
RR[RoundRobin]
CA[CacheAware<br/>+ Tree]
P2[PowerOfTwo]
end
end
subgraph "Infrastructure"
SD[Service Discovery<br/>K8s Integration]
PROM[Prometheus<br/>Metrics]
LOG[Logging<br/>tracing]
end
subgraph "Worker Layer"
subgraph "Regular Workers"
W1[Worker 1]
W2[Worker 2]
WN[Worker N]
end
subgraph "PD Workers"
PF1[Prefill 1]
PF2[Prefill 2]
D1[Decode 1]
D2[Decode 2]
end
end
PY --> R
HTTP --> S
R --> S
S --> REG
S --> PD
REG --> RND
REG --> RR
REG --> CA
PD --> RND
PD --> P2
PD --> CA
REG --> W1
REG --> W2
REG --> WN
PD --> PF1
PD --> PF2
PD --> D1
PD --> D2
SD --> REG
SD --> PD
S --> PROM
S --> LOG
Loading
Component Interactions
sequenceDiagram
participant C as Client
participant S as Server
participant R as Router
participant P as Policy
participant W as Worker
participant SD as ServiceDiscovery
participant M as Metrics
Note over SD: Continuous Discovery
SD->>R: Update Workers
C->>S: HTTP Request
S->>S: Parse & Validate
S->>R: Route Request
R->>P: Select Worker(s)
alt Regular Mode
P->>P: Apply Policy Logic
P-->>R: Selected Worker
R->>W: Forward Request
W-->>R: Response
else PD Mode
P->>P: Select Prefill & Decode
P-->>R: Worker Pair
par Prefill Request
R->>W: Prefill Request
and Decode Request
R->>W: Decode Request
end
W-->>R: Merged Response
end
R-->>S: Response
S-->>C: HTTP Response
R->>M: Record Metrics
Note over R,W: Health Checks
loop Every 30s
R->>W: Health Check
W-->>R: Status
R->>M: Update Health
end
SGLang Router Architecture Improvement Proposal
Table of Contents
Summary
This proposal outlines a architectural improvement plan for the SGLang Router, a high-performance load balancer that supports both traditional and disaggregated (Prefill-Decode) routing modes. The improvements focus on enhancing maintainability and extensibility without disrupting existing functionality. These changes lay the foundation for a long-term transformation where sgl-router evolves from a simple proxy into a full-featured OpenAI API server with native tool calling, session management, and direct gRPC communication with SGLang's backend services.
Current Architecture Overview
The SGLang Router currently operates as an HTTP proxy that distributes requests across multiple SGLang server instances. It supports both regular routing mode and prefill-decode (PD) disaggregated routing mode, with multiple load balancing policies including random, round-robin, cache-aware, and power-of-two selection. The implementation consists of several large monolithic files that mix concerns and make maintenance challenging. (See Appendix for detailed architecture diagrams)
System Components
1. Entry Point (
lib.rs)The main entry point provides Python bindings through PyO3:
2. HTTP Server (
server.rs)Actix-web based server exposing multiple endpoints:
graph LR subgraph "API Endpoints" subgraph "OpenAI API" CC["/v1/chat/completions"] CO["/v1/completions"] GE["/generate"] end subgraph "Management" AW["/add_worker"] RW["/remove_worker"] LW["/list_workers"] end subgraph "Monitoring" HE["/health"] GL["/get_loads"] SI["/get_server_info"] end end subgraph "Request Processing" RP["Request Parser"] RA["Request Adapter"] RO["Router Selection"] end subgraph "Response Handling" ST["Streaming\n(SSE)"] JS["JSON\nResponse"] ER["Error\nHandler"] end %% Flow connections CC --> RP CO --> RP GE --> RP RP --> RA RA --> RO RO --> ST RO --> JS RO --> ER3. Router Implementation (
router.rs)The router is implemented as an enum with four variants:
classDiagram class Router { <<enumeration>> Random RoundRobin CacheAware PrefillDecode } class Random { -worker_urls: Arc~RwLock~Vec~String~~~ -timeout_secs: u64 -interval_secs: u64 +route(request) HttpResponse +add_worker(url) Result +remove_worker(url) Result } class RoundRobin { -worker_urls: Arc~RwLock~Vec~String~~~ -current_index: AtomicUsize -timeout_secs: u64 +route(request) HttpResponse +get_next_worker() String } class CacheAware { -worker_urls: Arc~RwLock~Vec~String~~~ -tree_map: Arc~DashMap~String, Tree~~ -running_queue: Arc~Mutex~HashMap~String, usize~~~ -config: CacheAwareConfig +route(request) HttpResponse +select_by_cache(text) String +is_load_balanced() bool } class PrefillDecode { -pd_router: Arc~PDRouter~ +route(request) HttpResponse +forward_to_pd() HttpResponse } Router <|-- Random Router <|-- RoundRobin Router <|-- CacheAware Router <|-- PrefillDecode4. Cache-Aware Algorithm Detail
flowchart TD Start([Request Arrives]) --> Extract[Extract Text from Request] Extract --> CheckBalance{System<br/>Load Balanced?} CheckBalance -->|Yes| TreeLookup[Lookup in Radix Trees] CheckBalance -->|No| LoadBalance[Select Least Loaded] TreeLookup --> FindMatch[Find Best Prefix Match] FindMatch --> CheckThreshold{Match Rate ><br/>Threshold?} CheckThreshold -->|Yes| SelectCache[Select Worker<br/>with Best Match] CheckThreshold -->|No| SelectSmallest[Select Worker with<br/>Smallest Tree] SelectCache --> UpdateTree SelectSmallest --> UpdateTree LoadBalance --> UpdateTree[Update Tree<br/>with Request] UpdateTree --> Forward[Forward Request] Forward --> UpdateLoad[Update Load Counter] UpdateLoad --> End([Return Response])5. PD Router Architecture (
pd_router.rs)graph TB subgraph "PD Router Components" PDR[PD Router] subgraph "Worker Pools" PFP[Prefill Pool<br/>RwLock Vec] DCP[Decode Pool<br/>RwLock Vec] end subgraph "Selection Policies" PRND[Random Selection] PP2[Power of Two] PCA[Cache Aware] end subgraph "Request Processing" BSI[Bootstrap Injection] PAR[Parallel Dispatch] LPM[Logprob Merger] end subgraph "Load Tracking" PLT[Prefill Load Tracker] DLT[Decode Load Tracker] end end PDR --> PFP PDR --> DCP PDR --> PRND PDR --> PP2 PDR --> PCA PRND --> BSI PP2 --> BSI PCA --> BSI BSI --> PAR PAR --> LPM PFP --> PLT DCP --> DLT6. Service Discovery (
service_discovery.rs)stateDiagram-v2 [*] --> Initializing Initializing --> Watching: K8s Client Ready Watching --> Discovering: Timer Tick Discovering --> Processing: Pods Found Processing --> Filtering: Apply Selectors Filtering --> HealthCheck: Valid Pods HealthCheck --> UpdateWorkers: All Healthy HealthCheck --> PartialUpdate: Some Healthy HealthCheck --> Retry: All Failed UpdateWorkers --> Watching: Success PartialUpdate --> Watching: Partial Success Retry --> Discovering: Backoff Wait Watching --> Error: K8s API Error Error --> Retry: Exponential Backoff note right of HealthCheck Concurrent health checks with timeout protection end note note right of UpdateWorkers Atomic worker list update Triggers router refresh end noteRequest Flow Analysis
Regular Mode Request Flow
flowchart LR subgraph "1. Request Receipt" REQ[HTTP Request] --> PARSE[Parse JSON] PARSE --> ADAPT[Adapt to Internal Format] end subgraph "2. Routing Decision" ADAPT --> POLICY{Routing Policy} POLICY -->|Random| RND_LOGIC[Random Selection] POLICY -->|RoundRobin| RR_LOGIC[Sequential Selection] POLICY -->|CacheAware| CA_LOGIC[Cache Analysis] end subgraph "3. Worker Selection" RND_LOGIC --> HEALTH{Health Check} RR_LOGIC --> HEALTH CA_LOGIC --> HEALTH HEALTH -->|Healthy| SELECT[Select Worker] HEALTH -->|Unhealthy| RETRY[Try Next] RETRY --> HEALTH end subgraph "4. Request Forwarding" SELECT --> BUILD[Build HTTP Request] BUILD --> SEND[Send to Worker] SEND --> WAIT{Response Type} WAIT -->|Stream| SSE[SSE Handler] WAIT -->|JSON| JSON[JSON Handler] end subgraph "5. Response Processing" SSE --> STREAM[Stream Response] JSON --> RETURN[Return Response] STREAM --> CLIENT[Client] RETURN --> CLIENT endPD Mode Request Flow
flowchart TB subgraph "1. Request Preparation" REQ[Request] --> CHECK{Has Bootstrap?} CHECK -->|No| FETCH[Fetch Bootstrap<br/>from Prefill] CHECK -->|Yes| INJECT[Use Existing] FETCH --> INJECT end subgraph "2. Worker Selection" INJECT --> SEL_PF[Select Prefill Worker] INJECT --> SEL_DC[Select Decode Worker] SEL_PF --> PF_POLICY{Policy} SEL_DC --> DC_POLICY{Policy} PF_POLICY -->|Random| PF_RND[Random Prefill] PF_POLICY -->|P2| PF_P2[Power of Two Prefill] DC_POLICY -->|Random| DC_RND[Random Decode] DC_POLICY -->|P2| DC_P2[Power of Two Decode] end subgraph "3. Parallel Dispatch" PF_RND --> PF_REQ[Prefill Request] PF_P2 --> PF_REQ DC_RND --> DC_REQ[Decode Request] DC_P2 --> DC_REQ PF_REQ --> PF_WAIT[Wait Prefill] DC_REQ --> DC_WAIT[Wait Decode] end subgraph "4. Response Handling" DC_WAIT --> CHECK_LP{Logprobs<br/>Requested?} CHECK_LP -->|Yes| MERGE[Merge Logprobs] CHECK_LP -->|No| RETURN[Return Decode Response] PF_WAIT --> MERGE MERGE --> RETURN endIdentified Pain Points
1. Type Safety and State Management
Vec<String>)2. Code Duplication
3. Limited Extensibility
4. Scattered Observability
5. Basic Service Discovery
6. PD Mode Limitations
/add_workerreturns error for PD mode7. Configuration Management
Proposed Improvements
The following improvements are designed to address immediate pain points while laying the groundwork for our long-term vision of transforming sgl-router into a full OpenAI API server. Each phase builds capabilities that serve both current needs and future evolution.
Proposed Project Structure
The refactored codebase will reorganize existing files into focused modules:
Note:
pd_types.rswill be merged intopd_router.rsas those types are only used there.Phase 1: Foundation & Core Abstractions (Weeks 1-3)
Task 001: Centralized Configuration
Create a comprehensive configuration module to eliminate scattered validation:
Implement validation with clear error messages:
Task 002: Worker Abstraction
Transform workers from strings to typed entities, enabling future support for both HTTP endpoints and gRPC connections:
This abstraction is crucial for the long-term vision, as it allows the router to treat both traditional HTTP endpoints and future gRPC connections uniformly.
Task 003: RoutingPolicy Trait
Unify routing algorithms:
Task 004: Policy Migration
Implement all policies using the new trait, enabling:
Phase 2: Infrastructure (Week 4)
Task 005: Centralized Observability
Consolidate metrics:
Task 006: Enhanced Service Discovery
Add resilience:
Phase 3: Architecture (Week 5)
Task 007: Router Factory
Replace enum with trait-based design, enabling future dual-mode operation:
This factory pattern is essential for supporting both traditional proxy mode and future API server mode, allowing runtime selection based on configuration.
Long-Term Vision
From Load Balancer to Full OpenAI API Server
The architectural improvements proposed in this document are designed with a transformative long-term vision: evolving sgl-router from a simple HTTP proxy into a fully-featured OpenAI-compatible API server that directly integrates with SGLang's backend services.
Target Capabilities
Dual Operating Modes
Native OpenAI API Implementation
Tool Calling Framework
Direct gRPC Communication
Implementation Phases
Detailed Timeline
gantt title SGLang Router Improvement Timeline dateFormat YYYY-MM-DD section Phase 1 Configuration Module :t1, 2025-06-26, 5d Worker Abstraction :t2, after t1, 6d RoutingPolicy Trait :t3, after t2, 7d Policy Migration :t4, after t3, 6d section Phase 2 Centralized Observability :t5, after t4, 4d Enhanced Service Discovery :t6, after t4, 6d section Phase 3 Router Factory :t7, after t6, 7d section Testing Integration Testing :t8, after t7, 5d Performance Validation :t9, after t8, 3d Documentation :t10, after t8, 3dRisk Analysis
Technical Risks
Conclusion
This comprehensive improvement plan addresses fundamental architectural issues while maintaining system stability. The phased approach ensures each improvement builds on the previous, creating a more maintainable, extensible, and reliable routing system for SGLang.
Appendix: Architecture Diagrams
High-Level Architecture
graph TB subgraph "Client Layer" PY[Python Client<br/>SGLang] HTTP[HTTP Client<br/>OpenAI Compatible] end subgraph "Router Layer" R[Router<br/>lib.rs/PyO3] S[HTTP Server<br/>server.rs] subgraph "Routing Modes" REG[Regular Router<br/>router.rs] PD[PD Router<br/>pd_router.rs] end subgraph "Routing Policies" RND[Random] RR[RoundRobin] CA[CacheAware<br/>+ Tree] P2[PowerOfTwo] end end subgraph "Infrastructure" SD[Service Discovery<br/>K8s Integration] PROM[Prometheus<br/>Metrics] LOG[Logging<br/>tracing] end subgraph "Worker Layer" subgraph "Regular Workers" W1[Worker 1] W2[Worker 2] WN[Worker N] end subgraph "PD Workers" PF1[Prefill 1] PF2[Prefill 2] D1[Decode 1] D2[Decode 2] end end PY --> R HTTP --> S R --> S S --> REG S --> PD REG --> RND REG --> RR REG --> CA PD --> RND PD --> P2 PD --> CA REG --> W1 REG --> W2 REG --> WN PD --> PF1 PD --> PF2 PD --> D1 PD --> D2 SD --> REG SD --> PD S --> PROM S --> LOGComponent Interactions
sequenceDiagram participant C as Client participant S as Server participant R as Router participant P as Policy participant W as Worker participant SD as ServiceDiscovery participant M as Metrics Note over SD: Continuous Discovery SD->>R: Update Workers C->>S: HTTP Request S->>S: Parse & Validate S->>R: Route Request R->>P: Select Worker(s) alt Regular Mode P->>P: Apply Policy Logic P-->>R: Selected Worker R->>W: Forward Request W-->>R: Response else PD Mode P->>P: Select Prefill & Decode P-->>R: Worker Pair par Prefill Request R->>W: Prefill Request and Decode Request R->>W: Decode Request end W-->>R: Merged Response end R-->>S: Response S-->>C: HTTP Response R->>M: Record Metrics Note over R,W: Health Checks loop Every 30s R->>W: Health Check W-->>R: Status R->>M: Update Health end