SGLang Router Architecture Improvement Proposal

# SGLang Router Architecture Improvement Proposal

## Table of Contents
1. [Summary](#summary)
2. [Current Architecture Overview](#current-architecture-overview)
3. [System Components](#system-components)
4. [Request Flow Analysis](#request-flow-analysis)
5. [Identified Pain Points](#identified-pain-points)
6. [Proposed Improvements](#proposed-improvements)
7. [Long-Term Vision](#long-term-vision)
8. [Implementation Phases](#implementation-phases)
9. [Risk Analysis](#risk-analysis)
10. [Success Metrics](#success-metrics)
11. [Conclusion](#conclusion)
12. [Appendix: Architecture Diagrams](#appendix-architecture-diagrams)

## Summary

This proposal outlines a architectural improvement plan for the SGLang Router, a high-performance load balancer that supports both traditional and disaggregated (Prefill-Decode) routing modes. The improvements focus on enhancing maintainability and extensibility without disrupting existing functionality. These changes lay the foundation for a long-term transformation where sgl-router evolves from a simple proxy into a full-featured OpenAI API server with native tool calling, session management, and direct gRPC communication with SGLang's backend services.

## Current Architecture Overview

The SGLang Router currently operates as an HTTP proxy that distributes requests across multiple SGLang server instances. It supports both regular routing mode and prefill-decode (PD) disaggregated routing mode, with multiple load balancing policies including random, round-robin, cache-aware, and power-of-two selection. The implementation consists of several large monolithic files that mix concerns and make maintenance challenging. (See [Appendix](#appendix-architecture-diagrams) for detailed architecture diagrams)

## System Components

### 1. Entry Point (`lib.rs`)
The main entry point provides Python bindings through PyO3:

```rust
#[pyclass]
struct Router {
 // Configuration
 host: String,
 port: u16,
 worker_urls: Vec<String>,
 policy: PolicyType,
 
 // PD Mode specific
 pd_disaggregation: bool,
 prefill_urls: Option<Vec<(String, Option<u16>)>>,
 decode_urls: Option<Vec<String>>,
 
 // Policy parameters
 cache_threshold: f32,
 balance_abs_threshold: usize,
 balance_rel_threshold: f32,
 // ... more fields
}
```

### 2. HTTP Server (`server.rs`)
Actix-web based server exposing multiple endpoints:

```mermaid
graph LR
 subgraph "API Endpoints"
 subgraph "OpenAI API"
 CC["/v1/chat/completions"]
 CO["/v1/completions"]
 GE["/generate"]
 end
 
 subgraph "Management"
 AW["/add_worker"]
 RW["/remove_worker"]
 LW["/list_workers"]
 end
 
 subgraph "Monitoring"
 HE["/health"]
 GL["/get_loads"]
 SI["/get_server_info"]
 end
 end
 
 subgraph "Request Processing"
 RP["Request Parser"]
 RA["Request Adapter"]
 RO["Router Selection"]
 end
 
 subgraph "Response Handling"
 ST["Streaming\n(SSE)"]
 JS["JSON\nResponse"]
 ER["Error\nHandler"]
 end
 
 %% Flow connections
 CC --> RP
 CO --> RP
 GE --> RP

 RP --> RA
 RA --> RO

 RO --> ST
 RO --> JS
 RO --> ER

```

### 3. Router Implementation (`router.rs`)

The router is implemented as an enum with four variants:

```mermaid
classDiagram
 class Router {
 <<enumeration>>
 Random
 RoundRobin
 CacheAware
 PrefillDecode
 }
 
 class Random {
 -worker_urls: Arc~RwLock~Vec~String~~~
 -timeout_secs: u64
 -interval_secs: u64
 +route(request) HttpResponse
 +add_worker(url) Result
 +remove_worker(url) Result
 }
 
 class RoundRobin {
 -worker_urls: Arc~RwLock~Vec~String~~~
 -current_index: AtomicUsize
 -timeout_secs: u64
 +route(request) HttpResponse
 +get_next_worker() String
 }
 
 class CacheAware {
 -worker_urls: Arc~RwLock~Vec~String~~~
 -tree_map: Arc~DashMap~String, Tree~~
 -running_queue: Arc~Mutex~HashMap~String, usize~~~
 -config: CacheAwareConfig
 +route(request) HttpResponse
 +select_by_cache(text) String
 +is_load_balanced() bool
 }
 
 class PrefillDecode {
 -pd_router: Arc~PDRouter~
 +route(request) HttpResponse
 +forward_to_pd() HttpResponse
 }
 
 Router <|-- Random
 Router <|-- RoundRobin
 Router <|-- CacheAware
 Router <|-- PrefillDecode
```

### 4. Cache-Aware Algorithm Detail

```mermaid
flowchart TD
 Start([Request Arrives]) --> Extract[Extract Text from Request]
 Extract --> CheckBalance{System Load Balanced?}
 
 CheckBalance -->|Yes| TreeLookup[Lookup in Radix Trees]
 CheckBalance -->|No| LoadBalance[Select Least Loaded]
 
 TreeLookup --> FindMatch[Find Best Prefix Match]
 FindMatch --> CheckThreshold{Match Rate > Threshold?}
 
 CheckThreshold -->|Yes| SelectCache[Select Worker with Best Match]
 CheckThreshold -->|No| SelectSmallest[Select Worker with Smallest Tree]
 
 SelectCache --> UpdateTree
 SelectSmallest --> UpdateTree
 LoadBalance --> UpdateTree[Update Tree with Request]
 
 UpdateTree --> Forward[Forward Request]
 Forward --> UpdateLoad[Update Load Counter]
 UpdateLoad --> End([Return Response])
```

### 5. PD Router Architecture (`pd_router.rs`)

```mermaid
graph TB
 subgraph "PD Router Components"
 PDR[PD Router]
 
 subgraph "Worker Pools"
 PFP[Prefill Pool RwLock Vec]
 DCP[Decode Pool RwLock Vec]
 end
 
 subgraph "Selection Policies"
 PRND[Random Selection]
 PP2[Power of Two]
 PCA[Cache Aware]
 end
 
 subgraph "Request Processing"
 BSI[Bootstrap Injection]
 PAR[Parallel Dispatch]
 LPM[Logprob Merger]
 end
 
 subgraph "Load Tracking"
 PLT[Prefill Load Tracker]
 DLT[Decode Load Tracker]
 end
 end
 
 PDR --> PFP
 PDR --> DCP
 PDR --> PRND
 PDR --> PP2
 PDR --> PCA
 
 PRND --> BSI
 PP2 --> BSI
 PCA --> BSI
 
 BSI --> PAR
 PAR --> LPM
 
 PFP --> PLT
 DCP --> DLT
```

### 6. Service Discovery (`service_discovery.rs`)

```mermaid
stateDiagram-v2
 [*] --> Initializing
 Initializing --> Watching: K8s Client Ready
 
 Watching --> Discovering: Timer Tick
 Discovering --> Processing: Pods Found
 Processing --> Filtering: Apply Selectors
 Filtering --> HealthCheck: Valid Pods
 
 HealthCheck --> UpdateWorkers: All Healthy
 HealthCheck --> PartialUpdate: Some Healthy
 HealthCheck --> Retry: All Failed
 
 UpdateWorkers --> Watching: Success
 PartialUpdate --> Watching: Partial Success
 Retry --> Discovering: Backoff Wait
 
 Watching --> Error: K8s API Error
 Error --> Retry: Exponential Backoff
 
 note right of HealthCheck
 Concurrent health checks
 with timeout protection
 end note
 
 note right of UpdateWorkers
 Atomic worker list update
 Triggers router refresh
 end note
```

## Request Flow Analysis

### Regular Mode Request Flow

```mermaid
flowchart LR
 subgraph "1. Request Receipt"
 REQ[HTTP Request] --> PARSE[Parse JSON]
 PARSE --> ADAPT[Adapt to Internal Format]
 end
 
 subgraph "2. Routing Decision"
 ADAPT --> POLICY{Routing Policy}
 POLICY -->|Random| RND_LOGIC[Random Selection]
 POLICY -->|RoundRobin| RR_LOGIC[Sequential Selection]
 POLICY -->|CacheAware| CA_LOGIC[Cache Analysis]
 end
 
 subgraph "3. Worker Selection"
 RND_LOGIC --> HEALTH{Health Check}
 RR_LOGIC --> HEALTH
 CA_LOGIC --> HEALTH
 HEALTH -->|Healthy| SELECT[Select Worker]
 HEALTH -->|Unhealthy| RETRY[Try Next]
 RETRY --> HEALTH
 end
 
 subgraph "4. Request Forwarding"
 SELECT --> BUILD[Build HTTP Request]
 BUILD --> SEND[Send to Worker]
 SEND --> WAIT{Response Type}
 WAIT -->|Stream| SSE[SSE Handler]
 WAIT -->|JSON| JSON[JSON Handler]
 end
 
 subgraph "5. Response Processing"
 SSE --> STREAM[Stream Response]
 JSON --> RETURN[Return Response]
 STREAM --> CLIENT[Client]
 RETURN --> CLIENT
 end
```

### PD Mode Request Flow

```mermaid
flowchart TB
 subgraph "1. Request Preparation"
 REQ[Request] --> CHECK{Has Bootstrap?}
 CHECK -->|No| FETCH[Fetch Bootstrap from Prefill]
 CHECK -->|Yes| INJECT[Use Existing]
 FETCH --> INJECT
 end
 
 subgraph "2. Worker Selection"
 INJECT --> SEL_PF[Select Prefill Worker]
 INJECT --> SEL_DC[Select Decode Worker]
 
 SEL_PF --> PF_POLICY{Policy}
 SEL_DC --> DC_POLICY{Policy}
 
 PF_POLICY -->|Random| PF_RND[Random Prefill]
 PF_POLICY -->|P2| PF_P2[Power of Two Prefill]
 
 DC_POLICY -->|Random| DC_RND[Random Decode]
 DC_POLICY -->|P2| DC_P2[Power of Two Decode]
 end
 
 subgraph "3. Parallel Dispatch"
 PF_RND --> PF_REQ[Prefill Request]
 PF_P2 --> PF_REQ
 DC_RND --> DC_REQ[Decode Request]
 DC_P2 --> DC_REQ
 
 PF_REQ --> PF_WAIT[Wait Prefill]
 DC_REQ --> DC_WAIT[Wait Decode]
 end
 
 subgraph "4. Response Handling"
 DC_WAIT --> CHECK_LP{Logprobs Requested?}
 CHECK_LP -->|Yes| MERGE[Merge Logprobs]
 CHECK_LP -->|No| RETURN[Return Decode Response]
 PF_WAIT --> MERGE
 MERGE --> RETURN
 end
```

## Identified Pain Points

### 1. Type Safety and State Management
- **Issue**: Workers represented as strings (`Vec<String>`)
- **Impact**: No health/load tracking, type confusion, scattered state
- **Example**: Health checks require external HashMap lookups

### 2. Code Duplication
- **Issue**: Routing logic duplicated between regular and PD routers
- **Impact**: Maintenance overhead, inconsistent behavior
- **Example**: CacheAware implemented twice with slight variations

### 3. Limited Extensibility
- **Issue**: Router enum requires modification for new policies
- **Impact**: Violates Open-Closed Principle, risky changes
- **Example**: Adding PowerOfTwo to regular mode requires enum changes

### 4. Scattered Observability
- **Issue**: Metrics collection spread across multiple files
- **Impact**: Inconsistent naming, missing metrics, hard to dashboard
- **Example**: Some endpoints lack request duration metrics

### 5. Basic Service Discovery
- **Issue**: No retry logic, basic error handling
- **Impact**: Transient K8s API failures cause worker loss
- **Example**: Network blip removes healthy workers permanently

### 6. PD Mode Limitations
- **Issue**: No dynamic worker management in PD mode
- **Impact**: Requires restart to add/remove workers
- **Example**: `/add_worker` returns error for PD mode

### 7. Configuration Management
- **Issue**: Configuration validation scattered across multiple locations
- **Impact**: Inconsistent validation logic, duplicate code, runtime errors
- **Example**: URL validation in Python code, mode compatibility checks in server startup, policy parameter validation in individual routers

## Proposed Improvements

The following improvements are designed to address immediate pain points while laying the groundwork for our long-term vision of transforming sgl-router into a full OpenAI API server. Each phase builds capabilities that serve both current needs and future evolution.

### Proposed Project Structure

The refactored codebase will reorganize existing files into focused modules:

```
sgl-router/
├── src/
│ ├── lib.rs # Python bindings, main Router struct
│ ├── server.rs # HTTP server, actix-web endpoints
│ ├── openai_api_types.rs # OpenAI API request/response types
│ ├── service_discovery.rs # K8s service discovery
│ ├── request_adapter.rs # Request format adaptation
│ │
│ ├── config/ # Configuration management
│ │ ├── mod.rs
│ │ ├── types.rs # RouterConfig, PolicyConfig, etc.
│ │ ├── validation.rs # ConfigValidator
│ │ └── error.rs # ConfigError
│ │
│ ├── core/ # Core abstractions
│ │ ├── mod.rs
│ │ └── worker.rs # Worker trait and implementations
│ │
│ ├── router/ # Routing logic
│ │ ├── mod.rs
│ │ ├── policies/ # Routing policies
│ │ │ ├── mod.rs
│ │ │ ├── random.rs
│ │ │ ├── round_robin.rs
│ │ │ ├── cache_aware.rs
│ │ │ └── power_of_two.rs
│ │ ├── router.rs # Router implementations
│ │ ├── pd_router.rs # PD router logic (includes pd_types)
│ │ ├── tree.rs # Radix tree for cache-aware routing
│ │ └── factory.rs # Router factory
│ │
│ └── observability/ # Monitoring
│ ├── mod.rs
│ ├── logging.rs # Structured logging
│ └── metrics.rs # Prometheus metrics
```

Note: `pd_types.rs` will be merged into `pd_router.rs` as those types are only used there.

### Phase 1: Foundation & Core Abstractions (Weeks 1-3)

#### Task 001: Centralized Configuration
Create a comprehensive configuration module to eliminate scattered validation:

```rust
pub struct RouterConfig {
 pub mode: RoutingMode,
 pub policy: PolicyConfig,
 pub workers: Vec<String>,
 pub host: String,
 pub port: u16,
 // ... other fields
}

pub enum RoutingMode {
 Regular,
 PrefillDecode {
 prefill_urls: Vec<(String, Option<u16>)>,
 decode_urls: Vec<String>,
 },
}

pub enum PolicyConfig {
 Random,
 RoundRobin,
 CacheAware { threshold: f32, /* ... */ },
 PowerOfTwo { interval_secs: u64 },
}
```

Implement validation with clear error messages:
- Field-level validation for URLs, ports, thresholds
- Cross-field compatibility checks (mode vs policy)
- Early detection of configuration errors

#### Task 002: Worker Abstraction
Transform workers from strings to typed entities, enabling future support for both HTTP endpoints and gRPC connections:

```rust
pub trait Worker: Send + Sync + Clone {
 fn url(&self) -> &str;
 fn worker_type(&self) -> WorkerType;
 fn is_healthy(&self) -> bool;
 fn load(&self) -> Arc<AtomicUsize>;
 async fn check_health(&self) -> Result<(), WorkerError>;
}

pub enum WorkerType {
 Regular,
 Prefill { bootstrap_port: Option<u16> },
 Decode,
 // Future: GrpcTokenizer, GrpcScheduler for direct backend connections
}
```

This abstraction is crucial for the long-term vision, as it allows the router to treat both traditional HTTP endpoints and future gRPC connections uniformly.

#### Task 003: RoutingPolicy Trait
Unify routing algorithms:

```rust
#[async_trait]
pub trait RoutingPolicy: Send + Sync {
 async fn select_single(&self, workers: &[Arc<dyn Worker>], request: &Value) 
 -> Result<Arc<dyn Worker>, RoutingError>;
 
 async fn select_pair(&self, prefill: &[Arc<dyn Worker>], decode: &[Arc<dyn Worker>], request: &Value) 
 -> Result<(Arc<dyn Worker>, Arc<dyn Worker>), RoutingError>;
}
```

#### Task 004: Policy Migration
Implement all policies using the new trait, enabling:
- PowerOfTwo in regular mode
- All policies in PD mode
- Consistent behavior across modes

### Phase 2: Infrastructure (Week 4)

#### Task 005: Centralized Observability
Consolidate metrics:

```rust
pub struct RouterMetrics;

impl RouterMetrics {
 pub fn record_request(route: &str, method: &str);
 pub fn record_duration(route: &str, duration: Duration);
 pub fn record_error(route: &str, error: &str);
 pub fn set_worker_health(url: &str, healthy: bool);
 pub fn record_cache_hit(worker: &str);
}
```

#### Task 006: Enhanced Service Discovery
Add resilience:
- Exponential backoff retry
- Health validation before adding
- Support for all worker types
- Graceful degradation

### Phase 3: Architecture (Week 5)

#### Task 007: Router Factory
Replace enum with trait-based design, enabling future dual-mode operation:

```rust
pub trait Router: Send + Sync {
 async fn route(&self, req: HttpRequest, body: Value, route: &str) -> HttpResponse;
 async fn add_worker(&self, worker: Arc<dyn Worker>) -> Result<(), RouterError>;
 async fn remove_worker(&self, url: &str) -> Result<(), RouterError>;
 fn apply_discovery_update(&self, update: DiscoveryUpdate);
}

pub struct RouterFactory;

impl RouterFactory {
 pub async fn create_router(config: &RouterConfig) -> Result<Arc<dyn Router>, RouterError>;
 // Future: create_api_server(config) for full OpenAI API mode
}
```

This factory pattern is essential for supporting both traditional proxy mode and future API server mode, allowing runtime selection based on configuration.

## Long-Term Vision

### From Load Balancer to Full OpenAI API Server

The architectural improvements proposed in this document are designed with a transformative long-term vision: evolving sgl-router from a simple HTTP proxy into a fully-featured OpenAI-compatible API server that directly integrates with SGLang's backend services.

#### Target Capabilities

1. **Dual Operating Modes**
 - **Traditional Router Mode**: Continue supporting the current proxy behavior for backward compatibility
 - **API Server Mode**: Full OpenAI API implementation with advanced features

2. **Native OpenAI API Implementation**
 - Complete endpoint compatibility (chat/completions, completions, embeddings, etc.)
 - Built-in request validation and processing
 - Streaming response support with proper SSE formatting
 - Error handling matching OpenAI's API behavior

3. **Tool Calling Framework**
 - Native support for function/tool calling without relying on backend servers
 - Extensible executor system (HTTP, Python, Shell, custom integrations)
 - Tool result integration directly in the conversation flow
 - Security sandboxing and permission management

4. **Direct gRPC Communication**
 - Replace HTTP forwarding with efficient gRPC calls to SGLang's scheduler
 - Connection pooling and load balancing
 - Streaming support for real-time token generation
 - Reduced latency through protocol optimization and avoid

## Implementation Phases

### Detailed Timeline

```mermaid
gantt
 title SGLang Router Improvement Timeline
 dateFormat YYYY-MM-DD
 section Phase 1
 Configuration Module :t1, 2025-06-26, 5d
 Worker Abstraction :t2, after t1, 6d
 RoutingPolicy Trait :t3, after t2, 7d
 Policy Migration :t4, after t3, 6d
 section Phase 2
 Centralized Observability :t5, after t4, 4d
 Enhanced Service Discovery :t6, after t4, 6d
 section Phase 3
 Router Factory :t7, after t6, 7d
 section Testing
 Integration Testing :t8, after t7, 5d
 Performance Validation :t9, after t8, 3d
 Documentation :t10, after t8, 3d
```

## Risk Analysis

### Technical Risks

| Risk | Impact | Probability | Mitigation |
|------------------------|--------|-------------|----------------------------------------|
| Performance Regression | High | Medium | Continuous benchmarking, profiling |
| Breaking Changes | High | Low | Feature flags, gradual rollout |
| Memory Leaks | Medium | Low | Stress testing, leak detection |
| Thread Safety Issues | High | Medium | Race condition testing, careful review |


## Conclusion

This comprehensive improvement plan addresses fundamental architectural issues while maintaining system stability. The phased approach ensures each improvement builds on the previous, creating a more maintainable, extensible, and reliable routing system for SGLang.

## Appendix: Architecture Diagrams

### High-Level Architecture

```mermaid
graph TB
 subgraph "Client Layer"
 PY[Python Client SGLang]
 HTTP[HTTP Client OpenAI Compatible]
 end

 subgraph "Router Layer"
 R[Router lib.rs/PyO3]
 S[HTTP Server server.rs]

 subgraph "Routing Modes"
 REG[Regular Router router.rs]
 PD[PD Router pd_router.rs]
 end

 subgraph "Routing Policies"
 RND[Random]
 RR[RoundRobin]
 CA[CacheAware + Tree]
 P2[PowerOfTwo]
 end
 end

 subgraph "Infrastructure"
 SD[Service Discovery K8s Integration]
 PROM[Prometheus Metrics]
 LOG[Logging tracing]
 end

 subgraph "Worker Layer"
 subgraph "Regular Workers"
 W1[Worker 1]
 W2[Worker 2]
 WN[Worker N]
 end

 subgraph "PD Workers"
 PF1[Prefill 1]
 PF2[Prefill 2]
 D1[Decode 1]
 D2[Decode 2]
 end
 end

 PY --> R
 HTTP --> S
 R --> S
 S --> REG
 S --> PD
 REG --> RND
 REG --> RR
 REG --> CA
 PD --> RND
 PD --> P2
 PD --> CA

 REG --> W1
 REG --> W2
 REG --> WN

 PD --> PF1
 PD --> PF2
 PD --> D1
 PD --> D2

 SD --> REG
 SD --> PD
 S --> PROM
 S --> LOG

```

### Component Interactions

```mermaid
sequenceDiagram
 participant C as Client
 participant S as Server
 participant R as Router
 participant P as Policy
 participant W as Worker
 participant SD as ServiceDiscovery
 participant M as Metrics
 
 Note over SD: Continuous Discovery
 SD->>R: Update Workers
 
 C->>S: HTTP Request
 S->>S: Parse & Validate
 S->>R: Route Request
 
 R->>P: Select Worker(s)
 
 alt Regular Mode
 P->>P: Apply Policy Logic
 P-->>R: Selected Worker
 R->>W: Forward Request
 W-->>R: Response
 else PD Mode
 P->>P: Select Prefill & Decode
 P-->>R: Worker Pair
 par Prefill Request
 R->>W: Prefill Request
 and Decode Request
 R->>W: Decode Request
 end
 W-->>R: Merged Response
 end
 
 R-->>S: Response
 S-->>C: HTTP Response
 
 R->>M: Record Metrics
 
 Note over R,W: Health Checks
 loop Every 30s
 R->>W: Health Check
 W-->>R: Status
 R->>M: Update Health
 end
```

Risk	Impact	Probability	Mitigation
Performance Regression	High	Medium	Continuous benchmarking, profiling
Breaking Changes	High	Low	Feature flags, gradual rollout
Memory Leaks	Medium	Low	Stress testing, leak detection
Thread Safety Issues	High	Medium	Race condition testing, careful review

SGLang Router Architecture Improvement Proposal #7532

Description

SGLang Router Architecture Improvement Proposal

Table of Contents

Summary

Current Architecture Overview

System Components

1. Entry Point (lib.rs)

2. HTTP Server (server.rs)

3. Router Implementation (router.rs)

4. Cache-Aware Algorithm Detail

5. PD Router Architecture (pd_router.rs)

6. Service Discovery (service_discovery.rs)

Request Flow Analysis

Regular Mode Request Flow

PD Mode Request Flow

Identified Pain Points

1. Type Safety and State Management

2. Code Duplication

3. Limited Extensibility

4. Scattered Observability

5. Basic Service Discovery

6. PD Mode Limitations

7. Configuration Management

Proposed Improvements

Proposed Project Structure

Phase 1: Foundation & Core Abstractions (Weeks 1-3)

Task 001: Centralized Configuration

Task 002: Worker Abstraction

Task 003: RoutingPolicy Trait

Task 004: Policy Migration

Phase 2: Infrastructure (Week 4)

Task 005: Centralized Observability

Task 006: Enhanced Service Discovery

Phase 3: Architecture (Week 5)

Task 007: Router Factory

Long-Term Vision

From Load Balancer to Full OpenAI API Server

Target Capabilities

Implementation Phases

Detailed Timeline

Risk Analysis

Technical Risks

Conclusion

Appendix: Architecture Diagrams

High-Level Architecture

Component Interactions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Entry Point (`lib.rs`)

2. HTTP Server (`server.rs`)

3. Router Implementation (`router.rs`)

5. PD Router Architecture (`pd_router.rs`)

6. Service Discovery (`service_discovery.rs`)