Skip to content

SGLang Router Architecture Improvement Proposal #7532

@slin1237

Description

@slin1237

SGLang Router Architecture Improvement Proposal

Table of Contents

  1. Summary
  2. Current Architecture Overview
  3. System Components
  4. Request Flow Analysis
  5. Identified Pain Points
  6. Proposed Improvements
  7. Long-Term Vision
  8. Implementation Phases
  9. Risk Analysis
  10. Success Metrics
  11. Conclusion
  12. Appendix: Architecture Diagrams

Summary

This proposal outlines a architectural improvement plan for the SGLang Router, a high-performance load balancer that supports both traditional and disaggregated (Prefill-Decode) routing modes. The improvements focus on enhancing maintainability and extensibility without disrupting existing functionality. These changes lay the foundation for a long-term transformation where sgl-router evolves from a simple proxy into a full-featured OpenAI API server with native tool calling, session management, and direct gRPC communication with SGLang's backend services.

Current Architecture Overview

The SGLang Router currently operates as an HTTP proxy that distributes requests across multiple SGLang server instances. It supports both regular routing mode and prefill-decode (PD) disaggregated routing mode, with multiple load balancing policies including random, round-robin, cache-aware, and power-of-two selection. The implementation consists of several large monolithic files that mix concerns and make maintenance challenging. (See Appendix for detailed architecture diagrams)

System Components

1. Entry Point (lib.rs)

The main entry point provides Python bindings through PyO3:

#[pyclass]
struct Router {
    // Configuration
    host: String,
    port: u16,
    worker_urls: Vec<String>,
    policy: PolicyType,
    
    // PD Mode specific
    pd_disaggregation: bool,
    prefill_urls: Option<Vec<(String, Option<u16>)>>,
    decode_urls: Option<Vec<String>>,
    
    // Policy parameters
    cache_threshold: f32,
    balance_abs_threshold: usize,
    balance_rel_threshold: f32,
    // ... more fields
}

2. HTTP Server (server.rs)

Actix-web based server exposing multiple endpoints:

graph LR
    subgraph "API Endpoints"
        subgraph "OpenAI API"
            CC["/v1/chat/completions"]
            CO["/v1/completions"]
            GE["/generate"]
        end
        
        subgraph "Management"
            AW["/add_worker"]
            RW["/remove_worker"]
            LW["/list_workers"]
        end
        
        subgraph "Monitoring"
            HE["/health"]
            GL["/get_loads"]
            SI["/get_server_info"]
        end
    end
    
    subgraph "Request Processing"
        RP["Request Parser"]
        RA["Request Adapter"]
        RO["Router Selection"]
    end
    
    subgraph "Response Handling"
        ST["Streaming\n(SSE)"]
        JS["JSON\nResponse"]
        ER["Error\nHandler"]
    end
    
    %% Flow connections
    CC --> RP
    CO --> RP
    GE --> RP

    RP --> RA
    RA --> RO

    RO --> ST
    RO --> JS
    RO --> ER

Loading

3. Router Implementation (router.rs)

The router is implemented as an enum with four variants:

classDiagram
    class Router {
        <<enumeration>>
        Random
        RoundRobin
        CacheAware
        PrefillDecode
    }
    
    class Random {
        -worker_urls: Arc~RwLock~Vec~String~~~
        -timeout_secs: u64
        -interval_secs: u64
        +route(request) HttpResponse
        +add_worker(url) Result
        +remove_worker(url) Result
    }
    
    class RoundRobin {
        -worker_urls: Arc~RwLock~Vec~String~~~
        -current_index: AtomicUsize
        -timeout_secs: u64
        +route(request) HttpResponse
        +get_next_worker() String
    }
    
    class CacheAware {
        -worker_urls: Arc~RwLock~Vec~String~~~
        -tree_map: Arc~DashMap~String, Tree~~
        -running_queue: Arc~Mutex~HashMap~String, usize~~~
        -config: CacheAwareConfig
        +route(request) HttpResponse
        +select_by_cache(text) String
        +is_load_balanced() bool
    }
    
    class PrefillDecode {
        -pd_router: Arc~PDRouter~
        +route(request) HttpResponse
        +forward_to_pd() HttpResponse
    }
    
    Router <|-- Random
    Router <|-- RoundRobin
    Router <|-- CacheAware
    Router <|-- PrefillDecode
Loading

4. Cache-Aware Algorithm Detail

flowchart TD
    Start([Request Arrives]) --> Extract[Extract Text from Request]
    Extract --> CheckBalance{System<br/>Load Balanced?}
    
    CheckBalance -->|Yes| TreeLookup[Lookup in Radix Trees]
    CheckBalance -->|No| LoadBalance[Select Least Loaded]
    
    TreeLookup --> FindMatch[Find Best Prefix Match]
    FindMatch --> CheckThreshold{Match Rate ><br/>Threshold?}
    
    CheckThreshold -->|Yes| SelectCache[Select Worker<br/>with Best Match]
    CheckThreshold -->|No| SelectSmallest[Select Worker with<br/>Smallest Tree]
    
    SelectCache --> UpdateTree
    SelectSmallest --> UpdateTree
    LoadBalance --> UpdateTree[Update Tree<br/>with Request]
    
    UpdateTree --> Forward[Forward Request]
    Forward --> UpdateLoad[Update Load Counter]
    UpdateLoad --> End([Return Response])
Loading

5. PD Router Architecture (pd_router.rs)

graph TB
    subgraph "PD Router Components"
        PDR[PD Router]
        
        subgraph "Worker Pools"
            PFP[Prefill Pool<br/>RwLock Vec]
            DCP[Decode Pool<br/>RwLock Vec]
        end
        
        subgraph "Selection Policies"
            PRND[Random Selection]
            PP2[Power of Two]
            PCA[Cache Aware]
        end
        
        subgraph "Request Processing"
            BSI[Bootstrap Injection]
            PAR[Parallel Dispatch]
            LPM[Logprob Merger]
        end
        
        subgraph "Load Tracking"
            PLT[Prefill Load Tracker]
            DLT[Decode Load Tracker]
        end
    end
    
    PDR --> PFP
    PDR --> DCP
    PDR --> PRND
    PDR --> PP2
    PDR --> PCA
    
    PRND --> BSI
    PP2 --> BSI
    PCA --> BSI
    
    BSI --> PAR
    PAR --> LPM
    
    PFP --> PLT
    DCP --> DLT
Loading

6. Service Discovery (service_discovery.rs)

stateDiagram-v2
    [*] --> Initializing
    Initializing --> Watching: K8s Client Ready
    
    Watching --> Discovering: Timer Tick
    Discovering --> Processing: Pods Found
    Processing --> Filtering: Apply Selectors
    Filtering --> HealthCheck: Valid Pods
    
    HealthCheck --> UpdateWorkers: All Healthy
    HealthCheck --> PartialUpdate: Some Healthy
    HealthCheck --> Retry: All Failed
    
    UpdateWorkers --> Watching: Success
    PartialUpdate --> Watching: Partial Success
    Retry --> Discovering: Backoff Wait
    
    Watching --> Error: K8s API Error
    Error --> Retry: Exponential Backoff
    
    note right of HealthCheck
        Concurrent health checks
        with timeout protection
    end note
    
    note right of UpdateWorkers
        Atomic worker list update
        Triggers router refresh
    end note
Loading

Request Flow Analysis

Regular Mode Request Flow

flowchart LR
    subgraph "1. Request Receipt"
        REQ[HTTP Request] --> PARSE[Parse JSON]
        PARSE --> ADAPT[Adapt to Internal Format]
    end
    
    subgraph "2. Routing Decision"
        ADAPT --> POLICY{Routing Policy}
        POLICY -->|Random| RND_LOGIC[Random Selection]
        POLICY -->|RoundRobin| RR_LOGIC[Sequential Selection]
        POLICY -->|CacheAware| CA_LOGIC[Cache Analysis]
    end
    
    subgraph "3. Worker Selection"
        RND_LOGIC --> HEALTH{Health Check}
        RR_LOGIC --> HEALTH
        CA_LOGIC --> HEALTH
        HEALTH -->|Healthy| SELECT[Select Worker]
        HEALTH -->|Unhealthy| RETRY[Try Next]
        RETRY --> HEALTH
    end
    
    subgraph "4. Request Forwarding"
        SELECT --> BUILD[Build HTTP Request]
        BUILD --> SEND[Send to Worker]
        SEND --> WAIT{Response Type}
        WAIT -->|Stream| SSE[SSE Handler]
        WAIT -->|JSON| JSON[JSON Handler]
    end
    
    subgraph "5. Response Processing"
        SSE --> STREAM[Stream Response]
        JSON --> RETURN[Return Response]
        STREAM --> CLIENT[Client]
        RETURN --> CLIENT
    end
Loading

PD Mode Request Flow

flowchart TB
    subgraph "1. Request Preparation"
        REQ[Request] --> CHECK{Has Bootstrap?}
        CHECK -->|No| FETCH[Fetch Bootstrap<br/>from Prefill]
        CHECK -->|Yes| INJECT[Use Existing]
        FETCH --> INJECT
    end
    
    subgraph "2. Worker Selection"
        INJECT --> SEL_PF[Select Prefill Worker]
        INJECT --> SEL_DC[Select Decode Worker]
        
        SEL_PF --> PF_POLICY{Policy}
        SEL_DC --> DC_POLICY{Policy}
        
        PF_POLICY -->|Random| PF_RND[Random Prefill]
        PF_POLICY -->|P2| PF_P2[Power of Two Prefill]
        
        DC_POLICY -->|Random| DC_RND[Random Decode]
        DC_POLICY -->|P2| DC_P2[Power of Two Decode]
    end
    
    subgraph "3. Parallel Dispatch"
        PF_RND --> PF_REQ[Prefill Request]
        PF_P2 --> PF_REQ
        DC_RND --> DC_REQ[Decode Request]
        DC_P2 --> DC_REQ
        
        PF_REQ --> PF_WAIT[Wait Prefill]
        DC_REQ --> DC_WAIT[Wait Decode]
    end
    
    subgraph "4. Response Handling"
        DC_WAIT --> CHECK_LP{Logprobs<br/>Requested?}
        CHECK_LP -->|Yes| MERGE[Merge Logprobs]
        CHECK_LP -->|No| RETURN[Return Decode Response]
        PF_WAIT --> MERGE
        MERGE --> RETURN
    end
Loading

Identified Pain Points

1. Type Safety and State Management

  • Issue: Workers represented as strings (Vec<String>)
  • Impact: No health/load tracking, type confusion, scattered state
  • Example: Health checks require external HashMap lookups

2. Code Duplication

  • Issue: Routing logic duplicated between regular and PD routers
  • Impact: Maintenance overhead, inconsistent behavior
  • Example: CacheAware implemented twice with slight variations

3. Limited Extensibility

  • Issue: Router enum requires modification for new policies
  • Impact: Violates Open-Closed Principle, risky changes
  • Example: Adding PowerOfTwo to regular mode requires enum changes

4. Scattered Observability

  • Issue: Metrics collection spread across multiple files
  • Impact: Inconsistent naming, missing metrics, hard to dashboard
  • Example: Some endpoints lack request duration metrics

5. Basic Service Discovery

  • Issue: No retry logic, basic error handling
  • Impact: Transient K8s API failures cause worker loss
  • Example: Network blip removes healthy workers permanently

6. PD Mode Limitations

  • Issue: No dynamic worker management in PD mode
  • Impact: Requires restart to add/remove workers
  • Example: /add_worker returns error for PD mode

7. Configuration Management

  • Issue: Configuration validation scattered across multiple locations
  • Impact: Inconsistent validation logic, duplicate code, runtime errors
  • Example: URL validation in Python code, mode compatibility checks in server startup, policy parameter validation in individual routers

Proposed Improvements

The following improvements are designed to address immediate pain points while laying the groundwork for our long-term vision of transforming sgl-router into a full OpenAI API server. Each phase builds capabilities that serve both current needs and future evolution.

Proposed Project Structure

The refactored codebase will reorganize existing files into focused modules:

sgl-router/
├── src/
│   ├── lib.rs                     # Python bindings, main Router struct
│   ├── server.rs                  # HTTP server, actix-web endpoints
│   ├── openai_api_types.rs        # OpenAI API request/response types
│   ├── service_discovery.rs       # K8s service discovery
│   ├── request_adapter.rs         # Request format adaptation
│   │
│   ├── config/                    # Configuration management
│   │   ├── mod.rs
│   │   ├── types.rs               # RouterConfig, PolicyConfig, etc.
│   │   ├── validation.rs          # ConfigValidator
│   │   └── error.rs               # ConfigError
│   │
│   ├── core/                      # Core abstractions
│   │   ├── mod.rs
│   │   └── worker.rs              # Worker trait and implementations
│   │
│   ├── router/                    # Routing logic
│   │   ├── mod.rs
│   │   ├── policies/              # Routing policies
│   │   │   ├── mod.rs
│   │   │   ├── random.rs
│   │   │   ├── round_robin.rs
│   │   │   ├── cache_aware.rs
│   │   │   └── power_of_two.rs
│   │   ├── router.rs              # Router implementations
│   │   ├── pd_router.rs           # PD router logic (includes pd_types)
│   │   ├── tree.rs                # Radix tree for cache-aware routing
│   │   └── factory.rs             # Router factory
│   │
│   └── observability/             # Monitoring
│       ├── mod.rs
│       ├── logging.rs             # Structured logging
│       └── metrics.rs             # Prometheus metrics

Note: pd_types.rs will be merged into pd_router.rs as those types are only used there.

Phase 1: Foundation & Core Abstractions (Weeks 1-3)

Task 001: Centralized Configuration

Create a comprehensive configuration module to eliminate scattered validation:

pub struct RouterConfig {
    pub mode: RoutingMode,
    pub policy: PolicyConfig,
    pub workers: Vec<String>,
    pub host: String,
    pub port: u16,
    // ... other fields
}

pub enum RoutingMode {
    Regular,
    PrefillDecode {
        prefill_urls: Vec<(String, Option<u16>)>,
        decode_urls: Vec<String>,
    },
}

pub enum PolicyConfig {
    Random,
    RoundRobin,
    CacheAware { threshold: f32, /* ... */ },
    PowerOfTwo { interval_secs: u64 },
}

Implement validation with clear error messages:

  • Field-level validation for URLs, ports, thresholds
  • Cross-field compatibility checks (mode vs policy)
  • Early detection of configuration errors

Task 002: Worker Abstraction

Transform workers from strings to typed entities, enabling future support for both HTTP endpoints and gRPC connections:

pub trait Worker: Send + Sync + Clone {
    fn url(&self) -> &str;
    fn worker_type(&self) -> WorkerType;
    fn is_healthy(&self) -> bool;
    fn load(&self) -> Arc<AtomicUsize>;
    async fn check_health(&self) -> Result<(), WorkerError>;
}

pub enum WorkerType {
    Regular,
    Prefill { bootstrap_port: Option<u16> },
    Decode,
    // Future: GrpcTokenizer, GrpcScheduler for direct backend connections
}

This abstraction is crucial for the long-term vision, as it allows the router to treat both traditional HTTP endpoints and future gRPC connections uniformly.

Task 003: RoutingPolicy Trait

Unify routing algorithms:

#[async_trait]
pub trait RoutingPolicy: Send + Sync {
    async fn select_single(&self, workers: &[Arc<dyn Worker>], request: &Value) 
        -> Result<Arc<dyn Worker>, RoutingError>;
    
    async fn select_pair(&self, prefill: &[Arc<dyn Worker>], decode: &[Arc<dyn Worker>], request: &Value) 
        -> Result<(Arc<dyn Worker>, Arc<dyn Worker>), RoutingError>;
}

Task 004: Policy Migration

Implement all policies using the new trait, enabling:

  • PowerOfTwo in regular mode
  • All policies in PD mode
  • Consistent behavior across modes

Phase 2: Infrastructure (Week 4)

Task 005: Centralized Observability

Consolidate metrics:

pub struct RouterMetrics;

impl RouterMetrics {
    pub fn record_request(route: &str, method: &str);
    pub fn record_duration(route: &str, duration: Duration);
    pub fn record_error(route: &str, error: &str);
    pub fn set_worker_health(url: &str, healthy: bool);
    pub fn record_cache_hit(worker: &str);
}

Task 006: Enhanced Service Discovery

Add resilience:

  • Exponential backoff retry
  • Health validation before adding
  • Support for all worker types
  • Graceful degradation

Phase 3: Architecture (Week 5)

Task 007: Router Factory

Replace enum with trait-based design, enabling future dual-mode operation:

pub trait Router: Send + Sync {
    async fn route(&self, req: HttpRequest, body: Value, route: &str) -> HttpResponse;
    async fn add_worker(&self, worker: Arc<dyn Worker>) -> Result<(), RouterError>;
    async fn remove_worker(&self, url: &str) -> Result<(), RouterError>;
    fn apply_discovery_update(&self, update: DiscoveryUpdate);
}

pub struct RouterFactory;

impl RouterFactory {
    pub async fn create_router(config: &RouterConfig) -> Result<Arc<dyn Router>, RouterError>;
    // Future: create_api_server(config) for full OpenAI API mode
}

This factory pattern is essential for supporting both traditional proxy mode and future API server mode, allowing runtime selection based on configuration.

Long-Term Vision

From Load Balancer to Full OpenAI API Server

The architectural improvements proposed in this document are designed with a transformative long-term vision: evolving sgl-router from a simple HTTP proxy into a fully-featured OpenAI-compatible API server that directly integrates with SGLang's backend services.

Target Capabilities

  1. Dual Operating Modes

    • Traditional Router Mode: Continue supporting the current proxy behavior for backward compatibility
    • API Server Mode: Full OpenAI API implementation with advanced features
  2. Native OpenAI API Implementation

    • Complete endpoint compatibility (chat/completions, completions, embeddings, etc.)
    • Built-in request validation and processing
    • Streaming response support with proper SSE formatting
    • Error handling matching OpenAI's API behavior
  3. Tool Calling Framework

    • Native support for function/tool calling without relying on backend servers
    • Extensible executor system (HTTP, Python, Shell, custom integrations)
    • Tool result integration directly in the conversation flow
    • Security sandboxing and permission management
  4. Direct gRPC Communication

    • Replace HTTP forwarding with efficient gRPC calls to SGLang's scheduler
    • Connection pooling and load balancing
    • Streaming support for real-time token generation
    • Reduced latency through protocol optimization and avoid

Implementation Phases

Detailed Timeline

gantt
    title SGLang Router Improvement Timeline
    dateFormat  YYYY-MM-DD
    section Phase 1
    Configuration Module         :t1, 2025-06-26, 5d
    Worker Abstraction           :t2, after t1, 6d
    RoutingPolicy Trait          :t3, after t2, 7d
    Policy Migration             :t4, after t3, 6d
    section Phase 2
    Centralized Observability    :t5, after t4, 4d
    Enhanced Service Discovery   :t6, after t4, 6d
    section Phase 3
    Router Factory               :t7, after t6, 7d
    section Testing
    Integration Testing          :t8, after t7, 5d
    Performance Validation       :t9, after t8, 3d
    Documentation               :t10, after t8, 3d
Loading

Risk Analysis

Technical Risks

Risk Impact Probability Mitigation
Performance Regression High Medium Continuous benchmarking, profiling
Breaking Changes High Low Feature flags, gradual rollout
Memory Leaks Medium Low Stress testing, leak detection
Thread Safety Issues High Medium Race condition testing, careful review

Conclusion

This comprehensive improvement plan addresses fundamental architectural issues while maintaining system stability. The phased approach ensures each improvement builds on the previous, creating a more maintainable, extensible, and reliable routing system for SGLang.

Appendix: Architecture Diagrams

High-Level Architecture

graph TB
    subgraph "Client Layer"
        PY[Python Client<br/>SGLang]
        HTTP[HTTP Client<br/>OpenAI Compatible]
    end

    subgraph "Router Layer"
        R[Router<br/>lib.rs/PyO3]
        S[HTTP Server<br/>server.rs]

        subgraph "Routing Modes"
            REG[Regular Router<br/>router.rs]
            PD[PD Router<br/>pd_router.rs]
        end

        subgraph "Routing Policies"
            RND[Random]
            RR[RoundRobin]
            CA[CacheAware<br/>+ Tree]
            P2[PowerOfTwo]
        end
    end

    subgraph "Infrastructure"
        SD[Service Discovery<br/>K8s Integration]
        PROM[Prometheus<br/>Metrics]
        LOG[Logging<br/>tracing]
    end

    subgraph "Worker Layer"
        subgraph "Regular Workers"
            W1[Worker 1]
            W2[Worker 2]
            WN[Worker N]
        end

        subgraph "PD Workers"
            PF1[Prefill 1]
            PF2[Prefill 2]
            D1[Decode 1]
            D2[Decode 2]
        end
    end

    PY --> R
    HTTP --> S
    R --> S
    S --> REG
    S --> PD
    REG --> RND
    REG --> RR
    REG --> CA
    PD --> RND
    PD --> P2
    PD --> CA

    REG --> W1
    REG --> W2
    REG --> WN

    PD --> PF1
    PD --> PF2
    PD --> D1
    PD --> D2

    SD --> REG
    SD --> PD
    S --> PROM
    S --> LOG

Loading

Component Interactions

sequenceDiagram
    participant C as Client
    participant S as Server
    participant R as Router
    participant P as Policy
    participant W as Worker
    participant SD as ServiceDiscovery
    participant M as Metrics
    
    Note over SD: Continuous Discovery
    SD->>R: Update Workers
    
    C->>S: HTTP Request
    S->>S: Parse & Validate
    S->>R: Route Request
    
    R->>P: Select Worker(s)
    
    alt Regular Mode
        P->>P: Apply Policy Logic
        P-->>R: Selected Worker
        R->>W: Forward Request
        W-->>R: Response
    else PD Mode
        P->>P: Select Prefill & Decode
        P-->>R: Worker Pair
        par Prefill Request
            R->>W: Prefill Request
        and Decode Request
            R->>W: Decode Request
        end
        W-->>R: Merged Response
    end
    
    R-->>S: Response
    S-->>C: HTTP Response
    
    R->>M: Record Metrics
    
    Note over R,W: Health Checks
    loop Every 30s
        R->>W: Health Check
        W-->>R: Status
        R->>M: Update Health
    end
Loading

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions