Skip to content

[Feature] Merge PDLB into SGLang Router #7031

@slin1237

Description

@slin1237

Checklist

Motivation

Overview

Merge Prefill-Decode Load Balancer (PDLB) functionality into SGLang Router to support both traditional load balancing and prefill-decode disaggregated routing.

Key Insight: Since PDLB has very minimal to no users, we can implement the optimal solution without migration.

System Architecture

graph TB
    subgraph "Unified SGLang Router"
        A[Router Core] --> B{Policy Detection}
        B --> C[Regular Router]
        B --> D[PD Router]
        
        C --> C1[RoundRobin]
        C --> C2[Random] 
        C --> C3[CacheAware]
        
        D --> D1[PD Random]
        D --> D2[PD PowerOfTwo]
        D --> D3[PD CacheAware]
        
        D3 --> E[Tree-Based Selection]
        E --> E1[Prefill Tree]
        E --> E2[Load Tracking]
    end
    
    subgraph "Worker Infrastructure"
        F[Regular Workers]
        G[Prefill Workers]
        H[Decode Workers]
    end
    
    C --> F
    D --> G
    D --> H
    
    style D3 fill:#2E8B57,color:#fff
    style E fill:#DAA520,color:#fff
    style A fill:#1E90FF,color:#fff
Loading

Implementation Phases

Phase 1A: Extract PDLB Components

  • Create pd_types.rs with essential PDLB types
  • Extract EngineInfo, Bootstrap trait, SingleOrBatch<T>
  • Add PDSelectionPolicy enum (Random, PowerOfTwo, CacheAware)
// Essential PDLB types extracted
pub struct EngineInfo {
    pub engine_type: EngineType,
    pub url: String,
    pub bootstrap_port: Option<u16>,
}

#[typetag::serde(tag = "type")]
pub trait Bootstrap {
    fn is_stream(&self) -> bool;
    fn get_batch_size(&self) -> Result<Option<usize>, Error>;
    fn add_bootstrap_info(&mut self, prefill_info: &EngineInfo) -> Result<(), Error>;
}

#[derive(Debug, Clone, PartialEq)]
pub enum PDSelectionPolicy {
    Random,
    PowerOfTwo,
    CacheAware {
        cache_threshold: f32,
        balance_abs_threshold: usize,
        balance_rel_threshold: f32,
    },
}

Phase 1B: Core PD Router Extension

  • Extend Router enum with PrefillDecode variant
  • Add PrefillDecodeConfig to PolicyConfig
  • Update all Router methods to handle PD mode
  • Python bindings with Router.new_pd() constructor
// Extended Router enum
pub enum Router {
    RoundRobin { /* existing */ },
    Random { /* existing */ },
    CacheAware { /* existing */ },
    PrefillDecode {
        prefill_workers: Arc<RwLock<Vec<EngineInfo>>>,
        decode_workers: Arc<RwLock<Vec<EngineInfo>>>,
        selection_policy: PDSelectionPolicy,
        prefill_tree: Option<Arc<Mutex<Tree>>>,
        timeout_secs: u64,
        interval_secs: u64,
    },
}

Phase 2: Bootstrap & Dual Dispatch

  • Implement request parsing for PD mode (BytesBox<dyn Bootstrap>)
  • Bootstrap injection mechanism for batch/single requests
  • Dual dispatch logic (send to BOTH prefill and decode)
  • Stream handling for PD responses
// PD request routing flow
async fn route_pd_request(
    &self,
    client: &reqwest::Client,
    body: &Bytes,
    route: &str,
) -> HttpResponse {
    // 1. Parse into typed request
    let mut typed_request: Box<dyn Bootstrap> = parse_pd_request(body, route)?;
    
    // 2. Select prefill and decode servers
    let (prefill, decode) = self.select_pd_pair(client).await;
    
    // 3. Bootstrap injection
    typed_request.add_bootstrap_info(&prefill)?;
    
    // 4. Send to BOTH servers, return decode response
    let (_, decode_response) = tokio::join!(
        send_to_prefill(prefill, &typed_request),
        send_to_decode(decode, &typed_request)
    );
    
    decode_response
}

Phase 3: Cache-Aware PD Implementation

  • Adapt existing Tree structure for PD routing
  • Text extraction from PD requests for cache matching
  • Load balancing fallback when system is imbalanced
  • PD-specific metrics and monitoring
graph TD
    A[PD Request] --> B[Extract Text]
    B --> C{Load Balanced?}
    
    C -->|Yes - Imbalanced| D[Use Load Balancing]
    D --> D1[Select Least Loaded Prefill]
    D --> D2[Select PowerOfTwo Decode]
    
    C -->|No - Balanced| E[Use Cache-Aware]
    E --> F[Tree Prefix Match]
    F --> G{Match Rate > Threshold?}
    
    G -->|Yes - Cache Hit| H[Route to Matched Worker]
    G -->|No - Cache Miss| I[Route to Smallest Tree Worker]
    
    H --> J[Update Tree & Load Tracking]
    I --> J
    D1 --> J
    D2 --> K[Send Requests]
    J --> K
    
    style H fill:#2E8B57,color:#fff
    style I fill:#B22222,color:#fff
    style D1 fill:#1E90FF,color:#fff
Loading

Phase 4: Testing & Polish

Status:

  • Comprehensive unit and integration tests
  • Performance benchmarking
  • Documentation and examples
  • Production readiness validation

Python API Design

Desired Implementation

from sglang_router import Router

# Clean PD Router creation
router = Router.new_pd(
    prefill_urls=[("http://prefill1:8080", 9000), ("http://prefill2:8080", None)],
    decode_urls=["http://decode1:8081", "http://decode2:8081"],
    policy="cache_aware",  # "random", "po2", "cache_aware"
    host="127.0.0.1",
    port=3001,
    cache_threshold=0.5,
    balance_abs_threshold=32,
    balance_rel_threshold=1.0001,
)

router.start()

Command Line Interface

# PD mode with cache-aware routing
python -m sglang_router.launch_router \
    --policy prefill_decode \
    --prefill-urls http://prefill1:8080:9000 http://prefill2:8080 \
    --decode-urls http://decode1:8081 http://decode2:8081 \
    --pd-policy cache_aware \
    --cache-threshold 0.6 \
    --host 0.0.0.0 \
    --port 8080

Key Challenges

1. Request Processing Paradigm Shift

  • Router: Raw Bytes → Single worker selection
  • PDLB: Typed structs → Bootstrap injection → Dual server dispatch
  • Solution: Bridge both paradigms in PD mode

2. Bootstrap Injection Complexity

  • Must handle SingleOrBatch<T> for batch requests
  • Critical for prefill-decode disaggregation
  • Solution: Integrate PDLB's Bootstrap trait system

3. Cache-Aware PD Routing

  • First-of-its-kind cache-aware routing for PD disaggregation
  • Adaptive switching between cache optimization and load balancing
  • Solution: Adapt Router's Tree for prefill selection

Success Metrics

  • Functionality: All PD selection policies working (random, po2, cache_aware)
  • Performance: No significant latency overhead vs regular routing
  • Compatibility: Existing Router functionality unchanged
  • Production Ready: Comprehensive metrics, health checking, failover (excluding PD for failover)

Related resources

No response

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions