[Feature] Merge PDLB into SGLang Router

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

## Overview

Merge Prefill-Decode Load Balancer (PDLB) functionality into SGLang Router to support both traditional load balancing and prefill-decode disaggregated routing.

**Key Insight**: Since PDLB has very minimal to no users, we can implement the optimal solution without migration.

## System Architecture

```mermaid
graph TB
    subgraph "Unified SGLang Router"
        A[Router Core] --> B{Policy Detection}
        B --> C[Regular Router]
        B --> D[PD Router]
        
        C --> C1[RoundRobin]
        C --> C2[Random] 
        C --> C3[CacheAware]
        
        D --> D1[PD Random]
        D --> D2[PD PowerOfTwo]
        D --> D3[PD CacheAware]
        
        D3 --> E[Tree-Based Selection]
        E --> E1[Prefill Tree]
        E --> E2[Load Tracking]
    end
    
    subgraph "Worker Infrastructure"
        F[Regular Workers]
        G[Prefill Workers]
        H[Decode Workers]
    end
    
    C --> F
    D --> G
    D --> H
    
    style D3 fill:#2E8B57,color:#fff
    style E fill:#DAA520,color:#fff
    style A fill:#1E90FF,color:#fff
```

## Implementation Phases

### Phase 1A: Extract PDLB Components

- [x] Create `pd_types.rs` with essential PDLB types
- [x] Extract `EngineInfo`, `Bootstrap` trait, `SingleOrBatch<T>`
- [x] Add `PDSelectionPolicy` enum (Random, PowerOfTwo, CacheAware)

```rust
// Essential PDLB types extracted
pub struct EngineInfo {
    pub engine_type: EngineType,
    pub url: String,
    pub bootstrap_port: Option<u16>,
}

#[typetag::serde(tag = "type")]
pub trait Bootstrap {
    fn is_stream(&self) -> bool;
    fn get_batch_size(&self) -> Result<Option<usize>, Error>;
    fn add_bootstrap_info(&mut self, prefill_info: &EngineInfo) -> Result<(), Error>;
}

#[derive(Debug, Clone, PartialEq)]
pub enum PDSelectionPolicy {
    Random,
    PowerOfTwo,
    CacheAware {
        cache_threshold: f32,
        balance_abs_threshold: usize,
        balance_rel_threshold: f32,
    },
}
```

### Phase 1B: Core PD Router Extension

- [x] Extend Router enum with `PrefillDecode` variant
- [x] Add `PrefillDecodeConfig` to `PolicyConfig`
- [x] Update all Router methods to handle PD mode
- [x] Python bindings with `Router.new_pd()` constructor

```rust
// Extended Router enum
pub enum Router {
    RoundRobin { /* existing */ },
    Random { /* existing */ },
    CacheAware { /* existing */ },
    PrefillDecode {
        prefill_workers: Arc<RwLock<Vec<EngineInfo>>>,
        decode_workers: Arc<RwLock<Vec<EngineInfo>>>,
        selection_policy: PDSelectionPolicy,
        prefill_tree: Option<Arc<Mutex<Tree>>>,
        timeout_secs: u64,
        interval_secs: u64,
    },
}
```

### Phase 2: Bootstrap & Dual Dispatch

- [x] Implement request parsing for PD mode (`Bytes` → `Box<dyn Bootstrap>`)
- [x] Bootstrap injection mechanism for batch/single requests
- [x] Dual dispatch logic (send to BOTH prefill and decode)
- [x] Stream handling for PD responses

```rust
// PD request routing flow
async fn route_pd_request(
    &self,
    client: &reqwest::Client,
    body: &Bytes,
    route: &str,
) -> HttpResponse {
    // 1. Parse into typed request
    let mut typed_request: Box<dyn Bootstrap> = parse_pd_request(body, route)?;
    
    // 2. Select prefill and decode servers
    let (prefill, decode) = self.select_pd_pair(client).await;
    
    // 3. Bootstrap injection
    typed_request.add_bootstrap_info(&prefill)?;
    
    // 4. Send to BOTH servers, return decode response
    let (_, decode_response) = tokio::join!(
        send_to_prefill(prefill, &typed_request),
        send_to_decode(decode, &typed_request)
    );
    
    decode_response
}
```

### Phase 3: Cache-Aware PD Implementation

- [x] Adapt existing Tree structure for PD routing
- [x] Text extraction from PD requests for cache matching
- [x] Load balancing fallback when system is imbalanced
- [x] PD-specific metrics and monitoring

```mermaid
graph TD
    A[PD Request] --> B[Extract Text]
    B --> C{Load Balanced?}
    
    C -->|Yes - Imbalanced| D[Use Load Balancing]
    D --> D1[Select Least Loaded Prefill]
    D --> D2[Select PowerOfTwo Decode]
    
    C -->|No - Balanced| E[Use Cache-Aware]
    E --> F[Tree Prefix Match]
    F --> G{Match Rate > Threshold?}
    
    G -->|Yes - Cache Hit| H[Route to Matched Worker]
    G -->|No - Cache Miss| I[Route to Smallest Tree Worker]
    
    H --> J[Update Tree & Load Tracking]
    I --> J
    D1 --> J
    D2 --> K[Send Requests]
    J --> K
    
    style H fill:#2E8B57,color:#fff
    style I fill:#B22222,color:#fff
    style D1 fill:#1E90FF,color:#fff
```

### Phase 4: Testing & Polish
**Status**: 

- [x] Comprehensive unit and integration tests
- [x] Performance benchmarking
- [x] Documentation and examples
- [x] Production readiness validation

## Python API Design

### Desired Implementation
```python
from sglang_router import Router

# Clean PD Router creation
router = Router.new_pd(
    prefill_urls=[("http://prefill1:8080", 9000), ("http://prefill2:8080", None)],
    decode_urls=["http://decode1:8081", "http://decode2:8081"],
    policy="cache_aware",  # "random", "po2", "cache_aware"
    host="127.0.0.1",
    port=3001,
    cache_threshold=0.5,
    balance_abs_threshold=32,
    balance_rel_threshold=1.0001,
)

router.start()
```

### Command Line Interface

```bash
# PD mode with cache-aware routing
python -m sglang_router.launch_router \
    --policy prefill_decode \
    --prefill-urls http://prefill1:8080:9000 http://prefill2:8080 \
    --decode-urls http://decode1:8081 http://decode2:8081 \
    --pd-policy cache_aware \
    --cache-threshold 0.6 \
    --host 0.0.0.0 \
    --port 8080
```

## Key Challenges

### 1. Request Processing Paradigm Shift
- **Router**: Raw `Bytes` → Single worker selection
- **PDLB**: Typed structs → Bootstrap injection → Dual server dispatch
- **Solution**: Bridge both paradigms in PD mode

### 2. Bootstrap Injection Complexity
- Must handle `SingleOrBatch<T>` for batch requests
- Critical for prefill-decode disaggregation
- **Solution**: Integrate PDLB's Bootstrap trait system

### 3. Cache-Aware PD Routing
- First-of-its-kind cache-aware routing for PD disaggregation
- Adaptive switching between cache optimization and load balancing
- **Solution**: Adapt Router's Tree for prefill selection

## Success Metrics

- [x] **Functionality**: All PD selection policies working (random, po2, cache_aware)
- [x] **Performance**: No significant latency overhead vs regular routing
- [x] **Compatibility**: Existing Router functionality unchanged
- [x] **Production Ready**: Comprehensive metrics, health checking, failover (excluding PD for failover)

### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Merge PDLB into SGLang Router #7031

Checklist

Motivation

Overview

System Architecture

Implementation Phases

Phase 1A: Extract PDLB Components

Phase 1B: Core PD Router Extension

Phase 2: Bootstrap & Dual Dispatch

Phase 3: Cache-Aware PD Implementation

Phase 4: Testing & Polish

Python API Design

Desired Implementation

Command Line Interface

Key Challenges

1. Request Processing Paradigm Shift

2. Bootstrap Injection Complexity

3. Cache-Aware PD Routing

Success Metrics

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Merge PDLB into SGLang Router #7031

Description

Checklist

Motivation

Overview

System Architecture

Implementation Phases

Phase 1A: Extract PDLB Components

Phase 1B: Core PD Router Extension

Phase 2: Bootstrap & Dual Dispatch

Phase 3: Cache-Aware PD Implementation

Phase 4: Testing & Polish

Python API Design

Desired Implementation

Command Line Interface

Key Challenges

1. Request Processing Paradigm Shift

2. Bootstrap Injection Complexity

3. Cache-Aware PD Routing

Success Metrics

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions