[router] cache-aware load-balancing router v1 by ByronHsu · Pull Request #2114 · sgl-project/sglang

ByronHsu · 2024-11-21T17:34:27Z

Motivation

Related to #1732

This PR finishes the first version of cache-aware load-balancing router. For long shared prefix data, It can achieve 2x throughput compared with existing round-robin DP controller.

Usage

The router offers two modes:

1. Co-launch workers and router

This will be a drop-in replacement for the existing --dp-size. This part of code will be moved into sglang core.
Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.

$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8

2. Launch only router

This is useful if you for multi node DP. You can launch workers on different nodes, then connect the router to them.

$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000

$ python -m sglang_router.launch_router --help
usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]]
                        [--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD]
                        [--cache-routing-prob CACHE_ROUTING_PROB] [--eviction-interval EVICTION_INTERVAL]
                        [--max-tree-size MAX_TREE_SIZE]

options:
  -h, --help            show this help message and exit
  --host HOST           Host address to bind the router server (default: 127.0.0.1)
  --port PORT           Port number to bind the router server (default: 30000)
  --worker-urls WORKER_URLS [WORKER_URLS ...]
                        List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None)
  --policy {random,round_robin,cache_aware}
                        Load balancing policy to use (default: cache_aware)
  --cache-threshold CACHE_THRESHOLD
                        Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5)
  --cache-routing-prob CACHE_ROUTING_PROB
                        Probability of using cache-aware routing (0.0-1.0) (default: 1.0)
  --eviction-interval EVICTION_INTERVAL
                        Interval in seconds between cache eviction operations (default: 60)
  --max-tree-size MAX_TREE_SIZE
                        Maximum size of the approximation tree for cache-aware routing (default: 16777216)

Strategy

Cache-Aware Load-Balancing Router

This router combines two strategies to optimize both cache utilization and request distribution:

Cache-Aware Routing (Approximate Tree)
Load-Balancing Routing (Shortest Queue)

1. Cache-Aware Routing (Approximate Tree)

This strategy maintains an approximate radix tree for each worker based on request history,
eliminating the need for direct cache state queries. The tree stores raw text characters
instead of token IDs to avoid tokenization overhead.

Process:

For each request, find the worker with the highest prefix match
If match rate > cache_threshold:
- Route to the worker with highest match (likely has relevant data cached)
If match rate ≤ cache_threshold:
- Route to the worker with smallest tree size (most available cache capacity)
Background maintenance:
- Periodically evict least recently used leaf nodes to prevent memory overflow

2. Load-Balancing (Shortest Queue)

This strategy tracks pending request counts per worker and routes new requests
to the least busy worker for optimal load distribution.

Configuration Parameters

cache_routing_prob: (float, 0.0 to 1.0)
- 0.0: Exclusively use load balancing
- 1.0: Exclusively use cache-aware routing
- Between 0-1: Probability of using cache-aware routing vs load balancing
cache_threshold: (float, 0.0 to 1.0)
- Minimum prefix match ratio to use highest-match routing
- Below this threshold, routes to worker with most available cache space
eviction_interval_secs: (integer)
- Interval between LRU eviction cycles for the approximate trees
max_tree_size: (integer)
- Maximum nodes per tree
- When exceeded, LRU leaf nodes are evicted during the next eviction cycle

Benchmark Results

Generated Shared Prefix Dataset

python bench_serving.py --host 127.0.0.1 --port 30000 --dataset-name generated-shared-prefix \
    --generated-input-path ~/.cache/gen.json --generated-input-save-path ~/.cache/gen.json

Method	Throughput	Cache Rate
Original RR DP	82,665	20%
Cache Aware v1	158,596.72	75%
Perfect	160,288	75%

SharedGPT Dataset

python bench_serving.py --host 127.0.0.1 --port 30000

The performance does not degrade for non cache heavy case

Method	Throughput	Cache Rate
Original RR DP	17,164	2%
Cache Aware v1	17,775	2%

Multi Turn Dataset

python long_prompt_multi_turn.py --port 30000 --tokenizer "/shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693/" | tee client.log

Method	Latency	Cache Rate
Original RR DP	34	35%
Cache Aware v1	19	88%
Perfect	19	88%

Generated Shared Prefix Dataset but only has one system prompt

python bench_serving.py --host 127.0.0.1 --port 30000 --dataset-name generated-shared-prefix --gen-num-groups 8 --gen-num-groups 1 --gen-prompts-per-group 1024

Full cache aware has perf degradation because all requests are routed to one node. We can tune routing prob to 0.5 to beat naive RR.

Version	Throughput	Cache Rate
Original RR DP	154535.56
Cache aware v1	36510.71
Cache aware v1 - routing prob 0.5	190026.64

Reference: https://docs.google.com/spreadsheets/d/1Y_dY4EGpk26MsehoWf6K85p7BXBBWlXI-gk6-Ei5-cs/edit?gid=1463925947#gid=1463925947

Ying1123 · 2024-11-23T10:54:32Z

+            .map(|kv| kv.key().to_owned())
+            .unwrap_or("empty".to_string());
+
+        // Traverse from the curr node to the root and update the timestamp


Maybe not important, but this could be happened during the matching process (traverse takes time, but maybe is not the bottleneck now).

Ying1123 · 2024-11-23T11:07:11Z


-            if curr.children.contains_key(first_id) {
-                let child = curr.children.get(first_id).unwrap();
+    pub fn evict_tenant_data(&self, max_size: usize) {


The priority queue (actually a linked list is better) can be maintained, rather than recompute each eviction time. The current implementation is also fine for fast move, since it is actually a lazy eviction and the complexity is amortized.

could you explain how LL works?

ByronHsu changed the title ~~cache aware dp v1~~ [router] cache aware dp v1 Nov 21, 2024

ByronHsu mentioned this pull request Nov 21, 2024

[router] add base_gpu_id server args & merged radix tree python reference #2115

Merged

3 tasks

ByronHsu force-pushed the byhsu/approx-v2-new branch from dadf3e1 to 4f3c5d7 Compare November 21, 2024 20:39

ByronHsu changed the title ~~[router] cache aware dp v1~~ [router] cache-aware load-balancing router Nov 21, 2024

ByronHsu changed the title ~~[router] cache-aware load-balancing router~~ [router] cache-aware load-balancing router v1 Nov 21, 2024

ByronHsu force-pushed the byhsu/approx-v2-new branch from 1b37cd3 to 961c5a6 Compare November 21, 2024 22:50

ByronHsu marked this pull request as ready for review November 21, 2024 22:50

ByronHsu requested review from Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners November 21, 2024 22:50

ByronHsu force-pushed the byhsu/approx-v2-new branch from da1e53d to 98af490 Compare November 22, 2024 07:34

ByronHsu added 6 commits November 22, 2024 07:38

wip

8544036

wip

033742c

wip

850d730

wip

ac68603

fmt

ceefb6b

wip

441c5b6

ByronHsu force-pushed the byhsu/approx-v2-new branch from e38ddf7 to 441c5b6 Compare November 22, 2024 07:38

This was referenced Nov 22, 2024

Scheduler methods #1913

Closed

[Feature] Multi-instance deployment #1649

Closed

ByronHsu added 3 commits November 22, 2024 09:43

Merge branch 'main' into byhsu/approx-v2-new

18fb534

Merge branch 'main' into byhsu/approx-v2-new

000ef98

Merge branch 'main' into byhsu/approx-v2-new

719ac46

Ying1123 reviewed Nov 23, 2024

View reviewed changes

Ying1123 approved these changes Nov 23, 2024

View reviewed changes

ByronHsu merged commit cbedd1d into sgl-project:main Nov 23, 2024

merrymercy mentioned this pull request Nov 24, 2024

Development Roadmap (2024 Q4) #1487

Closed

37 tasks

zhaochenyang20 mentioned this pull request Mar 3, 2025

Development Roadmap (2025 H1) #4035

Closed

22 tasks

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

[router] cache-aware load-balancing router v1 (sgl-project#2114)

a5efa7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[router] cache-aware load-balancing router v1#2114

[router] cache-aware load-balancing router v1#2114
ByronHsu merged 9 commits intosgl-project:mainfrom
ByronHsu:byhsu/approx-v2-new

ByronHsu commented Nov 21, 2024 •

edited

Loading

Uh oh!

Ying1123 Nov 23, 2024

Uh oh!

Ying1123 Nov 23, 2024

Uh oh!

ByronHsu Nov 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ByronHsu commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Usage

1. Co-launch workers and router

2. Launch only router

Strategy

Cache-Aware Load-Balancing Router

1. Cache-Aware Routing (Approximate Tree)

2. Load-Balancing (Shortest Queue)

Configuration Parameters

Benchmark Results

Generated Shared Prefix Dataset

SharedGPT Dataset

Multi Turn Dataset

Generated Shared Prefix Dataset but only has one system prompt

Uh oh!

Ying1123 Nov 23, 2024

Choose a reason for hiding this comment

Uh oh!

Ying1123 Nov 23, 2024

Choose a reason for hiding this comment

Uh oh!

ByronHsu Nov 23, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ByronHsu commented Nov 21, 2024 •

edited

Loading