[router] cache-aware load-balancing router v1#2114
Merged
ByronHsu merged 9 commits intosgl-project:mainfrom Nov 23, 2024
Merged
[router] cache-aware load-balancing router v1#2114ByronHsu merged 9 commits intosgl-project:mainfrom
ByronHsu merged 9 commits intosgl-project:mainfrom
Conversation
3 tasks
dadf3e1 to
4f3c5d7
Compare
1b37cd3 to
961c5a6
Compare
da1e53d to
98af490
Compare
e38ddf7 to
441c5b6
Compare
This was referenced Nov 22, 2024
Closed
Ying1123
reviewed
Nov 23, 2024
| .map(|kv| kv.key().to_owned()) | ||
| .unwrap_or("empty".to_string()); | ||
|
|
||
| // Traverse from the curr node to the root and update the timestamp |
Contributor
There was a problem hiding this comment.
Maybe not important, but this could be happened during the matching process (traverse takes time, but maybe is not the bottleneck now).
Ying1123
reviewed
Nov 23, 2024
|
|
||
| if curr.children.contains_key(first_id) { | ||
| let child = curr.children.get(first_id).unwrap(); | ||
| pub fn evict_tenant_data(&self, max_size: usize) { |
Contributor
There was a problem hiding this comment.
The priority queue (actually a linked list is better) can be maintained, rather than recompute each eviction time. The current implementation is also fine for fast move, since it is actually a lazy eviction and the complexity is amortized.
Collaborator
Author
There was a problem hiding this comment.
could you explain how LL works?
Ying1123
approved these changes
Nov 23, 2024
37 tasks
22 tasks
timethink
pushed a commit
to timethink/sglang
that referenced
this pull request
Mar 9, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Related to #1732
This PR finishes the first version of cache-aware load-balancing router. For long shared prefix data, It can achieve 2x throughput compared with existing round-robin DP controller.
Usage
The router offers two modes:
1. Co-launch workers and router
This will be a drop-in replacement for the existing
--dp-size. This part of code will be moved into sglang core.Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.
2. Launch only router
This is useful if you for multi node DP. You can launch workers on different nodes, then connect the router to them.
$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000 $ python -m sglang_router.launch_router --help usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]] [--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD] [--cache-routing-prob CACHE_ROUTING_PROB] [--eviction-interval EVICTION_INTERVAL] [--max-tree-size MAX_TREE_SIZE] options: -h, --help show this help message and exit --host HOST Host address to bind the router server (default: 127.0.0.1) --port PORT Port number to bind the router server (default: 30000) --worker-urls WORKER_URLS [WORKER_URLS ...] List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None) --policy {random,round_robin,cache_aware} Load balancing policy to use (default: cache_aware) --cache-threshold CACHE_THRESHOLD Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5) --cache-routing-prob CACHE_ROUTING_PROB Probability of using cache-aware routing (0.0-1.0) (default: 1.0) --eviction-interval EVICTION_INTERVAL Interval in seconds between cache eviction operations (default: 60) --max-tree-size MAX_TREE_SIZE Maximum size of the approximation tree for cache-aware routing (default: 16777216)Strategy
Cache-Aware Load-Balancing Router
This router combines two strategies to optimize both cache utilization and request distribution:
1. Cache-Aware Routing (Approximate Tree)
This strategy maintains an approximate radix tree for each worker based on request history,
eliminating the need for direct cache state queries. The tree stores raw text characters
instead of token IDs to avoid tokenization overhead.
Process:
2. Load-Balancing (Shortest Queue)
This strategy tracks pending request counts per worker and routes new requests
to the least busy worker for optimal load distribution.
Configuration Parameters
cache_routing_prob: (float, 0.0 to 1.0)cache_threshold: (float, 0.0 to 1.0)eviction_interval_secs: (integer)max_tree_size: (integer)Benchmark Results
Generated Shared Prefix Dataset
python bench_serving.py --host 127.0.0.1 --port 30000 --dataset-name generated-shared-prefix \ --generated-input-path ~/.cache/gen.json --generated-input-save-path ~/.cache/gen.jsonSharedGPT Dataset
The performance does not degrade for non cache heavy case
Multi Turn Dataset
Generated Shared Prefix Dataset but only has one system prompt
Full cache aware has perf degradation because all requests are routed to one node. We can tune routing prob to 0.5 to beat naive RR.
Reference: https://docs.google.com/spreadsheets/d/1Y_dY4EGpk26MsehoWf6K85p7BXBBWlXI-gk6-Ei5-cs/edit?gid=1463925947#gid=1463925947