[serve.llm] Prefix-aware scheduler [2/N] Configure PrefixAwareReplicaScheduler as default scheduler in LLMServer#52725
Conversation
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
…xt directly Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Justin Ji <justinji@college.harvard.edu>
Signed-off-by: Justin Ji <justinji@college.harvard.edu>
Signed-off-by: Justin Ji <justinji@college.harvard.edu>
…eploymentConfig currently not working Signed-off-by: Justin Ji <justinji@college.harvard.edu>
Signed-off-by: Justin Ji <justinji@college.harvard.edu>
Signed-off-by: Justin Ji <justinji@college.harvard.edu>
Signed-off-by: Justin Ji <justinji@college.harvard.edu>
Signed-off-by: Justin Ji <justinji@college.harvard.edu>
Signed-off-by: Justin Ji <justinji@college.harvard.edu>
python/ray/llm/_internal/serve/deployments/routers/prefix_tree_deployment.py
Outdated
Show resolved
Hide resolved
python/ray/serve/_private/replica_scheduler/llm_pow_2_scheduler.py
Outdated
Show resolved
Hide resolved
python/ray/serve/_private/replica_scheduler/old_prefix_aware_scheduler.py
Outdated
Show resolved
Hide resolved
python/ray/serve/_private/replica_scheduler/prefix_aware_scheduler.py
Outdated
Show resolved
Hide resolved
python/ray/serve/_private/replica_scheduler/prefix_aware_scheduler.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree_deployment.py
Outdated
Show resolved
Hide resolved
|
This was also left from our discussion. For v0 we need some interface + example code like this (It doesn't have to work with yaml build pattern): from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
from ray.serve.router import PrefixTreeDeployment
from ray.serve.replica_scheduler import PrefixAwareReplicaScheduler
llm_config = LLMConfig(
model_loading_config=dict(
model_id="qwen-0.5b",
model_source="Qwen/Qwen2.5-0.5B-Instruct",
),
deployment_config=dict(
autoscaling_config=dict(
min_replicas=1, max_replicas=2,
)
),
accelerator_type="A10G",
)
tree_deployement = PrefixTreeDeployment.bind()
# TODO: Some how make tree_deployment appear when you do 'serve.get_deployment_handle("xyz")`.
# Deploy the application
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
deployment = deployment.options(replica_scheduler_class=PrefixAwareReplicaScheduler)
llm_app = LLMRouter.as_deployment().bind(llm_deplyments=[deployment], tree_deployment=tree_deployement)
serve.run(llm_app, blocking=True) |
…the deployment config Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ca-scheduler-benchmarks Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
… in LLM Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
|
Benchmark scripts moved to https://github.com/anyscale/serve-llm-replica-scheduler-benchmarks |
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
|
To change from the default prefix aware request router looks something like this: |
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
just one major comment about not making this request router the default. For the rest of the stuff we can merge as is and come back to it during next iterations.
python/ray/llm/_internal/serve/request_router/prefix_aware/prefix_tree.py
Show resolved
Hide resolved
python/ray/llm/_internal/serve/request_router/prefix_aware/prefix_tree.py
Show resolved
Hide resolved
| if count == min_count | ||
| ] | ||
|
|
||
| def start_eviction_loop( |
There was a problem hiding this comment.
This should be more like a background thread. (event loop should not be kept busy because of eviction)
kouroshHakha
left a comment
There was a problem hiding this comment.
just one major comment about not making this request router the default. For the rest of the stuff we can merge as is and come back to it during next iterations.
kouroshHakha
left a comment
There was a problem hiding this comment.
just one major comment about not making this request router the default. For the rest of the stuff we can merge as is and come back to it during next iterations.
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>




WIP
P0:
P1:
Why are these changes needed?
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.