[serve.llm] Prefix-aware scheduler [2/N] Configure PrefixAwareReplicaScheduler as default scheduler in LLMServer by jujipotle · Pull Request #52725 · ray-project/ray

jujipotle · 2025-05-01T18:50:45Z

WIP

P0:

Configure LLMRouter to use PrefixAwareScheduler.
Investigate and resolve scheduling task out-of-order issue.
Update the tree with both input and response text using callbacks from the router.
Handle autoscaling.
Update the tree without causing race conditions.
Add tests.

P1:

Implement eviction policy for vLLM replicas
Investigate load balancing vs prefix hit rate
Stress-testing the tree itself
Investigate impact of tree size on traversal time
Investigate SGLang / Dynamo strategies

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Gene Su <e870252314@gmail.com>

…xt directly Signed-off-by: Gene Su <e870252314@gmail.com>

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

…eploymentConfig currently not working Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

python/ray/llm/_internal/serve/deployments/routers/prefix_tree_deployment.py

python/ray/llm/_internal/serve/deployments/routers/router.py

python/ray/serve/_private/replica_scheduler/llm_pow_2_scheduler.py

python/ray/serve/_private/replica_scheduler/old_prefix_aware_scheduler.py

python/ray/serve/_private/replica_scheduler/prefix_aware_scheduler.py

python/ray/serve/_private/constants.py

python/ray/serve/_private/replica_scheduler/prefix_aware_scheduler.py

python/ray/llm/_internal/serve/deployments/routers/prefix_tree_deployment.py

kouroshHakha · 2025-05-01T23:13:04Z

This was also left from our discussion. For v0 we need some interface + example code like this (It doesn't have to work with yaml build pattern):

from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
from ray.serve.router import PrefixTreeDeployment
from ray.serve.replica_scheduler import PrefixAwareReplicaScheduler

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)


tree_deployement = PrefixTreeDeployment.bind()
# TODO: Some how make tree_deployment appear when you do 'serve.get_deployment_handle("xyz")`. 

# Deploy the application
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
deployment = deployment.options(replica_scheduler_class=PrefixAwareReplicaScheduler)
llm_app = LLMRouter.as_deployment().bind(llm_deplyments=[deployment], tree_deployment=tree_deployement)
serve.run(llm_app, blocking=True)

…the deployment config Signed-off-by: Gene Su <e870252314@gmail.com>

Signed-off-by: Gene Su <e870252314@gmail.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji · 2025-06-04T01:03:45Z

Results looking as expected following _benchmarking_scripts/replication_tutorial.md. Will be moving _benchmarking_scripts to an internal repo.

…ca-scheduler-benchmarks Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

… in LLM Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji · 2025-06-05T00:55:14Z

Benchmark scripts moved to https://github.com/anyscale/serve-llm-replica-scheduler-benchmarks

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji · 2025-06-06T02:07:06Z

To change from the default prefix aware request router looks something like this:

from ray import serve
from ray.serve.llm import LLMConfig
from ray.serve._private.request_router.pow_2_router import PowerOfTwoChoicesRequestRouter
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        ),
        request_router_class=PowerOfTwoChoicesRequestRouter
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

kouroshHakha

just one major comment about not making this request router the default. For the rest of the stuff we can merge as is and come back to it during next iterations.

python/ray/llm/_internal/serve/deployments/llm/llm_server.py

python/ray/llm/_internal/serve/request_router/prefix_aware/prefix_tree.py

kouroshHakha · 2025-06-07T01:04:08Z

python/ray/llm/_internal/serve/request_router/prefix_aware/prefix_tree.py

+                if count == min_count
+            ]
+
+    def start_eviction_loop(


This should be more like a background thread. (event loop should not be kept busy because of eviction)

python/ray/serve/_private/request_router/prefix_aware_router.py

kouroshHakha

just one major comment about not making this request router the default. For the rest of the stuff we can merge as is and come back to it during next iterations.

kouroshHakha

just one major comment about not making this request router the default. For the rest of the stuff we can merge as is and come back to it during next iterations.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

python/ray/llm/_internal/serve/deployments/llm/llm_server.py

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

GeneDer and others added 23 commits April 1, 2025 08:04

WIP

594a425

Signed-off-by: Gene Su <e870252314@gmail.com>

Merge branch 'master' into prototype-custom-replica-scheduler

e3fc0a9

WIP: refactor locality minxin

854af70

Signed-off-by: Gene Su <e870252314@gmail.com>

Merge branch 'master' into prototype-custom-replica-scheduler

4f44870

use context vars for tracking flags

33dc3a3

Signed-off-by: Gene Su <e870252314@gmail.com>

serialize replica scheduler

603d074

Signed-off-by: Gene Su <e870252314@gmail.com>

fix update mulitplexed mode id

33d7a53

Signed-off-by: Gene Su <e870252314@gmail.com>

more refactor

2fe7ddf

Signed-off-by: Gene Su <e870252314@gmail.com>

WIP: still breaking, need to test passing the _RequestSchedulingConte…

2b4a89e

…xt directly Signed-off-by: Gene Su <e870252314@gmail.com>

fix bug

df1ac0d

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Merge branch 'master' into prefix-aware-scheduler

762e7ce

Merge branch 'master' into prototype-custom-replica-scheduler

d69314d

begin migrate prefix aware

70b1953

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Opening up replica_scheduler

4105417

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

implement get_deployment_config

9b5e440

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Minimal changes to swap in custom scheduler using llm config, since d…

5b28af4

…eploymentConfig currently not working Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Match requests based on internal_request_id

fea040a

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Refactor prefix aware scheduler with newer design

b205643

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Wrap ObjectRefGenerator so callback can peek without iterating

bf9d5c6

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Fix callback for update tree to work with streaming responses

60c9975

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Clean up

57dd69c

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

Merge branch 'master' into prototype-custom-replica-scheduler

28fee7d

Callback takes longer with streaming

19eed43

Signed-off-by: Justin Ji <justinji@college.harvard.edu>

kouroshHakha reviewed May 1, 2025

View reviewed changes

GeneDer added 5 commits May 1, 2025 16:15

add get_deployment_config api and use the replica scheduler set onto …

75512bf

…the deployment config Signed-off-by: Gene Su <e870252314@gmail.com>

lint

632568e

Signed-off-by: Gene Su <e870252314@gmail.com>

fix getting deployment config

6b88c51

Signed-off-by: Gene Su <e870252314@gmail.com>

fix the key passed into choose_replicas

3d58a07

Signed-off-by: Gene Su <e870252314@gmail.com>

WIP

2ff36e6

Signed-off-by: Gene Su <e870252314@gmail.com>

Adapt to new choose_replicas signature

aaadb12

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji added 6 commits June 4, 2025 09:58

Move benchmark scripts to https://github.com/anyscale/serve-llm-repli…

2efabd2

…ca-scheduler-benchmarks Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Remove injected stats logger

a00c7dc

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Convert warning logs to debugs

25c3cdf

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Expose prefix aware router via request_router __init__.py

5b21833

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Don't export PrefixAwareRequestRouter from serve since it's only used…

ecaca59

… in LLM Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Merge branch 'master' into prefix-aware-scheduler

16b2426

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji added 5 commits June 5, 2025 09:55

Remove _track_metrics from PR

62d9a47

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Update to call base class with warning that autoscaling not supported

b4c3f2b

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Change name to PrefixAwarePow2ReplicaScheduler

0cc56e4

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Use a detached actor to avoid issues with actor lifetime

aa3b071

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Change name to end with 'Router'

e7c8a34

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Autoscaling is now handled by the prefix tree with detached lifetime

dae5f4e

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

kouroshHakha reviewed Jun 7, 2025

View reviewed changes

Set default LLMServer router to PowerOfTwoChoicesRequestRouter

e5c39c7

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

kouroshHakha reviewed Jun 9, 2025

View reviewed changes

python/ray/llm/_internal/serve/deployments/llm/llm_server.py Outdated Show resolved Hide resolved

Don't override Ray Serve router class in LLMDeployment

fd089ac

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

kouroshHakha approved these changes Jun 9, 2025

View reviewed changes

kouroshHakha enabled auto-merge (squash) June 9, 2025 16:20

kouroshHakha merged commit 93192cc into ray-project:master Jun 9, 2025
6 checks passed

eicherseiji mentioned this pull request Jun 11, 2025

(serve.llm) Make deployment conf configurable #53724

Closed

8 tasks

This was referenced Jun 18, 2025

[Serve] Add RouterConfig field to DeploymentConfig to configure RequestRouter #53870

Merged

[serve.llm] Prefix aware router eviction thread improvements #53957

Merged

Conversation

jujipotle commented May 1, 2025 • edited by kouroshHakha Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WIP

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha commented May 1, 2025

Uh oh!

eicherseiji commented Jun 4, 2025

Uh oh!

eicherseiji commented Jun 5, 2025

Uh oh!

eicherseiji commented Jun 6, 2025

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jujipotle commented May 1, 2025 •

edited by kouroshHakha

Loading