Skip to content

[Serve] model multiplexing and batching does not work together #56633

@patches11

Description

@patches11

What happened + What you expected to happen

When using multiplexing without batching, it works fine. However when adding in batching in appears the RequestContext is incorrect, and whichever model is loaded first will be used for subsequent requests

See reproduction script attached.

Start it locally:

serve run multiplexing_issue:app

Correct output with non_batched:

(venv) patches@computer ray-serve-debris % curl -X POST -H "Content-Type: application/octet-stream" "http://localhost:8000/predict?model_id=aaa&arg=1&kind=non_batched"
"Response from model_obj_for_aaa 1"

(venv) patches@computer ray-serve-debris % curl -X POST -H "Content-Type: application/octet-stream" "http://localhost:8000/predict?model_id=bbb&arg=1&kind=non_batched"
"Response from model_obj_for_bbb 1"

Correct log output:

INFO 2025-09-17 14:12:00,330 serve 17151 -- Application 'default' is ready at http://127.0.0.1:8000/.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,514 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Request for model_id: aaa
(ServeReplica:default:MultiplexedModel pid=17189) _RequestContext(route='/predict', request_id='52c45e21-3731-45c1-900c-fd3d2dc7d7e1', _internal_request_id='0748d457-6a2b-40ca-9a53-daa1a1117fab', app_name='default', multiplexed_model_id='aaa', grpc_context=None, is_http_request=False, cancel_on_parent_request_cancel=False)
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,514 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Loading model 'aaa'.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,514 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Loading model: aaa
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,515 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Successfully loaded model 'aaa' in 0.1ms.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,526 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- CALL /predict OK 13.0ms
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:12,496 default_APIIngress 9irqpcfm 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- model_id aaa
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:12,504 default_APIIngress 9irqpcfm 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x11914a590>.
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:12,527 default_APIIngress 9irqpcfm 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- POST /predict 200 32.8ms
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- Request for model_id: bbb
(ServeReplica:default:MultiplexedModel pid=17189) _RequestContext(route='/predict', request_id='abfa01e8-c9c5-42a9-8bc3-70d50be329b7', _internal_request_id='2479658b-eb9a-437f-a214-0d35f28ce254', app_name='default', multiplexed_model_id='bbb', grpc_context=None, is_http_request=False, cancel_on_parent_request_cancel=False)
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- Loading model 'bbb'.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- Loading model: bbb
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- Successfully loaded model 'bbb' in 0.2ms.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- CALL /predict OK 1.3ms
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:19,218 default_APIIngress 9irqpcfm abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- model_id bbb
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:19,222 default_APIIngress 9irqpcfm abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- POST /predict 200 6.0m

Incorrect output with batched:

(venv) patches@computer ray-serve-debris % curl -X POST -H "Content-Type: application/octet-stream" "http://localhost:8000/predict?model_id=aaa&arg=1&kind=batched"
"Response from model_obj_for_aaa 1"

(venv) patches@computer ray-serve-debris % curl -X POST -H "Content-Type: application/octet-stream" "http://localhost:8000/predict?model_id=bbb&arg=1&kind=batched"
"Response from model_obj_for_aaa 1"  

Logging for incorrect output:

INFO 2025-09-17 14:14:47,709 serve 18282 -- Application 'default' is ready at http://127.0.0.1:8000/.
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:51,193 default_APIIngress 19714ahf 7ec54120-889e-4be3-96ea-2ec8558141f1 -- model_id aaa
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:51,208 default_APIIngress 19714ahf 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x11b427f10>.
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,271 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Request for model_id: aaa
(ServeReplica:default:MultiplexedModel pid=18307) _RequestContext(route='/predict', request_id='7ec54120-889e-4be3-96ea-2ec8558141f1', _internal_request_id='dd19a6df-9bf3-4d8e-b5a7-8ef23c826444', app_name='default', multiplexed_model_id='aaa', grpc_context=None, is_http_request=False, cancel_on_parent_request_cancel=False)
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,272 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Loading model 'aaa'.
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,272 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Loading model: aaa
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,272 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Successfully loaded model 'aaa' in 0.1ms.
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,277 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- CALL /predict OK 57.5ms
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:51,278 default_APIIngress 19714ahf 7ec54120-889e-4be3-96ea-2ec8558141f1 -- POST /predict 200 86.0ms
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:53,303 default_APIIngress 19714ahf 0d229b95-331c-47b0-8ab6-78a9bb81d5cf -- model_id bbb
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:53,356 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Request for model_id: aaa
(ServeReplica:default:MultiplexedModel pid=18307) _RequestContext(route='/predict', request_id='7ec54120-889e-4be3-96ea-2ec8558141f1', _internal_request_id='dd19a6df-9bf3-4d8e-b5a7-8ef23c826444', app_name='default', multiplexed_model_id='aaa', grpc_context=None, is_http_request=False, cancel_on_parent_request_cancel=False)
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:53,356 default_MultiplexedModel 9r086hrb 0d229b95-331c-47b0-8ab6-78a9bb81d5cf -- CALL /predict OK 51.5ms
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:53,357 default_APIIngress 19714ahf 0d229b95-331c-47b0-8ab6-78a9bb81d5cf -- POST /predict 200 55.5ms

I am not a Ray expert, but in this output the RequestContext is the same for both requests, including request_id and internal_request_id, which seems odd to me

Versions / Dependencies

Ray 2.48.0
Python 3.11.6

Reproduction script

from ray import serve
from fastapi import FastAPI
import logging

logger = logging.getLogger("ray.serve")

app = FastAPI()

@serve.deployment
class MultiplexedModel:
    @serve.multiplexed(max_num_models_per_replica=2)
    async def get_model(self, model_id: str):
        logger.info(f"Loading model: {model_id}")
        return f"model_obj_for_{model_id}"

    @serve.batch(max_batch_size=2, batch_wait_timeout_s=0.05, max_concurrent_batches=1)
    async def batched(self, lst):
        model_id = serve.get_multiplexed_model_id()
        logger.info(f"Request for model_id: {model_id}\n{serve.context._get_serve_request_context()}")
        model = await self.get_model(model_id)

        result = []
        for item in lst:
            result.append(f"Response from {model} {item}")
        return result

    async def non_batched(self, item):
        model_id = serve.get_multiplexed_model_id()
        logger.info(f"Request for model_id: {model_id}\n{serve.context._get_serve_request_context()}")
        model = await self.get_model(model_id)

        return f"Response from {model} {item}"

@serve.deployment
@serve.ingress(app)
class APIIngress:
    def __init__(self, model_handle) -> None:
        self.model_handle = model_handle

    @app.post("/predict")
    async def predict(self, model_id: str, arg: str, kind: str):
        logger.info(f"model_id {model_id}")
        if kind == "batched":
            return await self.model_handle.options(multiplexed_model_id=model_id).batched.remote(arg)
        else:
            return await self.model_handle.options(multiplexed_model_id=model_id).non_batched.remote(arg)

app = APIIngress.bind(MultiplexedModel.bind())

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tcommunity-backlogserveRay Serve Related IssuestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions