-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Serve] model multiplexing and batching does not work together #56633
Copy link
Copy link
Closed
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogserveRay Serve Related IssueRay Serve Related IssuestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)
Description
What happened + What you expected to happen
When using multiplexing without batching, it works fine. However when adding in batching in appears the RequestContext is incorrect, and whichever model is loaded first will be used for subsequent requests
See reproduction script attached.
Start it locally:
serve run multiplexing_issue:appCorrect output with non_batched:
(venv) patches@computer ray-serve-debris % curl -X POST -H "Content-Type: application/octet-stream" "http://localhost:8000/predict?model_id=aaa&arg=1&kind=non_batched"
"Response from model_obj_for_aaa 1"
(venv) patches@computer ray-serve-debris % curl -X POST -H "Content-Type: application/octet-stream" "http://localhost:8000/predict?model_id=bbb&arg=1&kind=non_batched"
"Response from model_obj_for_bbb 1"
Correct log output:
INFO 2025-09-17 14:12:00,330 serve 17151 -- Application 'default' is ready at http://127.0.0.1:8000/.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,514 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Request for model_id: aaa
(ServeReplica:default:MultiplexedModel pid=17189) _RequestContext(route='/predict', request_id='52c45e21-3731-45c1-900c-fd3d2dc7d7e1', _internal_request_id='0748d457-6a2b-40ca-9a53-daa1a1117fab', app_name='default', multiplexed_model_id='aaa', grpc_context=None, is_http_request=False, cancel_on_parent_request_cancel=False)
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,514 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Loading model 'aaa'.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,514 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Loading model: aaa
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,515 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Successfully loaded model 'aaa' in 0.1ms.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:12,526 default_MultiplexedModel anlir3bn 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- CALL /predict OK 13.0ms
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:12,496 default_APIIngress 9irqpcfm 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- model_id aaa
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:12,504 default_APIIngress 9irqpcfm 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x11914a590>.
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:12,527 default_APIIngress 9irqpcfm 52c45e21-3731-45c1-900c-fd3d2dc7d7e1 -- POST /predict 200 32.8ms
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- Request for model_id: bbb
(ServeReplica:default:MultiplexedModel pid=17189) _RequestContext(route='/predict', request_id='abfa01e8-c9c5-42a9-8bc3-70d50be329b7', _internal_request_id='2479658b-eb9a-437f-a214-0d35f28ce254', app_name='default', multiplexed_model_id='bbb', grpc_context=None, is_http_request=False, cancel_on_parent_request_cancel=False)
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- Loading model 'bbb'.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- Loading model: bbb
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- Successfully loaded model 'bbb' in 0.2ms.
(ServeReplica:default:MultiplexedModel pid=17189) INFO 2025-09-17 14:12:19,221 default_MultiplexedModel anlir3bn abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- CALL /predict OK 1.3ms
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:19,218 default_APIIngress 9irqpcfm abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- model_id bbb
(ServeReplica:default:APIIngress pid=17179) INFO 2025-09-17 14:12:19,222 default_APIIngress 9irqpcfm abfa01e8-c9c5-42a9-8bc3-70d50be329b7 -- POST /predict 200 6.0m
Incorrect output with batched:
(venv) patches@computer ray-serve-debris % curl -X POST -H "Content-Type: application/octet-stream" "http://localhost:8000/predict?model_id=aaa&arg=1&kind=batched"
"Response from model_obj_for_aaa 1"
(venv) patches@computer ray-serve-debris % curl -X POST -H "Content-Type: application/octet-stream" "http://localhost:8000/predict?model_id=bbb&arg=1&kind=batched"
"Response from model_obj_for_aaa 1"
Logging for incorrect output:
INFO 2025-09-17 14:14:47,709 serve 18282 -- Application 'default' is ready at http://127.0.0.1:8000/.
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:51,193 default_APIIngress 19714ahf 7ec54120-889e-4be3-96ea-2ec8558141f1 -- model_id aaa
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:51,208 default_APIIngress 19714ahf 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x11b427f10>.
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,271 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Request for model_id: aaa
(ServeReplica:default:MultiplexedModel pid=18307) _RequestContext(route='/predict', request_id='7ec54120-889e-4be3-96ea-2ec8558141f1', _internal_request_id='dd19a6df-9bf3-4d8e-b5a7-8ef23c826444', app_name='default', multiplexed_model_id='aaa', grpc_context=None, is_http_request=False, cancel_on_parent_request_cancel=False)
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,272 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Loading model 'aaa'.
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,272 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Loading model: aaa
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,272 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Successfully loaded model 'aaa' in 0.1ms.
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:51,277 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- CALL /predict OK 57.5ms
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:51,278 default_APIIngress 19714ahf 7ec54120-889e-4be3-96ea-2ec8558141f1 -- POST /predict 200 86.0ms
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:53,303 default_APIIngress 19714ahf 0d229b95-331c-47b0-8ab6-78a9bb81d5cf -- model_id bbb
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:53,356 default_MultiplexedModel 9r086hrb 7ec54120-889e-4be3-96ea-2ec8558141f1 -- Request for model_id: aaa
(ServeReplica:default:MultiplexedModel pid=18307) _RequestContext(route='/predict', request_id='7ec54120-889e-4be3-96ea-2ec8558141f1', _internal_request_id='dd19a6df-9bf3-4d8e-b5a7-8ef23c826444', app_name='default', multiplexed_model_id='aaa', grpc_context=None, is_http_request=False, cancel_on_parent_request_cancel=False)
(ServeReplica:default:MultiplexedModel pid=18307) INFO 2025-09-17 14:14:53,356 default_MultiplexedModel 9r086hrb 0d229b95-331c-47b0-8ab6-78a9bb81d5cf -- CALL /predict OK 51.5ms
(ServeReplica:default:APIIngress pid=18311) INFO 2025-09-17 14:14:53,357 default_APIIngress 19714ahf 0d229b95-331c-47b0-8ab6-78a9bb81d5cf -- POST /predict 200 55.5ms
I am not a Ray expert, but in this output the RequestContext is the same for both requests, including request_id and internal_request_id, which seems odd to me
Versions / Dependencies
Ray 2.48.0
Python 3.11.6
Reproduction script
from ray import serve
from fastapi import FastAPI
import logging
logger = logging.getLogger("ray.serve")
app = FastAPI()
@serve.deployment
class MultiplexedModel:
@serve.multiplexed(max_num_models_per_replica=2)
async def get_model(self, model_id: str):
logger.info(f"Loading model: {model_id}")
return f"model_obj_for_{model_id}"
@serve.batch(max_batch_size=2, batch_wait_timeout_s=0.05, max_concurrent_batches=1)
async def batched(self, lst):
model_id = serve.get_multiplexed_model_id()
logger.info(f"Request for model_id: {model_id}\n{serve.context._get_serve_request_context()}")
model = await self.get_model(model_id)
result = []
for item in lst:
result.append(f"Response from {model} {item}")
return result
async def non_batched(self, item):
model_id = serve.get_multiplexed_model_id()
logger.info(f"Request for model_id: {model_id}\n{serve.context._get_serve_request_context()}")
model = await self.get_model(model_id)
return f"Response from {model} {item}"
@serve.deployment
@serve.ingress(app)
class APIIngress:
def __init__(self, model_handle) -> None:
self.model_handle = model_handle
@app.post("/predict")
async def predict(self, model_id: str, arg: str, kind: str):
logger.info(f"model_id {model_id}")
if kind == "batched":
return await self.model_handle.options(multiplexed_model_id=model_id).batched.remote(arg)
else:
return await self.model_handle.options(multiplexed_model_id=model_id).non_batched.remote(arg)
app = APIIngress.bind(MultiplexedModel.bind())Issue Severity
Medium: It is a significant difficulty but I can work around it.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogserveRay Serve Related IssueRay Serve Related IssuestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)