-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data] [LLM] Stage of engine support flexible compute.size #55480
Description
Description
The stage of engine use fixed config.concurrency and set compute.min_size/max_size both to config.concurrency.
if isinstance(config.concurrency, int):
# For CPU-only stages, we leverage auto-scaling to recycle resources.
processor_concurrency = (1, config.concurrency)
else:
raise ValueError(
"``concurrency`` is expected to be set as an integer,"
f" but got: {config.concurrency}."
) compute=ray.data.ActorPoolStrategy(
# vLLM start up time is significant, so if user give fixed
# concurrency, start all instances without auto-scaling.
min_size=config.concurrency,
max_size=config.concurrency,
max_tasks_in_flight_per_actor=config.experimental.get(
"max_tasks_in_flight_per_actor", DEFAULT_MAX_TASKS_IN_FLIGHT
),
),Use case
In my opinion, the config.concurrency should be Tuple[int, int] rather than int, or has the choice to been set to Tuple[int, int] at least, while application scenarios of data.llm are more inclined to offline batch processing. Compared with online tasks, offline tasks generally have lower priority (and are often subject to resource preemption by online tasks) and are less sensitive to latency. In such scenarios, rather than considering the overhead of engine startup and shutdown, it is more important to ensure dynamic resource allocation. If the quantity of the most sensitive GPU resources is fixed, the usability of data.llm in offline scenarios will be greatly reduced.