[Serve] Provide backpressure on handle metrics push

### Description

It would be nice to provide backpressure on handle metrics pushes to the Serve controller so that the controller does not become overloaded.

Relevant code is around these locations:
- https://github.com/ray-project/ray/blob/9835610bed0d09a0b3672fa74fdb693ec172f5cc/python/ray/serve/_private/metrics_utils.py#L48-L73
- https://github.com/ray-project/ray/blob/9835610bed0d09a0b3672fa74fdb693ec172f5cc/python/ray/serve/_private/router.py#L258-L265

Currently the metrics push is fire-and-forget, and happens on a fixed interval whether or not the previous push has finished.

### Use case

Our system is running a very large number of `DeploymentHandle`s (see https://github.com/ray-project/ray/issues/44784 for more details). We've noticed that the Serve controller gets overloaded (>100% CPU usage) trying to accept all of the metrics pushes, which leads to an ever-increasing number of increasingly-stale `record_handle_metrics` tasks idle on the controller, which then eventually runs out of memory and crashes.

	async def metrics_task(self, name: str):
	"""Periodically runs `task_func` every `interval_s` until `stop_event` is set.

	If `task_func` raises an error, an exception will be logged.
	"""

	wait_for_stop_event = asyncio.create_task(self.stop_event.wait())
	while True:
	if wait_for_stop_event.done():
	return

	try:
	self._tasks[name].task_func()
	except Exception as e:
	logger.exception(f"Failed to run metrics task '{name}': {e}")

	sleep_task = asyncio.create_task(
	self._async_sleep(self._tasks[name].interval_s)
	)
	await asyncio.wait(
	[sleep_task, wait_for_stop_event],
	return_when=asyncio.FIRST_COMPLETED,
	)

	if not sleep_task.done():
	sleep_task.cancel()

	self._controller_handle.record_handle_metrics.remote(
	send_timestamp=time.time(),
	deployment_id=self._deployment_id,
	handle_id=self._handle_id,
	actor_id=self._self_actor_id,
	handle_source=self._handle_source,
	**self._get_aggregated_requests(),
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Provide backpressure on handle metrics push #45776

Description

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Serve] Provide backpressure on handle metrics push #45776

Description

Description

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions