Skip to content

[Serve] Provide backpressure on handle metrics push #45776

@JoshKarpel

Description

@JoshKarpel

Description

It would be nice to provide backpressure on handle metrics pushes to the Serve controller so that the controller does not become overloaded.

Relevant code is around these locations:

  • async def metrics_task(self, name: str):
    """Periodically runs `task_func` every `interval_s` until `stop_event` is set.
    If `task_func` raises an error, an exception will be logged.
    """
    wait_for_stop_event = asyncio.create_task(self.stop_event.wait())
    while True:
    if wait_for_stop_event.done():
    return
    try:
    self._tasks[name].task_func()
    except Exception as e:
    logger.exception(f"Failed to run metrics task '{name}': {e}")
    sleep_task = asyncio.create_task(
    self._async_sleep(self._tasks[name].interval_s)
    )
    await asyncio.wait(
    [sleep_task, wait_for_stop_event],
    return_when=asyncio.FIRST_COMPLETED,
    )
    if not sleep_task.done():
    sleep_task.cancel()
  • self._controller_handle.record_handle_metrics.remote(
    send_timestamp=time.time(),
    deployment_id=self._deployment_id,
    handle_id=self._handle_id,
    actor_id=self._self_actor_id,
    handle_source=self._handle_source,
    **self._get_aggregated_requests(),
    )

Currently the metrics push is fire-and-forget, and happens on a fixed interval whether or not the previous push has finished.

Use case

Our system is running a very large number of DeploymentHandles (see #44784 for more details). We've noticed that the Serve controller gets overloaded (>100% CPU usage) trying to accept all of the metrics pushes, which leads to an ever-increasing number of increasingly-stale record_handle_metrics tasks idle on the controller, which then eventually runs out of memory and crashes.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksenhancementRequest for new feature and/or capabilityserveRay Serve Related Issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions