Skip to content

[Dashboard] metrics page is extremely slow and unreliable #55499

@eric-higgins-ai

Description

@eric-higgins-ai

What happened + What you expected to happen

When I launch a simple job and try to view the "Metrics" page in the Ray Dashboard, it takes 15-30s to load (whereas the same dashboards in Grafana take <1s), often crashes my Chrome tab with SIGTRAP or Error code 5, and uses tons of memory (opening the Default dashboard in Grafana uses 85MB and the Data dashboard uses 102MB, while viewing them in Ray uses 1.4GB).

As far as I can tell, this is happening because the page embeds 60 Grafana iframes. I don't believe the page load time is an issue with my Grafana instance, because most of the time is spent loading static assets and they're all loaded from the memory or disk caches. I'm using the official Grafana image, so it shouldn't be an application code issue there.

I mostly wanted to open this issue to check if there's a reason why the Ray Dashboard embeds each panel individually instead of just embedding the entire Default and Data dashboards. I'm happy to make a PR if this sounds like a good change. We can of course work around this by opening the dashboards in Grafana, but we really like the Ray Dashboard and want to centralize our user flows as much as possible there.

Versions / Dependencies

Ray 2.47.1
Python 3.10
Ubuntu 24.04
Grafana 12.0.1

Reproduction script

Just launch any Ray job and look at the metrics page in the dashboard

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profilingperformance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions