Skip to content

[core] improve the robustness of metric reporting#55152

Merged
edoakes merged 1 commit intomasterfrom
can-otel16
Aug 21, 2025
Merged

[core] improve the robustness of metric reporting#55152
edoakes merged 1 commit intomasterfrom
can-otel16

Conversation

@can-anyscale
Copy link
Copy Markdown
Contributor

@can-anyscale can-anyscale commented Aug 1, 2025

Context
This PR is related to the Ray metrics infrastructure. In Ray, each node runs a centralized process called the DashboardAgent, which acts as a gRPC server. Other processes on the same node send their per-process metrics to this server. The DashboardAgent then aggregates these metrics and exports them to Prometheus.
The DashboardAgent is spawned by the raylet.

Problem
Currently, the GCS server also depends on the DashboardAgent to export its internal metrics. However, the DashboardAgent is often spawned significantly later than the GCS. This leads to failed RPC requests from GCS to the DashboardAgent, resulting in dropped metrics and noisy error logs in GCS.

The same issue in theory also applies to raylet and core_worker, even though the initialization gap is not observed during my testing, likely because the gap is small.

Solution
To address this, this PR introduces an InitExporter function inside the MetricsAgentClient that repeatedly retries the connection to the DashboardAgent until it succeeds. Only after a successful connection is established will GCS/raylet/core-worker begin sending metrics.

To check if the gRPC server in DashboardAgent is ready, the implementation creates a new HealthCheck endpoint on the server side of the MetricsAgent.

I fixed this only for the OpenTelemetry-backed infrastructure. Addressing it for OpenCensus has complication (it doesn't lazily store metrics before init) so I don't bother spend more efforts for something that will be soon deprecated.

Test:

  • CI
  • test_metrics_agent.py is a comprehensive test that provides confidence everything is working properly. Run this test locally and check that errors are no longer observed inside gcs log.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @can-anyscale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the reliability of metric reporting from the GCS server by introducing a retry mechanism for connecting to the DashboardAgent. It also refactors the metric initialization process to be more resilient to startup order dependencies, ensuring that critical system metrics are consistently collected without generating excessive error logs.

Highlights

  • Robust GCS Metric Initialization: I've implemented a new InitStats function within the GCS server that robustly retries connecting to the DashboardAgent for metric reporting. This addresses the issue where the DashboardAgent might not be ready when GCS starts, leading to dropped metrics and noisy logs.
  • Separation of Metric Agent Initialization: The OpenTelemetry metric agent registration logic has been refactored into a dedicated InitOpenTelemetryMetricAgent function. This allows for delayed and retried initialization, ensuring metrics are only sent once the agent is confirmed to be running, while the legacy OpenCensus path remains in stats::Init.
  • Enhanced Logging for Metrics: Added new log statements in the reporter_agent.py and open_telemetry_metric_recorder.cc to provide better visibility into when histogram metrics are registered and when any metrics are being exported or recorded.
  • Test Suite Adjustments: Modified test_metrics_agent.py by removing a macOS-specific skip condition for OpenTelemetry and commenting out a significant portion of existing tests. This suggests either a temporary disablement or a planned refactor of the test suite.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the robustness of GCS metric reporting by introducing a retry mechanism for connecting to the DashboardAgent. This prevents dropped metrics and noisy logs when the GCS starts before the agent. The changes involve refactoring the stats initialization logic and adding the retry logic in the GCS server.

My review has identified a few issues:

  • A critical issue in test_metrics_agent.py where a large number of tests are commented out, which severely impacts test coverage.
  • A high-severity logic flaw in gcs_server.cc where the check for consecutive successful probes is not correctly implemented.
  • Several medium-severity issues related to leftover debugging logs in Python and C++ code that should be removed or changed to a lower log level before merging.

The overall approach is sound, but these issues should be addressed to ensure code quality and correctness.

@can-anyscale can-anyscale force-pushed the can-otel16 branch 2 times, most recently from 139312f to 12b5ccd Compare August 1, 2025 21:10
gcs_task_manager_->SetUsageStatsClient(usage_stats_client_.get());
}

void GcsServer::InitStats(int retry_count, int port_check_success_count) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main logic of this PR

@can-anyscale can-anyscale marked this pull request as ready for review August 1, 2025 21:11
@can-anyscale can-anyscale requested a review from a team as a code owner August 1, 2025 21:11
if (!is_stopped_) {
RAY_LOG(INFO) << "Stopping GCS server.";

// Cancel any pending stats initialization timer
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious that wouldn't stats_init_timer_->cancel(); immediately cancel the waiting stats_init_timer_ right after InitStats calls stats_init_timer_?

IIUC, this is async so InitStats() doesn't block there (and shouldn't) but this defy the purpose of waiting timer interval

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @owenowenisme, this is called inside the shutdown path of GcsServer so we do want to stop stats_init_timer_ and any of async in its queue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! The row of GcsServer::Stop() didn't expand so I misread haha

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Aug 4, 2025

To check if the gRPC server in DashboardAgent is ready, the implementation uses CheckPortFree as a proxy. Since this check is not foolproof, the function waits for at least two consecutive successful probes before assuming the DashboardAgent is ready to accept requests.

This is hacky/error prone. Why don't we just send an RPC to the agent? Can add another healthz method.

@can-anyscale
Copy link
Copy Markdown
Contributor Author

@edoakes - @jjyao was suggesting the same thing. I’ll investigate. I had initially hoped to avoid the complexity of sending actual gRPC requests for validation, but I agree that doing so would be more reliable.

@can-anyscale
Copy link
Copy Markdown
Contributor Author

@edoakes, @jjyao comments - add the HealthCheck endpoint to the reporter agent and uses it to validate if the grpc server is already running

@can-anyscale can-anyscale force-pushed the can-otel16 branch 3 times, most recently from 9c01cae to b2192b9 Compare August 5, 2025 00:52
@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label Aug 5, 2025
@can-anyscale can-anyscale force-pushed the can-otel16 branch 4 times, most recently from f9c1361 to c4418f7 Compare August 8, 2025 04:52
Comment on lines +542 to +543
# This is a health check endpoint for the reporter agent.
# It is used to check if the reporter agent is ready to receive requests.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: docstring format

Comment on lines +121 to +135
std::chrono::milliseconds(
absl::ToInt64Milliseconds(StatsConfig::instance().GetReportInterval())),
std::chrono::milliseconds(
absl::ToInt64Milliseconds(StatsConfig::instance().GetHarvestInterval())));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add inline comments for what these params are, like: /*foobar_timeout_ms=*/...

gcs_task_manager_->SetUsageStatsClient(usage_stats_client_.get());
}

void GcsServer::InitStats(int retry_count) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm -- why are we making this async at all? I would expect that once the GcsServer initialization sequence is done, everything is fully initialized. I don't see a strong detractor from blocking here until metrics are available and giving up after some generous timeout.

Copy link
Copy Markdown
Contributor Author

@can-anyscale can-anyscale Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly due to performance concerns, I’m seeing a 5–10s delay before the dashboard agent starts, which is a noticeable regression in GCS startup time. On the other hand, a small latency for a few metrics emitted during GCS startup I think is easier to accept.

But let me know otherwise

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the 5-10s coming from? If we can fix that, best of both worlds :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite complex and part of the raylet startup sequence, the raylet process starts the node_manager (main.cc#L777), which in turn starts the DashboardAgent (node_manager.cc#L213). The DashboardAgent then spawns a Python process running agent.py#L168, which starts the gRPC server. This sequence is synchronous/blocking, so I believe the gRPC server is already running as soon as the node_manager has started, @jjyao. The majority of the 5-10s likely comes from this sequence (minus the delay of the io_service queue which i think is minimal).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, all of those steps should be very fast, so I would only expect this to take ~ms. There is hidden time going somewhere...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I profiled some steps and found that most of the time is spent initializing reporter_agent.py itself, not in the path leading to it. Within its initialization, this line alone takes over 5 seconds.

This line is essentially just a wrapper around a Prometheus API, so it's puzzling why Prometheus initialization takes so long. Bad news is that the grpc server we need does depend on prometheus client to start first.

Screenshot 2025-08-13 at 4 25 06 PM

@can-anyscale can-anyscale force-pushed the can-otel16 branch 2 times, most recently from 69874ea to 348e6a6 Compare August 11, 2025 14:39
@can-anyscale can-anyscale requested a review from edoakes August 11, 2025 14:39
@can-anyscale
Copy link
Copy Markdown
Contributor Author

@edoakes's comments

@can-anyscale can-anyscale requested a review from jjyao August 13, 2025 16:51
@can-anyscale
Copy link
Copy Markdown
Contributor Author

After chatting offline with @jjyao, I now realize that the issue this PR addresses also applies to raylet—we’ve just been lucky not to see it in raylet–DashboardAgent communication, likely because the timing gap there is much smaller. Here’s what I have in mind for mid- and long-term fixes based on all our discussions:

Mid-term fix:

  • Essentially what this PR proposes: introduce a C++ class/component to verify if the DashboardAgent is up and running. This component would exist in both GCS and raylet, providing a reliable point for metrics and event exporters to establish their connections.
  • Metrics and event exporters would then be required to act as buffers, storing events and sending them once the DashboardAgent is ready.

Long-term fix (thinking out loud here)

  • Decouple the DashboardAgent startup from both GCS and raylet, making it an independent process. ray up would spawn the DashboardAgent directly, alongside GCS, raylet, etc. as it does today.
  • The DashboardAgent’s startup would be blocking and outlive all other Ray processes. Its startup time would need to be fast enough that this blocking is negligible.

If that sounds good, my thought is to merge this PR first. I need this PR to have a smoother roll out of the otel-backend (because otel is more loud in term of grpc errors than the existing opencensus implementation). The next step is to take the approach in this PR to apply to raylet. The long term fix can be a good candidate to rewrite DashboardAgent into Rust maybe.

CC: @edoakes , @jjyao

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Aug 15, 2025
@can-anyscale can-anyscale force-pushed the can-otel16 branch 3 times, most recently from 0334dc1 to 64706a6 Compare August 19, 2025 21:56
@can-anyscale
Copy link
Copy Markdown
Contributor Author

can-anyscale commented Aug 19, 2025

@edoakes's comments: move the healthcheck/retry/exporter-init logic inside the metrics_agent_client

  • Also apply the fix to raylet + core worker

@can-anyscale can-anyscale requested a review from edoakes August 19, 2025 21:56
@can-anyscale can-anyscale changed the title [core] improve the robustness of GCS metric reporting [core] improve the robustness of metric reporting Aug 19, 2025
@can-anyscale can-anyscale force-pushed the can-otel16 branch 2 times, most recently from 1e7d52c to d9e5f30 Compare August 20, 2025 01:19
@can-anyscale can-anyscale force-pushed the can-otel16 branch 4 times, most recently from 74effa9 to b87e3f1 Compare August 20, 2025 16:26
@can-anyscale
Copy link
Copy Markdown
Contributor Author

the test //python/ray/dashboard:modules/job/tests/test_sdk failing on CI is failing/flaky on master too

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some minor comments. Let's add a unit test for MetricsAgentClient to test the InitExporter logic.

I'd structure this as roughly:

  • Pass a fake RPC client and timer into metrics_agent_client_ to avoid relying on real timeouts etc.
  • In the test, use the callback passed to InitExporter to verify the behavior

Comment on lines +790 to +791
metrics_agent_client_->InitExporter(
[this]() { stats::InitOpenTelemetryExporter(options_.metrics_agent_port); });
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anywhere that we need to InitExporter separately from constructing the class? If not, I'd put the logic into the constructor instead of splitting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are inside the metric_exporter.cc; that file is open-census related so probably will be deleted soon; but even with that file deleted, I need to think a bit if we want to move this to a constructor - this stats::InitOpenTelemetryExporter should only be called once inside each process so might not be suitable to be called inside a constructor.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk we can keep this pattern for now

virtual void InitExporter(std::function<void()> init_exporter_fn) = 0;
};

class MetricsAgentClientImpl : public MetricsAgentClient {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's standardize on FooBarInterface for virtual class and FooBar for concrete implementation (that's what we have mostly)

can you rename in a separate PR sometime?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sound good

Comment on lines +98 to +101
/// The maximum number of retries to initialize the exporter. The following values
/// allow for 30 seconds of retries.
int kMetricAgentInitMaxRetries = 30;
int kMetricAgentInitRetryDelayMs = 1000;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these should either be global constants or else regular instance variables that are set in the constructor

Comment on lines +47 to +52
if (retry_count >= max_retry) {
RAY_LOG(ERROR) << "Failed to initialize exporter. Data will not be exported to the "
"metrics agent.";
return;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should call the callback with a status (in this case, not OK) and let it figure out what to do (better for testing as well)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to defer error handling to init_exporter_fn? let me look into it

Comment on lines +278 to +279
// Init OpenTelemetry exporter.
metrics_agent_client_->InitExporter(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a minor abstraction leak, I would call it something more direct like WaitForServerReady

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sound good

@can-anyscale
Copy link
Copy Markdown
Contributor Author

can-anyscale commented Aug 20, 2025

@edoakes's comments + added a test for MetricsAgentClient. Since GrpcClient isn’t override-able (its methods are template-based), I couldn’t create a mock version of it. Instead, I directly mocked the auto-generated HealthCheck function of the client. Given that this function is auto-generated with fixed semantics (under the view of the class under test), I think it’s reasonable to mock it directly without concern that the mock semantic will become meaningless in the future - but let me know if there is a better way to achieve the goal.

@can-anyscale can-anyscale force-pushed the can-otel16 branch 3 times, most recently from 54a54fe to 3cd1762 Compare August 21, 2025 00:19
Signed-off-by: Cuong Nguyen <can@anyscale.com>
@edoakes edoakes merged commit 8590c61 into master Aug 21, 2025
5 checks passed
@edoakes edoakes deleted the can-otel16 branch August 21, 2025 21:43
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
**Context**
This PR is related to the Ray metrics infrastructure. In Ray, each node
runs a centralized process called the DashboardAgent, which acts as a
gRPC server. Other processes on the same node send their per-process
metrics to this server. The DashboardAgent then aggregates these metrics
and exports them to Prometheus.
The DashboardAgent is spawned by the raylet.

**Problem**
Currently, the `GCS` server also depends on the DashboardAgent to export
its internal metrics. However, the DashboardAgent is often spawned
significantly later than the GCS. This leads to failed RPC requests from
GCS to the DashboardAgent, resulting in dropped metrics and noisy error
logs in GCS.

The same issue in theory also applies to `raylet` and `core_worker`,
even though the initialization gap is not observed during my testing,
likely because the gap is small.

**Solution**
To address this, this PR introduces an `InitExporter` function inside
the `MetricsAgentClient` that repeatedly retries the connection to the
DashboardAgent until it succeeds. Only after a successful connection is
established will GCS/raylet/core-worker begin sending metrics.

To check if the gRPC server in DashboardAgent is ready, the
implementation creates a new `HealthCheck` endpoint on the server side
of the `MetricsAgent`.

I fixed this only for the OpenTelemetry-backed infrastructure.
Addressing it for OpenCensus has complication (it doesn't lazily store
metrics before init) so I don't bother spend more efforts for something
that will be soon deprecated.

Test:
- CI
- `test_metrics_agent.py` is a comprehensive test that provides
confidence everything is working properly. Run this test locally and
check that errors are no longer observed inside gcs log.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
**Context**
This PR is related to the Ray metrics infrastructure. In Ray, each node
runs a centralized process called the DashboardAgent, which acts as a
gRPC server. Other processes on the same node send their per-process
metrics to this server. The DashboardAgent then aggregates these metrics
and exports them to Prometheus.
The DashboardAgent is spawned by the raylet.

**Problem**
Currently, the `GCS` server also depends on the DashboardAgent to export
its internal metrics. However, the DashboardAgent is often spawned
significantly later than the GCS. This leads to failed RPC requests from
GCS to the DashboardAgent, resulting in dropped metrics and noisy error
logs in GCS.

The same issue in theory also applies to `raylet` and `core_worker`,
even though the initialization gap is not observed during my testing,
likely because the gap is small.

**Solution**
To address this, this PR introduces an `InitExporter` function inside
the `MetricsAgentClient` that repeatedly retries the connection to the
DashboardAgent until it succeeds. Only after a successful connection is
established will GCS/raylet/core-worker begin sending metrics.

To check if the gRPC server in DashboardAgent is ready, the
implementation creates a new `HealthCheck` endpoint on the server side
of the `MetricsAgent`.

I fixed this only for the OpenTelemetry-backed infrastructure.
Addressing it for OpenCensus has complication (it doesn't lazily store
metrics before init) so I don't bother spend more efforts for something
that will be soon deprecated.

Test:
- CI
- `test_metrics_agent.py` is a comprehensive test that provides
confidence everything is working properly. Run this test locally and
check that errors are no longer observed inside gcs log.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
**Context**
This PR is related to the Ray metrics infrastructure. In Ray, each node
runs a centralized process called the DashboardAgent, which acts as a
gRPC server. Other processes on the same node send their per-process
metrics to this server. The DashboardAgent then aggregates these metrics
and exports them to Prometheus.
The DashboardAgent is spawned by the raylet.

**Problem**
Currently, the `GCS` server also depends on the DashboardAgent to export
its internal metrics. However, the DashboardAgent is often spawned
significantly later than the GCS. This leads to failed RPC requests from
GCS to the DashboardAgent, resulting in dropped metrics and noisy error
logs in GCS.

The same issue in theory also applies to `raylet` and `core_worker`,
even though the initialization gap is not observed during my testing,
likely because the gap is small.

**Solution**
To address this, this PR introduces an `InitExporter` function inside
the `MetricsAgentClient` that repeatedly retries the connection to the
DashboardAgent until it succeeds. Only after a successful connection is
established will GCS/raylet/core-worker begin sending metrics.

To check if the gRPC server in DashboardAgent is ready, the
implementation creates a new `HealthCheck` endpoint on the server side
of the `MetricsAgent`.

I fixed this only for the OpenTelemetry-backed infrastructure.
Addressing it for OpenCensus has complication (it doesn't lazily store
metrics before init) so I don't bother spend more efforts for something
that will be soon deprecated.

Test:
- CI
- `test_metrics_agent.py` is a comprehensive test that provides
confidence everything is working properly. Run this test locally and
check that errors are no longer observed inside gcs log.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants