[core] improve the robustness of metric reporting#55152
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @can-anyscale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly improves the reliability of metric reporting from the GCS server by introducing a retry mechanism for connecting to the DashboardAgent. It also refactors the metric initialization process to be more resilient to startup order dependencies, ensuring that critical system metrics are consistently collected without generating excessive error logs.
Highlights
- Robust GCS Metric Initialization: I've implemented a new
InitStatsfunction within the GCS server that robustly retries connecting to the DashboardAgent for metric reporting. This addresses the issue where the DashboardAgent might not be ready when GCS starts, leading to dropped metrics and noisy logs. - Separation of Metric Agent Initialization: The OpenTelemetry metric agent registration logic has been refactored into a dedicated
InitOpenTelemetryMetricAgentfunction. This allows for delayed and retried initialization, ensuring metrics are only sent once the agent is confirmed to be running, while the legacy OpenCensus path remains instats::Init. - Enhanced Logging for Metrics: Added new log statements in the
reporter_agent.pyandopen_telemetry_metric_recorder.ccto provide better visibility into when histogram metrics are registered and when any metrics are being exported or recorded. - Test Suite Adjustments: Modified
test_metrics_agent.pyby removing a macOS-specific skip condition for OpenTelemetry and commenting out a significant portion of existing tests. This suggests either a temporary disablement or a planned refactor of the test suite.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request improves the robustness of GCS metric reporting by introducing a retry mechanism for connecting to the DashboardAgent. This prevents dropped metrics and noisy logs when the GCS starts before the agent. The changes involve refactoring the stats initialization logic and adding the retry logic in the GCS server.
My review has identified a few issues:
- A critical issue in
test_metrics_agent.pywhere a large number of tests are commented out, which severely impacts test coverage. - A high-severity logic flaw in
gcs_server.ccwhere the check for consecutive successful probes is not correctly implemented. - Several medium-severity issues related to leftover debugging logs in Python and C++ code that should be removed or changed to a lower log level before merging.
The overall approach is sound, but these issues should be addressed to ensure code quality and correctness.
139312f to
12b5ccd
Compare
src/ray/gcs/gcs_server/gcs_server.cc
Outdated
| gcs_task_manager_->SetUsageStatsClient(usage_stats_client_.get()); | ||
| } | ||
|
|
||
| void GcsServer::InitStats(int retry_count, int port_check_success_count) { |
There was a problem hiding this comment.
main logic of this PR
src/ray/gcs/gcs_server/gcs_server.cc
Outdated
| if (!is_stopped_) { | ||
| RAY_LOG(INFO) << "Stopping GCS server."; | ||
|
|
||
| // Cancel any pending stats initialization timer |
There was a problem hiding this comment.
I'm curious that wouldn't stats_init_timer_->cancel(); immediately cancel the waiting stats_init_timer_ right after InitStats calls stats_init_timer_?
IIUC, this is async so InitStats() doesn't block there (and shouldn't) but this defy the purpose of waiting timer interval
There was a problem hiding this comment.
thanks @owenowenisme, this is called inside the shutdown path of GcsServer so we do want to stop stats_init_timer_ and any of async in its queue.
There was a problem hiding this comment.
Thanks for clarifying! The row of GcsServer::Stop() didn't expand so I misread haha
This is hacky/error prone. Why don't we just send an RPC to the agent? Can add another |
12b5ccd to
98292e4
Compare
9c01cae to
b2192b9
Compare
f9c1361 to
c4418f7
Compare
| # This is a health check endpoint for the reporter agent. | ||
| # It is used to check if the reporter agent is ready to receive requests. |
| std::chrono::milliseconds( | ||
| absl::ToInt64Milliseconds(StatsConfig::instance().GetReportInterval())), | ||
| std::chrono::milliseconds( | ||
| absl::ToInt64Milliseconds(StatsConfig::instance().GetHarvestInterval()))); |
There was a problem hiding this comment.
should add inline comments for what these params are, like: /*foobar_timeout_ms=*/...
src/ray/gcs/gcs_server/gcs_server.cc
Outdated
| gcs_task_manager_->SetUsageStatsClient(usage_stats_client_.get()); | ||
| } | ||
|
|
||
| void GcsServer::InitStats(int retry_count) { |
There was a problem hiding this comment.
Hm -- why are we making this async at all? I would expect that once the GcsServer initialization sequence is done, everything is fully initialized. I don't see a strong detractor from blocking here until metrics are available and giving up after some generous timeout.
There was a problem hiding this comment.
Mostly due to performance concerns, I’m seeing a 5–10s delay before the dashboard agent starts, which is a noticeable regression in GCS startup time. On the other hand, a small latency for a few metrics emitted during GCS startup I think is easier to accept.
But let me know otherwise
There was a problem hiding this comment.
Where is the 5-10s coming from? If we can fix that, best of both worlds :)
There was a problem hiding this comment.
It's quite complex and part of the raylet startup sequence, the raylet process starts the node_manager (main.cc#L777), which in turn starts the DashboardAgent (node_manager.cc#L213). The DashboardAgent then spawns a Python process running agent.py#L168, which starts the gRPC server. This sequence is synchronous/blocking, so I believe the gRPC server is already running as soon as the node_manager has started, @jjyao. The majority of the 5-10s likely comes from this sequence (minus the delay of the io_service queue which i think is minimal).
There was a problem hiding this comment.
Hm, all of those steps should be very fast, so I would only expect this to take ~ms. There is hidden time going somewhere...
There was a problem hiding this comment.
You're right! I profiled some steps and found that most of the time is spent initializing reporter_agent.py itself, not in the path leading to it. Within its initialization, this line alone takes over 5 seconds.
This line is essentially just a wrapper around a Prometheus API, so it's puzzling why Prometheus initialization takes so long. Bad news is that the grpc server we need does depend on prometheus client to start first.
69874ea to
348e6a6
Compare
|
@edoakes's comments |
|
After chatting offline with @jjyao, I now realize that the issue this PR addresses also applies to raylet—we’ve just been lucky not to see it in raylet–DashboardAgent communication, likely because the timing gap there is much smaller. Here’s what I have in mind for mid- and long-term fixes based on all our discussions: Mid-term fix:
Long-term fix (thinking out loud here)
If that sounds good, my thought is to merge this PR first. I need this PR to have a smoother roll out of the otel-backend (because otel is more loud in term of grpc errors than the existing opencensus implementation). The next step is to take the approach in this PR to apply to raylet. The long term fix can be a good candidate to rewrite DashboardAgent into Rust maybe. |
0334dc1 to
64706a6
Compare
|
@edoakes's comments: move the healthcheck/retry/exporter-init logic inside the metrics_agent_client
|
1e7d52c to
d9e5f30
Compare
74effa9 to
b87e3f1
Compare
|
the test |
edoakes
left a comment
There was a problem hiding this comment.
Looks good! Some minor comments. Let's add a unit test for MetricsAgentClient to test the InitExporter logic.
I'd structure this as roughly:
- Pass a fake RPC client and timer into
metrics_agent_client_to avoid relying on real timeouts etc. - In the test, use the callback passed to
InitExporterto verify the behavior
| metrics_agent_client_->InitExporter( | ||
| [this]() { stats::InitOpenTelemetryExporter(options_.metrics_agent_port); }); |
There was a problem hiding this comment.
is there anywhere that we need to InitExporter separately from constructing the class? If not, I'd put the logic into the constructor instead of splitting.
There was a problem hiding this comment.
There are inside the metric_exporter.cc; that file is open-census related so probably will be deleted soon; but even with that file deleted, I need to think a bit if we want to move this to a constructor - this stats::InitOpenTelemetryExporter should only be called once inside each process so might not be suitable to be called inside a constructor.
There was a problem hiding this comment.
kk we can keep this pattern for now
| virtual void InitExporter(std::function<void()> init_exporter_fn) = 0; | ||
| }; | ||
|
|
||
| class MetricsAgentClientImpl : public MetricsAgentClient { |
There was a problem hiding this comment.
nit: let's standardize on FooBarInterface for virtual class and FooBar for concrete implementation (that's what we have mostly)
can you rename in a separate PR sometime?
src/ray/rpc/metrics_agent_client.h
Outdated
| /// The maximum number of retries to initialize the exporter. The following values | ||
| /// allow for 30 seconds of retries. | ||
| int kMetricAgentInitMaxRetries = 30; | ||
| int kMetricAgentInitRetryDelayMs = 1000; |
There was a problem hiding this comment.
nit: these should either be global constants or else regular instance variables that are set in the constructor
| if (retry_count >= max_retry) { | ||
| RAY_LOG(ERROR) << "Failed to initialize exporter. Data will not be exported to the " | ||
| "metrics agent."; | ||
| return; | ||
| } |
There was a problem hiding this comment.
IMO we should call the callback with a status (in this case, not OK) and let it figure out what to do (better for testing as well)
There was a problem hiding this comment.
you mean to defer error handling to init_exporter_fn? let me look into it
src/ray/gcs/gcs_server/gcs_server.cc
Outdated
| // Init OpenTelemetry exporter. | ||
| metrics_agent_client_->InitExporter( |
There was a problem hiding this comment.
nit: this is a minor abstraction leak, I would call it something more direct like WaitForServerReady
67d5a6e to
5918c3c
Compare
|
@edoakes's comments + added a test for MetricsAgentClient. Since GrpcClient isn’t override-able (its methods are template-based), I couldn’t create a mock version of it. Instead, I directly mocked the auto-generated HealthCheck function of the client. Given that this function is auto-generated with fixed semantics (under the view of the class under test), I think it’s reasonable to mock it directly without concern that the mock semantic will become meaningless in the future - but let me know if there is a better way to achieve the goal. |
54a54fe to
3cd1762
Compare
Signed-off-by: Cuong Nguyen <can@anyscale.com>
3cd1762 to
a008e1f
Compare
**Context** This PR is related to the Ray metrics infrastructure. In Ray, each node runs a centralized process called the DashboardAgent, which acts as a gRPC server. Other processes on the same node send their per-process metrics to this server. The DashboardAgent then aggregates these metrics and exports them to Prometheus. The DashboardAgent is spawned by the raylet. **Problem** Currently, the `GCS` server also depends on the DashboardAgent to export its internal metrics. However, the DashboardAgent is often spawned significantly later than the GCS. This leads to failed RPC requests from GCS to the DashboardAgent, resulting in dropped metrics and noisy error logs in GCS. The same issue in theory also applies to `raylet` and `core_worker`, even though the initialization gap is not observed during my testing, likely because the gap is small. **Solution** To address this, this PR introduces an `InitExporter` function inside the `MetricsAgentClient` that repeatedly retries the connection to the DashboardAgent until it succeeds. Only after a successful connection is established will GCS/raylet/core-worker begin sending metrics. To check if the gRPC server in DashboardAgent is ready, the implementation creates a new `HealthCheck` endpoint on the server side of the `MetricsAgent`. I fixed this only for the OpenTelemetry-backed infrastructure. Addressing it for OpenCensus has complication (it doesn't lazily store metrics before init) so I don't bother spend more efforts for something that will be soon deprecated. Test: - CI - `test_metrics_agent.py` is a comprehensive test that provides confidence everything is working properly. Run this test locally and check that errors are no longer observed inside gcs log. Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
**Context** This PR is related to the Ray metrics infrastructure. In Ray, each node runs a centralized process called the DashboardAgent, which acts as a gRPC server. Other processes on the same node send their per-process metrics to this server. The DashboardAgent then aggregates these metrics and exports them to Prometheus. The DashboardAgent is spawned by the raylet. **Problem** Currently, the `GCS` server also depends on the DashboardAgent to export its internal metrics. However, the DashboardAgent is often spawned significantly later than the GCS. This leads to failed RPC requests from GCS to the DashboardAgent, resulting in dropped metrics and noisy error logs in GCS. The same issue in theory also applies to `raylet` and `core_worker`, even though the initialization gap is not observed during my testing, likely because the gap is small. **Solution** To address this, this PR introduces an `InitExporter` function inside the `MetricsAgentClient` that repeatedly retries the connection to the DashboardAgent until it succeeds. Only after a successful connection is established will GCS/raylet/core-worker begin sending metrics. To check if the gRPC server in DashboardAgent is ready, the implementation creates a new `HealthCheck` endpoint on the server side of the `MetricsAgent`. I fixed this only for the OpenTelemetry-backed infrastructure. Addressing it for OpenCensus has complication (it doesn't lazily store metrics before init) so I don't bother spend more efforts for something that will be soon deprecated. Test: - CI - `test_metrics_agent.py` is a comprehensive test that provides confidence everything is working properly. Run this test locally and check that errors are no longer observed inside gcs log. Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
**Context** This PR is related to the Ray metrics infrastructure. In Ray, each node runs a centralized process called the DashboardAgent, which acts as a gRPC server. Other processes on the same node send their per-process metrics to this server. The DashboardAgent then aggregates these metrics and exports them to Prometheus. The DashboardAgent is spawned by the raylet. **Problem** Currently, the `GCS` server also depends on the DashboardAgent to export its internal metrics. However, the DashboardAgent is often spawned significantly later than the GCS. This leads to failed RPC requests from GCS to the DashboardAgent, resulting in dropped metrics and noisy error logs in GCS. The same issue in theory also applies to `raylet` and `core_worker`, even though the initialization gap is not observed during my testing, likely because the gap is small. **Solution** To address this, this PR introduces an `InitExporter` function inside the `MetricsAgentClient` that repeatedly retries the connection to the DashboardAgent until it succeeds. Only after a successful connection is established will GCS/raylet/core-worker begin sending metrics. To check if the gRPC server in DashboardAgent is ready, the implementation creates a new `HealthCheck` endpoint on the server side of the `MetricsAgent`. I fixed this only for the OpenTelemetry-backed infrastructure. Addressing it for OpenCensus has complication (it doesn't lazily store metrics before init) so I don't bother spend more efforts for something that will be soon deprecated. Test: - CI - `test_metrics_agent.py` is a comprehensive test that provides confidence everything is working properly. Run this test locally and check that errors are no longer observed inside gcs log. Signed-off-by: Cuong Nguyen <can@anyscale.com>
Context
This PR is related to the Ray metrics infrastructure. In Ray, each node runs a centralized process called the DashboardAgent, which acts as a gRPC server. Other processes on the same node send their per-process metrics to this server. The DashboardAgent then aggregates these metrics and exports them to Prometheus.
The DashboardAgent is spawned by the raylet.
Problem
Currently, the
GCSserver also depends on the DashboardAgent to export its internal metrics. However, the DashboardAgent is often spawned significantly later than the GCS. This leads to failed RPC requests from GCS to the DashboardAgent, resulting in dropped metrics and noisy error logs in GCS.The same issue in theory also applies to
rayletandcore_worker, even though the initialization gap is not observed during my testing, likely because the gap is small.Solution
To address this, this PR introduces an
InitExporterfunction inside theMetricsAgentClientthat repeatedly retries the connection to the DashboardAgent until it succeeds. Only after a successful connection is established will GCS/raylet/core-worker begin sending metrics.To check if the gRPC server in DashboardAgent is ready, the implementation creates a new
HealthCheckendpoint on the server side of theMetricsAgent.I fixed this only for the OpenTelemetry-backed infrastructure. Addressing it for OpenCensus has complication (it doesn't lazily store metrics before init) so I don't bother spend more efforts for something that will be soon deprecated.
Test:
test_metrics_agent.pyis a comprehensive test that provides confidence everything is working properly. Run this test locally and check that errors are no longer observed inside gcs log.