Skip to content

[core][telemetry/01] migrate python telemetry recorder to opentelemetry#53098

Merged
can-anyscale merged 2 commits intomasterfrom
can-tel01
May 27, 2025
Merged

[core][telemetry/01] migrate python telemetry recorder to opentelemetry#53098
can-anyscale merged 2 commits intomasterfrom
can-tel01

Conversation

@can-anyscale
Copy link
Copy Markdown
Contributor

@can-anyscale can-anyscale commented May 16, 2025

This is a series of PR to migrate metric collection from opencencus to openlemetry. For context, ray has two main parts that instrument and record metric through opencencus: c++ and python part.

The python part consists of three components:

  • A singleton python object (let's call it the TelemetryMetricRecorder) that stores all telemetry that got sent to it in memories.
  • A PrometheusServer that holds this TelemetryMetricRecorder object. The server takes a snapshot of this object, convert telemetries into prometheus metrics and persist the metrics every few seconds.
  • A DashboardAgent, which runs as a cronjob on every ray node, that collects node-level telemetries and send them to the TelemetryMetricRecorder

The c++ part consists of one component (to be deep dived more):

  • A cronjob runs on every ray worker and send telemetries to TelemetryMetricRecorder via grpc messages.

This PR migrates the TelemetryMetricRecorder from using opencencus to opentelemetry. The migration is also control via a flag (which turns on the new stuff by default). Fairly small PR (most files are auto-generated).

Test:

  • CI + new unit test
  • test_metrics_agent.py + test_reporter.py has a lot of e2e tests and they still work
  • e2e test in anyscale (go/cantel01)

@can-anyscale can-anyscale requested a review from a team May 16, 2025 19:24
@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label May 16, 2025
Copy link
Copy Markdown
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve for dependency change.

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented May 16, 2025

🙌

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented May 16, 2025

@MengjinYan can you review this pls?

@can-anyscale can-anyscale changed the title [can][telemetry/01] migrate python telemetry recorder to opentelemetry [core][telemetry/01] migrate python telemetry recorder to opentelemetry May 17, 2025
@can-anyscale can-anyscale force-pushed the can-tel01 branch 2 times, most recently from 23327c7 to 3c4f326 Compare May 20, 2025 13:30
@can-anyscale can-anyscale force-pushed the can-tel01 branch 3 times, most recently from 3345501 to 1e0aa1e Compare May 22, 2025 02:26
@can-anyscale
Copy link
Copy Markdown
Contributor Author

@MengjinYan's comments

@can-anyscale can-anyscale requested a review from MengjinYan May 22, 2025 23:10
@can-anyscale can-anyscale force-pushed the can-tel01 branch 2 times, most recently from a744a44 to 4ff61cc Compare May 23, 2025 01:48
@can-anyscale
Copy link
Copy Markdown
Contributor Author

@MengjinYan: I changed _observations_by_gauge_name back to the version that deduplicates tags instead of simply appending data points. Previously, I reset _observations_by_gauge_name at collection time (triggered by the /metrics endpoint), but this broke test_reporter in a way that is updated recently by @dayshah: the test calls /metrics twice in a row and the second call returns no data (see here). To keep the behavior as close as it was, I removed the reset of _observations_by_gauge_name. It now needs to be a map rather than a vector of data points to prevent memory leaks.

We can discuss more in person next week perhaps if this makes no sense.

can-anyscale added a commit that referenced this pull request Jun 6, 2025
This is a series of PR to migrate metric collection from opencencus to
openlemetry. For context on the existing components, see
#53098.

---------------

This PR
- Use open telemetry to export gauge metric to grpc endpoint
- Grpc endpoint also export received metrics to prometheus server
- Add unit tests + e2e tests. Note that I create a test job separately
for open telemetry, the grpc server needs to use the network host
directly (it doesn't work in a docker-in-docker environment from CI for
reasons I don't 100% understand yet).

Overall, when `RAY_experimental_enable_open_telemetry_on_core` is True,
gauge metric uses open telemetry and everything else keeps using
opencensus. Otherwise, everything will use opencensus. This ensure
correctness for both options.

Test:
- CI

---------

Signed-off-by: can <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
can-anyscale added a commit that referenced this pull request Jun 11, 2025
This is a series of PR to migrate metric collection from opencencus to
openlemetry. For context on the existing components, see
#53098.

---------------

This PR
- Support Counter metric for `open_telemetry_metric_recorder.cc`
- Add a unittest to make sure it works
- This feature is not used anywhere yet, since the counter part
`open_telemetry_metric_recorder.py` needs to be able to consume the
counter metric as well, if this part is used. I'll add that in the next
PR

Test:
- CI

---------

Signed-off-by: can <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
This is a series of PR to migrate metric collection from opencencus to
openlemetry. For context on the existing components, see
#53098.

---------------

This PR
- Support Counter metric for `open_telemetry_metric_recorder.cc`
- Add a unittest to make sure it works
- This feature is not used anywhere yet, since the counter part
`open_telemetry_metric_recorder.py` needs to be able to consume the
counter metric as well, if this part is used. I'll add that in the next
PR

Test:
- CI

---------

Signed-off-by: can <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
can-anyscale added a commit that referenced this pull request Jun 19, 2025
This is a series of PR to migrate metric collection from opencencus to
openlemetry. For context on the existing components, see
#53098.

------------

This PR
- Support Counter metric on dashboard agent side
- Support Counter metric e2e (from worker to dashboard agent)

Test:
- CI

---------

Signed-off-by: can <can@anyscale.com>
Signed-off-by: Cuong Nguyen <can@anyscale.com>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
This is a series of PR to migrate metric collection from opencencus to
openlemetry. For context on the existing components, see
ray-project#53098.

------------

This PR
- Support Counter metric on dashboard agent side
- Support Counter metric e2e (from worker to dashboard agent)

Test:
- CI

---------

Signed-off-by: can <can@anyscale.com>
Signed-off-by: Cuong Nguyen <can@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
This is a series of PR to migrate metric collection from opencencus to
openlemetry. For context on the existing components, see
#53098.

---------------

This PR
- Support Counter metric for `open_telemetry_metric_recorder.cc`
- Add a unittest to make sure it works
- This feature is not used anywhere yet, since the counter part
`open_telemetry_metric_recorder.py` needs to be able to consume the
counter metric as well, if this part is used. I'll add that in the next
PR

Test:
- CI

---------

Signed-off-by: can <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
This is a series of PR to migrate metric collection from opencencus to
openlemetry. For context on the existing components, see
#53098.

------------

This PR
- Support Counter metric on dashboard agent side
- Support Counter metric e2e (from worker to dashboard agent)

Test:
- CI

---------

Signed-off-by: can <can@anyscale.com>
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants