[core][telemetry/08] record counter metric e2e by can-anyscale · Pull Request #53449 · ray-project/ray

can-anyscale · 2025-05-30T21:31:45Z

This is a series of PR to migrate metric collection from opencencus to openlemetry. For context on the existing components, see #53098.

This PR

Support Counter metric on dashboard agent side
Support Counter metric e2e (from worker to dashboard agent)

Test:

CI

Signed-off-by: can <can@anyscale.com>

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: can <can@anyscale.com>

Signed-off-by: can <can@anyscale.com>

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: can <can@anyscale.com>

Signed-off-by: can <can@anyscale.com>

dayshah · 2025-06-18T20:37:20Z

src/ray/stats/tests/metric_with_open_telemetry_test.cc

+TEST_F(MetricTest, TestCounterMetric) {
+  ASSERT_TRUE(OpenTelemetryMetricRecorder::GetInstance().IsMetricRegistered(
+      "metric_counter_test"));
+  STATS_metric_counter_test.Record(100.0, {{"Tag1", "Value1"}, {"Tag2", "Value2"}});


not testing anything after the record?

yeah no, open telemetry doesn't have an api to get the value after recording so i just test that it's not crashing; the e2e tests (test_metrics_agent.py) will check the values.

dayshah · 2025-06-18T20:39:37Z

src/ray/stats/tests/metric_with_open_telemetry_test.cc

  ASSERT_EQ(OpenTelemetryMetricRecorder::GetInstance().GetObservableMetricValue(
-                "metric_test", {{"Tag1", "Value1"}, {"Tag2", "Value2"}}),
+                "metric_gauge_test", {{"Tag1", "Value1"}, {"Tag2", "Value2"}}),
            42.0);


also one thing that I thought of, some of the get functions that exist only exist to be used in tests

we shouldn't have functions that are only used in tests, if tests really really need something private, they can be defined as friends or you can even have a function in the fixture class that accesses that private part and just make the fixture class a friend

i can look into the friend stuff

dayshah · 2025-06-18T20:40:34Z

src/ray/stats/metric.h

    OpenTelemetryMetricRecorder::GetInstance().RegisterGaugeMetric(name, description);
+  } else if (T == COUNT &&
+             ::RayConfig::instance().experimental_enable_open_telemetry_on_core()) {
+    OpenTelemetryMetricRecorder::GetInstance().RegisterCounterMetric(name, description);


nit but should just have a larger if around that checks for experimental_enable_open_telemetry_on_core

can-anyscale · 2025-06-18T21:33:57Z

@dayshah's comments

dayshah · 2025-06-19T00:15:01Z

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

+        """
+        Check if a metric with the given name is an observable metric.
+        """
+        return self._observations_by_name.get(name) is not None


do any of these 3 need to be separate functions? they're small and only seem to be used in one place each. Also they don't lock, and the fact that they exist increases the chance they're used without the lock

i can merge

merged \o/; i think in general though mutex should only be used at the top of the call chain and internal/private functions should not acquire or hold locks themselves (to avoid deadlock)

dayshah · 2025-06-19T00:16:14Z

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

+        with self._lock:
+            if name in self._registered_instruments:
+                # Counter with the same name is already registered.
+                return


should this condition ever happen?

in our code base probably not, but it's an api so it tries to handle potential failures from the users (who are also us basically but well)

Oh, actually this does happen and is expected behavior in [1]: for each batch of metrics received from the Export function, metrics are uniquely registered within the context of a single Export call, but not necessarily across multiple Export calls.

Can we add a comment describing the case?

can-anyscale · 2025-06-19T00:28:38Z

@dayshah's comments

can-anyscale · 2025-06-19T00:51:25Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+                        data_points = metric.gauge.data_points
+                    # counter metrics
+                    if metric.WhichOneof("data") == "sum" and metric.sum.is_monotonic:
+                        self._open_telemetry_metric_recorder.register_counter_metric(


can-anyscale · 2025-06-19T00:53:20Z

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

+        with self._lock:
+            if name in self._registered_instruments:
+                # Counter with the same name is already registered.
+                return


Oh, actually this does happen and is expected behavior in [1]: for each batch of metrics received from the Export function, metrics are uniquely registered within the context of a single Export call, but not necessarily across multiple Export calls.

Signed-off-by: can <can@anyscale.com> Signed-off-by: Cuong Nguyen <can@anyscale.com>

MengjinYan · 2025-07-08T07:21:14Z

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

        """
-        Set the value of a metric with the given name and tags.
-        This will create a gauge if it does not exist.
+        Set the value of a metric with the given name and tags. If the metric is not


From the logic of the function, it seems that we will not store the value if the metric is not registered even for observable metrics and it seems that the logic is different from the logic on c++ side.

Wondering do we assume that once this function is called, the metric should be registered?

It's a similar behavior in the c++ side that we'll ignore the metric value if it is not registered (https://github.com/ray-project/ray/blob/master/src/ray/telemetry/open_telemetry_metric_recorder.cc#L189). The only difference is that c++ will throw while python side just emits a warning.

I don't remember I throw on c++ side but I feel like c++ side should also emit a warning instead of throwing.

MengjinYan · 2025-07-08T07:23:26Z

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

+        with self._lock:
+            if name in self._registered_instruments:
+                # Counter with the same name is already registered.
+                return


Can we add a comment describing the case?

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

python/ray/tests/test_open_telemetry_metric_recorder.py

MengjinYan · 2025-07-08T07:38:49Z

python/ray/tests/test_open_telemetry_metric_recorder.py

+    recorder.set_metric_value(
+        name="test_counter",
+        tags={"label_key": "label_value"},
+        value=10.0,


Wondering should we verify the value of the counter here as well?

for counter metric we cannot retrieve its value (internal state within open-telemetry api) - there is another e2e test that we call the /metric endpoint to actually test the value ;)

can-anyscale force-pushed the can-tel07 branch from 071d17b to ce2ed9e Compare May 30, 2025 21:51

can-anyscale force-pushed the can-tel06 branch 7 times, most recently from efc22ae to 461d11f Compare June 2, 2025 17:53

can-anyscale force-pushed the can-tel07 branch from ce2ed9e to 6e1bdd1 Compare June 2, 2025 19:11

can-anyscale force-pushed the can-tel06 branch from 461d11f to ec83360 Compare June 2, 2025 21:05

can-anyscale force-pushed the can-tel07 branch 3 times, most recently from c3dab2e to d6343cf Compare June 2, 2025 22:19

can-anyscale marked this pull request as ready for review June 2, 2025 22:20

can-anyscale changed the title ~~Can tel07~~ [core][telemetry/08] record counter metric e2e Jun 2, 2025

can-anyscale force-pushed the can-tel06 branch from ec83360 to 8ee1528 Compare June 2, 2025 22:28

can-anyscale force-pushed the can-tel07 branch 2 times, most recently from 4ca8343 to 7efc7b3 Compare June 2, 2025 22:59

can-anyscale force-pushed the can-tel06 branch from 8ee1528 to 25a9a97 Compare June 3, 2025 00:03

can-anyscale force-pushed the can-tel07 branch from 7efc7b3 to eee0f9a Compare June 3, 2025 00:06

can-anyscale force-pushed the can-tel06 branch from 25a9a97 to 5afff64 Compare June 3, 2025 00:43

can-anyscale force-pushed the can-tel07 branch from eee0f9a to 7aa0eb3 Compare June 3, 2025 00:43

can-anyscale force-pushed the can-tel06 branch from 5afff64 to 048f045 Compare June 3, 2025 00:57

can-anyscale force-pushed the can-tel07 branch from 7aa0eb3 to 35aaecb Compare June 3, 2025 01:01

can-anyscale force-pushed the can-tel06 branch from 048f045 to 850fbee Compare June 3, 2025 03:05

can-anyscale force-pushed the can-tel07 branch 3 times, most recently from 1337adc to 241bb02 Compare June 3, 2025 04:23

can-anyscale force-pushed the can-tel06 branch from 850fbee to 6fda341 Compare June 3, 2025 04:23

can-anyscale added the go add ONLY when ready to merge, run all tests label Jun 3, 2025

can and others added 19 commits June 11, 2025 23:18

[core][telemetry/01] migrate python metric collection to opentelemetry

2a3a503

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

dff68c4

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

d95dde1

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

fe0b250

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

fd621eb

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

0cb9f70

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

af59d2b

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

3783283

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

f296adc

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

e677f01

Signed-off-by: can <can@anyscale.com>

[core][telemetry/05] refactor open_telemetry_metric_recorder.py

ff0080d

Signed-off-by: can <can@anyscale.com>

[core][telemetry/02] opentelemetry recorder grpc service

bf768e9

Signed-off-by: can <can@anyscale.com>

[core][telemetry/01] migrate python metric collection to opentelemetry

9b8c51d

Signed-off-by: can <can@anyscale.com>

[core][telemetry/07] support counter metric

f886abc

Signed-off-by: can <can@anyscale.com>

[core][telemetry/07] support counter metric

39090cc

Signed-off-by: can <can@anyscale.com>

[core][telemetry/07] support counter metric

27df0b9

Signed-off-by: can <can@anyscale.com>

can-anyscale force-pushed the can-tel07 branch from b417498 to a06800a Compare June 11, 2025 23:19

dayshah reviewed Jun 18, 2025

View reviewed changes

can-anyscale force-pushed the can-tel07 branch from a06800a to 5e8708c Compare June 18, 2025 21:33

dayshah reviewed Jun 19, 2025

View reviewed changes

can-anyscale commented Jun 19, 2025

View reviewed changes

dayshah approved these changes Jun 19, 2025

View reviewed changes

[core][telemetry/08] record counter metric e2e

78e509f

Signed-off-by: can <can@anyscale.com> Signed-off-by: Cuong Nguyen <can@anyscale.com>

MengjinYan reviewed Jul 8, 2025

View reviewed changes

Conversation

can-anyscale commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

can-anyscale commented Jun 18, 2025

Uh oh!

dayshah Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

can-anyscale commented Jun 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

can-anyscale commented May 30, 2025 •

edited

Loading

dayshah Jun 19, 2025 •

edited

Loading