Skip to content

[core][telemetry/08] record counter metric e2e#53449

Merged
can-anyscale merged 286 commits intomasterfrom
can-tel07
Jun 19, 2025
Merged

[core][telemetry/08] record counter metric e2e#53449
can-anyscale merged 286 commits intomasterfrom
can-tel07

Conversation

@can-anyscale
Copy link
Copy Markdown
Contributor

@can-anyscale can-anyscale commented May 30, 2025

This is a series of PR to migrate metric collection from opencencus to openlemetry. For context on the existing components, see #53098.


This PR

  • Support Counter metric on dashboard agent side
  • Support Counter metric e2e (from worker to dashboard agent)

Test:

  • CI

@can-anyscale can-anyscale force-pushed the can-tel06 branch 7 times, most recently from efc22ae to 461d11f Compare June 2, 2025 17:53
@can-anyscale can-anyscale force-pushed the can-tel07 branch 3 times, most recently from c3dab2e to d6343cf Compare June 2, 2025 22:19
@can-anyscale can-anyscale marked this pull request as ready for review June 2, 2025 22:20
@can-anyscale can-anyscale changed the title Can tel07 [core][telemetry/08] record counter metric e2e Jun 2, 2025
@can-anyscale can-anyscale force-pushed the can-tel07 branch 2 times, most recently from 4ca8343 to 7efc7b3 Compare June 2, 2025 22:59
@can-anyscale can-anyscale force-pushed the can-tel07 branch 3 times, most recently from 1337adc to 241bb02 Compare June 3, 2025 04:23
@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label Jun 3, 2025
can and others added 19 commits June 11, 2025 23:18
Signed-off-by: can <can@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this
solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: can <can@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this
solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: can <can@anyscale.com>
Signed-off-by: can <can@anyscale.com>
Signed-off-by: can <can@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this
solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: can <can@anyscale.com>
Signed-off-by: can <can@anyscale.com>
TEST_F(MetricTest, TestCounterMetric) {
ASSERT_TRUE(OpenTelemetryMetricRecorder::GetInstance().IsMetricRegistered(
"metric_counter_test"));
STATS_metric_counter_test.Record(100.0, {{"Tag1", "Value1"}, {"Tag2", "Value2"}});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not testing anything after the record?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah no, open telemetry doesn't have an api to get the value after recording so i just test that it's not crashing; the e2e tests (test_metrics_agent.py) will check the values.

ASSERT_EQ(OpenTelemetryMetricRecorder::GetInstance().GetObservableMetricValue(
"metric_test", {{"Tag1", "Value1"}, {"Tag2", "Value2"}}),
"metric_gauge_test", {{"Tag1", "Value1"}, {"Tag2", "Value2"}}),
42.0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also one thing that I thought of, some of the get functions that exist only exist to be used in tests

we shouldn't have functions that are only used in tests, if tests really really need something private, they can be defined as friends or you can even have a function in the fixture class that accesses that private part and just make the fixture class a friend

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can look into the friend stuff

OpenTelemetryMetricRecorder::GetInstance().RegisterGaugeMetric(name, description);
} else if (T == COUNT &&
::RayConfig::instance().experimental_enable_open_telemetry_on_core()) {
OpenTelemetryMetricRecorder::GetInstance().RegisterCounterMetric(name, description);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but should just have a larger if around that checks for experimental_enable_open_telemetry_on_core

@can-anyscale
Copy link
Copy Markdown
Contributor Author

@dayshah's comments

"""
Check if a metric with the given name is an observable metric.
"""
return self._observations_by_name.get(name) is not None
Copy link
Copy Markdown
Contributor

@dayshah dayshah Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do any of these 3 need to be separate functions? they're small and only seem to be used in one place each. Also they don't lock, and the fact that they exist increases the chance they're used without the lock

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can merge

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged \o/; i think in general though mutex should only be used at the top of the call chain and internal/private functions should not acquire or hold locks themselves (to avoid deadlock)

with self._lock:
if name in self._registered_instruments:
# Counter with the same name is already registered.
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this condition ever happen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our code base probably not, but it's an api so it tries to handle potential failures from the users (who are also us basically but well)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually this does happen and is expected behavior in [1]: for each batch of metrics received from the Export function, metrics are uniquely registered within the context of a single Export call, but not necessarily across multiple Export calls.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment describing the case?

@can-anyscale
Copy link
Copy Markdown
Contributor Author

@dayshah's comments

data_points = metric.gauge.data_points
# counter metrics
if metric.WhichOneof("data") == "sum" and metric.sum.is_monotonic:
self._open_telemetry_metric_recorder.register_counter_metric(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1]

with self._lock:
if name in self._registered_instruments:
# Counter with the same name is already registered.
return
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually this does happen and is expected behavior in [1]: for each batch of metrics received from the Export function, metrics are uniquely registered within the context of a single Export call, but not necessarily across multiple Export calls.

Signed-off-by: can <can@anyscale.com>
Signed-off-by: Cuong Nguyen <can@anyscale.com>
"""
Set the value of a metric with the given name and tags.
This will create a gauge if it does not exist.
Set the value of a metric with the given name and tags. If the metric is not
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the logic of the function, it seems that we will not store the value if the metric is not registered even for observable metrics and it seems that the logic is different from the logic on c++ side.

Wondering do we assume that once this function is called, the metric should be registered?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a similar behavior in the c++ side that we'll ignore the metric value if it is not registered (https://github.com/ray-project/ray/blob/master/src/ray/telemetry/open_telemetry_metric_recorder.cc#L189). The only difference is that c++ will throw while python side just emits a warning.

I don't remember I throw on c++ side but I feel like c++ side should also emit a warning instead of throwing.

with self._lock:
if name in self._registered_instruments:
# Counter with the same name is already registered.
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment describing the case?

recorder.set_metric_value(
name="test_counter",
tags={"label_key": "label_value"},
value=10.0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering should we verify the value of the counter here as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for counter metric we cannot retrieve its value (internal state within open-telemetry api) - there is another e2e test that we call the /metric endpoint to actually test the value ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants