mgr/prometheus: expose daemon health metrics by pereman2 · Pull Request #48843 · ceph/ceph

pereman2 · 2022-11-11T09:57:12Z

Until now daemon health metrics were stored without being used. One of the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs which this commit tries to expose to bring fine grained metrics to find troublesome OSDs instead of having a lone healthcheck of slow ops in the whole cluster.

New grafana panel in OSD overview:

and host details:

New slow_op metrics:

Aggregated slow_ops of all daemons:

Previous slow_op metrics which wasn't as fine grained as the newest one:

Signed-off-by: Pere Diaz Bou pdiazbou@redhat.com
Fixes: https://tracker.ceph.com/issues/58094

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

epuertat

Once addressed my comments, LGTM! Nice work @pereman2 !

epuertat · 2022-11-14T19:29:33Z

src/mgr/ActivePyModules.cc

+
+PyObject* ActivePyModules::get_daemon_health_metrics()
+{
+  without_gil_t no_gil;


We need a Senior Principal GIL Engineer to review this 🙈

The gil is acquired when the pyformatter starts processing the new PyObject in daemon_state.with_daemons_by_server. No need to have the gil acquired when not performing PyObject tasks.

src/mgr/BaseMgrModule.cc

epuertat · 2022-11-14T19:33:52Z

src/pybind/mgr/prometheus/module.py

                                   labelvalues=(stats['poolid'],))
+
+    def get_all_daemon_health_metrics(self):
+        daemon_metrics = self.get_daemon_health_metrics()


Just curious, is this a metric that will/should be moved to the ceph-exporter?

It is possible to extend the admin_socket to expose a new endpoint where you can parse the same output. We would have to extract the functionality from the ActivePyModules.cc and add the formatting capabilities to the DaemonHealthMetric class itself to easily translate the list of metrics in the socket.

anthonyeleven

Nice. I've implemented similar panels when drafting my own dashboards. To that end I'd like to request an additional panel, of slow ops (by node), which is helpful when trying to find a pattern of slow ops.

monitoring/ceph-mixin/dashboards/osd.libsonnet

monitoring/ceph-mixin/dashboards_out/osds-overview.json

monitoring/ceph-mixin/prometheus_alerts.libsonnet

monitoring/ceph-mixin/prometheus_alerts.yml

src/pybind/mgr/prometheus/module.py

src/pybind/mgr/mgr_module.py

nizamial09

Lgtm!

pereman2 · 2022-11-29T10:56:42Z

@anthonyeleven I've added the top 10 hosts with highest slow op count now.

anthonyeleven · 2022-11-29T13:51:49Z

@anthonyeleven I've added the top 10 hosts with highest slow op count now.

Thanks!

pereman2 · 2022-12-09T12:15:41Z

jenkins retest this please

avanthakkar · 2022-12-12T18:27:22Z

jenkins test make check

avanthakkar · 2022-12-12T18:27:34Z

jenkins test dashboard cephadm

avanthakkar · 2022-12-12T18:27:46Z

jenkins test dashboard

pereman2 · 2022-12-16T13:55:19Z

jenkins test make check

Until now daemon health metrics were stored without being used. One of the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs which this commit tries to expose to bring fine grained metrics to find troublesome OSDs instead of having a lone healthcheck of slow ops in the whole cluster. Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>

pereman2 · 2022-12-20T09:34:10Z

jenkins test make check

idryomov · 2023-04-26T21:44:57Z

src/pybind/mgr/prometheus/module.py

+            for health_metric in health_metrics:
+                path = f'daemon_health_metrics{daemon_name}{health_metric["type"]}'
+                self.metrics[path] = Metric(
+                    'counter',


@pereman2 I wonder if this metric should be a gauge instead of a counter? The "Health metrics for Ceph daemons" description is super opaque, but, judging by the rest of the PR, the primary use case is SLOW_OPS count and that number can go both up and down.

You're right, it should be a gauge. SLOW_OPS is the latest number of slow operations reported if I'm not wrong.

github-actions bot added core dashboard mgr monitoring pybind labels Nov 11, 2022

pereman2 requested review from aaSharma14, epuertat and neha-ojha November 11, 2022 09:57

pereman2 force-pushed the expose_slow_ops branch from 6360217 to bd1a92d Compare November 11, 2022 10:09

epuertat approved these changes Nov 14, 2022

View reviewed changes

pereman2 force-pushed the expose_slow_ops branch from bd1a92d to 91c0c76 Compare November 18, 2022 15:16

github-actions bot added the documentation label Nov 18, 2022

pereman2 requested a review from jmolmo November 28, 2022 09:43

pereman2 force-pushed the expose_slow_ops branch from 91c0c76 to 873c20a Compare November 28, 2022 14:35

pereman2 marked this pull request as ready for review November 28, 2022 14:35

pereman2 requested review from a team as code owners November 28, 2022 14:35

anthonyeleven requested changes Nov 28, 2022

View reviewed changes

anthonyeleven reviewed Nov 28, 2022

View reviewed changes

src/pybind/mgr/mgr_module.py Outdated Show resolved Hide resolved

nizamial09 approved these changes Nov 29, 2022

View reviewed changes

pereman2 force-pushed the expose_slow_ops branch from 873c20a to ba05602 Compare November 29, 2022 10:54

pereman2 requested a review from anthonyeleven November 29, 2022 10:56

anthonyeleven approved these changes Nov 29, 2022

View reviewed changes

pereman2 force-pushed the expose_slow_ops branch from ba05602 to a6e04a6 Compare December 2, 2022 09:55

pereman2 force-pushed the expose_slow_ops branch 2 times, most recently from 2af243b to d033d48 Compare December 12, 2022 12:00

pereman2 force-pushed the expose_slow_ops branch from d033d48 to 13f1bb4 Compare December 12, 2022 15:39

pereman2 force-pushed the expose_slow_ops branch from 13f1bb4 to 28b25af Compare December 16, 2022 09:11

pereman2 force-pushed the expose_slow_ops branch from 28b25af to 5a2b7c2 Compare December 20, 2022 08:45

pereman2 merged commit 8e07fbd into ceph:main Dec 20, 2022

pereman2 deleted the expose_slow_ops branch December 20, 2022 11:25

This was referenced Dec 20, 2022

quincy: mgr/prometheus: expose daemon health metrics #49519

Merged

pacific: mgr/prometheus: expose daemon health metrics #49520

Merged

idryomov reviewed Apr 26, 2023

View reviewed changes

idryomov mentioned this pull request Apr 26, 2023

mgr/prometheus: fix pool_objects_repaired and daemon_health_metrics format #51090

Merged

Conversation

pereman2 commented Nov 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

epuertat left a comment

Choose a reason for hiding this comment

Uh oh!

epuertat Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

pereman2 Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

epuertat Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

pereman2 Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

anthonyeleven left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nizamial09 left a comment

Choose a reason for hiding this comment

Uh oh!

pereman2 commented Nov 29, 2022

Uh oh!

anthonyeleven commented Nov 29, 2022

Uh oh!

pereman2 commented Dec 9, 2022

Uh oh!

avanthakkar commented Dec 12, 2022

Uh oh!

avanthakkar commented Dec 12, 2022

Uh oh!

avanthakkar commented Dec 12, 2022

Uh oh!

pereman2 commented Dec 16, 2022

Uh oh!

pereman2 commented Dec 20, 2022

Uh oh!

idryomov Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pereman2 Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pereman2 commented Nov 11, 2022 •

edited

Loading

idryomov Apr 26, 2023 •

edited

Loading