mgr/prometheus: expose daemon health metrics#48843
Conversation
6360217 to
bd1a92d
Compare
|
|
||
| PyObject* ActivePyModules::get_daemon_health_metrics() | ||
| { | ||
| without_gil_t no_gil; |
There was a problem hiding this comment.
We need a Senior Principal GIL Engineer to review this 🙈
There was a problem hiding this comment.
The gil is acquired when the pyformatter starts processing the new PyObject in daemon_state.with_daemons_by_server. No need to have the gil acquired when not performing PyObject tasks.
| labelvalues=(stats['poolid'],)) | ||
|
|
||
| def get_all_daemon_health_metrics(self): | ||
| daemon_metrics = self.get_daemon_health_metrics() |
There was a problem hiding this comment.
Just curious, is this a metric that will/should be moved to the ceph-exporter?
There was a problem hiding this comment.
It is possible to extend the admin_socket to expose a new endpoint where you can parse the same output. We would have to extract the functionality from the ActivePyModules.cc and add the formatting capabilities to the DaemonHealthMetric class itself to easily translate the list of metrics in the socket.
bd1a92d to
91c0c76
Compare
91c0c76 to
873c20a
Compare
anthonyeleven
left a comment
There was a problem hiding this comment.
Nice. I've implemented similar panels when drafting my own dashboards. To that end I'd like to request an additional panel, of slow ops (by node), which is helpful when trying to find a pattern of slow ops.
873c20a to
ba05602
Compare
|
@anthonyeleven I've added the top 10 hosts with highest slow op count now. |
Thanks! |
ba05602 to
a6e04a6
Compare
|
jenkins retest this please |
2af243b to
d033d48
Compare
d033d48 to
13f1bb4
Compare
|
jenkins test make check |
|
jenkins test dashboard cephadm |
|
jenkins test dashboard |
13f1bb4 to
28b25af
Compare
|
jenkins test make check |
Until now daemon health metrics were stored without being used. One of the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs which this commit tries to expose to bring fine grained metrics to find troublesome OSDs instead of having a lone healthcheck of slow ops in the whole cluster. Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
28b25af to
5a2b7c2
Compare
|
jenkins test make check |
| for health_metric in health_metrics: | ||
| path = f'daemon_health_metrics{daemon_name}{health_metric["type"]}' | ||
| self.metrics[path] = Metric( | ||
| 'counter', |
There was a problem hiding this comment.
@pereman2 I wonder if this metric should be a gauge instead of a counter? The "Health metrics for Ceph daemons" description is super opaque, but, judging by the rest of the PR, the primary use case is SLOW_OPS count and that number can go both up and down.
There was a problem hiding this comment.
You're right, it should be a gauge. SLOW_OPS is the latest number of slow operations reported if I'm not wrong.
Until now daemon health metrics were stored without being used. One of the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs which this commit tries to expose to bring fine grained metrics to find troublesome OSDs instead of having a lone healthcheck of slow ops in the whole cluster.
New grafana panel in OSD overview:


and host details:
New slow_op metrics:

Aggregated slow_ops of all daemons:

Previous slow_op metrics which wasn't as fine grained as the newest one:

Signed-off-by: Pere Diaz Bou pdiazbou@redhat.com
Fixes: https://tracker.ceph.com/issues/58094
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows