Skip to content

mgr/prometheus: expose daemon health metrics#48843

Merged
pereman2 merged 1 commit intoceph:mainfrom
rhcs-dashboard:expose_slow_ops
Dec 20, 2022
Merged

mgr/prometheus: expose daemon health metrics#48843
pereman2 merged 1 commit intoceph:mainfrom
rhcs-dashboard:expose_slow_ops

Conversation

@pereman2
Copy link
Contributor

@pereman2 pereman2 commented Nov 11, 2022

Until now daemon health metrics were stored without being used. One of the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs which this commit tries to expose to bring fine grained metrics to find troublesome OSDs instead of having a lone healthcheck of slow ops in the whole cluster.

New grafana panel in OSD overview:
image
and host details:
image

New slow_op metrics:
image

Aggregated slow_ops of all daemons:
image

Previous slow_op metrics which wasn't as fine grained as the newest one:
image

Signed-off-by: Pere Diaz Bou pdiazbou@redhat.com
Fixes: https://tracker.ceph.com/issues/58094

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once addressed my comments, LGTM! Nice work @pereman2 !


PyObject* ActivePyModules::get_daemon_health_metrics()
{
without_gil_t no_gil;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a Senior Principal GIL Engineer to review this 🙈

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gil is acquired when the pyformatter starts processing the new PyObject in daemon_state.with_daemons_by_server. No need to have the gil acquired when not performing PyObject tasks.

labelvalues=(stats['poolid'],))

def get_all_daemon_health_metrics(self):
daemon_metrics = self.get_daemon_health_metrics()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is this a metric that will/should be moved to the ceph-exporter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to extend the admin_socket to expose a new endpoint where you can parse the same output. We would have to extract the functionality from the ActivePyModules.cc and add the formatting capabilities to the DaemonHealthMetric class itself to easily translate the list of metrics in the socket.

@pereman2 pereman2 requested a review from jmolmo November 28, 2022 09:43
@pereman2 pereman2 marked this pull request as ready for review November 28, 2022 14:35
@pereman2 pereman2 requested review from a team as code owners November 28, 2022 14:35
Copy link
Contributor

@anthonyeleven anthonyeleven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I've implemented similar panels when drafting my own dashboards. To that end I'd like to request an additional panel, of slow ops (by node), which is helpful when trying to find a pattern of slow ops.

Copy link
Member

@nizamial09 nizamial09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@pereman2
Copy link
Contributor Author

@anthonyeleven I've added the top 10 hosts with highest slow op count now.

@anthonyeleven
Copy link
Contributor

@anthonyeleven I've added the top 10 hosts with highest slow op count now.

Thanks!

@pereman2
Copy link
Contributor Author

pereman2 commented Dec 9, 2022

jenkins retest this please

@pereman2 pereman2 force-pushed the expose_slow_ops branch 2 times, most recently from 2af243b to d033d48 Compare December 12, 2022 12:00
@avanthakkar
Copy link
Contributor

jenkins test make check

@avanthakkar
Copy link
Contributor

jenkins test dashboard cephadm

@avanthakkar
Copy link
Contributor

jenkins test dashboard

@pereman2
Copy link
Contributor Author

jenkins test make check

Until now daemon health metrics were stored without being used. One of
the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs
which this commit tries to expose to bring fine grained metrics to find
troublesome OSDs instead of having a lone healthcheck of slow ops in the
whole cluster.

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
@pereman2
Copy link
Contributor Author

jenkins test make check

for health_metric in health_metrics:
path = f'daemon_health_metrics{daemon_name}{health_metric["type"]}'
self.metrics[path] = Metric(
'counter',
Copy link
Contributor

@idryomov idryomov Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pereman2 I wonder if this metric should be a gauge instead of a counter? The "Health metrics for Ceph daemons" description is super opaque, but, judging by the rest of the PR, the primary use case is SLOW_OPS count and that number can go both up and down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it should be a gauge. SLOW_OPS is the latest number of slow operations reported if I'm not wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants