mgr/dashboard: add prometheus federation config for mullti-cluster monitoring by aaSharma14 · Pull Request #54964 · ceph/ceph

aaSharma14 · 2023-12-19T15:32:43Z

Introduce prometheus fedeartion in ceph dashboard. This is done by adding a federate job to the prometheus configuration. We can add/remove targets (remote cluster's prometheus service endpoint) to this job to scrape data from different clusters. These targets are getting added in the prometheus config file by exposing two new orch clis -

ceph orch prometheus set-target
ceph orch prometheus remove-target

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

nizamial09

some initial impressions!

src/pybind/mgr/cephadm/module.py

src/pybind/mgr/cephadm/services/monitoring.py

src/pybind/mgr/orchestrator/_interface.py

src/python-common/ceph/deployment/service_spec.py

src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2

src/pybind/mgr/cephadm/module.py

src/pybind/mgr/cephadm/services/monitoring.py

src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2

src/pybind/mgr/orchestrator/_interface.py

src/python-common/ceph/deployment/service_spec.py

src/pybind/mgr/cephadm/module.py

src/pybind/mgr/orchestrator/module.py

aaSharma14 · 2024-01-16T14:20:34Z

jenkins retest this please

nizamial09 · 2024-01-29T04:36:26Z

jenkins test make check

adk3798

minor comments. Can't speak much to the changes to the actual prometheus conf, but generally the code looks okay outside of the tests failing.

src/pybind/mgr/cephadm/services/monitoring.py

src/pybind/mgr/orchestrator/module.py

src/pybind/mgr/cephadm/module.py

rkachach

I just left some minor comments + some other more specific to security. Plz, I'd like to know if we have done any security assessment of what are the implications of enabling the security + this new feature. Is the system still secure? if not what security issues could we face when enabling this new feature and what should we do to overcome them.

src/pybind/mgr/cephadm/module.py

src/pybind/mgr/cephadm/services/monitoring.py

rkachach · 2024-02-05T12:26:03Z

src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2

+    relabel_configs:
+    - source_labels: [__address__]
+      target_label: cluster
+      replacement: {{ cluster_fsid }}


This section assumes you are using secure communication. Is this the case? what security implications has this new feature? are we taking them into account? have we did any security assessment for the impact?

nizamial09 · 2024-02-08T13:09:04Z

src/pybind/mgr/orchestrator/module.py

        except ArgumentError as e:
            return HandleCommandResult(-errno.EINVAL, "", (str(e)))

+    @_cli_write_command('orch prometheus set-target')


if I give a target like http://<ip>:port it kind of kills the prometheus daemon and i had to remove the target and restart prometheus module to get it working. Should it fail like that for a simple error. And if this is a crucial mistake, then it should have a proper validation set-up or we might end up breaking a deployment.

atleast some helpers mentioning how the prometheus target should look like would be helpful

@nizamial09 , this issue is being tracked here - https://tracker.ceph.com/issues/64369, Will open a separate PR for the mentioned issues soon.

cloudbehl · 2024-03-01T06:42:54Z

@adk3798 can we merge this if the teuthology run is okay? (asking since there are a series of bug fixes planned after #55574, which currently depends on this PR) thanks.

yeah, it looks like my testing tag got removed so this wasn't in my last run, but I'll do another run in the next day or 2 and if this doesn't break anything we can merge.

@adk3798 any updates?

nizamial09 · 2024-03-04T07:01:59Z

@aaSharma14 i saw these in the unit test failures

DEBUG    cephadm.serve:serve.py:828 Daemons that will be removed: []
DEBUG    cephadm.serve:serve.py:909 Placing haproxy.ingress.test.mhaubs on host test
DEBUG    cephadm.services.ingress:ingress.py:71 prepare_create haproxy.ingress.test.mhaubs on host test with spec IngressSpec.from_json(yaml.safe_load('''service_type: ingress
service_id: ingress
service_name: ingress.ingress
placement:
  count: 2
spec:
  backend_service: rgw.foo
  first_virtual_router_id: 50
  frontend_port: 8089
  keepalived_password: '12345'
  monitor_password: '12345'
  monitor_port: 8999
  monitor_user: admin
  virtual_ip: 1.2.3.4/32
'''))
DEBUG    cephadm.services.ingress:ingress.py:171 enabled default server opts: []
DEBUG    asyncio:selector_events.py:54 Using selector: EpollSelector
INFO     cephadm.serve:serve.py:1340 Deploying daemon haproxy.ingress.test.mhaubs on test
DEBUG    cephadm.inventory:inventory.py:894 Host test has no devices to save
DEBUG    cephadm.serve:serve.py:909 Placing keepalived.ingress.test.uonwnj on host test
DEBUG    cephadm.services.ingress:ingress.py:215 prepare_create keepalived.ingress.test.uonwnj on host test with spec IngressSpec.from_json(yaml.safe_load('''service_type: ingress
service_id: ingress
service_name: ingress.ingress
placement:
  count: 2
spec:
  backend_service: rgw.foo
  first_virtual_router_id: 50
  frontend_port: 8089
  keepalived_password: '12345'
  monitor_password: '12345'
  monitor_port: 8999
  monitor_user: admin
  virtual_ip: 1.2.3.4/32
'''))
INFO     cephadm.services.ingress:ingress.py:263 1.2.3.4 is in 1.2.3.0/24 on test interface if0

adk3798 · 2024-03-04T15:49:11Z

@adk3798 can we merge this if the teuthology run is okay? (asking since there are a series of bug fixes planned after #55574, which currently depends on this PR) thanks.

yeah, it looks like my testing tag got removed so this wasn't in my last run, but I'll do another run in the next day or 2 and if this doesn't break anything we can merge.

@adk3798 any updates?

https://pulpito.ceph.com/adking-2024-03-02_21:18:48-orch:cephadm-wip-adk-testing-2024-03-01-1302-distro-default-smithi/

failures were all the in cluster log stuff which is okay as we work on the ignorelist for the orch/cephadm suite, mds upgrade sequence test failures which are a known issue, and the test_repos test failing to get jammy packages for quincy which is also a known issue. So no regressions caused by this, and I'm okay merging once the CI here is passing.

monitoring Signed-off-by: Aashish Sharma <aasharma@redhat.com>

aaSharma14 · 2024-03-05T13:37:14Z

jenkins test api

Rendering the dashboards and alerts with showMultiCluster=True allows for them to work with multiple clusters storing their metrics in a single Prometheus instance. This works via the cluster label and that functionality already existed. This just fixes some inconsistencies in applying the label filters. Additionally this contains updates to the tests to have them succeed with with both configurations and avoid the introduction of regressions in regards to multiCluster in the future. There also are some consistency cleanups here and there: * `datasource` was not used consistently * `cluster` label_values are determined from `ceph_health_status` * `job` template and filters on this label were removed to align multi cluster support solely via the `cluster` label * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici to now show all instance values, but those of hosts with some Ceph component / daemon. * Enable showMultiCluster=True since `cluster` label is now always present, via ceph#54964 Fixes: https://tracker.ceph.com/issues/64321 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

Rendering the dashboards with showMultiCluster=True allows for them to work with multiple clusters storing their metrics in a single Prometheus instance. This works via the cluster label and that functionality already existed. This just fixes some inconsistencies in applying the label filters. Additionally this contains updates to the tests to have them succeed with with both configurations and avoid the introduction of regressions in regards to multiCluster in the future. There also are some consistency cleanups here and there: * `datasource` was not used consistently * `cluster` label_values are determined from `ceph_health_status` * `job` template and filters on this label were removed to align multi cluster support solely via the `cluster` label * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici to now show all instance values, but those of hosts with some Ceph component / daemon. * Enable showMultiCluster=True since `cluster` label is now always present, via ceph#54964 Fixes: https://tracker.ceph.com/issues/64321 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

Rendering the dashboards with showMultiCluster=True allows for them to work with multiple clusters storing their metrics in a single Prometheus instance. This works via the cluster label and that functionality already existed. This just fixes some inconsistencies in applying the label filters. Additionally this contains updates to the tests to have them succeed with with both configurations and avoid the introduction of regressions in regards to multiCluster in the future. There also are some consistency cleanups here and there: * `datasource` was not used consistently * `cluster` label_values are determined from `ceph_health_status` * `job` template and filters on this label were removed to align multi cluster support solely via the `cluster` label * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici to now show all instance values, but those of hosts with some Ceph component / daemon. * Enable showMultiCluster=True since `cluster` label is now always present, via ceph#54964 Improves: https://tracker.ceph.com/issues/64321 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

Rendering the dashboards with showMultiCluster=True allows for them to work with multiple clusters storing their metrics in a single Prometheus instance. This works via the cluster label and that functionality already existed. This just fixes some inconsistencies in applying the label filters. Additionally this contains updates to the tests to have them succeed with with both configurations and avoid the introduction of regressions in regards to multiCluster in the future. There also are some consistency cleanups here and there: * `datasource` was not used consistently * `cluster` label_values are determined from `ceph_health_status` * `job` template and filters on this label were removed to align multi cluster support solely via the `cluster` label * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici to now show all instance values, but those of hosts with some Ceph component / daemon. * Enable showMultiCluster=True since `cluster` label is now always present, via ceph#54964 Resolves: rhbz#2275936 Improves: https://tracker.ceph.com/issues/64321 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de> (cherry picked from commit 2457451)

Rendering the dashboards with showMultiCluster=True allows for them to work with multiple clusters storing their metrics in a single Prometheus instance. This works via the cluster label and that functionality already existed. This just fixes some inconsistencies in applying the label filters. Additionally this contains updates to the tests to have them succeed with with both configurations and avoid the introduction of regressions in regards to multiCluster in the future. There also are some consistency cleanups here and there: * `datasource` was not used consistently * `cluster` label_values are determined from `ceph_health_status` * `job` template and filters on this label were removed to align multi cluster support solely via the `cluster` label * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici to now show all instance values, but those of hosts with some Ceph component / daemon. * Enable showMultiCluster=True since `cluster` label is now always present, via ceph#54964 Improves: https://tracker.ceph.com/issues/64321 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>

Rendering the dashboards with showMultiCluster=True allows for them to work with multiple clusters storing their metrics in a single Prometheus instance. This works via the cluster label and that functionality already existed. This just fixes some inconsistencies in applying the label filters. Additionally this contains updates to the tests to have them succeed with with both configurations and avoid the introduction of regressions in regards to multiCluster in the future. There also are some consistency cleanups here and there: * `datasource` was not used consistently * `cluster` label_values are determined from `ceph_health_status` * `job` template and filters on this label were removed to align multi cluster support solely via the `cluster` label * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici to now show all instance values, but those of hosts with some Ceph component / daemon. * Enable showMultiCluster=True since `cluster` label is now always present, via ceph#54964 Improves: https://tracker.ceph.com/issues/64321 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de> (cherry picked from commit 090b8e1)

aaSharma14 requested review from adk3798, cloudbehl, nizamial09 and rkachach December 19, 2023 15:32

aaSharma14 requested review from a team as code owners December 19, 2023 15:32

github-actions bot added cephadm dashboard monitoring orchestrator pybind labels Dec 19, 2023

nizamial09 reviewed Dec 19, 2023

View reviewed changes

cloudbehl reviewed Dec 22, 2023

View reviewed changes

src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2 Outdated Show resolved Hide resolved

cloudbehl mentioned this pull request Dec 22, 2023

mgr/dashboard: add thanos querier and sidecar containers to dashboard #54824

Closed

14 tasks

aaSharma14 force-pushed the add-prometheus-federation-cli branch 2 times, most recently from 8e64de8 to 9ad8cd4 Compare December 27, 2023 06:02

nizamial09 reviewed Jan 4, 2024

View reviewed changes

src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved

adk3798 reviewed Jan 8, 2024

View reviewed changes

aaSharma14 force-pushed the add-prometheus-federation-cli branch from 9ad8cd4 to bc95834 Compare January 10, 2024 12:56

aaSharma14 requested review from adk3798, cloudbehl and nizamial09 January 10, 2024 12:59

adk3798 added the wip-adk-testing label Jan 16, 2024

adk3798 reviewed Jan 30, 2024

View reviewed changes

aaSharma14 force-pushed the add-prometheus-federation-cli branch 2 times, most recently from 96182fd to c21d7ac Compare February 5, 2024 11:14

rkachach requested changes Feb 5, 2024

View reviewed changes

nizamial09 reviewed Feb 8, 2024

View reviewed changes

github-actions bot added cephadm orchestrator pybind and removed needs-rebase labels Feb 29, 2024

aaSharma14 force-pushed the add-prometheus-federation-cli branch from d7faaaf to 852ecb8 Compare March 4, 2024 06:03

adk3798 removed the wip-adk-testing label Mar 4, 2024

aaSharma14 force-pushed the add-prometheus-federation-cli branch from 852ecb8 to f8c0940 Compare March 5, 2024 04:58

mgr/dashboard: add prometheus federation config for mulkti-cluster

82b50b4

monitoring Signed-off-by: Aashish Sharma <aasharma@redhat.com>

aaSharma14 force-pushed the add-prometheus-federation-cli branch from f8c0940 to 82b50b4 Compare March 5, 2024 06:29

nizamial09 merged commit 7d58640 into ceph:main Mar 6, 2024

nizamial09 deleted the add-prometheus-federation-cli branch March 6, 2024 06:29

rkachach mentioned this pull request Mar 8, 2024

mgr: adding filters for prometheus endpoint configuration rook/rook#13889

Closed

This was referenced May 3, 2024

squid: mgr/dashboard: add prometheus federation config for multi-cluster monitoring #57254

Merged

reef: mgr/dashboard: add prometheus federation config for mullti-cluster monitoring #57255

Merged

jmolmo mentioned this pull request May 30, 2024

Duplicates series on Prom alert Rules CephOSDFlapping and CephPGImbalance while monitoring several clusters rook/rook#13575

Closed

Conversation

aaSharma14 commented Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

nizamial09 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aaSharma14 commented Jan 16, 2024

Uh oh!

nizamial09 commented Jan 29, 2024

Uh oh!

adk3798 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rkachach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rkachach Feb 5, 2024

Choose a reason for hiding this comment

Uh oh!

nizamial09 Feb 8, 2024

Choose a reason for hiding this comment

Uh oh!

nizamial09 Feb 8, 2024

Choose a reason for hiding this comment

Uh oh!

aaSharma14 Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

cloudbehl commented Mar 1, 2024

Uh oh!

nizamial09 commented Mar 4, 2024

Uh oh!

adk3798 commented Mar 4, 2024

Uh oh!

aaSharma14 commented Mar 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aaSharma14 commented Dec 19, 2023 •

edited

Loading