Bug #74148: Prometheus module experiences connection issues related to cherrypy - mgr - Ceph

7d9f8f3b5f2112299079105c5582c6208348002d

Category:

prometheus module

Target version:

% Done:

Source:

Backport:

tentacle, squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

66571

Tags (freeform):

backport_processed

Merge Commit:

Fixed In:

v20.3.0-5831-g7d9f8f3b5f

Released In:

Upkeep Timestamp:

2026-03-05T06:50:59+00:00

Description

/a/teuthology-2025-12-07_20:00:23-rados-main-distro-default-smithi/8644597

2025-12-07T21:26:39.859 INFO:teuthology.orchestra.run.smithi060.stderr:+ jq -e '.status == "success"'
2025-12-07T21:26:39.863 INFO:teuthology.orchestra.run.smithi060.stdout:{"status":"success","data":{"yaml":"global:\n  scrape_interval: 10s\n  scrape_timeout: 10s\n  scrape_protocols:\n  - OpenMetricsText1.0.0\n  - OpenMetricsText0.0.1\n  - PrometheusText1.0.0\n  - PrometheusText0.0.4\n  evaluation_interval: 10s\n  external_labels:\n    cluster: e8525cbe-d3b1-11f0-87af-adfe0268badd\nruntime:\n  gogc: 75\nalerting:\n  alertmanagers:\n  - follow_redirects: true\n    enable_http2: true\n    scheme: http\n    timeout: 10s\n    api_version: v2\n    http_sd_configs:\n    - follow_redirects: true\n      enable_http2: true\n      refresh_interval: 1m\n      url: http://172.21.15.60:8765/sd/prometheus/sd-config?service=alertmanager\n    - follow_redirects: true\n      enable_http2: true\n      refresh_interval: 1m\n      url: http://172.21.15.99:8765/sd/prometheus/sd-config?service=alertmanager\nrule_files:\n- /etc/prometheus/alerting/*\nscrape_configs:\n- job_name: ceph\n  honor_labels: true\n  honor_timestamps: true\n  track_timestamps_staleness: false\n  scrape_interval: 10s\n  scrape_timeout: 10s\n  scrape_protocols:\n  - OpenMetricsText1.0.0\n  - OpenMetricsText0.0.1\n  - PrometheusText1.0.0\n  - PrometheusText0.0.4\n  always_scrape_classic_histograms: false\n  convert_classic_histograms_to_nhcb: false\n  metrics_path: /metrics\n  scheme: http\n  enable_compression: true\n  metric_name_validation_scheme: utf8\n  metric_name_escaping_scheme: allow-utf-8\n  follow_redirects: true\n  enable_http2: true\n  relabel_configs:\n  - source_labels: [__address__]\n    separator: ;\n    target_label: cluster\n    replacement: e8525cbe-d3b1-11f0-87af-adfe0268badd\n    action: replace\n  - source_labels: [instance]\n    separator: ;\n    target_label: instance\n    replacement: ceph_cluster\n    action: replace\n  http_sd_configs:\n  - follow_redirects: true\n    enable_http2: true\n    refresh_interval: 1m\n    url: http://172.21.15.60:8765/sd/prometheus/sd-config?service=ceph\n  - follow_redirects: true\n    enable_http2: true\n    refresh_interval: 1m\n    url: http://172.21.15.99:8765/sd/prometheus/sd-config?service=ceph\n- job_name: node-exporter\n  honor_labels: true\n  honor_timestamps: true\n  track_timestamps_staleness: false\n  scrape_interval: 10s\n  scrape_timeout: 10s\n  scrape_protocols:\n  - OpenMetricsText1.0.0\n  - OpenMetricsText0.0.1\n  - PrometheusText1.0.0\n  - PrometheusText0.0.4\n  always_scrape_classic_histograms: false\n  convert_classic_histograms_to_nhcb: false\n  metrics_path: /metrics\n  scheme: http\n  enable_compression: true\n  metric_name_validation_scheme: utf8\n  metric_name_escaping_scheme: allow-utf-8\n  follow_redirects: true\n  enable_http2: true\n  relabel_configs:\n  - source_labels: [__address__]\n    separator: ;\n    target_label: cluster\n    replacement: e8525cbe-d3b1-11f0-87af-adfe0268badd\n    action: replace\n  http_sd_configs:\n  - follow_redirects: true\n    enable_http2: true\n    refresh_interval: 1m\n    url: http://172.21.15.60:8765/sd/prometheus/sd-config?service=node-exporter\n  - follow_redirects: true\n    enable_http2: true\n    refresh_interval: 1m\n    url: http://172.21.15.99:8765/sd/prometheus/sd-config?service=node-exporter\notlp:\n  translation_strategy: UnderscoreEscapingWithSuffixes\n"}}true
2025-12-07T21:26:39.863 INFO:teuthology.orchestra.run.smithi060.stderr:+ curl -s http://172.21.15.99:9095/api/v1/alerts
2025-12-07T21:26:39.868 INFO:teuthology.orchestra.run.smithi060.stderr:+ curl -s http://172.21.15.99:9095/api/v1/alerts
2025-12-07T21:26:39.868 INFO:teuthology.orchestra.run.smithi060.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2025-12-07T21:26:40.488 DEBUG:teuthology.orchestra.run:got remote process result: 4
2025-12-07T21:26:40.488 INFO:teuthology.orchestra.run.smithi060.stdout:{"status":"success","data":{"alerts":[{"labels":{"alertname":"CephMgrPrometheusModuleInactive","cluster":"e8525cbe-d3b1-11f0-87af-adfe0268badd","instance":"ceph_cluster","job":"ceph","oid":"1.3.6.1.4.1.50495.1.2.1.6.2","severity":"critical","type":"ceph_default"},"annotations":{"description":"The mgr/prometheus module at ceph_cluster is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.","summary":"The mgr/prometheus module is not available"},"state":"firing","activeAt":"2025-12-07T21:21:33.245200013Z","value":"0e+00"}]}}
2025-12-07T21:26:40.490 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph_6ce249e0e13e12a74d5c855ed12d6b50671977c9/qa/tasks/cephadm.py", line 1467, in shell
    _shell(
  File "/home/teuthworker/src/git.ceph.com_ceph_6ce249e0e13e12a74d5c855ed12d6b50671977c9/qa/tasks/cephadm.py", line 41, in _shell
    return remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/remote.py", line 575, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 461, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed on smithi060 with status 4: 'sudo /home/ubuntu/cephtest/cephadm --image quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:6ce249e0e13e12a74d5c855ed12d6b50671977c9 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid e8525cbe-d3b1-11f0-87af-adfe0268badd -- bash -c \'set -e\nset -x\nceph orch apply node-exporter\nceph orch apply grafana\nceph orch apply alertmanager\nceph orch apply prometheus\nsleep 240\nceph orch ls\nceph orch ps\nceph orch host ls\nMON_DAEMON=$(ceph orch ps --daemon-type mon -f json | jq -r \'"\'"\'last | .daemon_name\'"\'"\')\nGRAFANA_HOST=$(ceph orch ps --daemon-type grafana -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nPROM_HOST=$(ceph orch ps --daemon-type prometheus -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nALERTM_HOST=$(ceph orch ps --daemon-type alertmanager -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nGRAFANA_IP=$(ceph orch host ls -f json | jq -r --arg GRAFANA_HOST "$GRAFANA_HOST" \'"\'"\'.[] | select(.hostname==$GRAFANA_HOST) | .addr\'"\'"\')\nPROM_IP=$(ceph orch host ls -f json | jq -r --arg PROM_HOST "$PROM_HOST" \'"\'"\'.[] | select(.hostname==$PROM_HOST) | .addr\'"\'"\')\nALERTM_IP=$(ceph orch host ls -f json | jq -r --arg ALERTM_HOST "$ALERTM_HOST" \'"\'"\'.[] | select(.hostname==$ALERTM_HOST) | .addr\'"\'"\')\n# check each host node-exporter metrics endpoint is responsive\nALL_HOST_IPS=$(ceph orch host ls -f json | jq -r \'"\'"\'.[] | .addr\'"\'"\')\nfor ip in $ALL_HOST_IPS; do\n  curl -s http://${ip}:9100/metric\ndone\n# check grafana endpoints are responsive and database health is okay\ncurl -k -s https://${GRAFANA_IP}:3000/api/health\ncurl -k -s https://${GRAFANA_IP}:3000/api/health | jq -e \'"\'"\'.database == "ok"\'"\'"\'\n# stop mon daemon in order to trigger an alert\nceph orch daemon stop $MON_DAEMON\nsleep 120\n# check prometheus endpoints are responsive and mon down alert is firing\ncurl -s http://${PROM_IP}:9095/api/v1/status/config\ncurl -s http://${PROM_IP}:9095/api/v1/status/config | jq -e \'"\'"\'.status == "success"\'"\'"\'\ncurl -s http://${PROM_IP}:9095/api/v1/alerts\ncurl -s http://${PROM_IP}:9095/api/v1/alerts | jq -e \'"\'"\'.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"\'"\'"\'\n# check alertmanager endpoints are responsive and mon down alert is active\ncurl -s http://${ALERTM_IP}:9093/api/v2/status\ncurl -s http://${ALERTM_IP}:9093/api/v2/alerts\ncurl -s http://${ALERTM_IP}:9093/api/v2/alerts | jq -e \'"\'"\'.[] | select(.labels | .alertname == "CephMonDown") | .status | .state == "active"\'"\'"\'\n# check prometheus metrics endpoint is not empty and make sure we can get metrics\nMETRICS_URL=$(

/a/teuthology-2025-12-07_20:00:23-rados-main-distro-default-smithi/8644597/remote/smithi060/log/e8525cbe-d3b1-11f0-87af-adfe0268badd/ceph-mgr.a.log.gz

2025-12-07T21:26:41.895+0000 7ff8f7017640  0 [prometheus INFO cherrypy.error] [07/Dec/2025:21:26:41] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 9283)) shut down
2025-12-07T21:26:41.895+0000 7ff8f7017640  0 [prometheus INFO cherrypy.error] [07/Dec/2025:21:26:41] ENGINE Bus STOPPED

I suspect this PR: https://github.com/ceph/ceph/pull/65245

Related issues 7 (2 open — 5 closed)

Related to mgr - Bug #74149: Prometheus module fails when trying to load security configuration JSON

Resolved

Related to mgr - Backport #74056: tentacle: ceph-mgr memory leak in prometheus module

In Progress

Related to mgr - Backport #74057: squid: ceph-mgr memory leak in prometheus module

In Progress

Has duplicate mgr - Bug #74564: Rocky10 - prometheus not active

Duplicate

Has duplicate RADOS - Bug #74784: rados/cephadm/test_monitoring_stack_basic - failed to jq -e "CephMonDown"

Closed

Copied to mgr - Backport #75344: tentacle: Prometheus module experiences connection issues related to cherrypy

Duplicate

Copied to mgr - Backport #75345: squid: Prometheus module experiences connection issues related to cherrypy

Duplicate

Updated by Laura Flores 3 months ago

Related to Bug #74149: Prometheus module fails when trying to load security configuration JSON added

Actions

Updated by Aishwarya Mathuria 3 months ago

/a/yuriw-2025-12-03_15:44:36-rados-wip-yuri5-testing-2025-12-02-1256-distro-default-smithi/8639563

2025-12-03T16:55:03.646 INFO:teuthology.orchestra.run.smithi062.stderr:+ curl -s http://172.21.15.78:9095/api/v1/alerts
2025-12-03T16:55:03.652 INFO:teuthology.orchestra.run.smithi062.stderr:+ curl -s http://172.21.15.78:9095/api/v1/alerts
2025-12-03T16:55:03.652 INFO:teuthology.orchestra.run.smithi062.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2025-12-03T16:55:04.255 DEBUG:teuthology.orchestra.run:got remote process result: 4
2025-12-03T16:55:04.256 INFO:teuthology.orchestra.run.smithi062.stdout:{"status":"success","data":{"alerts":[{"labels":{"alertname":"CephMgrPrometheusModuleInactive","cluster":"1b3bd86a-d067-11f0-87ab-adfe0268badd","instance":"ceph_cluster","job":"ceph","oid":"1.3.6.1.4.1.50495.1.2.1.6.2","severity":"critical","type":"ceph_default"},"annotations":{"description":"The mgr/prometheus module at ceph_cluster is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.","summary":"The mgr/prometheus module is not available"},"state":"firing","activeAt":"2025-12-03T16:51:13.245200013Z","value":"0e+00"}]}}
2025-12-03T16:55:04.257 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_151fc19e8957de33a9ab329f5cd67d0d2eab7212/qa/tasks/cephadm.py", line 1467, in shell
    _shell(
  File "/home/teuthworker/src/github.com_ceph_ceph-c_151fc19e8957de33a9ab329f5cd67d0d2eab7212/qa/tasks/cephadm.py", line 41, in _shell
    return remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/remote.py", line 575, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 461, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(

Actions

Updated by Nitzan Mordechai 3 months ago

Status changed from New to Fix Under Review
Pull request ID set to 66570

Actions

Updated by Nitzan Mordechai 3 months ago · Edited

Pull request ID changed from 66570 to 66571

Actions

Updated by Nitzan Mordechai 3 months ago

Related to Backport #74056: tentacle: ceph-mgr memory leak in prometheus module added
Related to Backport #74057: squid: ceph-mgr memory leak in prometheus module added

Actions

Updated by Nitzan Mordechai 3 months ago

i'm not adding new backport trackers since we are using https://tracker.ceph.com/issues/68989 backports - the issue found on main branch, tentacle and squid backports are on hold until that tracker is resolved.

Actions

Updated by Laura Flores about 2 months ago

/a/lflores-2026-01-21_20:56:39-rados-main-distro-default-trial/11956

Actions

Updated by Nitzan Mordechai about 2 months ago

Related to Bug #74564: Rocky10 - prometheus not active added

Actions

https://pulpito.ceph.com/yuriw-2026-01-27_16:21:32-rados-wip-yuri10-testing-2026-01-22-2036-tentacle-distro-default-trial/22003

Updated by Aishwarya Mathuria about 2 months ago

Actions

#10

Updated by Laura Flores about 2 months ago

/a/lflores-2026-01-26_23:21:06-rados-wip-yuri12-testing-2026-01-22-2045-distro-default-trial/19097

Actions

#11

Updated by Sridhar Seshasayee about 2 months ago

/a/skanta-2026-01-27_05:35:03-rados-wip-bharath1-testing-2026-01-26-1242-distro-default-trial/19767

Actions

#12

Updated by Nitzan Mordechai about 2 months ago

/a/yuriw-2026-01-29_18:33:05-rados-wip-yuri2-testing-2026-01-28-1643-tentacle-distro-default-trial/26512

Actions

#13

Updated by Nitzan Mordechai about 2 months ago

Backport set to tentacle

Actions

#14

Updated by Nitzan Mordechai about 2 months ago

Related to deleted (Bug #74564: Rocky10 - prometheus not active)

Actions

#15

Updated by Nitzan Mordechai about 2 months ago

Has duplicate Bug #74564: Rocky10 - prometheus not active added

Actions

#16

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-01-30_23:46:16-rados-wip-bharath7-testing-2026-01-29-2016-distro-default-trial/28574

Actions

#17

Updated by Connor Fawcett about 1 month ago

/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19866

Actions

#18

Updated by Laura Flores about 1 month ago

/a/yuriw-2026-02-03_16:00:06-rados-wip-yuri4-testing-2026-02-02-2122-distro-default-trial/31737

Actions

https://qa-proxy.ceph.com/teuthology/yuriw-2026-01-21_20:22:09-rados-wip-yuri9-testing-2026-01-21-1558-tentacle-distro-default-trial/11761/teuthology.log

#19

Updated by Shraddha Agrawal about 1 month ago

Actions

#20

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-02-07_00:02:26-rados-wip-bharath7-testing-2026-02-06-0906-distro-default-trial/39119

Actions

#21

Updated by Laura Flores about 1 month ago

Has duplicate Bug #74784: rados/cephadm/test_monitoring_stack_basic - failed to jq -e "CephMonDown" added

Actions

#22

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-02-05_03:38:32-rados-wip-bharath2-testing-2026-02-03-0542-distro-default-trial/35658

Actions

#23

Updated by Nitzan Mordechai about 1 month ago

Backport changed from tentacle to tentacle, squid

Actions

#24

Updated by Aishwarya Mathuria about 1 month ago

Seen in squid maybe because https://github.com/ceph/ceph/pull/66483 was included in the QA batch by mistake.
https://pulpito.ceph.com/yuriw-2026-02-17_20:43:43-rados-wip-yuri6-testing-2026-02-17-1732-squid-distro-default-trial/53883/

2026-02-17T21:15:44.683 INFO:teuthology.orchestra.run.trial127.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2026-02-17T21:15:44.729 DEBUG:teuthology.orchestra.run:got remote process result: 4
2026-02-17T21:15:44.729 INFO:teuthology.orchestra.run.trial127.stdout:{"status":"success","data":{"alerts":[]}}
2026-02-17T21:15:44.729 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
           ^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_ceph-c_d855f53b89fdcec760fd9232a5fb55ed4fb111a1/qa/tasks/cephadm.py", line 1492, in shell
    _shell(
  File "/home/teuthworker/src/github.com_ceph_ceph-c_d855f53b89fdcec760fd9232a5fb55ed4fb111a1/qa/tasks/cephadm.py", line 110, in _shell
    return remote.run(
           ^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/orchestra/remote.py", line 575, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/orchestra/run.py", line 461, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed on trial127 with status 4: 'sudo /home/ubuntu/cephtest/cephadm --image quay.ceph.io/ceph-ci/ceph:d855f53b89fdcec760fd9232a5fb55ed4fb111a1 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 6955d983-0c44-11f1-b9a6-d404e6e7d460 -- bash -c \'set -e\nset -x\nceph orch apply node-exporter\nceph orch apply grafana\nceph orch apply alertmanager\nceph orch apply prometheus\nsleep 240\nceph orch ls\nceph orch ps\nceph orch host ls\nMON_DAEMON=$(ceph orch ps --daemon-type mon -f json | jq -r \'"\'"\'last | .daemon_name\'"\'"\')\nGRAFANA_HOST=$(ceph orch ps --daemon-type grafana -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nPROM_HOST=$(ceph orch ps --daemon-type prometheus -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nALERTM_HOST=$(ceph orch ps --daemon-type alertmanager -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nGRAFANA_IP=$(ceph orch host ls -f json | jq -r --arg GRAFANA_HOST "$GRAFANA_HOST" \'"\'"\'.[] | select(.hostname==$GRAFANA_HOST) | .addr\'"\'"\')\nPROM_IP=$(ceph orch host ls -f json | jq -r --arg PROM_HOST "$PROM_HOST" \'"\'"\'.[] | select(.hostname==$PROM_HOST) | .addr\'"\'"\')\nALERTM_IP=$(ceph orch host ls -f json | jq -r --arg ALERTM_HOST "$ALERTM_HOST" \'"\'"\'.[] | select(.hostname==$ALERTM_HOST) | .addr\'"\'"\')\n# check each host node-exporter metrics endpoint is responsive\nALL_HOST_IPS=$(ceph orch host ls -f json | jq -r \'"\'"\'.[] | .addr\'"\'"\')\nfor ip in $ALL_HOST_IPS; do\n  curl -s http://${ip}:9100/metric\ndone\n# check grafana endpoints are responsive and database health is okay\ncurl -k -s https://${GRAFANA_IP}:3000/api/health\ncurl -k -s https://${GRAFANA_IP}:3000/api/health | jq -e \'"\'"\'.database == "ok"\'"\'"\'\n# stop mon daemon in order to trigger an alert\nceph orch daemon stop $MON_DAEMON\nsleep 120\n# check prometheus endpoints are responsive and mon down alert is firing\ncurl -s http://${PROM_IP}:9095/api/v1/status/config\ncurl -s http://${PROM_IP}:9095/api/v1/status/config | jq -e \'"\'"\'.status == "success"\'"\'"\'\ncurl -s http://${PROM_IP}:9095/api/v1/alerts\ncurl -s http://${PROM_IP}:9095/api/v1/alerts | jq -e \'"\'"\'.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"\'"\'"\'\n# check alertmanager endpoints are responsive and mon down alert is active\ncurl -s http://${ALERTM_IP}:9093/api/v2/status\ncurl -s http://${ALERTM_IP}:9093/api/v2/alerts\ncurl -s http://${ALERTM_IP}:9093/api/v2/alerts | jq -e \'"\'"\'.[] | select(.labels | .alertname == "CephMonDown") | .status | .state == "active"\'"\'"\'\n\''
2026-02-17T21:15:44.731 DEBUG:teuthology.run_tasks:Unwinding manager cephadm

Actions

#25

Updated by Sridhar Seshasayee 26 days ago

/a/skanta-2026-02-22_05:18:48-rados-wip-bharath21-testing-2026-02-20-1039-distro-default-trial/63350

Actions

#26

Updated by Upkeep Bot 16 days ago

Status changed from Fix Under Review to Pending Backport
Merge Commit set to 7d9f8f3b5f2112299079105c5582c6208348002d
Fixed In set to v20.3.0-5831-g7d9f8f3b5f
Upkeep Timestamp set to 2026-03-05T06:50:59+00:00

Actions

#27

Updated by Upkeep Bot 16 days ago

Copied to Backport #75344: tentacle: Prometheus module experiences connection issues related to cherrypy added

Actions

#28

Updated by Upkeep Bot 16 days ago

Copied to Backport #75345: squid: Prometheus module experiences connection issues related to cherrypy added

Actions

#29

Updated by Upkeep Bot 16 days ago

Tags (freeform) set to backport_processed

Actions