Project

General

Profile

Actions

Bug #74148

open

Prometheus module experiences connection issues related to cherrypy

Added by Laura Flores 3 months ago. Updated 8 days ago.

Status:
Pending Backport
Priority:
Normal
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Backport:
tentacle, squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-5831-g7d9f8f3b5f
Released In:
Upkeep Timestamp:
2026-03-05T06:50:59+00:00

Description

/a/teuthology-2025-12-07_20:00:23-rados-main-distro-default-smithi/8644597

2025-12-07T21:26:39.859 INFO:teuthology.orchestra.run.smithi060.stderr:+ jq -e '.status == "success"'
2025-12-07T21:26:39.863 INFO:teuthology.orchestra.run.smithi060.stdout:{"status":"success","data":{"yaml":"global:\n  scrape_interval: 10s\n  scrape_timeout: 10s\n  scrape_protocols:\n  - OpenMetricsText1.0.0\n  - OpenMetricsText0.0.1\n  - PrometheusText1.0.0\n  - PrometheusText0.0.4\n  evaluation_interval: 10s\n  external_labels:\n    cluster: e8525cbe-d3b1-11f0-87af-adfe0268badd\nruntime:\n  gogc: 75\nalerting:\n  alertmanagers:\n  - follow_redirects: true\n    enable_http2: true\n    scheme: http\n    timeout: 10s\n    api_version: v2\n    http_sd_configs:\n    - follow_redirects: true\n      enable_http2: true\n      refresh_interval: 1m\n      url: http://172.21.15.60:8765/sd/prometheus/sd-config?service=alertmanager\n    - follow_redirects: true\n      enable_http2: true\n      refresh_interval: 1m\n      url: http://172.21.15.99:8765/sd/prometheus/sd-config?service=alertmanager\nrule_files:\n- /etc/prometheus/alerting/*\nscrape_configs:\n- job_name: ceph\n  honor_labels: true\n  honor_timestamps: true\n  track_timestamps_staleness: false\n  scrape_interval: 10s\n  scrape_timeout: 10s\n  scrape_protocols:\n  - OpenMetricsText1.0.0\n  - OpenMetricsText0.0.1\n  - PrometheusText1.0.0\n  - PrometheusText0.0.4\n  always_scrape_classic_histograms: false\n  convert_classic_histograms_to_nhcb: false\n  metrics_path: /metrics\n  scheme: http\n  enable_compression: true\n  metric_name_validation_scheme: utf8\n  metric_name_escaping_scheme: allow-utf-8\n  follow_redirects: true\n  enable_http2: true\n  relabel_configs:\n  - source_labels: [__address__]\n    separator: ;\n    target_label: cluster\n    replacement: e8525cbe-d3b1-11f0-87af-adfe0268badd\n    action: replace\n  - source_labels: [instance]\n    separator: ;\n    target_label: instance\n    replacement: ceph_cluster\n    action: replace\n  http_sd_configs:\n  - follow_redirects: true\n    enable_http2: true\n    refresh_interval: 1m\n    url: http://172.21.15.60:8765/sd/prometheus/sd-config?service=ceph\n  - follow_redirects: true\n    enable_http2: true\n    refresh_interval: 1m\n    url: http://172.21.15.99:8765/sd/prometheus/sd-config?service=ceph\n- job_name: node-exporter\n  honor_labels: true\n  honor_timestamps: true\n  track_timestamps_staleness: false\n  scrape_interval: 10s\n  scrape_timeout: 10s\n  scrape_protocols:\n  - OpenMetricsText1.0.0\n  - OpenMetricsText0.0.1\n  - PrometheusText1.0.0\n  - PrometheusText0.0.4\n  always_scrape_classic_histograms: false\n  convert_classic_histograms_to_nhcb: false\n  metrics_path: /metrics\n  scheme: http\n  enable_compression: true\n  metric_name_validation_scheme: utf8\n  metric_name_escaping_scheme: allow-utf-8\n  follow_redirects: true\n  enable_http2: true\n  relabel_configs:\n  - source_labels: [__address__]\n    separator: ;\n    target_label: cluster\n    replacement: e8525cbe-d3b1-11f0-87af-adfe0268badd\n    action: replace\n  http_sd_configs:\n  - follow_redirects: true\n    enable_http2: true\n    refresh_interval: 1m\n    url: http://172.21.15.60:8765/sd/prometheus/sd-config?service=node-exporter\n  - follow_redirects: true\n    enable_http2: true\n    refresh_interval: 1m\n    url: http://172.21.15.99:8765/sd/prometheus/sd-config?service=node-exporter\notlp:\n  translation_strategy: UnderscoreEscapingWithSuffixes\n"}}true
2025-12-07T21:26:39.863 INFO:teuthology.orchestra.run.smithi060.stderr:+ curl -s http://172.21.15.99:9095/api/v1/alerts
2025-12-07T21:26:39.868 INFO:teuthology.orchestra.run.smithi060.stderr:+ curl -s http://172.21.15.99:9095/api/v1/alerts
2025-12-07T21:26:39.868 INFO:teuthology.orchestra.run.smithi060.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2025-12-07T21:26:40.488 DEBUG:teuthology.orchestra.run:got remote process result: 4
2025-12-07T21:26:40.488 INFO:teuthology.orchestra.run.smithi060.stdout:{"status":"success","data":{"alerts":[{"labels":{"alertname":"CephMgrPrometheusModuleInactive","cluster":"e8525cbe-d3b1-11f0-87af-adfe0268badd","instance":"ceph_cluster","job":"ceph","oid":"1.3.6.1.4.1.50495.1.2.1.6.2","severity":"critical","type":"ceph_default"},"annotations":{"description":"The mgr/prometheus module at ceph_cluster is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.","summary":"The mgr/prometheus module is not available"},"state":"firing","activeAt":"2025-12-07T21:21:33.245200013Z","value":"0e+00"}]}}
2025-12-07T21:26:40.490 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph_6ce249e0e13e12a74d5c855ed12d6b50671977c9/qa/tasks/cephadm.py", line 1467, in shell
    _shell(
  File "/home/teuthworker/src/git.ceph.com_ceph_6ce249e0e13e12a74d5c855ed12d6b50671977c9/qa/tasks/cephadm.py", line 41, in _shell
    return remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/remote.py", line 575, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 461, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed on smithi060 with status 4: 'sudo /home/ubuntu/cephtest/cephadm --image quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:6ce249e0e13e12a74d5c855ed12d6b50671977c9 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid e8525cbe-d3b1-11f0-87af-adfe0268badd -- bash -c \'set -e\nset -x\nceph orch apply node-exporter\nceph orch apply grafana\nceph orch apply alertmanager\nceph orch apply prometheus\nsleep 240\nceph orch ls\nceph orch ps\nceph orch host ls\nMON_DAEMON=$(ceph orch ps --daemon-type mon -f json | jq -r \'"\'"\'last | .daemon_name\'"\'"\')\nGRAFANA_HOST=$(ceph orch ps --daemon-type grafana -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nPROM_HOST=$(ceph orch ps --daemon-type prometheus -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nALERTM_HOST=$(ceph orch ps --daemon-type alertmanager -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nGRAFANA_IP=$(ceph orch host ls -f json | jq -r --arg GRAFANA_HOST "$GRAFANA_HOST" \'"\'"\'.[] | select(.hostname==$GRAFANA_HOST) | .addr\'"\'"\')\nPROM_IP=$(ceph orch host ls -f json | jq -r --arg PROM_HOST "$PROM_HOST" \'"\'"\'.[] | select(.hostname==$PROM_HOST) | .addr\'"\'"\')\nALERTM_IP=$(ceph orch host ls -f json | jq -r --arg ALERTM_HOST "$ALERTM_HOST" \'"\'"\'.[] | select(.hostname==$ALERTM_HOST) | .addr\'"\'"\')\n# check each host node-exporter metrics endpoint is responsive\nALL_HOST_IPS=$(ceph orch host ls -f json | jq -r \'"\'"\'.[] | .addr\'"\'"\')\nfor ip in $ALL_HOST_IPS; do\n  curl -s http://${ip}:9100/metric\ndone\n# check grafana endpoints are responsive and database health is okay\ncurl -k -s https://${GRAFANA_IP}:3000/api/health\ncurl -k -s https://${GRAFANA_IP}:3000/api/health | jq -e \'"\'"\'.database == "ok"\'"\'"\'\n# stop mon daemon in order to trigger an alert\nceph orch daemon stop $MON_DAEMON\nsleep 120\n# check prometheus endpoints are responsive and mon down alert is firing\ncurl -s http://${PROM_IP}:9095/api/v1/status/config\ncurl -s http://${PROM_IP}:9095/api/v1/status/config | jq -e \'"\'"\'.status == "success"\'"\'"\'\ncurl -s http://${PROM_IP}:9095/api/v1/alerts\ncurl -s http://${PROM_IP}:9095/api/v1/alerts | jq -e \'"\'"\'.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"\'"\'"\'\n# check alertmanager endpoints are responsive and mon down alert is active\ncurl -s http://${ALERTM_IP}:9093/api/v2/status\ncurl -s http://${ALERTM_IP}:9093/api/v2/alerts\ncurl -s http://${ALERTM_IP}:9093/api/v2/alerts | jq -e \'"\'"\'.[] | select(.labels | .alertname == "CephMonDown") | .status | .state == "active"\'"\'"\'\n# check prometheus metrics endpoint is not empty and make sure we can get metrics\nMETRICS_URL=$(

/a/teuthology-2025-12-07_20:00:23-rados-main-distro-default-smithi/8644597/remote/smithi060/log/e8525cbe-d3b1-11f0-87af-adfe0268badd/ceph-mgr.a.log.gz

2025-12-07T21:26:41.895+0000 7ff8f7017640  0 [prometheus INFO cherrypy.error] [07/Dec/2025:21:26:41] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 9283)) shut down
2025-12-07T21:26:41.895+0000 7ff8f7017640  0 [prometheus INFO cherrypy.error] [07/Dec/2025:21:26:41] ENGINE Bus STOPPED

I suspect this PR: https://github.com/ceph/ceph/pull/65245


Related issues 7 (2 open5 closed)

Related to mgr - Bug #74149: Prometheus module fails when trying to load security configuration JSONResolvedNitzan Mordechai

Actions
Related to mgr - Backport #74056: tentacle: ceph-mgr memory leak in prometheus moduleIn ProgressNitzan MordechaiActions
Related to mgr - Backport #74057: squid: ceph-mgr memory leak in prometheus moduleIn ProgressNitzan MordechaiActions
Has duplicate mgr - Bug #74564: Rocky10 - prometheus not activeDuplicateNitzan Mordechai

Actions
Has duplicate RADOS - Bug #74784: rados/cephadm/test_monitoring_stack_basic - failed to jq -e "CephMonDown"Closed

Actions
Copied to mgr - Backport #75344: tentacle: Prometheus module experiences connection issues related to cherrypyDuplicateNitzan MordechaiActions
Copied to mgr - Backport #75345: squid: Prometheus module experiences connection issues related to cherrypyDuplicateNitzan MordechaiActions
Actions #1

Updated by Laura Flores 3 months ago

  • Related to Bug #74149: Prometheus module fails when trying to load security configuration JSON added
Actions #2

Updated by Aishwarya Mathuria 3 months ago

/a/yuriw-2025-12-03_15:44:36-rados-wip-yuri5-testing-2025-12-02-1256-distro-default-smithi/8639563

2025-12-03T16:55:03.646 INFO:teuthology.orchestra.run.smithi062.stderr:+ curl -s http://172.21.15.78:9095/api/v1/alerts
2025-12-03T16:55:03.652 INFO:teuthology.orchestra.run.smithi062.stderr:+ curl -s http://172.21.15.78:9095/api/v1/alerts
2025-12-03T16:55:03.652 INFO:teuthology.orchestra.run.smithi062.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2025-12-03T16:55:04.255 DEBUG:teuthology.orchestra.run:got remote process result: 4
2025-12-03T16:55:04.256 INFO:teuthology.orchestra.run.smithi062.stdout:{"status":"success","data":{"alerts":[{"labels":{"alertname":"CephMgrPrometheusModuleInactive","cluster":"1b3bd86a-d067-11f0-87ab-adfe0268badd","instance":"ceph_cluster","job":"ceph","oid":"1.3.6.1.4.1.50495.1.2.1.6.2","severity":"critical","type":"ceph_default"},"annotations":{"description":"The mgr/prometheus module at ceph_cluster is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.","summary":"The mgr/prometheus module is not available"},"state":"firing","activeAt":"2025-12-03T16:51:13.245200013Z","value":"0e+00"}]}}
2025-12-03T16:55:04.257 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_151fc19e8957de33a9ab329f5cd67d0d2eab7212/qa/tasks/cephadm.py", line 1467, in shell
    _shell(
  File "/home/teuthworker/src/github.com_ceph_ceph-c_151fc19e8957de33a9ab329f5cd67d0d2eab7212/qa/tasks/cephadm.py", line 41, in _shell
    return remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/remote.py", line 575, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 461, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(

Actions #3

Updated by Nitzan Mordechai 3 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 66570
Actions #4

Updated by Nitzan Mordechai 3 months ago ยท Edited

  • Pull request ID changed from 66570 to 66571
Actions #5

Updated by Nitzan Mordechai 3 months ago

  • Related to Backport #74056: tentacle: ceph-mgr memory leak in prometheus module added
  • Related to Backport #74057: squid: ceph-mgr memory leak in prometheus module added
Actions #6

Updated by Nitzan Mordechai 3 months ago

i'm not adding new backport trackers since we are using https://tracker.ceph.com/issues/68989 backports - the issue found on main branch, tentacle and squid backports are on hold until that tracker is resolved.

Actions #7

Updated by Laura Flores about 2 months ago

/a/lflores-2026-01-21_20:56:39-rados-main-distro-default-trial/11956

Actions #8

Updated by Nitzan Mordechai about 2 months ago

  • Related to Bug #74564: Rocky10 - prometheus not active added
Actions #10

Updated by Laura Flores about 2 months ago

/a/lflores-2026-01-26_23:21:06-rados-wip-yuri12-testing-2026-01-22-2045-distro-default-trial/19097

Actions #11

Updated by Sridhar Seshasayee about 2 months ago

/a/skanta-2026-01-27_05:35:03-rados-wip-bharath1-testing-2026-01-26-1242-distro-default-trial/19767

Actions #12

Updated by Nitzan Mordechai about 2 months ago

/a/yuriw-2026-01-29_18:33:05-rados-wip-yuri2-testing-2026-01-28-1643-tentacle-distro-default-trial/26512

Actions #13

Updated by Nitzan Mordechai about 2 months ago

  • Backport set to tentacle
Actions #14

Updated by Nitzan Mordechai about 2 months ago

  • Related to deleted (Bug #74564: Rocky10 - prometheus not active)
Actions #15

Updated by Nitzan Mordechai about 2 months ago

  • Has duplicate Bug #74564: Rocky10 - prometheus not active added
Actions #16

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-01-30_23:46:16-rados-wip-bharath7-testing-2026-01-29-2016-distro-default-trial/28574

Actions #17

Updated by Connor Fawcett about 1 month ago

/a/skanta-2026-01-27_07:02:07-rados-wip-bharath3-testing-2026-01-26-1323-distro-default-trial/19866

Actions #18

Updated by Laura Flores about 1 month ago

/a/yuriw-2026-02-03_16:00:06-rados-wip-yuri4-testing-2026-02-02-2122-distro-default-trial/31737

Actions #20

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-02-07_00:02:26-rados-wip-bharath7-testing-2026-02-06-0906-distro-default-trial/39119

Actions #21

Updated by Laura Flores about 1 month ago

  • Has duplicate Bug #74784: rados/cephadm/test_monitoring_stack_basic - failed to jq -e "CephMonDown" added
Actions #22

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-02-05_03:38:32-rados-wip-bharath2-testing-2026-02-03-0542-distro-default-trial/35658

Actions #23

Updated by Nitzan Mordechai about 1 month ago

  • Backport changed from tentacle to tentacle, squid
Actions #24

Updated by Aishwarya Mathuria about 1 month ago

Seen in squid maybe because https://github.com/ceph/ceph/pull/66483 was included in the QA batch by mistake.
https://pulpito.ceph.com/yuriw-2026-02-17_20:43:43-rados-wip-yuri6-testing-2026-02-17-1732-squid-distro-default-trial/53883/

2026-02-17T21:15:44.683 INFO:teuthology.orchestra.run.trial127.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2026-02-17T21:15:44.729 DEBUG:teuthology.orchestra.run:got remote process result: 4
2026-02-17T21:15:44.729 INFO:teuthology.orchestra.run.trial127.stdout:{"status":"success","data":{"alerts":[]}}
2026-02-17T21:15:44.729 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
           ^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_ceph-c_d855f53b89fdcec760fd9232a5fb55ed4fb111a1/qa/tasks/cephadm.py", line 1492, in shell
    _shell(
  File "/home/teuthworker/src/github.com_ceph_ceph-c_d855f53b89fdcec760fd9232a5fb55ed4fb111a1/qa/tasks/cephadm.py", line 110, in _shell
    return remote.run(
           ^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/orchestra/remote.py", line 575, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/orchestra/run.py", line 461, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_5f66ecfb34c0370410e78b3ee641753d19da653b/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed on trial127 with status 4: 'sudo /home/ubuntu/cephtest/cephadm --image quay.ceph.io/ceph-ci/ceph:d855f53b89fdcec760fd9232a5fb55ed4fb111a1 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 6955d983-0c44-11f1-b9a6-d404e6e7d460 -- bash -c \'set -e\nset -x\nceph orch apply node-exporter\nceph orch apply grafana\nceph orch apply alertmanager\nceph orch apply prometheus\nsleep 240\nceph orch ls\nceph orch ps\nceph orch host ls\nMON_DAEMON=$(ceph orch ps --daemon-type mon -f json | jq -r \'"\'"\'last | .daemon_name\'"\'"\')\nGRAFANA_HOST=$(ceph orch ps --daemon-type grafana -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nPROM_HOST=$(ceph orch ps --daemon-type prometheus -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nALERTM_HOST=$(ceph orch ps --daemon-type alertmanager -f json | jq -e \'"\'"\'.[]\'"\'"\' | jq -r \'"\'"\'.hostname\'"\'"\')\nGRAFANA_IP=$(ceph orch host ls -f json | jq -r --arg GRAFANA_HOST "$GRAFANA_HOST" \'"\'"\'.[] | select(.hostname==$GRAFANA_HOST) | .addr\'"\'"\')\nPROM_IP=$(ceph orch host ls -f json | jq -r --arg PROM_HOST "$PROM_HOST" \'"\'"\'.[] | select(.hostname==$PROM_HOST) | .addr\'"\'"\')\nALERTM_IP=$(ceph orch host ls -f json | jq -r --arg ALERTM_HOST "$ALERTM_HOST" \'"\'"\'.[] | select(.hostname==$ALERTM_HOST) | .addr\'"\'"\')\n# check each host node-exporter metrics endpoint is responsive\nALL_HOST_IPS=$(ceph orch host ls -f json | jq -r \'"\'"\'.[] | .addr\'"\'"\')\nfor ip in $ALL_HOST_IPS; do\n  curl -s http://${ip}:9100/metric\ndone\n# check grafana endpoints are responsive and database health is okay\ncurl -k -s https://${GRAFANA_IP}:3000/api/health\ncurl -k -s https://${GRAFANA_IP}:3000/api/health | jq -e \'"\'"\'.database == "ok"\'"\'"\'\n# stop mon daemon in order to trigger an alert\nceph orch daemon stop $MON_DAEMON\nsleep 120\n# check prometheus endpoints are responsive and mon down alert is firing\ncurl -s http://${PROM_IP}:9095/api/v1/status/config\ncurl -s http://${PROM_IP}:9095/api/v1/status/config | jq -e \'"\'"\'.status == "success"\'"\'"\'\ncurl -s http://${PROM_IP}:9095/api/v1/alerts\ncurl -s http://${PROM_IP}:9095/api/v1/alerts | jq -e \'"\'"\'.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"\'"\'"\'\n# check alertmanager endpoints are responsive and mon down alert is active\ncurl -s http://${ALERTM_IP}:9093/api/v2/status\ncurl -s http://${ALERTM_IP}:9093/api/v2/alerts\ncurl -s http://${ALERTM_IP}:9093/api/v2/alerts | jq -e \'"\'"\'.[] | select(.labels | .alertname == "CephMonDown") | .status | .state == "active"\'"\'"\'\n\''
2026-02-17T21:15:44.731 DEBUG:teuthology.run_tasks:Unwinding manager cephadm

Actions #25

Updated by Sridhar Seshasayee 26 days ago

/a/skanta-2026-02-22_05:18:48-rados-wip-bharath21-testing-2026-02-20-1039-distro-default-trial/63350

Actions #26

Updated by Upkeep Bot 16 days ago

  • Status changed from Fix Under Review to Pending Backport
  • Merge Commit set to 7d9f8f3b5f2112299079105c5582c6208348002d
  • Fixed In set to v20.3.0-5831-g7d9f8f3b5f
  • Upkeep Timestamp set to 2026-03-05T06:50:59+00:00
Actions #27

Updated by Upkeep Bot 16 days ago

  • Copied to Backport #75344: tentacle: Prometheus module experiences connection issues related to cherrypy added
Actions #28

Updated by Upkeep Bot 16 days ago

  • Copied to Backport #75345: squid: Prometheus module experiences connection issues related to cherrypy added
Actions #29

Updated by Upkeep Bot 16 days ago

  • Tags (freeform) set to backport_processed
Actions #30

Updated by Sridhar Seshasayee 8 days ago

Looks like the fix is merged. The following run did not have this fix:
/a/skanta-2026-03-04_23:53:38-rados-wip-bharath1-testing-2026-03-04-1011-distro-default-trial/85634

Actions

Also available in: Atom PDF