Bug #74564
closedRocky10 - prometheus not active
0%
Description
/a/nmordech-2026-01-25_11:10:14-rados-wip-rocky10-branch-of-the-day-2026-01-23-1769128778-distro-default-trial/17059
2026-01-25T12:03:46.976 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9095/api/v1/alerts
2026-01-25T12:03:46.979 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9095/api/v1/alerts
2026-01-25T12:03:46.979 INFO:teuthology.orchestra.run.trial028.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2026-01-25T12:03:47.032 DEBUG:teuthology.orchestra.run:got remote process result: 4
2026-01-25T12:03:47.032 INFO:teuthology.orchestra.run.trial028.stdout:{"status":"success","data":{"alerts":[{"labels":{"alertname":"CephMgrPrometheusModuleInactive","cluster":"d67caa45-f9e4-11f0-8e4c-d404e6e7d460","instance":"ceph_cluster","job":"ceph","oid":"1.3.6.1.4.1.50495.1.2.1.6.2","severity":"critical","type":"ceph_default"},"annotations":{"description":"The mgr/promethe
us module at ceph_cluster is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine modu
le status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.","summary":"The mgr/prometheus module is not available"},"state":"firing","activeAt":"2026-01-25T11:58:13.245200013Z","value":"0e+00"}]}}
2026-01-25T12:03:47.033 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/run_tasks.py", line 105, in run_tasks
manager = run_one_task(taskname, ctx=ctx, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/run_tasks.py", line 83, in run_one_task
return task(**kwargs)
^^^^^^^^^^^^^^
File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/cephadm.py", line 1467, in shell
_shell(
File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/cephadm.py", line 41, in _shell
return remote.run(
^^^^^^^^^^^
File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/orchestra/remote.py", line 575, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/orchestra/run.py", line 461, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/orchestra/run.py", line 181, in _raise_for_status
raise CommandFailedError(
Updated by Nitzan Mordechai about 2 months ago
Could be infra issues:
2026-01-25 11:56:40,239 7f5ef2d14e80 QUIET systemctl: stdout inactive
2026-01-25 11:56:40,239 7f5ef2d14e80 DEBUG firewalld.service is not enabled
2026-01-25 11:56:40,239 7f5ef2d14e80 DEBUG Not possible to open ports <[8443]>. firewalld.service is not available
2026-01-25 11:56:40,240 7f5ef2d14e80 INFO Ceph Dashboard is now available at:
Updated by David Galloway about 2 months ago
If you look higher up in the log, prometheus is running
2026-01-25T12:01:44.969 INFO:teuthology.orchestra.run.trial028.stderr:+ ceph orch ls 2026-01-25T12:01:45.122 INFO:teuthology.orchestra.run.trial028.stdout:NAME PORTS RUNNING REFRESHED AGE PLACEMENT 2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:alertmanager ?:9093,9094 1/1 3m ago 4m count:1 2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:grafana ?:3000 1/1 3m ago 4m count:1 2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:mgr 2/2 3m ago 4m trial028=a;trial120=b;count:2 2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:mon 3/3 3m ago 4m trial028:10.20.193.28=a;trial120:10.20.193.120=b;trial140:10.20.193.140=c;count:3 2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:node-exporter ?:9100 3/3 3m ago 4m * 2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:osd.default 3 3m ago 4m trial140 2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:prometheus ?:9095 1/1 3m ago 4m count:1 2026-01-25T12:01:45.133 INFO:teuthology.orchestra.run.trial028.stderr:+ ceph orch ps 2026-01-25T12:01:45.285 INFO:teuthology.orchestra.run.trial028.stdout:NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID 2026-01-25T12:01:45.285 INFO:teuthology.orchestra.run.trial028.stdout:alertmanager.trial140 trial140 *:9093,9094 running (3m) 3m ago 3m 20.4M - 0.28.1 91c01b3cec9b c0dc3fe0c79d 2026-01-25T12:01:45.285 INFO:teuthology.orchestra.run.trial028.stdout:grafana.trial028 trial028 *:3000 running (3m) 3m ago 3m 117M - 12.3.1 5cdab57891ea ede5cc12b05c 2026-01-25T12:01:45.285 INFO:teuthology.orchestra.run.trial028.stdout:mgr.a trial028 *:9283,8765 running (5m) 3m ago 5m 209M - 20.3.0-4942-gb62a951f 181aee340bfc ef89407f2d34 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:mgr.b trial120 *:8443,8765 running (4m) 3m ago 4m 148M - 20.3.0-4942-gb62a951f 181aee340bfc a299469fca70 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:mon.a trial028 running (5m) 3m ago 5m 52.8M 2048M 20.3.0-4942-gb62a951f 181aee340bfc f84b57af0716 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:mon.b trial120 running (4m) 3m ago 4m 45.5M 2048M 20.3.0-4942-gb62a951f 181aee340bfc 0377ffd6563e 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:mon.c trial140 running (4m) 3m ago 4m 44.6M 2048M 20.3.0-4942-gb62a951f 181aee340bfc d447cc23e0bc 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:node-exporter.trial028 trial028 *:9100 running (3m) 3m ago 3m 9755k - 1.9.1 255ec253085f 67ae65e6a772 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:node-exporter.trial120 trial120 *:9100 running (3m) 3m ago 3m 5817k - 1.9.1 255ec253085f 6a33e0ab4902 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:node-exporter.trial140 trial140 *:9100 running (3m) 3m ago 3m 5754k - 1.9.1 255ec253085f fa8c9d239888 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:osd.0 trial028 running (4m) 3m ago 4m 48.0M 80.4G 20.3.0-4942-gb62a951f 181aee340bfc ad0a6b93eef3 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:osd.1 trial120 running (4m) 3m ago 4m 40.0M 80.4G 20.3.0-4942-gb62a951f 181aee340bfc 598c84d3ac8d 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:osd.2 trial140 running (4m) 3m ago 4m 48.9M 84.4G 20.3.0-4942-gb62a951f 181aee340bfc ced2e1ff3233 2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:prometheus.trial120 trial120 *:9095 running (3m) 3m ago 3m 34.4M - 3.6.0 4fcecf061b74 3b0b6caea023 2026-01-25T12:01:45.296 INFO:teuthology.orchestra.run.trial028.stderr:+ ceph orch host ls 2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:HOST ADDR LABELS STATUS 2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:trial028 10.20.193.28 2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:trial120 10.20.193.120 2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:trial140 10.20.193.140 2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:3 hosts in cluster 2026-01-25T12:01:45.458 INFO:teuthology.orchestra.run.trial028.stderr:++ ceph orch ps --daemon-type mon -f json 2026-01-25T12:01:45.458 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -r 'last | .daemon_name' 2026-01-25T12:01:45.620 INFO:teuthology.orchestra.run.trial028.stderr:+ MON_DAEMON=mon.c 2026-01-25T12:01:45.621 INFO:teuthology.orchestra.run.trial028.stderr:++ ceph orch ps --daemon-type grafana -f json 2026-01-25T12:01:45.621 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -e '.[]' 2026-01-25T12:01:45.621 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -r .hostname 2026-01-25T12:01:45.785 INFO:teuthology.orchestra.run.trial028.stderr:+ GRAFANA_HOST=trial028 2026-01-25T12:01:45.785 INFO:teuthology.orchestra.run.trial028.stderr:++ ceph orch ps --daemon-type prometheus -f json 2026-01-25T12:01:45.786 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -e '.[]' 2026-01-25T12:01:45.786 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -r .hostname 2026-01-25T12:01:45.958 INFO:teuthology.orchestra.run.trial028.stderr:+ PROM_HOST=trial120
And we can communicate with it
2026-01-25T12:03:46.973 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9095/api/v1/status/config
2026-01-25T12:03:46.973 INFO:teuthology.orchestra.run.trial028.stderr:+ jq -e '.status == "success"'
2026-01-25T12:03:46.976 INFO:teuthology.orchestra.run.trial028.stdout:{"status":"success"
The test was expecting to see a CephMonDown alert but didn't so the test failed
2026-01-25T12:03:46.979 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9095/api/v1/alerts
2026-01-25T12:03:46.979 INFO:teuthology.orchestra.run.trial028.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2026-01-25T12:03:47.032 DEBUG:teuthology.orchestra.run:got remote process result: 4
2026-01-25T12:03:47.032 INFO:teuthology.orchestra.run.trial028.stdout:{"status":"success","data":{"alerts":[{"labels":{"alertname":"CephMgrPrometheusModuleInactive","cluster":"d67caa45-f9e4-11f0-8e4c-d404e6e7d460","instance":"ceph_cluster","job":"ceph","oid":"1.3.6.1.4.1.50495.1.2.1.6.2","severity":"critical","type":"ceph_default"},"annotations":{"description":"The mgr/prometheus module at ceph_cluster is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.","summary":"The mgr/prometheus module is not available"},"state":"firing","activeAt":"2026-01-25T11:58:13.245200013Z","value":"0e+00"}]}}
Not infra.
Updated by Nizamudeen A about 2 months ago
I see the prometheus web server is started on the standby mgr which is mgr.b
2026-01-25T11:58:03.091 INFO:journalctl@ceph.mgr.b.trial120.stdout:Jan 25 11:58:03 trial120 ceph-d67caa45-f9e4-11f0-8e4c-d404e6e7d460-mgr-b[15624]: [25/Jan/2026:11:58:03] ENGINE Serving on http://:::9283 2026-01-25T11:58:03.091 INFO:journalctl@ceph.mgr.b.trial120.stdout:Jan 25 11:58:03 trial120 ceph-d67caa45-f9e4-11f0-8e4c-d404e6e7d460-mgr-b[15624]: [25/Jan/2026:11:58:03] ENGINE Bus STARTED
but it never ran the server on the active mgr later which is mgr.a even after the mgr came back up (maybe some race conditions?).
and that explains why these endpoint returned 404 page not found because I think the service discovery might have failed for active mgr
2026-01-25T12:01:46.768 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.28:9100/metric 2026-01-25T12:01:46.770 INFO:teuthology.orchestra.run.trial028.stdout:404 page not found 2026-01-25T12:01:46.771 INFO:teuthology.orchestra.run.trial028.stderr:+ for ip in $ALL_HOST_IPS 2026-01-25T12:01:46.771 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9100/metric 2026-01-25T12:01:46.773 INFO:teuthology.orchestra.run.trial028.stdout:404 page not found 2026-01-25T12:01:46.774 INFO:teuthology.orchestra.run.trial028.stderr:+ for ip in $ALL_HOST_IPS 2026-01-25T12:01:46.774 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.140:9100/metric 2026-01-25T12:01:46.778 INFO:teuthology.orchestra.run.trial028.stdout:404 page not found
I could confirm or say more if the test has more info like whether the service discovery endpoint lists the active mgr. We could get that from
curl http://10.20.193.28:8765/sd/prometheus/sd-config?service=ceph
maybe we can add this to the test?
but that's what i could find so far. if its reproducible, maybe we need check if the service-discovery is returning the active daemon where its running or fail the mgr to where prometheus server is being run from
Updated by Laura Flores about 2 months ago ยท Edited
@Nitzan Mordechai could this issue be related to https://tracker.ceph.com/issues/74148?
If so, this issue also exists on main and is not specific to the rocky10 changes. I understand that the root causes might be a little different between these two issues though- thoughts?
Updated by Laura Flores about 2 months ago
- Subject changed from Rocky10 - promethouse not active to Rocky10 - prometheus not active
Updated by Laura Flores about 2 months ago
- Related to QA Run #74540: wip-rocky10-branch-of-the-day-2026-01-23-1769128778 added
Updated by Nitzan Mordechai about 2 months ago
Laura Flores wrote in #note-5:
@Nitzan Mordechai could this issue be related to https://tracker.ceph.com/issues/74148?
If so, this issue also exists on main and is not specific to the rocky10 changes. I understand that the root causes might be a little different between these two issues though- thoughts?
Great catch! its very likely the issue here, it looks like a deadlock
Updated by Nitzan Mordechai about 2 months ago
- Related to Bug #74148: Prometheus module experiences connection issues related to cherrypy added
Updated by Nitzan Mordechai about 2 months ago
- Status changed from New to In Progress
Updated by Yaarit Hatuka about 2 months ago
This failure happens in both rocky and main tests.
Updated by Nitzan Mordechai about 2 months ago
- Related to deleted (Bug #74148: Prometheus module experiences connection issues related to cherrypy)
Updated by Nitzan Mordechai about 2 months ago
- Is duplicate of Bug #74148: Prometheus module experiences connection issues related to cherrypy added
Updated by Laura Flores about 1 month ago
- Status changed from In Progress to Duplicate