Bug #74564: Rocky10 - prometheus not active - mgr - Ceph

Actions

Copy link

Bug #74564

closed

Rocky10 - prometheus not active

Added by Nitzan Mordechai about 2 months ago. Updated about 1 month ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Nitzan Mordechai

Category:

Target version:

% Done:

Source:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Tags:

rocky10

Description

/a/nmordech-2026-01-25_11:10:14-rados-wip-rocky10-branch-of-the-day-2026-01-23-1769128778-distro-default-trial/17059

2026-01-25T12:03:46.976 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9095/api/v1/alerts
2026-01-25T12:03:46.979 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9095/api/v1/alerts
2026-01-25T12:03:46.979 INFO:teuthology.orchestra.run.trial028.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2026-01-25T12:03:47.032 DEBUG:teuthology.orchestra.run:got remote process result: 4
2026-01-25T12:03:47.032 INFO:teuthology.orchestra.run.trial028.stdout:{"status":"success","data":{"alerts":[{"labels":{"alertname":"CephMgrPrometheusModuleInactive","cluster":"d67caa45-f9e4-11f0-8e4c-d404e6e7d460","instance":"ceph_cluster","job":"ceph","oid":"1.3.6.1.4.1.50495.1.2.1.6.2","severity":"critical","type":"ceph_default"},"annotations":{"description":"The mgr/promethe
us module at ceph_cluster is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine modu
le status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.","summary":"The mgr/prometheus module is not available"},"state":"firing","activeAt":"2026-01-25T11:58:13.245200013Z","value":"0e+00"}]}}
2026-01-25T12:03:47.033 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
           ^^^^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/cephadm.py", line 1467, in shell
    _shell(
  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/cephadm.py", line 41, in _shell
    return remote.run(
           ^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/orchestra/remote.py", line 575, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/orchestra/run.py", line 461, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_c433f1062990a0488dc29a553589bc609a460691/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Updated by Nitzan Mordechai about 2 months ago

Could be infra issues:
2026-01-25 11:56:40,239 7f5ef2d14e80 QUIET systemctl: stdout inactive
2026-01-25 11:56:40,239 7f5ef2d14e80 DEBUG firewalld.service is not enabled
2026-01-25 11:56:40,239 7f5ef2d14e80 DEBUG Not possible to open ports <[8443]>. firewalld.service is not available
2026-01-25 11:56:40,240 7f5ef2d14e80 INFO Ceph Dashboard is now available at:

Actions

Copy link

Updated by David Galloway about 2 months ago

If you look higher up in the log, prometheus is running

2026-01-25T12:01:44.969 INFO:teuthology.orchestra.run.trial028.stderr:+ ceph orch ls
2026-01-25T12:01:45.122 INFO:teuthology.orchestra.run.trial028.stdout:NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT
2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:alertmanager   ?:9093,9094      1/1  3m ago     4m   count:1
2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:grafana        ?:3000           1/1  3m ago     4m   count:1
2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:mgr                             2/2  3m ago     4m   trial028=a;trial120=b;count:2
2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:mon                             3/3  3m ago     4m   trial028:10.20.193.28=a;trial120:10.20.193.120=b;trial140:10.20.193.140=c;count:3
2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:node-exporter  ?:9100           3/3  3m ago     4m   *
2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:osd.default                       3  3m ago     4m   trial140
2026-01-25T12:01:45.123 INFO:teuthology.orchestra.run.trial028.stdout:prometheus     ?:9095           1/1  3m ago     4m   count:1
2026-01-25T12:01:45.133 INFO:teuthology.orchestra.run.trial028.stderr:+ ceph orch ps
2026-01-25T12:01:45.285 INFO:teuthology.orchestra.run.trial028.stdout:NAME                    HOST      PORTS        STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID
2026-01-25T12:01:45.285 INFO:teuthology.orchestra.run.trial028.stdout:alertmanager.trial140   trial140  *:9093,9094  running (3m)     3m ago   3m    20.4M        -  0.28.1                 91c01b3cec9b  c0dc3fe0c79d
2026-01-25T12:01:45.285 INFO:teuthology.orchestra.run.trial028.stdout:grafana.trial028        trial028  *:3000       running (3m)     3m ago   3m     117M        -  12.3.1                 5cdab57891ea  ede5cc12b05c
2026-01-25T12:01:45.285 INFO:teuthology.orchestra.run.trial028.stdout:mgr.a                   trial028  *:9283,8765  running (5m)     3m ago   5m     209M        -  20.3.0-4942-gb62a951f  181aee340bfc  ef89407f2d34
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:mgr.b                   trial120  *:8443,8765  running (4m)     3m ago   4m     148M        -  20.3.0-4942-gb62a951f  181aee340bfc  a299469fca70
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:mon.a                   trial028               running (5m)     3m ago   5m    52.8M    2048M  20.3.0-4942-gb62a951f  181aee340bfc  f84b57af0716
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:mon.b                   trial120               running (4m)     3m ago   4m    45.5M    2048M  20.3.0-4942-gb62a951f  181aee340bfc  0377ffd6563e
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:mon.c                   trial140               running (4m)     3m ago   4m    44.6M    2048M  20.3.0-4942-gb62a951f  181aee340bfc  d447cc23e0bc
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:node-exporter.trial028  trial028  *:9100       running (3m)     3m ago   3m    9755k        -  1.9.1                  255ec253085f  67ae65e6a772
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:node-exporter.trial120  trial120  *:9100       running (3m)     3m ago   3m    5817k        -  1.9.1                  255ec253085f  6a33e0ab4902
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:node-exporter.trial140  trial140  *:9100       running (3m)     3m ago   3m    5754k        -  1.9.1                  255ec253085f  fa8c9d239888
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:osd.0                   trial028               running (4m)     3m ago   4m    48.0M    80.4G  20.3.0-4942-gb62a951f  181aee340bfc  ad0a6b93eef3
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:osd.1                   trial120               running (4m)     3m ago   4m    40.0M    80.4G  20.3.0-4942-gb62a951f  181aee340bfc  598c84d3ac8d
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:osd.2                   trial140               running (4m)     3m ago   4m    48.9M    84.4G  20.3.0-4942-gb62a951f  181aee340bfc  ced2e1ff3233
2026-01-25T12:01:45.286 INFO:teuthology.orchestra.run.trial028.stdout:prometheus.trial120     trial120  *:9095       running (3m)     3m ago   3m    34.4M        -  3.6.0                  4fcecf061b74  3b0b6caea023
2026-01-25T12:01:45.296 INFO:teuthology.orchestra.run.trial028.stderr:+ ceph orch host ls
2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:HOST      ADDR           LABELS  STATUS
2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:trial028  10.20.193.28
2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:trial120  10.20.193.120
2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:trial140  10.20.193.140
2026-01-25T12:01:45.446 INFO:teuthology.orchestra.run.trial028.stdout:3 hosts in cluster
2026-01-25T12:01:45.458 INFO:teuthology.orchestra.run.trial028.stderr:++ ceph orch ps --daemon-type mon -f json
2026-01-25T12:01:45.458 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -r 'last | .daemon_name'
2026-01-25T12:01:45.620 INFO:teuthology.orchestra.run.trial028.stderr:+ MON_DAEMON=mon.c
2026-01-25T12:01:45.621 INFO:teuthology.orchestra.run.trial028.stderr:++ ceph orch ps --daemon-type grafana -f json
2026-01-25T12:01:45.621 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -e '.[]'
2026-01-25T12:01:45.621 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -r .hostname
2026-01-25T12:01:45.785 INFO:teuthology.orchestra.run.trial028.stderr:+ GRAFANA_HOST=trial028
2026-01-25T12:01:45.785 INFO:teuthology.orchestra.run.trial028.stderr:++ ceph orch ps --daemon-type prometheus -f json
2026-01-25T12:01:45.786 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -e '.[]'
2026-01-25T12:01:45.786 INFO:teuthology.orchestra.run.trial028.stderr:++ jq -r .hostname
2026-01-25T12:01:45.958 INFO:teuthology.orchestra.run.trial028.stderr:+ PROM_HOST=trial120

And we can communicate with it

2026-01-25T12:03:46.973 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9095/api/v1/status/config
2026-01-25T12:03:46.973 INFO:teuthology.orchestra.run.trial028.stderr:+ jq -e '.status == "success"'
2026-01-25T12:03:46.976 INFO:teuthology.orchestra.run.trial028.stdout:{"status":"success"

The test was expecting to see a CephMonDown alert but didn't so the test failed

2026-01-25T12:03:46.979 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9095/api/v1/alerts
2026-01-25T12:03:46.979 INFO:teuthology.orchestra.run.trial028.stderr:+ jq -e '.data | .alerts | .[] | select(.labels | .alertname == "CephMonDown") | .state == "firing"'
2026-01-25T12:03:47.032 DEBUG:teuthology.orchestra.run:got remote process result: 4
2026-01-25T12:03:47.032 INFO:teuthology.orchestra.run.trial028.stdout:{"status":"success","data":{"alerts":[{"labels":{"alertname":"CephMgrPrometheusModuleInactive","cluster":"d67caa45-f9e4-11f0-8e4c-d404e6e7d460","instance":"ceph_cluster","job":"ceph","oid":"1.3.6.1.4.1.50495.1.2.1.6.2","severity":"critical","type":"ceph_default"},"annotations":{"description":"The mgr/prometheus module at ceph_cluster is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.","summary":"The mgr/prometheus module is not available"},"state":"firing","activeAt":"2026-01-25T11:58:13.245200013Z","value":"0e+00"}]}}

Not infra.

Actions

Copy link

Updated by Nizamudeen A about 2 months ago

I see the prometheus web server is started on the standby mgr which is mgr.b

2026-01-25T11:58:03.091 INFO:journalctl@ceph.mgr.b.trial120.stdout:Jan 25 11:58:03 trial120 ceph-d67caa45-f9e4-11f0-8e4c-d404e6e7d460-mgr-b[15624]: [25/Jan/2026:11:58:03] ENGINE Serving on http://:::9283
2026-01-25T11:58:03.091 INFO:journalctl@ceph.mgr.b.trial120.stdout:Jan 25 11:58:03 trial120 ceph-d67caa45-f9e4-11f0-8e4c-d404e6e7d460-mgr-b[15624]: [25/Jan/2026:11:58:03] ENGINE Bus STARTED

but it never ran the server on the active mgr later which is mgr.a even after the mgr came back up (maybe some race conditions?).

and that explains why these endpoint returned 404 page not found because I think the service discovery might have failed for active mgr

2026-01-25T12:01:46.768 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.28:9100/metric
2026-01-25T12:01:46.770 INFO:teuthology.orchestra.run.trial028.stdout:404 page not found
2026-01-25T12:01:46.771 INFO:teuthology.orchestra.run.trial028.stderr:+ for ip in $ALL_HOST_IPS
2026-01-25T12:01:46.771 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.120:9100/metric
2026-01-25T12:01:46.773 INFO:teuthology.orchestra.run.trial028.stdout:404 page not found
2026-01-25T12:01:46.774 INFO:teuthology.orchestra.run.trial028.stderr:+ for ip in $ALL_HOST_IPS
2026-01-25T12:01:46.774 INFO:teuthology.orchestra.run.trial028.stderr:+ curl -s http://10.20.193.140:9100/metric
2026-01-25T12:01:46.778 INFO:teuthology.orchestra.run.trial028.stdout:404 page not found

I could confirm or say more if the test has more info like whether the service discovery endpoint lists the active mgr. We could get that from

curl http://10.20.193.28:8765/sd/prometheus/sd-config?service=ceph

maybe we can add this to the test?

but that's what i could find so far. if its reproducible, maybe we need check if the service-discovery is returning the active daemon where its running or fail the mgr to where prometheus server is being run from

Actions

Copy link

Updated by Nitzan Mordechai about 2 months ago

Assignee set to Nitzan Mordechai

Actions

Copy link

Updated by Laura Flores about 2 months ago · Edited

@Nitzan Mordechai could this issue be related to https://tracker.ceph.com/issues/74148?

If so, this issue also exists on main and is not specific to the rocky10 changes. I understand that the root causes might be a little different between these two issues though- thoughts?

Actions

Copy link

Updated by Laura Flores about 2 months ago

Subject changed from Rocky10 - promethouse not active to Rocky10 - prometheus not active

Actions

Copy link

Updated by Laura Flores about 2 months ago

Related to QA Run #74540: wip-rocky10-branch-of-the-day-2026-01-23-1769128778 added

Actions

Copy link

Updated by Nitzan Mordechai about 2 months ago

Laura Flores wrote in #note-5:

@Nitzan Mordechai could this issue be related to https://tracker.ceph.com/issues/74148?

If so, this issue also exists on main and is not specific to the rocky10 changes. I understand that the root causes might be a little different between these two issues though- thoughts?

Great catch! its very likely the issue here, it looks like a deadlock

Actions

Copy link