mgr/cephadm: fixing prometheus port handling by rkachach · Pull Request #45241 · ceph/ceph

rkachach · 2022-03-03T10:45:33Z

Fixes: https://tracker.ceph.com/issues/51072

Signed-off-by: Redouane Kachach rkachach@redhat.com

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

adk3798 · 2022-03-06T18:30:28Z

seems to work in testing

[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  4m ago     19m  count:1    
crash                           3/3  4m ago     19m  *          
grafana        ?:3000           1/1  4m ago     19m  count:1    
mgr                             2/2  4m ago     19m  count:2    
mon                             3/5  4m ago     19m  count:5    
node-exporter  ?:9100           3/3  4m ago     19m  *          
prometheus     ?:9095           1/1  4m ago     30s  count:1    
[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (17m)     6m ago  20m    15.5M        -  0.20.0                 0881eb8f169f  efe844cfef9b  
crash.vm-00          vm-00               running (20m)     6m ago  20m    7155k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  29dcefad4cca  
crash.vm-01          vm-01               running (18m)     5m ago  18m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  9a3dc115dc73  
crash.vm-02          vm-02               running (18m)     6m ago  18m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  b7638f27b6ab  
grafana.vm-00        vm-00  *:3000       running (17m)     6m ago  19m    24.2M        -  6.7.4                  557c83e11646  d4049437d170  
mgr.vm-00.syxfti     vm-00  *:9283       running (21m)     6m ago  21m     453M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  a48844d4b598  
mgr.vm-01.utpsdw     vm-01  *:8443,9283  running (18m)     5m ago  18m     417M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  58cb3c911fcc  
mon.vm-00            vm-00               running (21m)     6m ago  21m    41.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  e06cfb279a05  
mon.vm-01            vm-01               running (18m)     5m ago  18m    30.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  09c3b549852e  
mon.vm-02            vm-02               running (18m)     6m ago  18m    33.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  20c7a9b0c000  
node-exporter.vm-00  vm-00  *:9100       running (19m)     6m ago  19m    8242k        -  0.18.1                 e5a616e4b9cf  9639d68a726c  
node-exporter.vm-01  vm-01  *:9100       running (18m)     5m ago  18m    8212k        -  0.18.1                 e5a616e4b9cf  c589fc6f95be  
node-exporter.vm-02  vm-02  *:9100       running (18m)     6m ago  18m    8224k        -  0.18.1                 e5a616e4b9cf  fa0ba27c858b  
prometheus.vm-00     vm-00  *:9095       running (17m)     6m ago  17m    34.8M        -  2.18.1                 de242295e225  80a3863c1724  
[ceph: root@vm-00 /]# cat /usr/share/ceph/mgr/cephadm/module.py | grep monitoring_spec.port
            monitoring_spec.port = cast(int, self._ceph_get_module_option('prometheus', 'server_port'))
[ceph: root@vm-00 /]# ceph config set mgr mgr/prometheus/server_port 12765
[ceph: root@vm-00 /]# ceph config get mgr mgr/promeheus/server_port
Error ENOENT: 
[ceph: root@vm-00 /]# ceph config get mgr mgr/prometheus/server_port
12765
[ceph: root@vm-00 /]# ceph orch apply prometheus
Scheduled prometheus update...
[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (21m)    10s ago  23m    15.8M        -  0.20.0                 0881eb8f169f  efe844cfef9b  
crash.vm-00          vm-00               running (23m)    10s ago  23m    7155k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  29dcefad4cca  
crash.vm-01          vm-01               running (21m)     8m ago  21m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  9a3dc115dc73  
crash.vm-02          vm-02               running (21m)     9m ago  21m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  b7638f27b6ab  
grafana.vm-00        vm-00  *:3000       running (20m)    10s ago  23m    24.9M        -  6.7.4                  557c83e11646  d4049437d170  
mgr.vm-00.syxfti     vm-00  *:9283       running (24m)    10s ago  24m     456M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  a48844d4b598  
mgr.vm-01.utpsdw     vm-01  *:8443,9283  running (21m)     8m ago  21m     417M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  58cb3c911fcc  
mon.vm-00            vm-00               running (24m)    10s ago  24m    45.9M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  e06cfb279a05  
mon.vm-01            vm-01               running (21m)     8m ago  21m    30.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  09c3b549852e  
mon.vm-02            vm-02               running (21m)     9m ago  21m    33.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  20c7a9b0c000  
node-exporter.vm-00  vm-00  *:9100       running (23m)    10s ago  23m    8493k        -  0.18.1                 e5a616e4b9cf  9639d68a726c  
node-exporter.vm-01  vm-01  *:9100       running (21m)     8m ago  21m    8212k        -  0.18.1                 e5a616e4b9cf  c589fc6f95be  
node-exporter.vm-02  vm-02  *:9100       running (21m)     9m ago  21m    8224k        -  0.18.1                 e5a616e4b9cf  fa0ba27c858b  
prometheus.vm-00     vm-00  *:12765      running (18s)    10s ago  18s    17.3M        -  2.18.1                 de242295e225  c1eba972b487  
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  13s ago    24m  count:1    
crash                           3/3  9m ago     24m  *          
grafana        ?:3000           1/1  13s ago    24m  count:1    
mgr                             2/2  8m ago     24m  count:2    
mon                             3/5  9m ago     24m  count:5    
node-exporter  ?:9100           3/3  9m ago     24m  *          
prometheus     ?:12765          1/1  13s ago    32s  count:1    
[ceph: root@vm-00 /]# ceph orch ls --service-name prometheus -f yaml
service_type: prometheus
service_name: prometheus
placement:
  count: 1
spec:
  port: 12765
status:
  created: '2022-03-06T18:19:31.965101Z'
  last_refresh: '2022-03-06T18:19:51.454906Z'
  ports:
  - 12765
  running: 1
  size: 1
events:
- 2022-03-06T18:14:36.221492Z service:prometheus [INFO] "service was created"
[ceph: root@vm-00 /]# ceph config set mgr mgr/prometheus/server_port 7457 
[ceph: root@vm-00 /]# ceph orch apply prometheus
Scheduled prometheus update...
[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (23m)    19s ago  26m    15.9M        -  0.20.0                 0881eb8f169f  efe844cfef9b  
crash.vm-00          vm-00               running (26m)    19s ago  26m    7155k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  29dcefad4cca  
crash.vm-01          vm-01               running (24m)    71s ago  24m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  9a3dc115dc73  
crash.vm-02          vm-02               running (24m)    71s ago  24m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  b7638f27b6ab  
grafana.vm-00        vm-00  *:3000       running (23m)    19s ago  25m    25.3M        -  6.7.4                  557c83e11646  d4049437d170  
mgr.vm-00.syxfti     vm-00  *:9283       running (27m)    19s ago  27m     457M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  a48844d4b598  
mgr.vm-01.utpsdw     vm-01  *:8443,9283  running (24m)    71s ago  24m     419M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  58cb3c911fcc  
mon.vm-00            vm-00               running (27m)    19s ago  27m    48.9M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  e06cfb279a05  
mon.vm-01            vm-01               running (24m)    71s ago  24m    40.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  09c3b549852e  
mon.vm-02            vm-02               running (24m)    71s ago  24m    40.3M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  20c7a9b0c000  
node-exporter.vm-00  vm-00  *:9100       running (25m)    19s ago  25m    8568k        -  0.18.1                 e5a616e4b9cf  9639d68a726c  
node-exporter.vm-01  vm-01  *:9100       running (24m)    71s ago  24m    8334k        -  0.18.1                 e5a616e4b9cf  c589fc6f95be  
node-exporter.vm-02  vm-02  *:9100       running (23m)    71s ago  23m    8514k        -  0.18.1                 e5a616e4b9cf  fa0ba27c858b  
prometheus.vm-00     vm-00  *:7457       running (27s)    19s ago  27s    17.2M        -  2.18.1                 de242295e225  9458ba412396  
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  21s ago    26m  count:1    
crash                           3/3  73s ago    26m  *          
grafana        ?:3000           1/1  21s ago    26m  count:1    
mgr                             2/2  73s ago    26m  count:2    
mon                             3/5  73s ago    26m  count:5    
node-exporter  ?:9100           3/3  73s ago    26m  *          
prometheus     ?:7457           1/1  21s ago    41s  count:1
[ceph: root@vm-00 /]# exit
exit
[root@vm-00 ~]# sudo netstat -tulpn | grep prometheus
tcp6       0      0 :::7457                 :::*                    LISTEN      36133/prometheus

src/pybind/mgr/cephadm/module.py

rkachach · 2022-03-07T13:09:15Z

@adk3798 when I change the port everything works correctly. How ever if after that I disable/enable the cephadm mgr module, I see the following error which I'm note if it's something that has to do with cephadm or the promethus mgr module itself.

[07/Mar/2022:13:06:44] ENGINE Bus STOPPING
[07/Mar/2022:13:06:44] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 7777)) already shut down
[07/Mar/2022:13:06:44] ENGINE Bus STOPPED
[07/Mar/2022:13:06:44] ENGINE Bus EXITING
[07/Mar/2022:13:06:44] ENGINE Bus EXITED
2022-03-07T13:06:44.562+0000 7f3c92123700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'prometheus' while running on mgr.ceph-node-0.zdhcaf: Timeout('Port 7777 not free on ::.',)
2022-03-07T13:06:44.562+0000 7f3c92123700 -1 prometheus.serve:
2022-03-07T13:06:44.564+0000 7f3c92123700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1741, in serve
    cherrypy.engine.start()
  File "/lib/python3.6/site-packages/cherrypy/process/wspbus.py", line 283, in start
    raise e_info
  File "/lib/python3.6/site-packages/cherrypy/process/wspbus.py", line 268, in start
    self.publish('start')
  File "/lib/python3.6/site-packages/cherrypy/process/wspbus.py", line 248, in publish
    raise exc
cherrypy.process.wspbus.ChannelFailures: Timeout('Port 7777 not free on ::.',)

rkachach · 2022-03-10T11:57:05Z

jenkins retest this please

rkachach · 2022-03-10T15:48:37Z

jenkins retest this please

rkachach · 2022-03-11T15:50:03Z

jenkins retest this please

rkachach · 2022-03-16T15:47:36Z

jenkins retest this please

p-se

This code works well, with the exception of the result of ceph mgr services not being updated after the port has changed. This may lead to issues with config generation, as the result used in cephadm for retrieval of active services.

➜  build git:(rkachach-fix_issue_51072) ✗ ceph config set mgr mgr/prometheus/server_port 8888
➜  build git:(rkachach-fix_issue_51072) ✗ curl -fSsl http://home:8888/metrics | wc -l
2561
➜  build git:(rkachach-fix_issue_51072) ✗ ceph mgr services
{
    "dashboard": "https://192.168.1.2:41481/",
    "prometheus": "http://home:7777/"
}

I'm not sure if the result of ceph mgr services can be updated without a restart after the port has changed. Should that not work, we'd need to ensure the mgr_map returned reflects the actual configuration, which might make it necessary to not dynamically restart cherrypy on change of a port (in notify_config). At least for this setting. When the mgr is restarted (like in ceph mgr module disable ... && ceph mgr module enable ...), then the change is reflected properly.

rkachach · 2022-03-22T16:33:36Z

This code works well, with the exception of the result of ceph mgr services not being updated after the port has changed. This may lead to issues with config generation, as the result used in cephadm for retrieval of active services.
➜  build git:(rkachach-fix_issue_51072) ✗ ceph config set mgr mgr/prometheus/server_port 8888
➜  build git:(rkachach-fix_issue_51072) ✗ curl -fSsl http://home:8888/metrics | wc -l
2561
➜  build git:(rkachach-fix_issue_51072) ✗ ceph mgr services
{
    "dashboard": "https://192.168.1.2:41481/",
    "prometheus": "http://home:7777/"
}
I'm not sure if the result of ceph mgr services can be updated without a restart after the port has changed. Should that not work, we'd need to ensure the mgr_map returned reflects the actual configuration, which might make it necessary to not dynamically restart cherrypy on change of a port (in notify_config). At least for this setting. When the mgr is restarted (like in ceph mgr module disable ... && ceph mgr module enable ...), then the change is reflected properly.

@p-se Thanks for reviewing this.

I fixed the issue (basically a call to:
self.set_uri(build_url(scheme='http', host=server_addr, port=server_port, path='/'))
was missing when a port change notification is received. Now the url is updated correctly and the change is reflected immediately. I re-tested the new changes both on Active (and by forcing a fail over to the stdby mgr).

On mgr active node (ceph-node-0):


[ceph: root@ceph-node-0 /]# ceph config set mgr mgr/prometheus/server_port 6666
[ceph: root@ceph-node-0 /]# ceph mgr services
{
    "prometheus": "http://192.168.100.100:6666/"
}

[root@ceph-node-0 ~]# ss -tulpn | grep ceph-mgr
tcp   LISTEN 0      5                    *:6666            *:*    users:(("ceph-mgr",pid=11436,fd=24))

From testing host:

fix_issue_51072 >  curl --silent 192.168.100.100:6666/metrics | wc -l    
2049

Force a failover:

[ceph: root@ceph-node-0 /]# ceph mgr fail

On the new active node (ceph-node-1):

[ceph: root@ceph-node-1 /]# ceph mgr services
{
    "prometheus": "http://192.168.100.101:6666/"
}

[root@ceph-node-1 ~]# ss -tulpn | grep ceph-mgr
tcp   LISTEN 0      512            0.0.0.0:6808      0.0.0.0:*    users:(("ceph-mgr",pid=10949,fd=29))     
tcp   LISTEN 0      512            0.0.0.0:6809      0.0.0.0:*    users:(("ceph-mgr",pid=10949,fd=30))     
tcp   LISTEN 0      5      192.168.100.101:7150      0.0.0.0:*    users:(("ceph-mgr",pid=10949,fd=64))     
tcp   LISTEN 0      5                    *:6666            *:*    users:(("ceph-mgr",pid=10949,fd=37))

From testing host:

 curl --silent 192.168.100.101:6666/metrics | wc -l 
2049

adk3798 · 2022-03-22T23:50:25Z

jenkins test api

src/pybind/mgr/prometheus/module.py

Fixes: https://tracker.ceph.com/issues/51072 Signed-off-by: Redouane Kachach <rkachach@redhat.com>

adk3798 · 2022-03-24T19:08:26Z

http://pulpito.front.sepia.ceph.com/adking-2022-03-23_02:54:35-orch:cephadm-wip-adk-testing-2022-03-22-2000-distro-basic-smithi/

2 Failures caused by wrong error code from host add command due to another PR included in the run

rkachach requested a review from a team as a code owner March 3, 2022 10:45

github-actions bot added cephadm pybind labels Mar 3, 2022

mgfritch requested a review from p-se March 3, 2022 15:32

adk3798 reviewed Mar 6, 2022

View reviewed changes

src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved

rkachach force-pushed the fix_issue_51072 branch from f32b13e to e10a788 Compare March 7, 2022 12:36

rkachach requested a review from adk3798 March 7, 2022 12:37

rkachach force-pushed the fix_issue_51072 branch from e10a788 to 4542a5c Compare March 8, 2022 17:11

github-actions bot added the monitoring label Mar 8, 2022

rkachach force-pushed the fix_issue_51072 branch from 4542a5c to 395b298 Compare March 9, 2022 11:04

rkachach added the wip-rkachach-testing label Mar 10, 2022

rkachach force-pushed the fix_issue_51072 branch from 395b298 to 9dbe8d4 Compare March 11, 2022 12:04

rkachach force-pushed the fix_issue_51072 branch from 9dbe8d4 to 3328d54 Compare March 14, 2022 09:17

p-se reviewed Mar 22, 2022

View reviewed changes

rkachach force-pushed the fix_issue_51072 branch from 3328d54 to b09012a Compare March 22, 2022 16:24

rkachach requested a review from p-se March 22, 2022 16:36

adk3798 approved these changes Mar 22, 2022

View reviewed changes

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved

adk3798 added the wip-adk-testing label Mar 22, 2022

mgr/cephadm: fixing prometheus port handling

8eb1397

Fixes: https://tracker.ceph.com/issues/51072 Signed-off-by: Redouane Kachach <rkachach@redhat.com>

rkachach force-pushed the fix_issue_51072 branch from b09012a to 8eb1397 Compare March 23, 2022 09:28

rkachach removed the wip-rkachach-testing label Mar 23, 2022

adk3798 merged commit 0bdf0ef into ceph:master Mar 25, 2022

rkachach deleted the fix_issue_51072 branch March 25, 2022 14:17

adk3798 mentioned this pull request Mar 30, 2022

Cephadm Pacific Batch Backport March #45716

Merged

14 tasks

adk3798 mentioned this pull request Apr 27, 2022

quincy: Cephadm Batch Backport April #46055

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/cephadm: fixing prometheus port handling#45241

mgr/cephadm: fixing prometheus port handling#45241
adk3798 merged 1 commit intoceph:masterfrom
rkachach:fix_issue_51072

rkachach commented Mar 3, 2022

Uh oh!

adk3798 commented Mar 6, 2022

Uh oh!

Uh oh!

rkachach commented Mar 7, 2022 •

edited

Loading

Uh oh!

rkachach commented Mar 10, 2022

Uh oh!

rkachach commented Mar 10, 2022

Uh oh!

rkachach commented Mar 11, 2022

Uh oh!

rkachach commented Mar 16, 2022

Uh oh!

p-se left a comment

Uh oh!

rkachach commented Mar 22, 2022 •

edited

Loading

Uh oh!

adk3798 commented Mar 22, 2022

Uh oh!

Uh oh!

adk3798 commented Mar 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rkachach commented Mar 3, 2022

Checklist

Uh oh!

adk3798 commented Mar 6, 2022

Uh oh!

Uh oh!

rkachach commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkachach commented Mar 10, 2022

Uh oh!

rkachach commented Mar 10, 2022

Uh oh!

rkachach commented Mar 11, 2022

Uh oh!

rkachach commented Mar 16, 2022

Uh oh!

p-se left a comment

Choose a reason for hiding this comment

Uh oh!

rkachach commented Mar 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adk3798 commented Mar 22, 2022

Uh oh!

Uh oh!

adk3798 commented Mar 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rkachach commented Mar 7, 2022 •

edited

Loading

rkachach commented Mar 22, 2022 •

edited

Loading