Skip to content

mgr/cephadm: fixing prometheus port handling#45241

Merged
adk3798 merged 1 commit intoceph:masterfrom
rkachach:fix_issue_51072
Mar 25, 2022
Merged

mgr/cephadm: fixing prometheus port handling#45241
adk3798 merged 1 commit intoceph:masterfrom
rkachach:fix_issue_51072

Conversation

@rkachach
Copy link
Contributor

@rkachach rkachach commented Mar 3, 2022

Fixes: https://tracker.ceph.com/issues/51072

Signed-off-by: Redouane Kachach rkachach@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@rkachach rkachach requested a review from a team as a code owner March 3, 2022 10:45
@mgfritch mgfritch requested a review from p-se March 3, 2022 15:32
@adk3798
Copy link
Contributor

adk3798 commented Mar 6, 2022

seems to work in testing

[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  4m ago     19m  count:1    
crash                           3/3  4m ago     19m  *          
grafana        ?:3000           1/1  4m ago     19m  count:1    
mgr                             2/2  4m ago     19m  count:2    
mon                             3/5  4m ago     19m  count:5    
node-exporter  ?:9100           3/3  4m ago     19m  *          
prometheus     ?:9095           1/1  4m ago     30s  count:1    
[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (17m)     6m ago  20m    15.5M        -  0.20.0                 0881eb8f169f  efe844cfef9b  
crash.vm-00          vm-00               running (20m)     6m ago  20m    7155k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  29dcefad4cca  
crash.vm-01          vm-01               running (18m)     5m ago  18m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  9a3dc115dc73  
crash.vm-02          vm-02               running (18m)     6m ago  18m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  b7638f27b6ab  
grafana.vm-00        vm-00  *:3000       running (17m)     6m ago  19m    24.2M        -  6.7.4                  557c83e11646  d4049437d170  
mgr.vm-00.syxfti     vm-00  *:9283       running (21m)     6m ago  21m     453M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  a48844d4b598  
mgr.vm-01.utpsdw     vm-01  *:8443,9283  running (18m)     5m ago  18m     417M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  58cb3c911fcc  
mon.vm-00            vm-00               running (21m)     6m ago  21m    41.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  e06cfb279a05  
mon.vm-01            vm-01               running (18m)     5m ago  18m    30.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  09c3b549852e  
mon.vm-02            vm-02               running (18m)     6m ago  18m    33.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  20c7a9b0c000  
node-exporter.vm-00  vm-00  *:9100       running (19m)     6m ago  19m    8242k        -  0.18.1                 e5a616e4b9cf  9639d68a726c  
node-exporter.vm-01  vm-01  *:9100       running (18m)     5m ago  18m    8212k        -  0.18.1                 e5a616e4b9cf  c589fc6f95be  
node-exporter.vm-02  vm-02  *:9100       running (18m)     6m ago  18m    8224k        -  0.18.1                 e5a616e4b9cf  fa0ba27c858b  
prometheus.vm-00     vm-00  *:9095       running (17m)     6m ago  17m    34.8M        -  2.18.1                 de242295e225  80a3863c1724  
[ceph: root@vm-00 /]# cat /usr/share/ceph/mgr/cephadm/module.py | grep monitoring_spec.port
            monitoring_spec.port = cast(int, self._ceph_get_module_option('prometheus', 'server_port'))
[ceph: root@vm-00 /]# ceph config set mgr mgr/prometheus/server_port 12765
[ceph: root@vm-00 /]# ceph config get mgr mgr/promeheus/server_port
Error ENOENT: 
[ceph: root@vm-00 /]# ceph config get mgr mgr/prometheus/server_port
12765
[ceph: root@vm-00 /]# ceph orch apply prometheus
Scheduled prometheus update...
[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (21m)    10s ago  23m    15.8M        -  0.20.0                 0881eb8f169f  efe844cfef9b  
crash.vm-00          vm-00               running (23m)    10s ago  23m    7155k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  29dcefad4cca  
crash.vm-01          vm-01               running (21m)     8m ago  21m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  9a3dc115dc73  
crash.vm-02          vm-02               running (21m)     9m ago  21m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  b7638f27b6ab  
grafana.vm-00        vm-00  *:3000       running (20m)    10s ago  23m    24.9M        -  6.7.4                  557c83e11646  d4049437d170  
mgr.vm-00.syxfti     vm-00  *:9283       running (24m)    10s ago  24m     456M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  a48844d4b598  
mgr.vm-01.utpsdw     vm-01  *:8443,9283  running (21m)     8m ago  21m     417M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  58cb3c911fcc  
mon.vm-00            vm-00               running (24m)    10s ago  24m    45.9M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  e06cfb279a05  
mon.vm-01            vm-01               running (21m)     8m ago  21m    30.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  09c3b549852e  
mon.vm-02            vm-02               running (21m)     9m ago  21m    33.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  20c7a9b0c000  
node-exporter.vm-00  vm-00  *:9100       running (23m)    10s ago  23m    8493k        -  0.18.1                 e5a616e4b9cf  9639d68a726c  
node-exporter.vm-01  vm-01  *:9100       running (21m)     8m ago  21m    8212k        -  0.18.1                 e5a616e4b9cf  c589fc6f95be  
node-exporter.vm-02  vm-02  *:9100       running (21m)     9m ago  21m    8224k        -  0.18.1                 e5a616e4b9cf  fa0ba27c858b  
prometheus.vm-00     vm-00  *:12765      running (18s)    10s ago  18s    17.3M        -  2.18.1                 de242295e225  c1eba972b487  
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  13s ago    24m  count:1    
crash                           3/3  9m ago     24m  *          
grafana        ?:3000           1/1  13s ago    24m  count:1    
mgr                             2/2  8m ago     24m  count:2    
mon                             3/5  9m ago     24m  count:5    
node-exporter  ?:9100           3/3  9m ago     24m  *          
prometheus     ?:12765          1/1  13s ago    32s  count:1    
[ceph: root@vm-00 /]# ceph orch ls --service-name prometheus -f yaml
service_type: prometheus
service_name: prometheus
placement:
  count: 1
spec:
  port: 12765
status:
  created: '2022-03-06T18:19:31.965101Z'
  last_refresh: '2022-03-06T18:19:51.454906Z'
  ports:
  - 12765
  running: 1
  size: 1
events:
- 2022-03-06T18:14:36.221492Z service:prometheus [INFO] "service was created"
[ceph: root@vm-00 /]# ceph config set mgr mgr/prometheus/server_port 7457 
[ceph: root@vm-00 /]# ceph orch apply prometheus
Scheduled prometheus update...
[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (23m)    19s ago  26m    15.9M        -  0.20.0                 0881eb8f169f  efe844cfef9b  
crash.vm-00          vm-00               running (26m)    19s ago  26m    7155k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  29dcefad4cca  
crash.vm-01          vm-01               running (24m)    71s ago  24m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  9a3dc115dc73  
crash.vm-02          vm-02               running (24m)    71s ago  24m    7109k        -  17.0.0-8135-gf5b96461  a0a69fd29e57  b7638f27b6ab  
grafana.vm-00        vm-00  *:3000       running (23m)    19s ago  25m    25.3M        -  6.7.4                  557c83e11646  d4049437d170  
mgr.vm-00.syxfti     vm-00  *:9283       running (27m)    19s ago  27m     457M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  a48844d4b598  
mgr.vm-01.utpsdw     vm-01  *:8443,9283  running (24m)    71s ago  24m     419M        -  17.0.0-8135-gf5b96461  a0a69fd29e57  58cb3c911fcc  
mon.vm-00            vm-00               running (27m)    19s ago  27m    48.9M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  e06cfb279a05  
mon.vm-01            vm-01               running (24m)    71s ago  24m    40.0M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  09c3b549852e  
mon.vm-02            vm-02               running (24m)    71s ago  24m    40.3M    2048M  17.0.0-8135-gf5b96461  a0a69fd29e57  20c7a9b0c000  
node-exporter.vm-00  vm-00  *:9100       running (25m)    19s ago  25m    8568k        -  0.18.1                 e5a616e4b9cf  9639d68a726c  
node-exporter.vm-01  vm-01  *:9100       running (24m)    71s ago  24m    8334k        -  0.18.1                 e5a616e4b9cf  c589fc6f95be  
node-exporter.vm-02  vm-02  *:9100       running (23m)    71s ago  23m    8514k        -  0.18.1                 e5a616e4b9cf  fa0ba27c858b  
prometheus.vm-00     vm-00  *:7457       running (27s)    19s ago  27s    17.2M        -  2.18.1                 de242295e225  9458ba412396  
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  21s ago    26m  count:1    
crash                           3/3  73s ago    26m  *          
grafana        ?:3000           1/1  21s ago    26m  count:1    
mgr                             2/2  73s ago    26m  count:2    
mon                             3/5  73s ago    26m  count:5    
node-exporter  ?:9100           3/3  73s ago    26m  *          
prometheus     ?:7457           1/1  21s ago    41s  count:1
[ceph: root@vm-00 /]# exit
exit
[root@vm-00 ~]# sudo netstat -tulpn | grep prometheus
tcp6       0      0 :::7457                 :::*                    LISTEN      36133/prometheus 

@rkachach rkachach requested a review from adk3798 March 7, 2022 12:37
@rkachach
Copy link
Contributor Author

rkachach commented Mar 7, 2022

@adk3798 when I change the port everything works correctly. How ever if after that I disable/enable the cephadm mgr module, I see the following error which I'm note if it's something that has to do with cephadm or the promethus mgr module itself.

[07/Mar/2022:13:06:44] ENGINE Bus STOPPING
[07/Mar/2022:13:06:44] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 7777)) already shut down
[07/Mar/2022:13:06:44] ENGINE Bus STOPPED
[07/Mar/2022:13:06:44] ENGINE Bus EXITING
[07/Mar/2022:13:06:44] ENGINE Bus EXITED
2022-03-07T13:06:44.562+0000 7f3c92123700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'prometheus' while running on mgr.ceph-node-0.zdhcaf: Timeout('Port 7777 not free on ::.',)
2022-03-07T13:06:44.562+0000 7f3c92123700 -1 prometheus.serve:
2022-03-07T13:06:44.564+0000 7f3c92123700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1741, in serve
    cherrypy.engine.start()
  File "/lib/python3.6/site-packages/cherrypy/process/wspbus.py", line 283, in start
    raise e_info
  File "/lib/python3.6/site-packages/cherrypy/process/wspbus.py", line 268, in start
    self.publish('start')
  File "/lib/python3.6/site-packages/cherrypy/process/wspbus.py", line 248, in publish
    raise exc
cherrypy.process.wspbus.ChannelFailures: Timeout('Port 7777 not free on ::.',)

@rkachach
Copy link
Contributor Author

jenkins retest this please

1 similar comment
@rkachach
Copy link
Contributor Author

jenkins retest this please

@rkachach
Copy link
Contributor Author

jenkins retest this please

@rkachach
Copy link
Contributor Author

jenkins retest this please

Copy link
Contributor

@p-se p-se left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code works well, with the exception of the result of ceph mgr services not being updated after the port has changed. This may lead to issues with config generation, as the result used in cephadm for retrieval of active services.

build git:(rkachach-fix_issue_51072) ✗ ceph config set mgr mgr/prometheus/server_port 8888build git:(rkachach-fix_issue_51072) ✗ curl -fSsl http://home:8888/metrics | wc -l
2561build git:(rkachach-fix_issue_51072) ✗ ceph mgr services
{
    "dashboard": "https://192.168.1.2:41481/",
    "prometheus": "http://home:7777/"
}

I'm not sure if the result of ceph mgr services can be updated without a restart after the port has changed. Should that not work, we'd need to ensure the mgr_map returned reflects the actual configuration, which might make it necessary to not dynamically restart cherrypy on change of a port (in notify_config). At least for this setting. When the mgr is restarted (like in ceph mgr module disable ... && ceph mgr module enable ...), then the change is reflected properly.

@rkachach
Copy link
Contributor Author

rkachach commented Mar 22, 2022

This code works well, with the exception of the result of ceph mgr services not being updated after the port has changed. This may lead to issues with config generation, as the result used in cephadm for retrieval of active services.

➜  build git:(rkachach-fix_issue_51072) ✗ ceph config set mgr mgr/prometheus/server_port 8888
➜  build git:(rkachach-fix_issue_51072) ✗ curl -fSsl http://home:8888/metrics | wc -l
2561
➜  build git:(rkachach-fix_issue_51072) ✗ ceph mgr services
{
    "dashboard": "https://192.168.1.2:41481/",
    "prometheus": "http://home:7777/"
}

I'm not sure if the result of ceph mgr services can be updated without a restart after the port has changed. Should that not work, we'd need to ensure the mgr_map returned reflects the actual configuration, which might make it necessary to not dynamically restart cherrypy on change of a port (in notify_config). At least for this setting. When the mgr is restarted (like in ceph mgr module disable ... && ceph mgr module enable ...), then the change is reflected properly.

@p-se Thanks for reviewing this.

I fixed the issue (basically a call to:
self.set_uri(build_url(scheme='http', host=server_addr, port=server_port, path='/'))
was missing when a port change notification is received. Now the url is updated correctly and the change is reflected immediately. I re-tested the new changes both on Active (and by forcing a fail over to the stdby mgr).

On mgr active node (ceph-node-0):


[ceph: root@ceph-node-0 /]# ceph config set mgr mgr/prometheus/server_port 6666
[ceph: root@ceph-node-0 /]# ceph mgr services
{
    "prometheus": "http://192.168.100.100:6666/"
}

[root@ceph-node-0 ~]# ss -tulpn | grep ceph-mgr
tcp   LISTEN 0      5                    *:6666            *:*    users:(("ceph-mgr",pid=11436,fd=24))

From testing host:

fix_issue_51072 >  curl --silent 192.168.100.100:6666/metrics | wc -l    
2049

Force a failover:

[ceph: root@ceph-node-0 /]# ceph mgr fail

On the new active node (ceph-node-1):

[ceph: root@ceph-node-1 /]# ceph mgr services
{
    "prometheus": "http://192.168.100.101:6666/"
}

[root@ceph-node-1 ~]# ss -tulpn | grep ceph-mgr
tcp   LISTEN 0      512            0.0.0.0:6808      0.0.0.0:*    users:(("ceph-mgr",pid=10949,fd=29))     
tcp   LISTEN 0      512            0.0.0.0:6809      0.0.0.0:*    users:(("ceph-mgr",pid=10949,fd=30))     
tcp   LISTEN 0      5      192.168.100.101:7150      0.0.0.0:*    users:(("ceph-mgr",pid=10949,fd=64))     
tcp   LISTEN 0      5                    *:6666            *:*    users:(("ceph-mgr",pid=10949,fd=37)) 

From testing host:

 curl --silent 192.168.100.101:6666/metrics | wc -l 
2049

@rkachach rkachach requested a review from p-se March 22, 2022 16:36
@adk3798
Copy link
Contributor

adk3798 commented Mar 22, 2022

jenkins test api

Fixes: https://tracker.ceph.com/issues/51072

Signed-off-by: Redouane Kachach <rkachach@redhat.com>
@adk3798
Copy link
Contributor

adk3798 commented Mar 24, 2022

http://pulpito.front.sepia.ceph.com/adking-2022-03-23_02:54:35-orch:cephadm-wip-adk-testing-2022-03-22-2000-distro-basic-smithi/

2 Failures caused by wrong error code from host add command due to another PR included in the run

@adk3798 adk3798 merged commit 0bdf0ef into ceph:master Mar 25, 2022
@rkachach rkachach deleted the fix_issue_51072 branch March 25, 2022 14:17
@adk3798 adk3798 mentioned this pull request Mar 30, 2022
14 tasks
@adk3798 adk3798 mentioned this pull request Apr 27, 2022
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants