Bug #54419
closed`ceph orch upgrade start` seems to never reach completion
0%
Description
Pretty much consistently reproducible here - http://pulpito.front.sepia.ceph.com/yuriw-2022-02-25_15:53:18-fs-wip-yuri11-testing-2022-02-21-0831-quincy-distro-default-smithi/6705843/
Yaml matrix
fs/upgrade/mds_upgrade_sequence/{bluestore-bitmap centos_8.stream_container_tools conf/{client mds mon osd} overrides/{pg-warn syntax whitelist_health whitelist_wrongly_marked_down} roles tasks/{0-from/v16.2.4 1-volume/{0-create 1-ranks/2 2-allow_standby_replay/yes 3-inline/yes 4-verify} 2-client 3-upgrade-with-workload 4-verify}}
Upgrade starts:
2022-02-25T16:20:16.424 DEBUG:teuthology.orchestra.run.smithi133:> sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/ceph:v16.2.4 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid 08be78d6-9656-11ec-8c35-001a4aab830c -e sha1=4fba29ce98c0f535f72d6211e12a92b0f5cc66df -- bash -c 'ceph orch upgrade start --image quay.ceph.io/ceph-ci/ceph:$sha1'
This check never seems to reach completion;
- cephadm.shell:
env:
- sha1
host.a:
- while ceph orch upgrade status | jq '.in_progress' | grep true ; do ceph orch ps ; ceph versions ; ceph fs dump; sleep 30 ; done
Last check info (`ceph orch ps`):
2022-02-25T22:34:15.621 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.620+0000 7fec97fff700 1 -- 172.21.15.133:0/2733944680 --> [v2:172.21.15.133:6800/3763011160,v1:172.21.15.133:6801/3763011160] -- mgr_command(tid 0: {"prefix": "orch ps", "target":
["mon-mgr", ""]}) v1 -- 0x7fec980fab10 con 0x7fec80060a40
2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stderr:2022-02-25T22:34:15.628+0000 7fec7f7fe700 1 -- 172.21.15.133:0/2733944680 <== mgr.14162 v2:172.21.15.133:6800/3763011160 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+2992 (secure 0 0 0) 0x7fec980f
ab10 con 0x7fec80060a40
2022-02-25T22:34:15.629 INFO:teuthology.orchestra.run.smithi133.stdout:NAME HOST PORTS STATUS REFRESHED AGE VERSION IMAGE ID CONTAINER ID
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:alertmanager.smithi133 smithi133 *:9093,9094 running (6h) 5m ago 6h 0.20.0 0881eb8f169f 6e5319c197ce
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi133 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 bcb7d2ac9bc5
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:crash.smithi140 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 ff644256fecb
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:grafana.smithi133 smithi133 *:3000 running (6h) 5m ago 6h 6.7.4 557c83e11646 a3ea39cc9870
2022-02-25T22:34:15.630 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.heswfq smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 4872e1b9c65b
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi133.znzevk smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 c7321edf1b47
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.hsukve smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 a9aca818bda0
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mds.cephfs.smithi140.kdgefj smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 51be41e99316
2022-02-25T22:34:15.631 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi133.myobmx smithi133 *:9283 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 2c4687932e0d
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mgr.smithi140.bjvbbe smithi140 *:8443,9283 running (6h) 3m ago 6h 17.0.0-10430-g4fba29ce 049fbe5af4ba e53ceb73c69d
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi133 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 119b013df37b
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:mon.smithi140 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 2b43fb2a6c28
2022-02-25T22:34:15.632 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi133 smithi133 *:9100 running (6h) 5m ago 6h 0.18.1 e5a616e4b9cf 8c3a40d0e2e7
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:node-exporter.smithi140 smithi140 *:9100 running (6h) 3m ago 6h 0.18.1 e5a616e4b9cf ec3bf7d18486
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.0 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 1fc8dffde333
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.1 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 943fe5d8ce93
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.2 smithi133 running (6h) 5m ago 6h 16.2.4 8d91d370c2b8 700ff7f81ead
2022-02-25T22:34:15.633 INFO:teuthology.orchestra.run.smithi133.stdout:osd.3 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 ed20ffd50d9b
2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.4 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 fb188f04ee5f
2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:osd.5 smithi140 running (6h) 3m ago 6h 16.2.4 8d91d370c2b8 ba02f87240e8
2022-02-25T22:34:15.634 INFO:teuthology.orchestra.run.smithi133.stdout:prometheus.smithi133 smithi133 *:9095 running (6h) 5m ago 6h 2.18.1 de242295e225 b0a184237a7a
Only one ceph-mgr was upgrade on 17.*, rest ceph daemons are still running 16.2.4 - not sure why.
Updated by Venky Shankar about 4 years ago
Adam,
I did a cursory check for similar issues, but couldn't find any. There is tracker #54411, but that one has MDSs crashing.
MDSs and other daemons are still on 16.2.4 - what could cause this?
Cheers,
Venky
Updated by Venky Shankar about 4 years ago
Adam,
I spent some time looking into this:
Upgrade starts fine with cephadm trying to update the standby ceph-mgr
2022-03-09T14:26:46.050+0000 7fcf96cf6700 4 mgr get_store get_store key: mgr/cephadm/extra_ceph_conf
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999e
c4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace']
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG root] Have connection to smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG root] mgr.smithi174.vklqpz container image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85
2022-03-09T14:26:46.051+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] args: --image quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85 deploy --fsid ceaf2912-9fb3-11ec-8c35-001a4aab830c --name mgr.smithi174.vklqpz --met
a-json {"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]
} --config-json - --tcp-ports 8443 9283 --allow-ptrace
Here, it probably tries to deploy (and redeploy?) ceph-mgr:
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm INFO cephadm.serve] Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 log_channel(cephadm) log [INF] : Deploying daemon mgr.smithi174.vklqpz on smithi174
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : command = deploy
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] _run_cephadm : args = ['--name', 'mgr.smithi174.vklqpz', '--meta-json', '{"service_name": "mgr", "ports": [8443, 9283], "ip": null, "deployed_by": ["docker.io/ceph/ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337", "docker.io/ceph/ceph@sha256:54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949"]}', '--config-json', '-', '--tcp-ports', '8443 9283', '--allow-ptrace']
2022-03-09T14:26:46.050+0000 7fcf96cf6700 0 [cephadm DEBUG root] Have connection to smithi174
.....
.....
.....
.....
.....
2022-03-09T14:27:16.687+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm
2022-03-09T14:27:17.392+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] code: 0
2022-03-09T14:27:17.392+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.serve] err: Redeploy daemon mgr.smithi174.vklqpz ...
2022-03-09T14:27:17.393+0000 7fcf96cf6700 1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/host.smithi174}] v 0) v1 -- 0x55ff583a4000 con 0x55ff56bb8400
Then, when it comes to upgrading itself, there is no standby ceph-mgr available:
2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.upgrade] Upgrade: Checking mgr daemons
2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm INFO cephadm.upgrade] Upgrade: Need to upgrade myself (mgr.smithi119.czhgre)
2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 log_channel(cephadm) log [INF] : Upgrade: Need to upgrade myself (mgr.smithi119.czhgre)
2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz container digest correct
2022-03-09T14:27:28.827+0000 7fcf96cf6700 0 [cephadm DEBUG cephadm.upgrade] daemon mgr.smithi174.vklqpz not deployed by correct version
2022-03-09T14:27:28.828+0000 7fcf96cf6700 0 [cephadm ERROR cephadm.upgrade] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
2022-03-09T14:27:28.828+0000 7fcf96cf6700 -1 log_channel(cephadm) log [ERR] : Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon
2022-03-09T14:27:28.828+0000 7fcf96cf6700 1 -- 172.21.15.119:0/3384159902 --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mon_command([{prefix=config-key set, key=mgr/cephadm/upgrade_state}] v 0) v1 -- 0x55ff583a4600 con 0x55ff56bb8400
2022-03-09T14:27:28.838+0000 7fcfc3e79700 15 mgr notify_all queuing notify to cephadm
2022-03-09T14:27:28.838+0000 7fcfc3e79700 20 mgr update_kv_data set mgr/cephadm/upgrade_state = {"target_name": "quay.ceph.io/ceph-ci/ceph:e98697fdcb3b7b8eab3fc453719d4e18f0d62be4", "progress_id": "066fd2ec-6d47-45c0-ad4c-7c87aec0d07f", "target_id": "a26d38fa99d22957938f77f7d65fb1b93b80f520b00ecb8334618c543bd3d3a9", "target_digests": ["quay.ceph.io/ceph-ci/ceph@sha256:0dacea6c1eb3ffb15f584f5d72137b793530e47098bdc4f1d9c14fbf1debbe85"], "target_version": "17.0.0-11006-ge98697fd", "fs_original_max_mds": null, "error": "UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon", "paused": true}
2022-03-09T14:27:28.838+0000 7fcfc3e79700 1 -- 172.21.15.119:0/3384159902 <== mon.0 v2:172.21.15.119:3300/0 1753 ==== mon_command_ack([{prefix=config-key set, key=mgr/cephadm/upgrade_state}]=0 set mgr/cephadm/upgrade_state v134)=0 set mgr/cephadm/upgrade_state v134) v1 ==== 661+0+0 (secure 0 0 0) 0x55ff56c8f1e0 con 0x55ff56bb8400
... and the upgrade is "paused".
The standby mgr seems to be up however:
2022-03-09T14:27:17.003+0000 7fb0753eb000 0 ceph version 17.0.0-11006-ge98697fd (e98697fdcb3b7b8eab3fc453719d4e18f0d62be4) quincy (dev), process ceph-mgr, pid 7 2022-03-09T14:27:17.004+0000 7fb0753eb000 0 pidfile_write: ignore empty --pid-file 2022-03-09T14:27:17.006+0000 7fb0753eb000 1 Processor -- start 2022-03-09T14:27:17.006+0000 7fb0753eb000 1 -- start start ..... ..... ..... ..... 2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr tick tick 2022-03-09T14:27:36.461+0000 7fb06576b700 20 mgr send_beacon standby 2022-03-09T14:27:36.461+0000 7fb06576b700 10 mgr send_beacon sending beacon as gid 24457 2022-03-09T14:27:36.462+0000 7fb06576b700 1 -- 172.21.15.174:0/2967250110 --> [v2:172.21.15.174:3300/0,v1:172.21.15.174:6789/0] -- mgrbeacon mgr.smithi174.vklqpz(ceaf2912-9fb3-11ec-8c35-001a4aab830c,24457, , 0) v10 -- 0x55d6ef1c2c80 con 0x55d6e6c5a800
... and continues to send beacon (as standby) till the test times out and daemons are terminated.
I'm not sure what's going on.
Updated by Laura Flores over 3 years ago
@Venky @Adam is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?
Updated by Adam King over 3 years ago
Laura Flores wrote:
@Venky @Adam is https://tracker.ceph.com/issues/57255 a dupe of this Tracker?
Most likely, yes. I think this tracker vs. https://tracker.ceph.com/issues/57255 is just how the problem expresses itself before and after https://github.com/ceph/ceph/pull/45361
Updated by Venky Shankar over 2 years ago
- Related to Bug #57255: rados/cephadm/mds_upgrade_sequence, pacific : cephadm [ERR] Upgrade: Paused due to UPGRADE_NO_STANDBY_MGR: Upgrade: Need standby mgr daemon added