Bug #65494
closedceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists"
0%
Description
Description¶
We have a random error (about 1 in 200 deploys) when after creating a rook
cephcluster and cephblockpool successfully, configuring rbd mirroring and
adding a cephrbdmirror, the cephrbdmirror never becomes ready (we waited few hours).
Looking at ceph status shows:
cluster:
id: dbf6c8b8-dd8b-4117-933e-93778b1a7274
health: HEALTH_ERR
Module 'devicehealth' has failed: table Device already exists
In rook-ceph-mgr-a pod logs we see:
debug 2024-04-09T13:05:48.947+0000 7f0632607700 -1 Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check
return func(self, *args, **kwargs)
File "/usr/share/ceph/mgr/devicehealth/module.py", line 350, in _do_serve
if self.db_ready() and self.enable_monitoring:
File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready
return self.db is not None
File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db
self._db = self.open_db()
File "/usr/share/ceph/mgr/mgr_module.py", line 1264, in open_db
self.configure_db(db)
File "/usr/share/ceph/mgr/mgr_module.py", line 1241, in configure_db
self.load_schema(db)
File "/usr/share/ceph/mgr/mgr_module.py", line 1230, in load_schema
self.maybe_upgrade(db, int(row['value']))
File "/usr/share/ceph/mgr/mgr_module.py", line 1207, in maybe_upgrade
db.executescript(self.SCHEMA)
sqlite3.OperationalError: table Device already exists
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/devicehealth/module.py", line 394, in serve
self._do_serve()
File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check
self.open_db();
File "/usr/share/ceph/mgr/mgr_module.py", line 1264, in open_db
self.configure_db(db)
File "/usr/share/ceph/mgr/mgr_module.py", line 1241, in configure_db
self.load_schema(db)
File "/usr/share/ceph/mgr/mgr_module.py", line 1230, in load_schema
self.maybe_upgrade(db, int(row['value']))
File "/usr/share/ceph/mgr/mgr_module.py", line 1207, in maybe_upgrade
db.executescript(self.SCHEMA)
sqlite3.OperationalError: table Device already exists
Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.
For testing ramen this is not that bad, we can delete the environment and
recreate it in 10 minutes, but for real deployment this looks bad.
See also¶
- Upstream issue: https://github.com/RamenDR/ramen/issues/1298
Files
Updated by Nir Soffer almost 2 years ago
Thread from rook slack:
https://rook-io.slack.com/archives/CK9CF5H2R/p1711467112958679
Updated by Yaarit Hatuka almost 2 years ago
- Project changed from mgr to cephsqlite
Looks like a sqlite issue. Patrick, can you take a look please?
Updated by Nir Soffer almost 2 years ago
- image: quay.io/ceph/ceph:v18
- imageID: quay.io/ceph/ceph@sha256:8c1697a0a924bbd625c9f1b33893bbc47b97b8a8c66666a715fe60b353b1d93e
Updated by Patrick Donnelly almost 2 years ago
debug 2024-04-09T13:05:18.368+0000 7f644947d700 0 log_channel(cluster) log [DBG] : pgmap v7: 1 pgs: 1 creating+peering; 0 B data, 26 MiB used, 50 GiB / 50 GiB avail debug 2024-04-09T13:05:18.654+0000 7f64180bc700 0 [devicehealth INFO root] creating main.db for devicehealth ignoring --setuser ceph since I am not root ignoring --setgroup ceph since I am not root debug 2024-04-09T13:05:18.784+0000 7fd37cd18200 0 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable), process ceph-mgr, pid 1
mgr restarted while db was being initialized
debug 2024-04-09T13:05:48.913+0000 7f0632607700 0 [devicehealth INFO root] creating main.db for devicehealth debug 2024-04-09T13:05:48.913+0000 7f0632607700 0 [devicehealth ERROR root] Caught fatal database error: table Device already exists
There is a nuance of sqlite transactions with python I was not aware. I'll write up a fix soon.
Updated by Patrick Donnelly almost 2 years ago
- Status changed from New to In Progress
- Assignee set to Patrick Donnelly
- Target version set to v20.0.0
- Source set to Community (dev)
- Backport set to squid,reef
Updated by Ilya Dryomov almost 2 years ago
Nir Soffer wrote:
Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.
Hi Nir,
I don't think this is related to RBD mirroring. From the manager log, it's clear that MirrorSnapshotScheduleHandler in rbd_support module continued running after that exception was thrown in devicehealth module:
debug 2024-04-09T13:05:48.913+0000 7f0632607700 0 [devicehealth INFO root] creating main.db for devicehealth
debug 2024-04-09T13:05:48.913+0000 7f0632607700 0 [devicehealth ERROR root] Caught fatal database error: table Device already exists
debug 2024-04-09T13:05:48.947+0000 7f0632607700 0 [devicehealth INFO root] creating main.db for devicehealth
debug 2024-04-09T13:05:48.947+0000 7f0632607700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.a: table Device already exists
[...]
debug 2024-04-09T13:06:36.965+0000 7f06225e7700 0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2024-04-09T13:06:36.966+0000 7f06225e7700 0 [rbd_support INFO root] load_schedules: replicapool, start_after=
debug 2024-04-09T13:06:36.967+0000 7f06225e7700 0 [rbd_support INFO root] load_schedule: 2 [
{
"interval": "2m",
"start_time": "14:00:00-05:00"
}
]
[...]
debug 2024-04-09T13:07:36.980+0000 7f06225e7700 0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2024-04-09T13:07:36.980+0000 7f06225e7700 0 [rbd_support INFO root] load_schedules: replicapool, start_after=
debug 2024-04-09T13:07:36.985+0000 7f06225e7700 0 [rbd_support INFO root] load_schedule: 2 [
{
"interval": "2m",
"start_time": "14:00:00-05:00"
}
]
[...]
debug 2024-04-09T15:41:38.808+0000 7f06225e7700 0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2024-04-09T15:41:38.811+0000 7f06225e7700 0 [rbd_support INFO root] load_schedules: replicapool, start_after=
debug 2024-04-09T15:41:38.813+0000 7f06225e7700 0 [rbd_support INFO root] load_schedule: 2 [
{
"interval": "2m",
"start_time": "14:00:00-05:00"
}
]
Have you tried restarting rbd-mirror pod like you did in https://tracker.ceph.com/issues/65487? Could the RBD mirroring part of this ticket be a duplicate of that?
Updated by Nir Soffer almost 2 years ago
Ilya Dryomov wrote in #note-6:
Nir Soffer wrote:
Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.Hi Nir,
I don't think this is related to RBD mirroring. From the manager log, it's clear that MirrorSnapshotScheduleHandler in rbd_support module continued running after that exception was thrown in devicehealth module:
[...]
Have you tried restarting rbd-mirror pod like you did in https://tracker.ceph.com/issues/65487? Could the RBD mirroring part of this ticket be a duplicate of that?
I probably tried to rbd-mirror, but this is not the same issue. In this case the
cephrbdmirror never becomes Ready - it never reports any status.
In https://tracker.ceph.com/issues/65487 the cephrbdmirror becomes Ready and everyhing
looks normal, but the cephblockpool .status.mirroringStatus.summary.deamon_health never
becomes OK.
In our test log you can see:
cephcluster is created successfully:
2024-04-09 16:02:40,569 DEBUG [dr2/0] Deploying rook ceph cluster
2024-04-09 16:02:41,236 DEBUG [dr2/0] cephblockpool.ceph.rook.io/builtin-mgr created
2024-04-09 16:02:41,250 DEBUG [dr2/0] cephcluster.ceph.rook.io/my-cluster created
2024-04-09 16:02:41,253 DEBUG [dr2/0] Waiting until rook ceph cluster is ready
2024-04-09 16:02:41,318 DEBUG [dr2/0] 'cephcluster/my-cluster' output='jsonpath={.status.phase}' found in 0.06 seconds
2024-04-09 16:05:07,151 DEBUG [dr2/0] cephcluster.ceph.rook.io/my-cluster condition met
2024-04-09 16:05:07,234 DEBUG [dr2/0] ceph cluster status:
2024-04-09 16:05:07,234 DEBUG [dr2/0] conditions:
2024-04-09 16:05:07,234 DEBUG [dr2/0] - lastHeartbeatTime: '2024-04-09T13:05:07Z'
2024-04-09 16:05:07,234 DEBUG [dr2/0] lastTransitionTime: '2024-04-09T13:05:07Z'
2024-04-09 16:05:07,234 DEBUG [dr2/0] message: Cluster created successfully
2024-04-09 16:05:07,234 DEBUG [dr2/0] reason: ClusterCreated
2024-04-09 16:05:07,234 DEBUG [dr2/0] status: 'True'
2024-04-09 16:05:07,234 DEBUG [dr2/0] type: Ready
2024-04-09 16:05:07,235 DEBUG [dr2/0] message: Cluster created successfully
2024-04-09 16:05:07,235 DEBUG [dr2/0] observedGeneration: 2
2024-04-09 16:05:07,235 DEBUG [dr2/0] phase: Ready
2024-04-09 16:05:07,235 DEBUG [dr2/0] state: Created
2024-04-09 16:05:07,235 DEBUG [dr2/0] storage:
2024-04-09 16:05:07,235 DEBUG [dr2/0] osd:
2024-04-09 16:05:07,235 DEBUG [dr2/0] storeType:
2024-04-09 16:05:07,235 DEBUG [dr2/0] bluestore: 1
2024-04-09 16:05:07,235 DEBUG [dr2/0] version:
2024-04-09 16:05:07,235 DEBUG [dr2/0] image: quay.io/ceph/ceph:v18
2024-04-09 16:05:07,235 DEBUG [dr2/0] version: 18.2.2-0
ceph status after cluster is created:
2024-04-09 16:05:10,870 DEBUG [dr2/0] deployment "rook-ceph-tools" successfully rolled out 2024-04-09 16:05:10,872 DEBUG [dr2/0] ceph status: 2024-04-09 16:05:11,683 DEBUG [dr2/0] cluster: 2024-04-09 16:05:11,684 DEBUG [dr2/0] id: 2aeb5e7d-352f-4227-9c99-6d2d422751f9 2024-04-09 16:05:11,684 DEBUG [dr2/0] health: HEALTH_OK 2024-04-09 16:05:11,684 DEBUG [dr2/0] 2024-04-09 16:05:11,684 DEBUG [dr2/0] services: 2024-04-09 16:05:11,684 DEBUG [dr2/0] mon: 1 daemons, quorum a (age 42s) 2024-04-09 16:05:11,684 DEBUG [dr2/0] mgr: a(active, since 3s) 2024-04-09 16:05:11,684 DEBUG [dr2/0] osd: 1 osds: 0 up, 1 in (since 10s) 2024-04-09 16:05:11,684 DEBUG [dr2/0] 2024-04-09 16:05:11,684 DEBUG [dr2/0] data: 2024-04-09 16:05:11,684 DEBUG [dr2/0] pools: 0 pools, 0 pgs 2024-04-09 16:05:11,684 DEBUG [dr2/0] objects: 0 objects, 0 B 2024-04-09 16:05:11,684 DEBUG [dr2/0] usage: 0 B used, 0 B / 0 B avail 2024-04-09 16:05:11,684 DEBUG [dr2/0] pgs:
cephblockpool created successfully:
2024-04-09 16:05:11,827 DEBUG [dr2/0] Creating rbd pool and storage class
2024-04-09 16:05:12,020 DEBUG [dr2/0] cephblockpool.ceph.rook.io/replicapool created
2024-04-09 16:05:12,033 DEBUG [dr2/0] storageclass.storage.k8s.io/rook-ceph-block created
2024-04-09 16:05:12,036 DEBUG [dr2/0] Waiting until ceph block pool is ready
2024-04-09 16:05:16,353 DEBUG [dr2/0] 'cephblockpool/replicapool' output='jsonpath={.status.phase}' found in 4.32 seconds
2024-04-09 16:05:41,886 DEBUG [dr2/0] cephblockpool.ceph.rook.io/replicapool condition met
2024-04-09 16:05:41,888 DEBUG [dr2/0] Waiting for replica pool peer token
2024-04-09 16:05:42,048 DEBUG [dr2/0] cephblockpool.ceph.rook.io/replicapool condition met
2024-04-09 16:05:42,127 DEBUG [dr2/0] ceph pool status:
2024-04-09 16:05:42,127 DEBUG [dr2/0] info:
2024-04-09 16:05:42,127 DEBUG [dr2/0] rbdMirrorBootstrapPeerSecretName: pool-peer-token-replicapool
2024-04-09 16:05:42,127 DEBUG [dr2/0] mirroringInfo:
2024-04-09 16:05:42,127 DEBUG [dr2/0] lastChecked: '2024-04-09T13:05:42Z'
2024-04-09 16:05:42,127 DEBUG [dr2/0] mode: image
2024-04-09 16:05:42,127 DEBUG [dr2/0] site_name: 2aeb5e7d-352f-4227-9c99-6d2d422751f9
2024-04-09 16:05:42,127 DEBUG [dr2/0] mirroringStatus:
2024-04-09 16:05:42,127 DEBUG [dr2/0] lastChecked: '2024-04-09T13:05:42Z'
2024-04-09 16:05:42,127 DEBUG [dr2/0] summary:
2024-04-09 16:05:42,127 DEBUG [dr2/0] daemon_health: UNKNOWN
2024-04-09 16:05:42,127 DEBUG [dr2/0] health: UNKNOWN
2024-04-09 16:05:42,127 DEBUG [dr2/0] image_health: OK
2024-04-09 16:05:42,127 DEBUG [dr2/0] states: {}
2024-04-09 16:05:42,127 DEBUG [dr2/0] observedGeneration: 1
2024-04-09 16:05:42,127 DEBUG [dr2/0] phase: Ready
2024-04-09 16:05:42,127 DEBUG [dr2/0] snapshotScheduleStatus:
2024-04-09 16:05:42,127 DEBUG [dr2/0] lastChecked: '2024-04-09T13:05:42Z'
2024-04-09 16:05:42,127 DEBUG [dr2/0] snapshotSchedules:
2024-04-09 16:05:42,127 DEBUG [dr2/0] - image: '-'
2024-04-09 16:05:42,127 DEBUG [dr2/0] items:
2024-04-09 16:05:42,127 DEBUG [dr2/0] - interval: 2m
2024-04-09 16:05:42,127 DEBUG [dr2/0] start_time: 14:00:00-05:00
2024-04-09 16:05:42,127 DEBUG [dr2/0] namespace: '-'
2024-04-09 16:05:42,127 DEBUG [dr2/0] pool: replicapool
Configuring rbd mirroring on both clusters:
2024-04-09 16:06:50,002 DEBUG [rdr/0] Getting mirroring info site name for cluster 'dr1'
2024-04-09 16:06:50,084 DEBUG [rdr/0] 'cephblockpools.ceph.rook.io/replicapool' output='jsonpath={.status.mirroringInfo.site_name}' found in 0.08 seconds
2024-04-09 16:06:50,084 DEBUG [rdr/0] Getting rbd mirror boostrap peer secret name for cluster 'dr1'
2024-04-09 16:06:50,183 DEBUG [rdr/0] Getting secret pool-peer-token-replicapool token for cluster 'dr1'
2024-04-09 16:06:50,253 DEBUG [rdr/0] Getting mirroring info site name for cluster 'dr2'
2024-04-09 16:06:50,369 DEBUG [rdr/0] 'cephblockpools.ceph.rook.io/replicapool' output='jsonpath={.status.mirroringInfo.site_name}' found in 0.12 seconds
2024-04-09 16:06:50,369 DEBUG [rdr/0] Getting rbd mirror boostrap peer secret name for cluster 'dr2'
2024-04-09 16:06:50,454 DEBUG [rdr/0] Getting secret pool-peer-token-replicapool token for cluster 'dr2'
2024-04-09 16:06:50,547 DEBUG [rdr/0] Setting up mirroring from 'dr2' to 'dr1'
2024-04-09 16:06:50,547 DEBUG [rdr/0] Applying rbd mirror secret in cluster 'dr1'
2024-04-09 16:06:50,708 DEBUG [rdr/0] secret/2aeb5e7d-352f-4227-9c99-6d2d422751f9 created
2024-04-09 16:06:50,711 DEBUG [rdr/0] Configure peers for cluster 'dr1'
2024-04-09 16:06:50,842 DEBUG [rdr/0] cephblockpool.ceph.rook.io/replicapool patched
2024-04-09 16:06:50,847 DEBUG [rdr/0] Apply rbd mirror to cluster 'dr1'
2024-04-09 16:06:50,984 DEBUG [rdr/0] cephrbdmirror.ceph.rook.io/my-rbd-mirror created
2024-04-09 16:06:50,990 DEBUG [rdr/0] Setting up mirroring from 'dr1' to 'dr2'
2024-04-09 16:06:50,990 DEBUG [rdr/0] Applying rbd mirror secret in cluster 'dr2'
2024-04-09 16:06:51,146 DEBUG [rdr/0] secret/c5123696-8638-4c9e-aaf4-2a59333bf389 created
2024-04-09 16:06:51,152 DEBUG [rdr/0] Configure peers for cluster 'dr2'
2024-04-09 16:06:51,241 DEBUG [rdr/0] cephblockpool.ceph.rook.io/replicapool patched
2024-04-09 16:06:51,244 DEBUG [rdr/0] Apply rbd mirror to cluster 'dr2'
2024-04-09 16:06:51,399 DEBUG [rdr/0] cephrbdmirror.ceph.rook.io/my-rbd-mirror created
Waiting for cephrbdmirror for cluster dr1 succeeded in 5 seconds:
2024-04-09 16:06:51,406 DEBUG [rdr/0] Waiting until rbd mirror is ready in cluster 'dr1'
2024-04-09 16:06:55,809 DEBUG [rdr/0] 'cephrbdmirror/my-rbd-mirror' output='jsonpath={.status.phase}' found in 4.40 seconds
2024-04-09 16:06:55,986 DEBUG [rdr/0] cephrbdmirror.ceph.rook.io/my-rbd-mirror condition met
2024-04-09 16:06:56,057 DEBUG [rdr/0] Cluster '{cluster}' rbd mirror status:
2024-04-09 16:06:56,057 DEBUG [rdr/0] observedGeneration: 1
2024-04-09 16:06:56,057 DEBUG [rdr/0] phase: Ready
2024-04-09 16:06:56,057 DEBUG [rdr/0]
Waiting for cephrbdmirror .status.phase for cluster dr2 times out after 300 seconds
this resources is never reconciled:
2024-04-09 16:06:56,057 DEBUG [rdr/0] Waiting until rbd mirror is ready in cluster 'dr2'
2024-04-09 16:11:56,172 ERROR Command failed
Traceback (most recent call last):
...
drenv.commands.Error: Command failed:
command: ('addons/rbd-mirror/start', 'dr1', 'dr2')
exitcode: 1
error:
Traceback (most recent call last):
...
RuntimeError: Timeout waiting for 'cephrbdmirror/my-rbd-mirror' output='jsonpath={.status.phase}' namespace='rook-ceph' profile='dr2' timeout=300
Updated by Patrick Donnelly almost 2 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 56997
Updated by Patrick Donnelly almost 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot almost 2 years ago
- Copied to Backport #65730: squid: ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists" added
Updated by Upkeep Bot almost 2 years ago
- Copied to Backport #65731: reef: ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists" added
Updated by Patrick Donnelly almost 2 years ago
- Backport changed from squid,reef to squid,reef,quincy
Updated by Upkeep Bot almost 2 years ago
- Copied to Backport #65736: quincy: ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists" added
Updated by Nir Soffer almost 2 years ago
Do we have an image I can test, or maybe use until the fix is released?
Updated by Nir Soffer over 1 year ago
Nir Soffer wrote in #note-16:
Do we have an image I can test, or maybe use until the fix is released?
Can we get a fix for current ceph version (v19)?
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Patrick Donnelly over 1 year ago
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 07afb4ae0916fddb8bb920b5663e374dc650df6e
- Fixed In set to v19.3.0-1888-g07afb4ae091
- Upkeep Timestamp set to 2025-07-09T13:46:37+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-1888-g07afb4ae091 to v19.3.0-1888-g07afb4ae09
- Upkeep Timestamp changed from 2025-07-09T13:46:37+00:00 to 2025-07-14T17:40:35+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~3003
- Upkeep Timestamp changed from 2025-07-14T17:40:35+00:00 to 2025-11-01T00:57:47+00:00
Updated by Upkeep Bot about 2 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp changed from 2025-11-01T00:57:47+00:00 to 2026-02-03T01:13:22+00:00