Bug #55606
closed[ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown
0%
Description
/home/teuthworker/archive/yuriw-2022-04-29_15:44:49-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi/6813955
2022-04-29T19:34:39.756 INFO:teuthology.orchestra.run.smithi169.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2022-04-29T19:34:39.767 DEBUG:teuthology.orchestra.run.smithi155:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2022-04-29T19:34:39.787 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.y: unknown operation
2022-04-29T19:34:39.787 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 devicehealth.serve:
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 Traceback (most recent call last):
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr: File "/usr/share/ceph/mgr/devicehealth/module.py", line 376, in serve
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr: File "/usr/share/ceph/mgr/mgr_module.py", line 1117, in set_kv
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr: self.db.execute(SQL, (key, value))
2022-04-29T19:34:39.789 INFO:tasks.ceph.mgr.y.smithi155.stderr:sqlite3.InternalError: unknown operation
2022-04-29T19:34:39.789 INFO:tasks.ceph.mgr.y.smithi155.stderr:
2022-04-29T19:34:39.834 INFO:tasks.ceph.osd.3.smithi155.stderr:2022-04-29T19:34:39.833+0000 7f38469fe700 -1 received signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 3 (PID: 91832) UID: 0
Updated by Laura Flores almost 4 years ago
- Project changed from RADOS to cephsqlite
Updated by Yaarit Hatuka almost 4 years ago
- Related to Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error added
Updated by Laura Flores almost 4 years ago
Nitzan Mordechai wrote:
/home/teuthworker/archive/yuriw-2022-04-29_15:44:49-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi/6813955
[...]
mgr log at the time of the crash:
2022-04-29T19:34:39.785+0000 7f5664339700 10 MonCommandCompletion::finish()
2022-04-29T19:34:39.785+0000 7f5664339700 20 mgr Gil Switched to new thread state 0x559d5f25c000
2022-04-29T19:34:39.785+0000 7f5664339700 20 mgr ~Gil Destroying new thread state 0x559d5f25c000
2022-04-29T19:34:39.785+0000 7f5664339700 10 mgr notify_all notify_all: notify_all command
2022-04-29T19:34:39.785+0000 7f5664339700 15 mgr notify_all queuing notify (command) to restful
2022-04-29T19:34:39.785+0000 7f5604101700 0 [devicehealth DEBUG root] skipping duplicate INTEL_SSDPEDMD400G4_CVFT623300GW400BGN
2022-04-29T19:34:39.785+0000 7f5604101700 10 ceph_option_get device_failure_prediction_mode found: none
2022-04-29T19:34:39.785+0000 7f5604101700 0 [devicehealth DEBUG root] set_kv('last_scrape', '20220429-193321')
2022-04-29T19:34:39.785+0000 7f56297f6700 20 mgr Gil Switched to new thread state 0x559d5ea22400
2022-04-29T19:34:39.785+0000 7f56297f6700 20 mgr ~Gil Destroying new thread state 0x559d5ea22400
2022-04-29T19:34:39.785+0000 7f5604101700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.y: unknown operation
2022-04-29T19:34:39.785+0000 7f5604101700 -1 devicehealth.serve:
2022-04-29T19:34:39.785+0000 7f5604101700 -1 Traceback (most recent call last):
File "/usr/share/ceph/mgr/devicehealth/module.py", line 376, in serve
self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))
File "/usr/share/ceph/mgr/mgr_module.py", line 1117, in set_kv
self.db.execute(SQL, (key, value))
sqlite3.InternalError: unknown operation
Updated by Patrick Donnelly almost 4 years ago
Problem appears to be that libcephsqlite lost its RADOS lock:
2022-04-29T19:34:39.315+0000 7f5669343700 1 -- 172.21.15.155:0/832204322 <== osd.7 v1:172.21.15.169:6804/91796 154 ==== osd_op_reply(146 main.db.0000000000000000 [call] v59'104 uv103 ondisk = -2 ((2) No such file or directory)) v8 ==== 168+0+0 (unknown 3461770130 0 0) 0x559d5fa1d8c0 con 0x559d5f026000 2022-04-29T19:34:39.315+0000 7f55ef0d7700 -1 client.4184: SimpleRADOSStriper: lock_keeper_main: main.db: lock renewal failed: (2) No such file or directory
At this time, the mgr was noting some 10 pgs were inactive.
I will think about how we can make this mgr module more resilient to this type of situation.
Updated by Yaarit Hatuka almost 4 years ago
Hi Patrick,
Can you please explain why unknown state of pgs results in this behavior?
The commonality we see between these issues so far is that they happen after an upgrade (on gibba and the LRC), or due to daemon restarts.
Updated by Laura Flores over 3 years ago
/a/yuriw-2022-12-07_15:47:33-rados-wip-yuri-testing-2022-12-06-1204-distro-default-smithi/7106555
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error added
Updated by Patrick Donnelly about 3 years ago
- Related to deleted (Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error)
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56844: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #58652: Module 'devicehealth' has failed: disk I/O error added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56266: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT)) added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56205: crash: File "mgr/devicehealth/module.py", in serve: self.scrape_all() added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56279: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56287: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56291: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56297: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56312: crash: File "mgr/devicehealth/module.py", in serve: ls = self.get_kv('last_scrape') added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56321: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56322: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56858: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT)) added
Updated by Patrick Donnelly about 3 years ago
- Has duplicate Bug #56898: crash: File "mgr/devicehealth/module.py", in show_device_metrics: res = self._get_device_metrics(devid, sample=sample) added
Updated by Patrick Donnelly about 3 years ago
- Status changed from New to In Progress
- Assignee set to Patrick Donnelly
- Target version set to v18.0.0
- Source set to Community (user)
- Backport set to quincy
Updated by Satoru Takeuchi about 3 years ago
I hit a quite similar problem. My cluster became unhealthy as follows.
```
$ kubectl exec n ${NS} deployment/rook-ceph-tools - ceph status
cluster:
id: b52d5f3d-ba14-442e-a089-0bca47b83758
health: HEALTH_ERR
441 large omap objects
Module 'devicehealth' has failed: unknown operation
1 osds down
1 host (1 osds) down
Degraded data redundancy: 1282/321355596 objects degraded (0.000%), 88 pgs degraded, 106 pgs undersized
62 pgs not deep-scrubbed in time
62 pgs not scrubbed in time
```
Before occurring this problem, the following message is shown in the mgr.a's log.
```
2023-03-16 09:08:24 debug 2023-03-16T00:08:24.478+0000 7feacbcf5700 -1 client.107899129: SimpleRADOSStriper: lock_keeper_main: main.db: lock renewal failed: (2) No such file or directory
```
At that time, mgr failover didn't happen. In addition, this problem was resolved after restarting mgr.a,
I have two questions about this.
a. Can my problem be considered to hit this issue's bug?
b. If a is true, is there any configuration to bypass this problem before 50291 is merged?
software environment:
- ceph: v17.2.5
- rook: v1.10.7
Updated by Telemetry Bot almost 3 years ago
- Crash signature (v1) updated (diff)
- Crash signature (v2) updated (diff)
- Affected Versions v17.0.0, v17.2.0, v17.2.3, v17.2.4, v17.2.5 added
Sanitized backtrace:
File "mgr/devicehealth/module.py", in serve: self.scrape_all()
File "mgr/devicehealth/module.py", in scrape_all: self.put_device_metrics(device, data)
File "mgr/devicehealth/module.py", in put_device_metrics: self._create_device(devid)
File "mgr/devicehealth/module.py", in _create_device: cursor = self.db.execute(SQL, (devid,))
Crash dump sample:
{
"backtrace": [
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n self.scrape_all()",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n self.put_device_metrics(device, data)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n self._create_device(devid)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n cursor = self.db.execute(SQL, (devid,))",
"<redacted>"
],
"ceph_version": "17.2.3",
"crash_id": "2022-12-23T00:08:19.380001Z_532f16aa-9667-4774-9740-7eb6486407c1",
"entity_name": "mgr.380e06557672265883c1723194701ca26b08aabe",
"mgr_module": "devicehealth",
"mgr_module_caller": "PyModuleRunner::serve",
"mgr_python_exception": "InternalError",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "92fb822a43775ec1de7d41c75c8d0ec0bbb72ba5429a46000f34101c1bc6524e",
"timestamp": "2022-12-23T00:08:19.380001Z",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-87-generic",
"utsname_sysname": "Linux",
"utsname_version": "#98~18.04.1-Ubuntu SMP Wed Sep 22 10:45:04 UTC 2021"
}Updated by Telemetry Bot almost 3 years ago
Updated by Telemetry Bot almost 3 years ago
Updated by Telemetry Bot almost 3 years ago
Updated by Telemetry Bot almost 3 years ago
Updated by Telemetry Bot almost 3 years ago
- Crash signature (v2) updated (diff)
Updated by Telemetry Bot almost 3 years ago
Updated by Telemetry Bot almost 3 years ago
Updated by Telemetry Bot almost 3 years ago
Updated by Telemetry Bot almost 3 years ago
- Crash signature (v2) updated (diff)
Updated by Telemetry Bot almost 3 years ago
- Crash signature (v2) updated (diff)
Updated by Telemetry Bot almost 3 years ago
- Crash signature (v2) updated (diff)
Updated by Telemetry Bot almost 3 years ago
Updated by Telemetry Bot almost 3 years ago
- Crash signature (v2) updated (diff)
Updated by Patrick Donnelly almost 3 years ago
Updated by Patrick Donnelly almost 3 years ago
- Has duplicate Bug #58351: Module 'devicehealth' has failed: unknown operation added
Updated by Patrick Donnelly over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot over 2 years ago
- Copied to Backport #62022: reef: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown added
Updated by Upkeep Bot over 2 years ago
- Copied to Backport #62023: quincy: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown added
Updated by Patrick Donnelly about 2 years ago
- Status changed from Pending Backport to Resolved
Updated by Upkeep Bot 8 months ago
- Merge Commit set to deae3a6f18ccbc1950b62f4a70f2ec5e5ddefa0b
- Fixed In set to v18.0.0-4968-gdeae3a6f18c
- Released In set to v19.2.0~2001
- Upkeep Timestamp set to 2025-07-13T20:09:17+00:00