Bug #55606: [ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown - cephsqlite - Ceph

Actions

Copy link

Bug #55606

closed

[ERR] Unhandled exception from module ''devicehealth'' while running on mgr.y: unknown

Added by Nitzan Mordechai almost 4 years ago. Updated 8 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Patrick Donnelly

Target version:

Ceph - v19.0.0

% Done:

Source:

Community (user)

Backport:

reef,quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.0.0, Ceph - v17.1.0, Ceph - v17.2.0, Ceph - v17.2.1, Ceph - v17.2.3, Ceph - v17.2.4, Ceph - v17.2.5, Ceph - v17.2.6

ceph-qa-suite:

Pull request ID:

50291

Tags (freeform):

Merge Commit:

deae3a6f18ccbc1950b62f4a70f2ec5e5ddefa0b

Fixed In:

v18.0.0-4968-gdeae3a6f18c

Released In:

v19.2.0~2001

Upkeep Timestamp:

2025-07-13T20:09:17+00:00

Description

/home/teuthworker/archive/yuriw-2022-04-29_15:44:49-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi/6813955

2022-04-29T19:34:39.756 INFO:teuthology.orchestra.run.smithi169.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2022-04-29T19:34:39.767 DEBUG:teuthology.orchestra.run.smithi155:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2022-04-29T19:34:39.787 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.y: unknown operation
2022-04-29T19:34:39.787 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 devicehealth.serve:
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:2022-04-29T19:34:39.785+0000 7f5604101700 -1 Traceback (most recent call last):
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:  File "/usr/share/ceph/mgr/devicehealth/module.py", line 376, in serve
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:    self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:  File "/usr/share/ceph/mgr/mgr_module.py", line 1117, in set_kv
2022-04-29T19:34:39.788 INFO:tasks.ceph.mgr.y.smithi155.stderr:    self.db.execute(SQL, (key, value))
2022-04-29T19:34:39.789 INFO:tasks.ceph.mgr.y.smithi155.stderr:sqlite3.InternalError: unknown operation
2022-04-29T19:34:39.789 INFO:tasks.ceph.mgr.y.smithi155.stderr:
2022-04-29T19:34:39.834 INFO:tasks.ceph.osd.3.smithi155.stderr:2022-04-29T19:34:39.833+0000 7f38469fe700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 3  (PID: 91832) UID: 0

Actions

Copy link

Updated by Laura Flores almost 4 years ago

Project changed from RADOS to cephsqlite

Actions

Copy link

Updated by Yaarit Hatuka almost 4 years ago

Related to Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error added

Actions

Copy link

Updated by Laura Flores almost 4 years ago

Nitzan Mordechai wrote:

/home/teuthworker/archive/yuriw-2022-04-29_15:44:49-rados-wip-yuri5-testing-2022-04-28-1007-distro-default-smithi/6813955

[...]

mgr log at the time of the crash:

2022-04-29T19:34:39.785+0000 7f5664339700 10 MonCommandCompletion::finish()
2022-04-29T19:34:39.785+0000 7f5664339700 20 mgr Gil Switched to new thread state 0x559d5f25c000
2022-04-29T19:34:39.785+0000 7f5664339700 20 mgr ~Gil Destroying new thread state 0x559d5f25c000
2022-04-29T19:34:39.785+0000 7f5664339700 10 mgr notify_all notify_all: notify_all command
2022-04-29T19:34:39.785+0000 7f5664339700 15 mgr notify_all queuing notify (command) to restful
2022-04-29T19:34:39.785+0000 7f5604101700  0 [devicehealth DEBUG root] skipping duplicate INTEL_SSDPEDMD400G4_CVFT623300GW400BGN
2022-04-29T19:34:39.785+0000 7f5604101700 10 ceph_option_get device_failure_prediction_mode found: none
2022-04-29T19:34:39.785+0000 7f5604101700  0 [devicehealth DEBUG root] set_kv('last_scrape', '20220429-193321')
2022-04-29T19:34:39.785+0000 7f56297f6700 20 mgr Gil Switched to new thread state 0x559d5ea22400
2022-04-29T19:34:39.785+0000 7f56297f6700 20 mgr ~Gil Destroying new thread state 0x559d5ea22400
2022-04-29T19:34:39.785+0000 7f5604101700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.y: unknown operation
2022-04-29T19:34:39.785+0000 7f5604101700 -1 devicehealth.serve:
2022-04-29T19:34:39.785+0000 7f5604101700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 376, in serve
    self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT))
  File "/usr/share/ceph/mgr/mgr_module.py", line 1117, in set_kv
    self.db.execute(SQL, (key, value))
sqlite3.InternalError: unknown operation

Actions

Copy link

Updated by Patrick Donnelly almost 4 years ago

Problem appears to be that libcephsqlite lost its RADOS lock:

2022-04-29T19:34:39.315+0000 7f5669343700  1 -- 172.21.15.155:0/832204322 <== osd.7 v1:172.21.15.169:6804/91796 154 ==== osd_op_reply(146 main.db.0000000000000000 [call] v59'104 uv103 ondisk = -2 ((2) No such file or directory)) v8 ==== 168+0+0 (unknown 3461770130 0 0) 0x559d5fa1d8c0 con 0x559d5f026000
2022-04-29T19:34:39.315+0000 7f55ef0d7700 -1 client.4184: SimpleRADOSStriper: lock_keeper_main: main.db: lock renewal failed: (2) No such file or directory

At this time, the mgr was noting some 10 pgs were inactive.

I will think about how we can make this mgr module more resilient to this type of situation.

Actions

Copy link

Updated by Yaarit Hatuka almost 4 years ago

Hi Patrick,

Can you please explain why unknown state of pgs results in this behavior?

The commonality we see between these issues so far is that they happen after an upgrade (on gibba and the LRC), or due to daemon restarts.

Actions

Copy link

Updated by Laura Flores over 3 years ago

/a/yuriw-2022-12-07_15:47:33-rados-wip-yuri-testing-2022-12-06-1204-distro-default-smithi/7106555

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error added

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Related to deleted (Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error)

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56844: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#10

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #58652: Module 'devicehealth' has failed: disk I/O error added

Actions

Copy link

#11

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56266: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT)) added

Actions

Copy link

#12

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56205: crash: File "mgr/devicehealth/module.py", in serve: self.scrape_all() added

Actions

Copy link

#13

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56279: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added

Actions

Copy link

#14

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56287: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#15

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56291: crash: File "mgr/devicehealth/module.py", in get_recent_device_metrics: return self._get_device_metrics(devid, min_sample=min_sample) added

Actions

Copy link

#16

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56297: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#17

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56312: crash: File "mgr/devicehealth/module.py", in serve: ls = self.get_kv('last_scrape') added

Actions

Copy link

#18

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56321: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#19

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56322: crash: File "mgr/devicehealth/module.py", in serve: if self.db_ready() and self.enable_monitoring: added

Actions

Copy link

#20

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56858: crash: File "mgr/devicehealth/module.py", in serve: self.set_kv('last_scrape', last_scrape.strftime(TIME_FORMAT)) added

Actions

Copy link

#21

Updated by Patrick Donnelly about 3 years ago

Has duplicate Bug #56898: crash: File "mgr/devicehealth/module.py", in show_device_metrics: res = self._get_device_metrics(devid, sample=sample) added

Actions

Copy link

#22

Updated by Patrick Donnelly about 3 years ago

Status changed from New to In Progress
Assignee set to Patrick Donnelly
Target version set to v18.0.0
Source set to Community (user)
Backport set to quincy

Actions

Copy link

#23

Updated by Patrick Donnelly about 3 years ago

Pull request ID set to 50291

Actions

Copy link

#24

Updated by Satoru Takeuchi about 3 years ago

I hit a quite similar problem. My cluster became unhealthy as follows.

```
$ kubectl exec ~~n ${NS} deployment/rook-ceph-tools -~~ ceph status
cluster:
id: b52d5f3d-ba14-442e-a089-0bca47b83758
health: HEALTH_ERR
441 large omap objects
Module 'devicehealth' has failed: unknown operation
1 osds down
1 host (1 osds) down
Degraded data redundancy: 1282/321355596 objects degraded (0.000%), 88 pgs degraded, 106 pgs undersized
62 pgs not deep-scrubbed in time
62 pgs not scrubbed in time
```

Before occurring this problem, the following message is shown in the mgr.a's log.

```
2023-03-16 09:08:24 debug 2023-03-16T00:08:24.478+0000 7feacbcf5700 -1 client.107899129: SimpleRADOSStriper: lock_keeper_main: main.db: lock renewal failed: (2) No such file or directory
```

At that time, mgr failover didn't happen. In addition, this problem was resolved after restarting mgr.a,

I have two questions about this.

a. Can my problem be considered to hit this issue's bug?
b. If a is true, is there any configuration to bypass this problem before 50291 is merged?

software environment:
- ceph: v17.2.5
- rook: v1.10.7

Actions

Copy link

#25

Updated by Telemetry Bot almost 3 years ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)
Affected Versions v17.0.0, v17.2.0, v17.2.3, v17.2.4, v17.2.5 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=396bcc462c052481a3b21c9f0a35da3582e33cbca357b481453a7b6fcfd3c9da

Sanitized backtrace:

    File "mgr/devicehealth/module.py", in serve: self.scrape_all()
    File "mgr/devicehealth/module.py", in scrape_all: self.put_device_metrics(device, data)
    File "mgr/devicehealth/module.py", in put_device_metrics: self._create_device(devid)
    File "mgr/devicehealth/module.py", in _create_device: cursor = self.db.execute(SQL, (devid,))

Crash dump sample:

{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n    self.scrape_all()",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n    self.put_device_metrics(device, data)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n    self._create_device(devid)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n    cursor = self.db.execute(SQL, (devid,))",
        "<redacted>" 
    ],
    "ceph_version": "17.2.3",
    "crash_id": "2022-12-23T00:08:19.380001Z_532f16aa-9667-4774-9740-7eb6486407c1",
    "entity_name": "mgr.380e06557672265883c1723194701ca26b08aabe",
    "mgr_module": "devicehealth",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "InternalError",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "92fb822a43775ec1de7d41c75c8d0ec0bbb72ba5429a46000f34101c1bc6524e",
    "timestamp": "2022-12-23T00:08:19.380001Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-87-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#98~18.04.1-Ubuntu SMP Wed Sep 22 10:45:04 UTC 2021" 
}

Actions

Copy link

#26