Bug #74605
openRocky10 - ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1)
0%
Description
/a/nmordech-2026-01-25_11:10:14-rados-wip-rocky10-branch-of-the-day-2026-01-23-1769128778-distro-default-trial/17021
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:======================================================================
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1)
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/mgr/test_devicehealth.py", line 103, in test_sql_autocommit1
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner: return self._test_sql_autocommit(1)
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/mgr/test_devicehealth.py", line 92, in _test_sql_autocommit
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner: self._wait_for_killpoint_death()
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/mgr/test_devicehealth.py", line 58, in _wait_for_killpoint_death
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner: self.wait_until_true(killpoint_dead, timeout=30)
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/ceph_test_case.py", line 339, in wait_until_true
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner: raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:Ran 2 tests in 93.894s
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:FAILED (errors=1)
This was found in a Rocky10 run. I'm not certain yet if it is exclusive to the Rocky10 work, but for now, this has only been seen on this run.
Updated by Laura Flores about 2 months ago
- Related to QA Run #74540: wip-rocky10-branch-of-the-day-2026-01-23-1769128778 added
Updated by Patrick Donnelly about 2 months ago
Odd debugging output:
2026-01-25T11:45:32.387+0000 7f42ad7f46c0 10 mgr notify_all notify_all: notify_all command 2026-01-25T11:45:32.388+0000 7f41d5c146c0 0 [volumes DEBUG root] mon_command: 'osd pool application enable' -> 0 in 1.002s 2026-01-25T11:45:32.388+0000 7f41d5c146c0 0 [volumes DEBUG root] using uri file:///.mgr:devicehealth/main.db?vfs=ceph
The log shows this is for the "volumes" plugin but it should be "devicehealth".
Continuing investigation...
Updated by Patrick Donnelly about 2 months ago
Evidence it looked up the killpoint:
2026-01-25T11:45:33.054+0000 7f41d5c146c0 20 mgr get_config key: mgr/devicehealth/sqlite3_killpoint 2026-01-25T11:45:33.054+0000 7f41d5c146c0 10 mgr get_typed_config get_typed_config sqlite3_killpoint found
and note it uses the correct "devicehealth" name. Odd. Not sure what's going on here yet.
Updated by Laura Flores about 2 months ago
Patrick Donnelly wrote in #note-6:
Evidence it looked up the killpoint:
[...]
and note it uses the correct "devicehealth" name. Odd. Not sure what's going on here yet.
The use of "volumes" instead of "devicehealth" looks similar to what's going on here: https://tracker.ceph.com/issues/74543#note-19
Updated by Laura Flores about 2 months ago
- Related to Bug #74543: Rocky10 - AttributeError in dashboard module added
Updated by Laura Flores about 2 months ago
@Patrick Donnelly note that the other issue I marked as "Related" probably has a different root cause than this one; they just share the same incorrect logging issue.
Updated by Patrick Donnelly about 2 months ago
Laura Flores wrote in #note-9:
@Patrick Donnelly note that the other issue I marked as "Related" probably has a different root cause than this one; they just share the same incorrect logging issue.
It may be related.
We also have evidence the mgr observed the correct value:
2026-01-25T11:45:28.374+0000 7f42ae7f66c0 20 mgr ms_dispatch2 set config/mgr/mgr/devicehealth/sqlite3_killpoint = 1
but still the code did not call _exit at that killpoint. Since this code has not really changed and I don't immediately see why/how it's not working, I'm going to suggest that this be catalogued with the other py3.12 issues and, hopefully, will be resolved or figured out as the other more pressing problems are resolved like #74543.
Updated by Nitzan Mordechai about 1 month ago
/a/yaarit-2026-02-08_02:23:05-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-06-1770413686-distro-default-trial/
4 jobs: ['39830', '39800', '39820', '39810']
Updated by Nitzan Mordechai about 1 month ago
@Patrick Donnelly it reoccure on the new branch - a few times:
/a/yaarit-2026-02-08_02:23:05-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-06-1770413686-distro-default-trial/
4 jobs: ['39830', '39800', '39820', '39810']
it looks like we are waiting on _wait_for_killpoint_death and timeout:
2026-02-08T02:33:37.146 INFO:tasks.mgr.test_devicehealth:x 2026-02-08T02:33:37.248 INFO:tasks.mgr.test_devicehealth:y 2026-02-08T02:33:37.349 INFO:tasks.mgr.test_devicehealth:z 2026-02-08T02:33:37.450 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)... 2026-02-08T02:33:42.450 INFO:tasks.mgr.test_devicehealth:x 2026-02-08T02:33:42.551 INFO:tasks.mgr.test_devicehealth:y 2026-02-08T02:33:42.652 INFO:tasks.mgr.test_devicehealth:z 2026-02-08T02:33:42.753 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)... 2026-02-08T02:33:46.476 DEBUG:teuthology.orchestra.run.trial089:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2026-02-08T02:33:46.479 DEBUG:teuthology.orchestra.run.trial121:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2026-02-08T02:33:47.754 INFO:tasks.mgr.test_devicehealth:x 2026-02-08T02:33:47.855 INFO:tasks.mgr.test_devicehealth:y 2026-02-08T02:33:47.955 INFO:tasks.mgr.test_devicehealth:z 2026-02-08T02:33:48.056 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)... 2026-02-08T02:33:53.057 INFO:tasks.mgr.test_devicehealth:x 2026-02-08T02:33:53.158 INFO:tasks.mgr.test_devicehealth:y 2026-02-08T02:33:53.259 INFO:tasks.mgr.test_devicehealth:z 2026-02-08T02:33:53.359 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)... 2026-02-08T02:33:58.360 INFO:tasks.mgr.test_devicehealth:x 2026-02-08T02:33:58.461 INFO:tasks.mgr.test_devicehealth:y 2026-02-08T02:33:58.562 INFO:tasks.mgr.test_devicehealth:z 2026-02-08T02:33:58.663 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)... 2026-02-08T02:34:03.664 INFO:tasks.mgr.test_devicehealth:x 2026-02-08T02:34:03.764 INFO:tasks.mgr.test_devicehealth:y 2026-02-08T02:34:03.865 INFO:tasks.mgr.test_devicehealth:z
The retry count doesn't increase? and it looks like we are on the middle of restart
Updated by Patrick Donnelly about 1 month ago
Nitzan Mordechai wrote in #note-12:
@Patrick Donnelly it reoccure on the new branch - a few times:
/a/yaarit-2026-02-08_02:23:05-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-06-1770413686-distro-default-trial/
4 jobs: ['39830', '39800', '39820', '39810']it looks like we are waiting on _wait_for_killpoint_death and timeout:
[...]The retry count doesn't increase? and it looks like we are on the middle of restart
The issue is that the mgr never exits due to a call to _exit, as we would expect. I talked about that in https://tracker.ceph.com/issues/74605#note-10
To be clear, I do not think this has anything to do with libcephsqlite. This is just the only killpoint test in the mgr which uses _exit to immediately terminate the mgr.
Updated by Nitzan Mordechai about 1 month ago
/a/yaarit-2026-02-10_02:32:59-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-09-1770676549-distro-default-trial/42034
Updated by Nitzan Mordechai about 1 month ago
/a/yaarit-2026-02-10_23:45:18-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-09-1770676549-distro-default-trial
['44147', '44177', '44157', '44167']
Updated by Nitzan Mordechai about 1 month ago
/a/nmordech-2026-02-17_05:44:40-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-13-1771001916-distro-default-trial/51654
Updated by Nitzan Mordechai 27 days ago
/a/nmordech-2026-02-21_21:31:01-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-21-1771692874-tentacle-distro-default-trial/62998
Updated by Nitzan Mordechai 24 days ago
/a/nmordech-2026-02-25_11:34:16-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-24-1771941190-distro-default-trial/69835
Updated by Laura Flores 23 days ago
/a/nmordech-2026-02-25_11:36:23-rados-wip-rocky10-branch-of-the-day-2026-02-24-1771941190-distro-default-trial/70008
Updated by Laura Flores 23 days ago
- Related to QA Run #75162: wip-rocky10-branch-of-the-day-2026-02-24-1771941190 added
Updated by Nitzan Mordechai 19 days ago
/a/yaarit-2026-02-26_20:20:34-rados-wip-rocky10-branch-of-the-day-2026-02-26-1772108951-distro-default-trial/72900
Updated by Nitzan Mordechai 18 days ago
@Patrick Donnelly, the logging issue was fixed, but the tracker keeps showing.
I think we have some differences in python 3.12 for SQLite and autocommit that are causing this issue. I found that PR: https://github.com/ceph/ceph/pull/61763
and the doc of python 3.12: https://docs.python.org/3.12/library/sqlite3.html#transaction-control
Since we are using isolation_level=None, it seems that now every transaction is autocommit, and the test will fail
Updated by Patrick Donnelly 18 days ago
The bug is caused by:
commit 78983ad0d0cce422da32dc4876ac186f6d32c3f5
Author: Adam King <adking@redhat.com>
Date: Wed Jun 16 08:02:15 2021 -0400
mgr/cephadm: cephadm agent 2.0
Creates an http endpoint in mgr/cephadm to receive
http requests and an agent that can be deployed on
each host that will gather metadata on the host and
send it to the mgr/cephadm http endpoint. Should save the
cephadm mgr module a lot of time it would have to spend
repeatedly ssh-ing into each host to gather the metadata
and help performance on larger clusters.
Fixes: https://tracker.ceph.com/issues/51004
Signed-off-by: Adam King <adking@redhat.com>
diff --git a/src/pybind/mgr/cephadm/module.py b/src/pybind/mgr/cephadm/module.py
index 7b5676ac5d5..8a5ff3de81d 100644
--- a/src/pybind/mgr/cephadm/module.py
+++ b/src/pybind/mgr/cephadm/module.py
@@ -28,6 +28,8 @@ from ceph.deployment.service_spec import \
from ceph.utils import str_to_datetime, datetime_to_str, datetime_now
from cephadm.serve import CephadmServe
from cephadm.services.cephadmservice import CephadmDaemonDeploySpec
+from cephadm.agent import CherryPyThread, CephadmAgentHelpers
+
from mgr_module import MgrModule, HandleCommandResult, Option
from mgr_util import create_self_signed_cert
@@ -45,7 +47,7 @@ from . import utils
from . import ssh
from .migrations import Migrations
from .services.cephadmservice import MonService, MgrService, MdsService, RgwService, \
- RbdMirrorService, CrashService, CephadmService, CephfsMirrorService
+ RbdMirrorService, CrashService, CephadmService, CephfsMirrorService, CephadmAgent
from .services.ingress import IngressService
from .services.container import CustomContainerService
from .services.iscsi import IscsiService
@@ -79,6 +81,16 @@ Host *
ConnectTimeout=30
"""
+# cherrypy likes to sys.exit on error. don't let it take us down too!
+
+
+def os_exit_noop(status: int) -> None:
+ pass
+
+
+os._exit = os_exit_noop # type: ignore
We are using os._exit for sqlite3 killpoints. Either this needs to be reverted or we need an internal function to call _exit.
Updated by Patrick Donnelly 18 days ago
- Status changed from New to Fix Under Review
- Target version set to v21.0.0
- Source set to Q/A
- Pull request ID set to 67632
Updated by Upkeep Bot 3 days ago
- Status changed from Fix Under Review to Pending Backport
- Merge Commit set to 2de737fa184892e6c38f4da5744e04c2771b9a7c
- Fixed In set to v20.3.0-6220-g2de737fa18
- Upkeep Timestamp set to 2026-03-18T18:15:06+00:00
Updated by Upkeep Bot 3 days ago
- Copied to Backport #75604: tentacle: Rocky10 - ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1) added