Bug #74605: Rocky10 - ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1) - mgr - Ceph

Actions

Copy link

Bug #74605

open

Rocky10 - ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1)

Added by Laura Flores about 2 months ago. Updated 3 days ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Patrick Donnelly

Category:

devicehealth module

Target version:

Ceph - v21.0.0

% Done:

Source:

Q/A

Backport:

tentacle

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

67632

Tags (freeform):

backport_processed

Merge Commit:

2de737fa184892e6c38f4da5744e04c2771b9a7c

Fixed In:

v20.3.0-6220-g2de737fa18

Released In:

Upkeep Timestamp:

2026-03-18T18:15:06+00:00

Tags:

rocky10

Description

/a/nmordech-2026-01-25_11:10:14-rados-wip-rocky10-branch-of-the-day-2026-01-23-1769128778-distro-default-trial/17021

2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:======================================================================
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1)
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/mgr/test_devicehealth.py", line 103, in test_sql_autocommit1
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:    return self._test_sql_autocommit(1)
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/mgr/test_devicehealth.py", line 92, in _test_sql_autocommit
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:    self._wait_for_killpoint_death()
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/mgr/test_devicehealth.py", line 58, in _wait_for_killpoint_death
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:    self.wait_until_true(killpoint_dead, timeout=30)
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/ceph_test_case.py", line 339, in wait_until_true
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:Ran 2 tests in 93.894s
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:FAILED (errors=1)

This was found in a Rocky10 run. I'm not certain yet if it is exclusive to the Rocky10 work, but for now, this has only been seen on this run.

Related issues 4 (3 open — 1 closed)

Actions

Copy link

Updated by Laura Flores about 2 months ago

Assignee deleted (~~Yaarit Hatuka~~)

Actions

Copy link

Updated by Laura Flores about 2 months ago

Related to QA Run #74540: wip-rocky10-branch-of-the-day-2026-01-23-1769128778 added

Actions

Copy link

Updated by Yaarit Hatuka about 2 months ago

https://pulpito.ceph.com/nmordech-2026-01-29_15:32:36-rados:mgr-wip-rocky10-branch-of-the-day-2026-01-23-1769128778-distro-default-trial/26335/

Actions

Copy link

Updated by Patrick Donnelly about 2 months ago

Assignee set to Patrick Donnelly

Actions

Copy link

Updated by Patrick Donnelly about 2 months ago

Odd debugging output:

2026-01-25T11:45:32.387+0000 7f42ad7f46c0 10 mgr notify_all notify_all: notify_all command
2026-01-25T11:45:32.388+0000 7f41d5c146c0  0 [volumes DEBUG root] mon_command: 'osd pool application enable' -> 0 in 1.002s
2026-01-25T11:45:32.388+0000 7f41d5c146c0  0 [volumes DEBUG root] using uri file:///.mgr:devicehealth/main.db?vfs=ceph

The log shows this is for the "volumes" plugin but it should be "devicehealth".

Continuing investigation...

Actions

Copy link

Updated by Patrick Donnelly about 2 months ago

Evidence it looked up the killpoint:

2026-01-25T11:45:33.054+0000 7f41d5c146c0 20 mgr get_config  key: mgr/devicehealth/sqlite3_killpoint
2026-01-25T11:45:33.054+0000 7f41d5c146c0 10 mgr get_typed_config get_typed_config sqlite3_killpoint found

https://github.com/ceph/ceph/blob/390843f79eeb97df5ba92a8b67f3715816fc8f7c/src/pybind/mgr/mgr_module.py#L1305

and note it uses the correct "devicehealth" name. Odd. Not sure what's going on here yet.

Actions

Copy link

Updated by Laura Flores about 2 months ago

Patrick Donnelly wrote in #note-6:

Evidence it looked up the killpoint:

[...]

https://github.com/ceph/ceph/blob/390843f79eeb97df5ba92a8b67f3715816fc8f7c/src/pybind/mgr/mgr_module.py#L1305

and note it uses the correct "devicehealth" name. Odd. Not sure what's going on here yet.

The use of "volumes" instead of "devicehealth" looks similar to what's going on here: https://tracker.ceph.com/issues/74543#note-19

Actions

Copy link

Updated by Laura Flores about 2 months ago

Related to Bug #74543: Rocky10 - AttributeError in dashboard module added

Actions

Copy link

Updated by Laura Flores about 2 months ago

@Patrick Donnelly note that the other issue I marked as "Related" probably has a different root cause than this one; they just share the same incorrect logging issue.

Actions

Copy link

#10

Updated by Patrick Donnelly about 2 months ago

Laura Flores wrote in #note-9:

@Patrick Donnelly note that the other issue I marked as "Related" probably has a different root cause than this one; they just share the same incorrect logging issue.

It may be related.

We also have evidence the mgr observed the correct value:

2026-01-25T11:45:28.374+0000 7f42ae7f66c0 20 mgr ms_dispatch2 set config/mgr/mgr/devicehealth/sqlite3_killpoint = 1

but still the code did not call _exit at that killpoint. Since this code has not really changed and I don't immediately see why/how it's not working, I'm going to suggest that this be catalogued with the other py3.12 issues and, hopefully, will be resolved or figured out as the other more pressing problems are resolved like #74543.

Actions

Copy link

#11

Updated by Nitzan Mordechai about 1 month ago

/a/yaarit-2026-02-08_02:23:05-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-06-1770413686-distro-default-trial/
4 jobs: ['39830', '39800', '39820', '39810']

Actions

Copy link

#12

Updated by Nitzan Mordechai about 1 month ago

@Patrick Donnelly it reoccure on the new branch - a few times:
/a/yaarit-2026-02-08_02:23:05-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-06-1770413686-distro-default-trial/
4 jobs: ['39830', '39800', '39820', '39810']

it looks like we are waiting on _wait_for_killpoint_death and timeout:

2026-02-08T02:33:37.146 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:37.248 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:37.349 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:37.450 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:33:42.450 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:42.551 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:42.652 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:42.753 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:33:46.476 DEBUG:teuthology.orchestra.run.trial089:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2026-02-08T02:33:46.479 DEBUG:teuthology.orchestra.run.trial121:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2026-02-08T02:33:47.754 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:47.855 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:47.955 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:48.056 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:33:53.057 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:53.158 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:53.259 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:53.359 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:33:58.360 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:58.461 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:58.562 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:58.663 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:34:03.664 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:34:03.764 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:34:03.865 INFO:tasks.mgr.test_devicehealth:z

The retry count doesn't increase? and it looks like we are on the middle of restart

Actions

Copy link

#13

Updated by Patrick Donnelly about 1 month ago

Nitzan Mordechai wrote in #note-12:

@Patrick Donnelly it reoccure on the new branch - a few times:
/a/yaarit-2026-02-08_02:23:05-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-06-1770413686-distro-default-trial/
4 jobs: ['39830', '39800', '39820', '39810']

it looks like we are waiting on _wait_for_killpoint_death and timeout:
[...]

The retry count doesn't increase? and it looks like we are on the middle of restart

The issue is that the mgr never exits due to a call to _exit, as we would expect. I talked about that in https://tracker.ceph.com/issues/74605#note-10

@Patrick Donnelly, the logging issue was fixed, but the tracker keeps showing.
I think we have some differences in python 3.12 for SQLite and autocommit that are causing this issue. I found that PR: https://github.com/ceph/ceph/pull/61763

and the doc of python 3.12: https://docs.python.org/3.12/library/sqlite3.html#transaction-control
Since we are using isolation_level=None, it seems that now every transaction is autocommit, and the test will fail

Actions

Copy link

#24

Updated by Patrick Donnelly 18 days ago

The bug is caused by:

commit 78983ad0d0cce422da32dc4876ac186f6d32c3f5
Author: Adam King <adking@redhat.com>
Date:   Wed Jun 16 08:02:15 2021 -0400

    mgr/cephadm: cephadm agent 2.0

    Creates an http endpoint in mgr/cephadm to receive
    http requests and an agent that can be deployed on
    each host that will gather metadata on the host and
    send it to the mgr/cephadm http endpoint. Should save the
    cephadm mgr module a lot of time it would have to spend
    repeatedly ssh-ing into each host to gather the metadata
    and help performance on larger clusters.

    Fixes: https://tracker.ceph.com/issues/51004

    Signed-off-by: Adam King <adking@redhat.com>

diff --git a/src/pybind/mgr/cephadm/module.py b/src/pybind/mgr/cephadm/module.py
index 7b5676ac5d5..8a5ff3de81d 100644
--- a/src/pybind/mgr/cephadm/module.py
+++ b/src/pybind/mgr/cephadm/module.py
@@ -28,6 +28,8 @@ from ceph.deployment.service_spec import \
 from ceph.utils import str_to_datetime, datetime_to_str, datetime_now
 from cephadm.serve import CephadmServe
 from cephadm.services.cephadmservice import CephadmDaemonDeploySpec
+from cephadm.agent import CherryPyThread, CephadmAgentHelpers
+

 from mgr_module import MgrModule, HandleCommandResult, Option
 from mgr_util import create_self_signed_cert
@@ -45,7 +47,7 @@ from . import utils
 from . import ssh
 from .migrations import Migrations
 from .services.cephadmservice import MonService, MgrService, MdsService, RgwService, \
-    RbdMirrorService, CrashService, CephadmService, CephfsMirrorService
+    RbdMirrorService, CrashService, CephadmService, CephfsMirrorService, CephadmAgent
 from .services.ingress import IngressService
 from .services.container import CustomContainerService
 from .services.iscsi import IscsiService
@@ -79,6 +81,16 @@ Host *
   ConnectTimeout=30
 """ 

+# cherrypy likes to sys.exit on error.  don't let it take us down too!
+
+
+def os_exit_noop(status: int) -> None:
+    pass
+
+
+os._exit = os_exit_noop   # type: ignore

We are using os._exit for sqlite3 killpoints. Either this needs to be reverted or we need an internal function to call _exit.

Actions

Copy link

#25

Updated by Patrick Donnelly 18 days ago

Status changed from New to Fix Under Review
Target version set to v21.0.0
Source set to Q/A
Pull request ID set to 67632

Actions

Copy link

#26

Updated by Upkeep Bot 3 days ago

Status changed from Fix Under Review to Pending Backport
Merge Commit set to 2de737fa184892e6c38f4da5744e04c2771b9a7c
Fixed In set to v20.3.0-6220-g2de737fa18
Upkeep Timestamp set to 2026-03-18T18:15:06+00:00

Actions

Copy link

#27

Updated by Upkeep Bot 3 days ago

Copied to Backport #75604: tentacle: Rocky10 - ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1) added

Actions

Copy link

#28

Updated by Upkeep Bot 3 days ago

Tags (freeform) set to backport_processed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Tags

Custom queries

Bug #74605

Rocky10 - ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1)

Updated by Laura Flores about 2 months ago

Updated by Laura Flores about 2 months ago

Updated by Yaarit Hatuka about 2 months ago

Updated by Patrick Donnelly about 2 months ago

Updated by Patrick Donnelly about 2 months ago

Updated by Patrick Donnelly about 2 months ago

Updated by Laura Flores about 2 months ago

Updated by Laura Flores about 2 months ago

Updated by Laura Flores about 2 months ago

Updated by Patrick Donnelly about 2 months ago

Updated by Nitzan Mordechai about 1 month ago

Updated by Nitzan Mordechai about 1 month ago

Updated by Patrick Donnelly about 1 month ago

Updated by Nitzan Mordechai about 1 month ago

Updated by Nitzan Mordechai about 1 month ago

Updated by Nitzan Mordechai about 1 month ago

Updated by Nitzan Mordechai 27 days ago

Updated by Nitzan Mordechai 27 days ago

Updated by Nitzan Mordechai 24 days ago

Updated by Laura Flores 23 days ago

Updated by Laura Flores 23 days ago

Updated by Nitzan Mordechai 19 days ago

Updated by Nitzan Mordechai 18 days ago

Updated by Patrick Donnelly 18 days ago

Updated by Patrick Donnelly 18 days ago

Updated by Upkeep Bot 3 days ago

Updated by Upkeep Bot 3 days ago

Updated by Upkeep Bot 3 days ago