Project

General

Profile

Actions

Bug #74605

open

Rocky10 - ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1)

Added by Laura Flores about 2 months ago. Updated 3 days ago.

Status:
Pending Backport
Priority:
Normal
Category:
devicehealth module
Target version:
% Done:

0%

Source:
Q/A
Backport:
tentacle
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-6220-g2de737fa18
Released In:
Upkeep Timestamp:
2026-03-18T18:15:06+00:00
Tags:

Description

/a/nmordech-2026-01-25_11:10:14-rados-wip-rocky10-branch-of-the-day-2026-01-23-1769128778-distro-default-trial/17021

2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:======================================================================
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1)
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/mgr/test_devicehealth.py", line 103, in test_sql_autocommit1
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:    return self._test_sql_autocommit(1)
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/mgr/test_devicehealth.py", line 92, in _test_sql_autocommit
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:    self._wait_for_killpoint_death()
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/mgr/test_devicehealth.py", line 58, in _wait_for_killpoint_death
2026-01-25T11:46:03.749 INFO:tasks.cephfs_test_runner:    self.wait_until_true(killpoint_dead, timeout=30)
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_7f2fdcf9ada173165e852daae5c39da5989bddd1/qa/tasks/ceph_test_case.py", line 339, in wait_until_true
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:Ran 2 tests in 93.894s
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:
2026-01-25T11:46:03.750 INFO:tasks.cephfs_test_runner:FAILED (errors=1)

This was found in a Rocky10 run. I'm not certain yet if it is exclusive to the Rocky10 work, but for now, this has only been seen on this run.


Related issues 4 (3 open1 closed)

Related to Ceph QA - QA Run #74540: wip-rocky10-branch-of-the-day-2026-01-23-1769128778QA Needs ApprovalLaura FloresActions
Related to mgr - Bug #74543: Rocky10 - AttributeError in dashboard moduleDuplicateSamuel Just

Actions
Related to Ceph QA - QA Run #75162: wip-rocky10-branch-of-the-day-2026-02-24-1771941190QA TestingActions
Copied to mgr - Backport #75604: tentacle: Rocky10 - ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1)NewPatrick DonnellyActions
Actions #1

Updated by Laura Flores about 2 months ago

  • Assignee deleted (Yaarit Hatuka)
Actions #2

Updated by Laura Flores about 2 months ago

  • Related to QA Run #74540: wip-rocky10-branch-of-the-day-2026-01-23-1769128778 added
Actions #4

Updated by Patrick Donnelly about 2 months ago

  • Assignee set to Patrick Donnelly
Actions #5

Updated by Patrick Donnelly about 2 months ago

Odd debugging output:

2026-01-25T11:45:32.387+0000 7f42ad7f46c0 10 mgr notify_all notify_all: notify_all command
2026-01-25T11:45:32.388+0000 7f41d5c146c0  0 [volumes DEBUG root] mon_command: 'osd pool application enable' -> 0 in 1.002s
2026-01-25T11:45:32.388+0000 7f41d5c146c0  0 [volumes DEBUG root] using uri file:///.mgr:devicehealth/main.db?vfs=ceph

The log shows this is for the "volumes" plugin but it should be "devicehealth".

Continuing investigation...

Actions #6

Updated by Patrick Donnelly about 2 months ago

Evidence it looked up the killpoint:

2026-01-25T11:45:33.054+0000 7f41d5c146c0 20 mgr get_config  key: mgr/devicehealth/sqlite3_killpoint
2026-01-25T11:45:33.054+0000 7f41d5c146c0 10 mgr get_typed_config get_typed_config sqlite3_killpoint found

https://github.com/ceph/ceph/blob/390843f79eeb97df5ba92a8b67f3715816fc8f7c/src/pybind/mgr/mgr_module.py#L1305

and note it uses the correct "devicehealth" name. Odd. Not sure what's going on here yet.

Actions #7

Updated by Laura Flores about 2 months ago

Patrick Donnelly wrote in #note-6:

Evidence it looked up the killpoint:

[...]

https://github.com/ceph/ceph/blob/390843f79eeb97df5ba92a8b67f3715816fc8f7c/src/pybind/mgr/mgr_module.py#L1305

and note it uses the correct "devicehealth" name. Odd. Not sure what's going on here yet.

The use of "volumes" instead of "devicehealth" looks similar to what's going on here: https://tracker.ceph.com/issues/74543#note-19

Actions #8

Updated by Laura Flores about 2 months ago

  • Related to Bug #74543: Rocky10 - AttributeError in dashboard module added
Actions #9

Updated by Laura Flores about 2 months ago

@Patrick Donnelly note that the other issue I marked as "Related" probably has a different root cause than this one; they just share the same incorrect logging issue.

Actions #10

Updated by Patrick Donnelly about 2 months ago

Laura Flores wrote in #note-9:

@Patrick Donnelly note that the other issue I marked as "Related" probably has a different root cause than this one; they just share the same incorrect logging issue.

It may be related.

We also have evidence the mgr observed the correct value:

2026-01-25T11:45:28.374+0000 7f42ae7f66c0 20 mgr ms_dispatch2 set config/mgr/mgr/devicehealth/sqlite3_killpoint = 1

but still the code did not call _exit at that killpoint. Since this code has not really changed and I don't immediately see why/how it's not working, I'm going to suggest that this be catalogued with the other py3.12 issues and, hopefully, will be resolved or figured out as the other more pressing problems are resolved like #74543.

Actions #11

Updated by Nitzan Mordechai about 1 month ago

/a/yaarit-2026-02-08_02:23:05-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-06-1770413686-distro-default-trial/
4 jobs: ['39830', '39800', '39820', '39810']

Actions #12

Updated by Nitzan Mordechai about 1 month ago

@Patrick Donnelly it reoccure on the new branch - a few times:
/a/yaarit-2026-02-08_02:23:05-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-06-1770413686-distro-default-trial/
4 jobs: ['39830', '39800', '39820', '39810']

it looks like we are waiting on _wait_for_killpoint_death and timeout:

2026-02-08T02:33:37.146 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:37.248 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:37.349 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:37.450 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:33:42.450 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:42.551 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:42.652 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:42.753 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:33:46.476 DEBUG:teuthology.orchestra.run.trial089:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2026-02-08T02:33:46.479 DEBUG:teuthology.orchestra.run.trial121:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2026-02-08T02:33:47.754 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:47.855 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:47.955 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:48.056 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:33:53.057 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:53.158 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:53.259 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:53.359 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:33:58.360 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:33:58.461 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:33:58.562 INFO:tasks.mgr.test_devicehealth:z
2026-02-08T02:33:58.663 DEBUG:tasks.ceph_test_case:wait_until_true: waiting (timeout=30 retry_count=0)...
2026-02-08T02:34:03.664 INFO:tasks.mgr.test_devicehealth:x
2026-02-08T02:34:03.764 INFO:tasks.mgr.test_devicehealth:y
2026-02-08T02:34:03.865 INFO:tasks.mgr.test_devicehealth:z

The retry count doesn't increase? and it looks like we are on the middle of restart

Actions #13

Updated by Patrick Donnelly about 1 month ago

Nitzan Mordechai wrote in #note-12:

@Patrick Donnelly it reoccure on the new branch - a few times:
/a/yaarit-2026-02-08_02:23:05-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-06-1770413686-distro-default-trial/
4 jobs: ['39830', '39800', '39820', '39810']

it looks like we are waiting on _wait_for_killpoint_death and timeout:
[...]

The retry count doesn't increase? and it looks like we are on the middle of restart

The issue is that the mgr never exits due to a call to _exit, as we would expect. I talked about that in https://tracker.ceph.com/issues/74605#note-10

To be clear, I do not think this has anything to do with libcephsqlite. This is just the only killpoint test in the mgr which uses _exit to immediately terminate the mgr.

Actions #14

Updated by Nitzan Mordechai about 1 month ago

/a/yaarit-2026-02-10_02:32:59-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-09-1770676549-distro-default-trial/42034

Actions #15

Updated by Nitzan Mordechai about 1 month ago

/a/yaarit-2026-02-10_23:45:18-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-09-1770676549-distro-default-trial
['44147', '44177', '44157', '44167']

Actions #16

Updated by Nitzan Mordechai about 1 month ago

/a/nmordech-2026-02-17_05:44:40-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-13-1771001916-distro-default-trial/51654

Actions #17

Updated by Nitzan Mordechai 27 days ago

/a/nmordech-2026-02-21_21:31:01-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-21-1771692874-tentacle-distro-default-trial/62998

Actions #18

Updated by Nitzan Mordechai 27 days ago

  • Backport set to tentacle
Actions #19

Updated by Nitzan Mordechai 24 days ago

/a/nmordech-2026-02-25_11:34:16-rados:mgr-wip-rocky10-branch-of-the-day-2026-02-24-1771941190-distro-default-trial/69835

Actions #20

Updated by Laura Flores 23 days ago

/a/nmordech-2026-02-25_11:36:23-rados-wip-rocky10-branch-of-the-day-2026-02-24-1771941190-distro-default-trial/70008

Actions #21

Updated by Laura Flores 23 days ago

  • Related to QA Run #75162: wip-rocky10-branch-of-the-day-2026-02-24-1771941190 added
Actions #22

Updated by Nitzan Mordechai 19 days ago

/a/yaarit-2026-02-26_20:20:34-rados-wip-rocky10-branch-of-the-day-2026-02-26-1772108951-distro-default-trial/72900

Actions #23

Updated by Nitzan Mordechai 18 days ago

@Patrick Donnelly, the logging issue was fixed, but the tracker keeps showing.
I think we have some differences in python 3.12 for SQLite and autocommit that are causing this issue. I found that PR: https://github.com/ceph/ceph/pull/61763

and the doc of python 3.12: https://docs.python.org/3.12/library/sqlite3.html#transaction-control
Since we are using isolation_level=None, it seems that now every transaction is autocommit, and the test will fail

Actions #24

Updated by Patrick Donnelly 18 days ago

The bug is caused by:

commit 78983ad0d0cce422da32dc4876ac186f6d32c3f5
Author: Adam King <adking@redhat.com>
Date:   Wed Jun 16 08:02:15 2021 -0400

    mgr/cephadm: cephadm agent 2.0

    Creates an http endpoint in mgr/cephadm to receive
    http requests and an agent that can be deployed on
    each host that will gather metadata on the host and
    send it to the mgr/cephadm http endpoint. Should save the
    cephadm mgr module a lot of time it would have to spend
    repeatedly ssh-ing into each host to gather the metadata
    and help performance on larger clusters.

    Fixes: https://tracker.ceph.com/issues/51004

    Signed-off-by: Adam King <adking@redhat.com>

diff --git a/src/pybind/mgr/cephadm/module.py b/src/pybind/mgr/cephadm/module.py
index 7b5676ac5d5..8a5ff3de81d 100644
--- a/src/pybind/mgr/cephadm/module.py
+++ b/src/pybind/mgr/cephadm/module.py
@@ -28,6 +28,8 @@ from ceph.deployment.service_spec import \
 from ceph.utils import str_to_datetime, datetime_to_str, datetime_now
 from cephadm.serve import CephadmServe
 from cephadm.services.cephadmservice import CephadmDaemonDeploySpec
+from cephadm.agent import CherryPyThread, CephadmAgentHelpers
+

 from mgr_module import MgrModule, HandleCommandResult, Option
 from mgr_util import create_self_signed_cert
@@ -45,7 +47,7 @@ from . import utils
 from . import ssh
 from .migrations import Migrations
 from .services.cephadmservice import MonService, MgrService, MdsService, RgwService, \
-    RbdMirrorService, CrashService, CephadmService, CephfsMirrorService
+    RbdMirrorService, CrashService, CephadmService, CephfsMirrorService, CephadmAgent
 from .services.ingress import IngressService
 from .services.container import CustomContainerService
 from .services.iscsi import IscsiService
@@ -79,6 +81,16 @@ Host *
   ConnectTimeout=30
 """ 

+# cherrypy likes to sys.exit on error.  don't let it take us down too!
+
+
+def os_exit_noop(status: int) -> None:
+    pass
+
+
+os._exit = os_exit_noop   # type: ignore

We are using os._exit for sqlite3 killpoints. Either this needs to be reverted or we need an internal function to call _exit.

Actions #25

Updated by Patrick Donnelly 18 days ago

  • Status changed from New to Fix Under Review
  • Target version set to v21.0.0
  • Source set to Q/A
  • Pull request ID set to 67632
Actions #26

Updated by Upkeep Bot 3 days ago

  • Status changed from Fix Under Review to Pending Backport
  • Merge Commit set to 2de737fa184892e6c38f4da5744e04c2771b9a7c
  • Fixed In set to v20.3.0-6220-g2de737fa18
  • Upkeep Timestamp set to 2026-03-18T18:15:06+00:00
Actions #27

Updated by Upkeep Bot 3 days ago

  • Copied to Backport #75604: tentacle: Rocky10 - ERROR: test_sql_autocommit1 (tasks.mgr.test_devicehealth.TestDeviceHealth.test_sql_autocommit1) added
Actions #28

Updated by Upkeep Bot 3 days ago

  • Tags (freeform) set to backport_processed
Actions

Also available in: Atom PDF