Bug #71915: qa: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally - CephFS - Ceph

Actions

Copy link

Bug #71915

open

qa: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally

Added by Kotresh Hiremath Ravishankar 9 months ago. Updated 2 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Kotresh Hiremath Ravishankar

Category:

Testing

Target version:

Ceph - v21.0.0

% Done:

Source:

Q/A

Backport:

tentacle,squid,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

64297

Tags (freeform):

backport_processed

Merge Commit:

cf379ffe1205d8583a6cacb61d70b2cbe16f8e91

Fixed In:

v20.3.0-1897-gcf379ffe12

Released In:

Upkeep Timestamp:

2025-07-25T11:18:03+00:00

Description

The test test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally on main branch. This is found while testing an upstream PR.

https://pulpito.ceph.com/khiremat-2025-06-28_17:10:36-fs:functional-wip-khiremat-qa-test-1-distro-default-smithi/8355022

2025-06-28T18:52:31.443 INFO:tasks.cephfs_test_runner:Test when a CephFS has 2 active MDSs and one of them have either ... FAIL
2025-06-28T18:52:31.444 INFO:tasks.cephfs_test_runner:
2025-06-28T18:52:31.444 INFO:tasks.cephfs_test_runner:======================================================================
2025-06-28T18:52:31.444 INFO:tasks.cephfs_test_runner:FAIL: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail)
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:Test when a CephFS has 2 active MDSs and one of them have either
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_kotreshhr_ceph_2ac2b127953bf014f6dae347a56f973dc93e6069/qa/tasks/cephfs/test_admin.py", line 2846, in test_with_health_warn_with_2_active_MDSs
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:    self.negtest_ceph_cmd(args=f'mds fail {non_hw_mds_id}', retval=1,
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_kotreshhr_ceph_2ac2b127953bf014f6dae347a56f973dc93e6069/qa/tasks/ceph_test_case.py", line 114, in negtest_ceph_cmd
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:    self._verify(proc, retval, errmsgs)
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_kotreshhr_ceph_2ac2b127953bf014f6dae347a56f973dc93e6069/qa/tasks/ceph_test_case.py", line 69, in _verify
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:    self.assert_retval(proc.returncode, exp_retval)
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_kotreshhr_ceph_2ac2b127953bf014f6dae347a56f973dc93e6069/qa/tasks/ceph_test_case.py", line 60, in assert_retval
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:    assert proc_retval == exp_retval, msg
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:AssertionError: expected return value: 1
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:received return value: 0

Please find the health detail output just before the failure.

2025-06-28T18:52:20.355 DEBUG:teuthology.orchestra.run.smithi072:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph health detail --format json
...
...
2025-06-28T18:52:20.756 INFO:teuthology.orchestra.run.smithi072.stderr:2025-06-28T18:52:20.753+0000 7f457effd640  1 -- 172.21.15.72:0/162872624 <== mon.0 v2:172.21.15.72:3300/0 7 ==== mon_command_ack([{"prefix": "health", "detail": "detail", "format": "json"}]=0  v0) ==== 92+0+291 (secure 0 0 0) 0x7f4578061690 con 0x7f4588165740
2025-06-28T18:52:20.757 INFO:teuthology.orchestra.run.smithi072.stdout:
2025-06-28T18:52:20.757 INFO:teuthology.orchestra.run.smithi072.stdout:{"status":"HEALTH_WARN","checks":{"MDS_CACHE_OVERSIZED":{"severity":"HEALTH_WARN","summary":{"message":"1 MDSs report oversized cache","count":1},"detail":[{"message":"mds.d(mds.0): MDS cache is too large (1MB/1kB); 401 inodes in use by clients, 0 stray files"}],"muted":false}},"mutes":[]}

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 9 months ago

Looking at the logs of the successful run

1. https://pulpito.ceph.com/khiremat-2025-06-28_17:10:36-fs:functional-wip-khiremat-qa-test-1-distro-default-smithi/8355020/

2025-06-28T18:31:33.699 DEBUG:teuthology.orchestra.run.smithi088:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph health detail --format json
...
...
2025-06-28T18:31:34.108 INFO:teuthology.orchestra.run.smithi088.stderr:2025-06-28T18:31:34.102+0000 7f0b23fff640  1 -- 172.21.15.88:0/1494550706 <== mon.0 v2:172.21.15.88:3300/0 7 ==== mon_command_ack([{"prefix": "health", "detail": "detail", "format": "json"}]=0  v0) ==== 92+0+396 (secure 0 0 0) 0x7f0b28032590 con 0x7f0b340ba280
2025-06-28T18:31:34.108 INFO:teuthology.orchestra.run.smithi088.stdout:
2025-06-28T18:31:34.108 INFO:teuthology.orchestra.run.smithi088.stdout:{"status":"HEALTH_WARN","checks":{"MDS_CACHE_OVERSIZED":{"severity":"HEALTH_WARN","summary":{"message":"2 MDSs report oversized cache","count":2},"detail":[{"message":"mds.c(mds.1): MDS cache is too large (48kB/1kB); 0 inodes in use by clients, 0 stray files"},{"message":"mds.a(mds.0): MDS cache is too large (1MB/1kB); 401 inodes in use by clients, 0 stray files"}],"muted":false}},"mutes":[]}

2. https://pulpito.ceph.com/khiremat-2025-06-28_17:10:36-fs:functional-wip-khiremat-qa-test-1-distro-default-smithi/8355021/

2025-06-28T18:37:12.436 DEBUG:teuthology.orchestra.run.smithi028:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph health detail --format json
...
...
2025-06-28T18:37:12.869 INFO:teuthology.orchestra.run.smithi028.stderr:2025-06-28T18:37:12.867+0000 7f1da27fc640  1 -- 172.21.15.28:0/3110406769 <== mon.0 v2:172.21.15.28:3300/0 7 ==== mon_command_ack([{"prefix": "health", "detail": "detail", "format": "json"}]=0  v0) ==== 92+0+291 (secure 0 0 0) 0x7f1db00617a0 con 0x7f1db4167e30
2025-06-28T18:37:12.870 INFO:teuthology.orchestra.run.smithi028.stdout:
2025-06-28T18:37:12.870 INFO:teuthology.orchestra.run.smithi028.stdout:{"status":"HEALTH_WARN","checks":{"MDS_CACHE_OVERSIZED":{"severity":"HEALTH_WARN","summary":{"message":"1 MDSs report oversized cache","count":1},"detail":[{"message":"mds.c(mds.0): MDS cache is too large (1MB/1kB); 401 inodes in use by clients, 0 stray files"}],"muted":false}},"mutes":[]}

3. https://pulpito.ceph.com/khiremat-2025-06-28_17:10:36-fs:functional-wip-khiremat-qa-test-1-distro-default-smithi/8355023/

2025-06-28T18:36:16.715 DEBUG:teuthology.orchestra.run.smithi055:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph health detail --format json
...
...
2025-06-28T18:36:17.138 INFO:teuthology.orchestra.run.smithi055.stderr:2025-06-28T18:36:17.134+0000 7f66227fc640  1 -- 172.21.15.55:0/2590277424 <== mon.0 v2:172.21.15.55:3300/0 7 ==== mon_command_ack([{"prefix": "health", "detail": "detail", "format": "json"}]=0  v0) ==== 92+0+396 (secure 0 0 0) 0x7f6618011040 con 0x7f6624099040
2025-06-28T18:36:17.138 INFO:teuthology.orchestra.run.smithi055.stdout:
2025-06-28T18:36:17.138 INFO:teuthology.orchestra.run.smithi055.stdout:{"status":"HEALTH_WARN","checks":{"MDS_CACHE_OVERSIZED":{"severity":"HEALTH_WARN","summary":{"message":"2 MDSs report oversized cache","count":2},"detail":[{"message":"mds.a(mds.1): MDS cache is too large (51kB/1kB); 0 inodes in use by clients, 0 stray files"},{"message":"mds.d(mds.0): MDS c
ache is too large (1MB/1kB); 401 inodes in use by clients, 0 stray files"}],"muted":false}},"mutes":[]}

So most of the times both the active mdses has generated health warnings and hence the test is passing even with the PR https://github.com/ceph/ceph/pull/61554. In the cases, where 1 MDS has generated health warning and still the job passed, at the time of running `mds fail`, I think the mds had a health warning generated !?

Here is the test case for reference.

    def test_with_health_warn_with_2_active_MDSs(self):
        '''
        Test when a CephFS has 2 active MDSs and one of them have either
        health warning MDS_TRIM or MDS_CACHE_OVERSIZE, running "ceph mds fail" 
        fails for both MDSs without confirmation flag and passes for both when
        confirmation flag is passed.
        '''
        health_warn = 'MDS_CACHE_OVERSIZED'
        self.fs.set_max_mds(2)
        self.gen_health_warn_mds_cache_oversized()
        mds1_id, mds2_id = self.fs.get_active_names()

        # MDS ID for which health warning has been generated.
        hw_mds_id = self._get_unhealthy_mds_id(health_warn)
        if mds1_id == hw_mds_id:
            non_hw_mds_id = mds2_id
        elif mds2_id == hw_mds_id:
            non_hw_mds_id = mds1_id
        else:
            raise RuntimeError('There are only 2 MDSs right now but apparently'
                               'health warning was raised for an MDS other '
                               'than these two. This is definitely an error.')

        # actual testing begins now...
        self.negtest_ceph_cmd(args=f'mds fail {non_hw_mds_id}', retval=1,
                              errmsgs=health_warn)
        self.negtest_ceph_cmd(args=f'mds fail {hw_mds_id}', retval=1,
                              errmsgs=health_warn)
        self.run_ceph_cmd(f'mds fail {mds1_id} --yes-i-really-mean-it')
        self.run_ceph_cmd(f'mds fail {mds2_id} --yes-i-really-mean-it')

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 9 months ago

Status changed from New to Fix Under Review
Pull request ID set to 64297

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 8 months ago

Backport set to tentacle,squid,reef

Actions

Copy link

Updated by Venky Shankar 8 months ago

Category set to Testing
Status changed from Fix Under Review to Pending Backport
Target version set to v21.0.0
Source set to Q/A

Actions

Copy link

Updated by Upkeep Bot 8 months ago

Merge Commit set to cf379ffe1205d8583a6cacb61d70b2cbe16f8e91
Fixed In set to v20.3.0-1897-gcf379ffe12
Upkeep Timestamp set to 2025-07-25T11:18:03+00:00

Actions

Copy link

Updated by Upkeep Bot 8 months ago

Copied to Backport #72280: squid: qa: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally added

Actions

Copy link

Updated by Upkeep Bot 8 months ago

Copied to Backport #72281: tentacle: qa: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally added

Actions

Copy link