Actions
Bug #71915
openqa: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally
Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Testing
Target version:
% Done:
0%
Source:
Q/A
Backport:
tentacle,squid,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Tags (freeform):
backport_processed
Merge Commit:
Fixed In:
v20.3.0-1897-gcf379ffe12
Released In:
Upkeep Timestamp:
2025-07-25T11:18:03+00:00
Description
The test test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally on main branch. This is found while testing an upstream PR.
2025-06-28T18:52:31.443 INFO:tasks.cephfs_test_runner:Test when a CephFS has 2 active MDSs and one of them have either ... FAIL
2025-06-28T18:52:31.444 INFO:tasks.cephfs_test_runner:
2025-06-28T18:52:31.444 INFO:tasks.cephfs_test_runner:======================================================================
2025-06-28T18:52:31.444 INFO:tasks.cephfs_test_runner:FAIL: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail)
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:Test when a CephFS has 2 active MDSs and one of them have either
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_kotreshhr_ceph_2ac2b127953bf014f6dae347a56f973dc93e6069/qa/tasks/cephfs/test_admin.py", line 2846, in test_with_health_warn_with_2_active_MDSs
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner: self.negtest_ceph_cmd(args=f'mds fail {non_hw_mds_id}', retval=1,
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_kotreshhr_ceph_2ac2b127953bf014f6dae347a56f973dc93e6069/qa/tasks/ceph_test_case.py", line 114, in negtest_ceph_cmd
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner: self._verify(proc, retval, errmsgs)
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_kotreshhr_ceph_2ac2b127953bf014f6dae347a56f973dc93e6069/qa/tasks/ceph_test_case.py", line 69, in _verify
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner: self.assert_retval(proc.returncode, exp_retval)
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_kotreshhr_ceph_2ac2b127953bf014f6dae347a56f973dc93e6069/qa/tasks/ceph_test_case.py", line 60, in assert_retval
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner: assert proc_retval == exp_retval, msg
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:AssertionError: expected return value: 1
2025-06-28T18:52:31.445 INFO:tasks.cephfs_test_runner:received return value: 0
Please find the health detail output just before the failure.
2025-06-28T18:52:20.355 DEBUG:teuthology.orchestra.run.smithi072:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph health detail --format json
...
...
2025-06-28T18:52:20.756 INFO:teuthology.orchestra.run.smithi072.stderr:2025-06-28T18:52:20.753+0000 7f457effd640 1 -- 172.21.15.72:0/162872624 <== mon.0 v2:172.21.15.72:3300/0 7 ==== mon_command_ack([{"prefix": "health", "detail": "detail", "format": "json"}]=0 v0) ==== 92+0+291 (secure 0 0 0) 0x7f4578061690 con 0x7f4588165740
2025-06-28T18:52:20.757 INFO:teuthology.orchestra.run.smithi072.stdout:
2025-06-28T18:52:20.757 INFO:teuthology.orchestra.run.smithi072.stdout:{"status":"HEALTH_WARN","checks":{"MDS_CACHE_OVERSIZED":{"severity":"HEALTH_WARN","summary":{"message":"1 MDSs report oversized cache","count":1},"detail":[{"message":"mds.d(mds.0): MDS cache is too large (1MB/1kB); 401 inodes in use by clients, 0 stray files"}],"muted":false}},"mutes":[]}
Updated by Kotresh Hiremath Ravishankar 9 months ago
Looking at the logs of the successful run
2025-06-28T18:31:33.699 DEBUG:teuthology.orchestra.run.smithi088:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph health detail --format json
...
...
2025-06-28T18:31:34.108 INFO:teuthology.orchestra.run.smithi088.stderr:2025-06-28T18:31:34.102+0000 7f0b23fff640 1 -- 172.21.15.88:0/1494550706 <== mon.0 v2:172.21.15.88:3300/0 7 ==== mon_command_ack([{"prefix": "health", "detail": "detail", "format": "json"}]=0 v0) ==== 92+0+396 (secure 0 0 0) 0x7f0b28032590 con 0x7f0b340ba280
2025-06-28T18:31:34.108 INFO:teuthology.orchestra.run.smithi088.stdout:
2025-06-28T18:31:34.108 INFO:teuthology.orchestra.run.smithi088.stdout:{"status":"HEALTH_WARN","checks":{"MDS_CACHE_OVERSIZED":{"severity":"HEALTH_WARN","summary":{"message":"2 MDSs report oversized cache","count":2},"detail":[{"message":"mds.c(mds.1): MDS cache is too large (48kB/1kB); 0 inodes in use by clients, 0 stray files"},{"message":"mds.a(mds.0): MDS cache is too large (1MB/1kB); 401 inodes in use by clients, 0 stray files"}],"muted":false}},"mutes":[]}
2025-06-28T18:37:12.436 DEBUG:teuthology.orchestra.run.smithi028:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph health detail --format json
...
...
2025-06-28T18:37:12.869 INFO:teuthology.orchestra.run.smithi028.stderr:2025-06-28T18:37:12.867+0000 7f1da27fc640 1 -- 172.21.15.28:0/3110406769 <== mon.0 v2:172.21.15.28:3300/0 7 ==== mon_command_ack([{"prefix": "health", "detail": "detail", "format": "json"}]=0 v0) ==== 92+0+291 (secure 0 0 0) 0x7f1db00617a0 con 0x7f1db4167e30
2025-06-28T18:37:12.870 INFO:teuthology.orchestra.run.smithi028.stdout:
2025-06-28T18:37:12.870 INFO:teuthology.orchestra.run.smithi028.stdout:{"status":"HEALTH_WARN","checks":{"MDS_CACHE_OVERSIZED":{"severity":"HEALTH_WARN","summary":{"message":"1 MDSs report oversized cache","count":1},"detail":[{"message":"mds.c(mds.0): MDS cache is too large (1MB/1kB); 401 inodes in use by clients, 0 stray files"}],"muted":false}},"mutes":[]}
2025-06-28T18:36:16.715 DEBUG:teuthology.orchestra.run.smithi055:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph health detail --format json
...
...
2025-06-28T18:36:17.138 INFO:teuthology.orchestra.run.smithi055.stderr:2025-06-28T18:36:17.134+0000 7f66227fc640 1 -- 172.21.15.55:0/2590277424 <== mon.0 v2:172.21.15.55:3300/0 7 ==== mon_command_ack([{"prefix": "health", "detail": "detail", "format": "json"}]=0 v0) ==== 92+0+396 (secure 0 0 0) 0x7f6618011040 con 0x7f6624099040
2025-06-28T18:36:17.138 INFO:teuthology.orchestra.run.smithi055.stdout:
2025-06-28T18:36:17.138 INFO:teuthology.orchestra.run.smithi055.stdout:{"status":"HEALTH_WARN","checks":{"MDS_CACHE_OVERSIZED":{"severity":"HEALTH_WARN","summary":{"message":"2 MDSs report oversized cache","count":2},"detail":[{"message":"mds.a(mds.1): MDS cache is too large (51kB/1kB); 0 inodes in use by clients, 0 stray files"},{"message":"mds.d(mds.0): MDS c
ache is too large (1MB/1kB); 401 inodes in use by clients, 0 stray files"}],"muted":false}},"mutes":[]}
So most of the times both the active mdses has generated health warnings and hence the test is passing even with the PR https://github.com/ceph/ceph/pull/61554. In the cases, where 1 MDS has generated health warning and still the job passed, at the time of running `mds fail`, I think the mds had a health warning generated !?
Here is the test case for reference.
def test_with_health_warn_with_2_active_MDSs(self):
'''
Test when a CephFS has 2 active MDSs and one of them have either
health warning MDS_TRIM or MDS_CACHE_OVERSIZE, running "ceph mds fail"
fails for both MDSs without confirmation flag and passes for both when
confirmation flag is passed.
'''
health_warn = 'MDS_CACHE_OVERSIZED'
self.fs.set_max_mds(2)
self.gen_health_warn_mds_cache_oversized()
mds1_id, mds2_id = self.fs.get_active_names()
# MDS ID for which health warning has been generated.
hw_mds_id = self._get_unhealthy_mds_id(health_warn)
if mds1_id == hw_mds_id:
non_hw_mds_id = mds2_id
elif mds2_id == hw_mds_id:
non_hw_mds_id = mds1_id
else:
raise RuntimeError('There are only 2 MDSs right now but apparently'
'health warning was raised for an MDS other '
'than these two. This is definitely an error.')
# actual testing begins now...
self.negtest_ceph_cmd(args=f'mds fail {non_hw_mds_id}', retval=1,
errmsgs=health_warn)
self.negtest_ceph_cmd(args=f'mds fail {hw_mds_id}', retval=1,
errmsgs=health_warn)
self.run_ceph_cmd(f'mds fail {mds1_id} --yes-i-really-mean-it')
self.run_ceph_cmd(f'mds fail {mds2_id} --yes-i-really-mean-it')
Updated by Kotresh Hiremath Ravishankar 9 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 64297
Updated by Kotresh Hiremath Ravishankar 8 months ago
- Backport set to tentacle,squid,reef
Updated by Venky Shankar 8 months ago
- Category set to Testing
- Status changed from Fix Under Review to Pending Backport
- Target version set to v21.0.0
- Source set to Q/A
Updated by Upkeep Bot 8 months ago
- Merge Commit set to cf379ffe1205d8583a6cacb61d70b2cbe16f8e91
- Fixed In set to v20.3.0-1897-gcf379ffe12
- Upkeep Timestamp set to 2025-07-25T11:18:03+00:00
Updated by Upkeep Bot 8 months ago
- Copied to Backport #72280: squid: qa: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally added
Updated by Upkeep Bot 8 months ago
- Copied to Backport #72281: tentacle: qa: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally added
Updated by Upkeep Bot 8 months ago
- Copied to Backport #72282: reef: qa: test_with_health_warn_with_2_active_MDSs (tasks.cephfs.test_admin.TestMDSFail) fails occasionally added
Actions