Bug #63931
closedqa: test_mirroring_init_failure_with_recovery failure
0%
Description
Seen in a few qa runs recently.
https://pulpito.ceph.com/yuriw-2022-08-11_16:57:01-fs-wip-yuri3-testing-2022-08-11-0809-pacific-distro-default-smithi/6968329
https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862
https://pulpito.ceph.com/vshankar-2023-12-13_09:42:45-fs-wip-vshankar-testing3-2023-12-13-1225-quincy-testing-default-smithi/7489625
https://pulpito.ceph.com/yuriw-2023-12-28_15:45:00-fs-wip-yuri7-testing-2023-12-27-1008-pacific-distro-default-smithi/7502932
tasks.cephfs.test_mirroring.TestMirroring.test_mirroring_init_failure_with_recovery' 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ERROR 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner: 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:====================================================================== 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring) 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_5b5a60d2d510bfaa9213c41e32227fb008245a30/qa/tasks/cephfs/test_mirroring.py", line 742, in test_mirroring_init_failure_with_recovery 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner: while proceed(): 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_teuthology_9680d2aba1e76f64ade295c81599c97907457848/teuthology/contextutil.py", line 134, in __call__ 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: raise MaxWhileTries(error_msg) 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (21) after waiting for 100 seconds 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:Ran 29 tests in 6496.852s 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:FAILED (errors=1) 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:====================================================================== 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring) 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_5b5a60d2d510bfaa9213c41e32227fb008245a30/qa/tasks/cephfs/test_mirroring.py", line 742, in test_mirroring_init_failure_with_recovery 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: while proceed(): 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_teuthology_9680d2aba1e76f64ade295c81599c97907457848/teuthology/contextutil.py", line 134, in __call__ 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: raise MaxWhileTries(error_msg) 2023-12-27T18:49:39.572 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (21) after waiting for 100 seconds 2023-12-27T18:49:39.572 INFO:tasks.cephfs_test_runner:
Files
Updated by Kotresh Hiremath Ravishankar over 2 years ago
- Related to Bug #50224: qa: test_mirroring_init_failure_with_recovery failure added
Updated by Venky Shankar over 2 years ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Assignee set to Jos Collin
- Target version set to v19.0.0
- Backport set to quincy,reef
Updated by Jos Collin about 2 years ago
- Status changed from Triaged to In Progress
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
Updated by Venky Shankar about 2 years ago
- Target version changed from v19.0.0 to v20.0.0
- Backport changed from quincy,reef to quincy,reef,squid
Updated by Venky Shankar about 2 years ago
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where m_init_failed gets set or there is a delayed check for m_init_failed. I would start with the latter and hash that out first.
Updated by Jos Collin about 2 years ago
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failedgets set or there is a delayed check form_init_failed. I would start with the latter and hash that out first.
It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
460 void FSMirror::mirror_status(Formatter *f) {
461 std::scoped_lock locker(m_lock);
!462 f->open_object_section("status");
463 if (m_init_failed) {
464 f->dump_string("state", "failed");
465 } else if (is_blocklisted(locker)) {
466 f->dump_string("state", "blocklisted");
467 } else {
468 // dump rados addr for blocklist test
469 f->dump_string("rados_inst", m_addrs);
470 f->open_object_section("peers");
471 for ([[maybe_unused]] auto &[peer, peer_replayer] : m_peer_replayers) {
472 peer.dump(f);
473 }
─── Variables ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
arg this = 0x5601b6cfe000: {m_cct = 0x5601b1e9a000,m_filesystem = {fscid = 1,fs_name = "a"},m_pool_id …, f = 0x5601b6d28e00: {_vptr.Formatter = 0x7fb7cf694b30 <vtable for ceph::JSONFormatter+16>}
loc locker = {_M_device = @0x5601b6cfe060}
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>> p m_init_failed
$4 = false
The command:
./bin/ceph --admin-daemon asok/client.admin.342404.asok fs mirror status a@1
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.
Updated by Venky Shankar about 2 years ago
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failedgets set or there is a delayed check form_init_failed. I would start with the latter and hash that out first.It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.
Do you have the run link? I'll have a look now.
Updated by Jos Collin about 2 years ago
Venky Shankar wrote:
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failedgets set or there is a delayed check form_init_failed. I would start with the latter and hash that out first.It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.Do you have the run link? I'll have a look now.
The above finding is from the vstart cluster reproduced.
I was also looking at https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862.
Updated by Venky Shankar about 2 years ago
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failedgets set or there is a delayed check form_init_failed. I would start with the latter and hash that out first.It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.Do you have the run link? I'll have a look now.
The above finding is from the vstart cluster reproduced.
I was also looking at https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862.
No problem. Just upload the cephfs-mirror debug logs from the vstart cluster when running this test.
Updated by Jos Collin about 2 years ago
- File client.admin.367187.log client.admin.367187.log added
- File client.mirror_remote.367187.log client.mirror_remote.367187.log added
Venky Shankar wrote:
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failedgets set or there is a delayed check form_init_failed. I would start with the latter and hash that out first.It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.Do you have the run link? I'll have a look now.
The above finding is from the vstart cluster reproduced.
I was also looking at https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862.No problem. Just upload the cephfs-mirror debug logs from the vstart cluster when running this test.
Logs attached.
Updated by Jos Collin almost 2 years ago
Building the failed branch again, as none of these qa runs mentioned in the Description is having remote logs.
Updated by Jos Collin almost 2 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 56193
test_mirroring_init_failure_with_recovery make the mirror daemon fail. After this, the mirror daemon didn't restarted to show the 'state' == 'failed' status. This is confirmed by the error message in the description and the subsequent 'mirror status' command shows 'Commond failed' in teuthology.log. This issue is being fixed by PR#56193.
Updated by Venky Shankar almost 2 years ago
Jos Collin wrote in #note-12:
test_mirroring_init_failure_with_recovery make the mirror daemon fail. After this, the mirror daemon didn't restarted to show the 'state' == 'failed' status. This is confirmed by the error message in the description and the subsequent 'mirror status' command shows 'Commond failed' in teuthology.log. This issue is being fixed by PR#56193.
Do you mean the FSMirror instance didn't restart or the mirror daemon crashed?
Updated by Jos Collin almost 2 years ago
Venky Shankar wrote in #note-13:
Jos Collin wrote in #note-12:
test_mirroring_init_failure_with_recovery make the mirror daemon fail. After this, the mirror daemon didn't restarted to show the 'state' 'failed' status. This is confirmed by the error message in the description and the subsequent 'mirror status' command shows 'Commond failed' in teuthology.log. This issue is being fixed by PR#56193.
Do you mean the FSMirror instance didn't restart or the mirror daemon crashed?
In test_mirroring_init_failure_with_recovery, as I understand, we are failing the mirror daemon. Then it should restart after 30 seconds and 'state' 'failed' status is checked in the test. Because of the time check bug (fixed by https://github.com/ceph/ceph/pull/56193/commits/518fe18c0220bbe755df7a80db040c18667765a8), the mirror daemon is not restarting. There's no crash noted in the logs, but it fails to restart. Thus this tracker is also referred in PR#56193.
Updated by Venky Shankar almost 2 years ago
Jos Collin wrote in #note-14:
Venky Shankar wrote in #note-13:
Jos Collin wrote in #note-12:
test_mirroring_init_failure_with_recovery make the mirror daemon fail. After this, the mirror daemon didn't restarted to show the 'state' 'failed' status. This is confirmed by the error message in the description and the subsequent 'mirror status' command shows 'Commond failed' in teuthology.log. This issue is being fixed by PR#56193.
Do you mean the FSMirror instance didn't restart or the mirror daemon crashed?
In test_mirroring_init_failure_with_recovery, as I understand, we are failing the mirror daemon. Then it should restart after 30 seconds and 'state' 'failed' status is checked in the test. Because of the time check bug (fixed by https://github.com/ceph/ceph/pull/56193/commits/518fe18c0220bbe755df7a80db040c18667765a8), the mirror daemon is not restarting. There's no crash noted in the logs, but it fails to restart. Thus this tracker is also referred in PR#56193.
kk, thanks.
Updated by Venky Shankar almost 2 years ago
- Status changed from Fix Under Review to Pending Backport
- Source set to Q/A
Updated by Jos Collin almost 2 years ago
- Copied to Backport #66972: squid: qa: test_mirroring_init_failure_with_recovery failure added
Updated by Jos Collin almost 2 years ago
- Copied to Backport #66973: reef: qa: test_mirroring_init_failure_with_recovery failure added
Updated by Jos Collin almost 2 years ago
- Copied to Backport #66974: quincy: qa: test_mirroring_init_failure_with_recovery failure added
Updated by Upkeep Bot almost 2 years ago
- Tags (freeform) set to backport_processed
Updated by Jos Collin over 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Upkeep Bot 10 months ago
- Merge Commit set to 62eb72731aca5d403ed6239946c6ea66f3be36e7
- Fixed In set to v19.3.0-3451-g62eb72731ac
- Upkeep Timestamp set to 2025-07-02T03:42:06+00:00
Updated by Upkeep Bot 10 months ago
- Fixed In changed from v19.3.0-3451-g62eb72731ac to v19.3.0-3451-g62eb72731a
- Upkeep Timestamp changed from 2025-07-02T03:42:06+00:00 to 2025-07-14T16:45:37+00:00
Updated by Upkeep Bot 6 months ago
- Released In set to v20.2.0~2488
- Upkeep Timestamp changed from 2025-07-14T16:45:37+00:00 to 2025-11-01T01:27:08+00:00