Project

General

Profile

Actions

Bug #68187

open

cephfs_mirror: test_mirroring_init_failure_with_recovery failed to get state: "failed"

Added by Jos Collin over 1 year ago. Updated about 2 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

100%

Source:
Community (dev)
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs-mirror
Labels (FS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

This happens only in fs:valgrind tests:

https://pulpito.ceph.com/vshankar-2024-09-18_11:34:04-fs-wip-vshankar-testing-20240917.043942-debug-testing-default-smithi/7911148

2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:======================================================================
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring)
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_66579315f64c180ea2f8cffbe69b7e2cb8c46860/qa/tasks/cephfs/test_mirroring.py", line 905, in test_mirroring_init_failure_with_recovery
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:    while proceed():
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_3752d3834a7b6cd13dc17dfaa6c2fd3658f3439a/teuthology/contextutil.py", line 134, in __call__
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (20) after waiting for 100 seconds

Related issues 1 (0 open1 closed)

Related to CephFS - Bug #70515: squid: qa: test_mirroring_init_failure_with_recovery: `teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (20) after waiting for 100 seconds`DuplicateJos Collin

Actions
Actions #1

Updated by Jos Collin over 1 year ago

  • Status changed from New to In Progress
Actions #2

Updated by Jos Collin over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 60052
Actions #6

Updated by Jos Collin over 1 year ago

  • Description updated (diff)
Actions #8

Updated by Jos Collin 12 months ago

  • Related to Bug #70515: squid: qa: test_mirroring_init_failure_with_recovery: `teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (20) after waiting for 100 seconds` added
Actions #9

Updated by Jos Collin 8 months ago

- https://pulpito.ceph.com/vshankar-2025-07-30_12:00:00-fs-wip-vshankar-testing-20250730.064735-debug-testing-default-smithi/8416877/

The mirror daemon seems to have identified the failure (missing cephfs_mirror objects):

2025-07-30T18:06:44.303+0000 13579640 20 cephfs::mirror::Mirror update_fs_mirrors
2025-07-30T18:06:44.303+0000 13579640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2025-07-30T18:06:44.303+0000 13579640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2025-07-30T18:06:44.304+0000 13579640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=64, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2025-07-30T18:06:44.304+0000 13579640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=64, fs_name=cephfs}

However, the asok command fails with:

2025-07-30T18:06:40.355 DEBUG:teuthology.orchestra.run.smithi191:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64
2025-07-30T18:06:40.460 INFO:teuthology.orchestra.run.smithi191.stderr:no valid command found; 10 closest matches:
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:0
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:1
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:2
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:abort
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:assert
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config diff
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config diff get <var>
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config get <var>
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config help [<var>]
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config set <var> <val>...
2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:admin_socket: invalid command
2025-07-30T18:06:40.462 DEBUG:teuthology.orchestra.run:got remote process result: 22
2025-07-30T18:06:40.462 WARNING:tasks.cephfs.test_mirroring:mirror daemon command with label "mirror status for fs: cephfs" failed: Command failed (mirror status for fs: cephfs) on smithi191 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64'
Actions #10

Updated by Jos Collin 8 months ago

@Venky Shankar

From the logs, the mirror daemon shows restarting.

So we shouldn't expect this command to run successfully and print failed status ('state' == 'failed') during the restarting.

mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64

Instead, we should catch "invalid command" and continue checking if it would show the failed status, when the restart finishes.

Actions #11

Updated by Jos Collin 8 months ago ยท Edited

The mirror daemon didn't restart it seems.

The client log shows incomplete logs. At the end of ceph-client.mirror.48688.log:

2025-07-30T18:11:04.680+0000 13579640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2025-07-30T18:11:04.681+0000 13579640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2025-07-30T18:11:04.681+0000 13579640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=64, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2025-07-30T18:11:04.681+0000 13579640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=64, fs_name=cephfs}
2025-07-30T18:11:04.681+0000 13579640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0xe1c1430) after 2 seconds
2025-07-30T18:11:04.681+0000 12d78640 20 cephfs::mirror::FSMirror ~FSMirror
2025-07-30T18:11:04.681+0000 12d78640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2025-07-30T18:11:04.681+0000 12d78640 10 cephfs::mirror::ServiceDaemon: 0x14e0e840 add_or_update_fs_attribute: fscid=62
2025-07-30T18:11:04.682+0000 12d78640 20 cephfs::mirror::FSMirror init
2025-07-30T18:11:04.682+0000 12d78640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2025-07-30T18:11:05.139+0000 12d78640 10 cephfs::mirror::Utils connect: using mon addr=172.21.15.191
2025-07-30T18:11:06.681+0000 13579640 20 cephfs::mirror::Mirror update_fs_mirrors

After the mirror daemon failure, it shows:

2025-07-30T18:06:01.674+0000 12d78640 -1 asok(0x8ebbe10) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2025-07-30T18:06:01.826+0000 12d78640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2025-07-30T18:06:01.896+0000 12d78640 -1 cephfs::mirror::FSMirror init: error accessing local pool (id=124): (2) No such file or directory
2025-07-30T18:06:01.898+0000 12d78640 20 cephfs::mirror::Mirror handle_enable_mirroring: filesystem={fscid=62, fs_name=cephfs}, peers=, r=-2
2025-07-30T18:06:01.898+0000 12d78640 -1 cephfs::mirror::Mirror handle_enable_mirroring: failed to initialize FSMirror for filesystem={fscid=62, fs_name=cephfs}: (2) No such file or directory
2025-07-30T18:06:01.898+0000 12d78640 10 cephfs::mirror::ServiceDaemon: 0x14e0e840 add_or_update_fs_attribute: fscid=62

Mirror daemon restart failure:

ceph-client.mirror.48688.log:2025-07-30T18:05:37.851+0000 12d78640 -1 cephfs::mirror::Mirror handle_enable_mirroring: failed to initialize FSMirror for filesystem={fscid=62, fs_name=cephfs}: (2) No such file or directory

Actions #12

Updated by Jos Collin 8 months ago

@Venky Shankar

Today, I've tried this in vstart cluster, where mirroring was enabled:

$ ./bin/ceph mgr module disable mirroring
$ ./bin/ceph fs mirror enable a

Wait for sometime and this doesn't display the "state:" in the following command output, which explains why the qa test failed.

jcollin@telecast:~/workspace/ceph/build$ ./bin/ceph --admin-daemon asok/client.admin.33771.asok fs mirror status a@1
{
    "rados_inst": "127.0.0.1:0/102182576",
    "peers": {
        "6557125c-fa18-4e33-aab4-be16d80321bb": {
            "remote": {
                "client_name": "client.mirror_remote",
                "cluster_name": "ceph",
                "fs_name": "remotefs" 
            }
        }
    },
    "snap_dirs": {
        "dir_count": 1
    }
}
Actions #13

Updated by Jos Collin 8 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID changed from 60052 to 64846
Actions #14

Updated by Jos Collin about 2 months ago

  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF