Skip to content

tools/cephfs_mirror: Multi-threaded Mirroring#66572

Merged
vshankar merged 31 commits intoceph:mainfrom
kotreshhr:mirror-multithreaded
Feb 25, 2026
Merged

tools/cephfs_mirror: Multi-threaded Mirroring#66572
vshankar merged 31 commits intoceph:mainfrom
kotreshhr:mirror-multithreaded

Conversation

@kotreshhr
Copy link
Contributor

@kotreshhr kotreshhr commented Dec 9, 2025

Adds a pool of datasync threads which concurrently syncs the files.
The existing thread would crawl and add files to the queue which the
datasync threads syncs concurrently. The existing thread also takes
care of mkdir, deletes, snap rename and deletes as well.

Fixes: https://tracker.ceph.com/issues/73452

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@github-actions github-actions bot added the cephfs Ceph File System label Dec 9, 2025
@kotreshhr kotreshhr requested a review from vshankar December 9, 2025 12:51
@kotreshhr
Copy link
Contributor Author

kotreshhr commented Dec 9, 2025

It's just a draft PR and there is lot of cleanup, error handling required.
With this approach there are couple of downsides which needs to be addressed.

  1. The existing (crawler) thread is also taking care of entry operations (deletes, mkdir etc). Since snapshot has to be taken at remote site at the end of each snapshot sync from primary, it can't procees with crawling the next snapshot. Probably a separate lightweight crawler needs to be introduced.
  2. The exisiting threads (crawler threads) picks up one mirror directory each. So multiple mirror directories sync parallely but each directory is single threaded. The data sync threads added as part this PR is long running similar to existing threads and picks up the jobs from SyncMechansim's data queue. Since each SyncMechanism represents single snapshot that's being synced. The pool of data sync threads sync one snapshot at a time. Even though some of the threads are free(finished syncing), it's waiting for pending threads to finish to synchronize on taking snapshots which is consistent.

So even though the entry operations happen parallely for different mirrored directories, the data sync happens one after the other.

@github-actions
Copy link

github-actions bot commented Dec 19, 2025

Config Diff Tool Output

+ added: cephfs_mirror_max_concurrent_data_sync_threads (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: old: maximum number of directory snapshots that can be synchronized concurrently by cephfs-mirror daemon. Controls the number of synchronization threads. (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: new: maximum number of directory snapshots that can be crawled concurrently by cephfs-mirror daemon. Controls the number of synchronization crawler threads. Note that the crawler threads also does entry operations like directory creations, file deletes and snapshot deletes/renames. (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: old: maximum number of concurrent snapshot synchronization threads (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: new: maximum number of concurrent snapshot synchronization crawler threads (cephfs-mirror.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

@kotreshhr kotreshhr force-pushed the mirror-multithreaded branch 4 times, most recently from cd4979a to 34a23d7 Compare January 14, 2026 20:30
@kotreshhr kotreshhr marked this pull request as ready for review January 14, 2026 20:30
@kotreshhr
Copy link
Contributor Author

@vshankar This is ready for review and testing. PTAL

@vshankar vshankar requested a review from a team January 27, 2026 09:10
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a line-by-line review of this change yet, but overall it seems to do the correct thing and the data structures are well laid out for easy review. I like the approach and implementation. Good work @kotreshhr

Additionally there are some things we can do in the change to further enhance user experience (when reporting metrics in the future) and generally as suggestions for this PR:

  • Not use blockdiff when the file size is less than 8/16MB since the blockdiff call itself could be more expensive than the full data transfer. We should make this configurable by having a minimum file size below which the mirror daemon would not use blockdiff.
  • Add some negative tests to verify that the edge cases are appropriately handled
  • Run this change through fs:mirror suite to see if some tests fail (now that lab is up and running). I expect some test failures, especially those which rely on timing.
  • Add documentation for the introduced config
  • Add to PendingReleaseNotes

I will do a detailed review when the above points are incorporated.

@kotreshhr kotreshhr force-pushed the mirror-multithreaded branch from 5b4ae86 to becd6e8 Compare February 4, 2026 08:54
@kotreshhr
Copy link
Contributor Author

Rebased and added a few metrics to existing asok command

@kotreshhr kotreshhr force-pushed the mirror-multithreaded branch from becd6e8 to 3b924a6 Compare February 4, 2026 11:17
@kotreshhr kotreshhr requested a review from a team as a code owner February 4, 2026 11:17
@kotreshhr
Copy link
Contributor Author

/config check ok

@kotreshhr kotreshhr force-pushed the mirror-multithreaded branch from 6425ffe to b7fd4bd Compare February 15, 2026 20:06
@kotreshhr
Copy link
Contributor Author

I haven't done a line-by-line review of this change yet, but overall it seems to do the correct thing and the data structures are well laid out for easy review. I like the approach and implementation. Good work @kotreshhr

Additionally there are some things we can do in the change to further enhance user experience (when reporting metrics in the future) and generally as suggestions for this PR:

  • Not use blockdiff when the file size is less than 8/16MB since the blockdiff call itself could be more expensive than the full data transfer. We should make this configurable by having a minimum file size below which the mirror daemon would not use blockdiff.

Done, block diff is used only if file size is greater than 16 MiB

  • Add some negative tests to verify that the edge cases are appropriately handled

The existing test cases do cover blocklist/cancel/shutdown test cases

  • Run this change through fs:mirror suite to see if some tests fail (now that lab is up and running). I expect some test failures, especially those which rely on timing.

Ran once. I have scheduled one more.

  • Add documentation for the introduced config

Done

  • Add to PendingReleaseNotes

Done

I will do a detailed review when the above points are incorporated.

@kotreshhr
Copy link
Contributor Author

jenkins test make check

@kotreshhr
Copy link
Contributor Author

jenkins test make check arm64

@kotreshhr
Copy link
Contributor Author

kotreshhr commented Feb 17, 2026

Rebased and fixed flake8 errors causing make check failure introduced by #67305 Not sure why that PR didn't catch that.

@vshankar
Copy link
Contributor

Rebased and fixed flake8 errors causing make check failure introduced by #67305 Not sure why that PR didn't catch that.

Could you send 03a52ff as a separate change so that I can merge it sooner?

@kotreshhr kotreshhr force-pushed the mirror-multithreaded branch from 03a52ff to 4ea9928 Compare February 17, 2026 07:57
…wait

1. Add is_stopping() predicate at sdq_cv wait
2. Use the existing should_backoff() routine to validate
   shutdown/blocklsit/cancel errors and set corresponding errors.
3. Handle notify logic at the end
4. In shutdown(), notify all syncm's sdq_cv wait

Fixes: https://tracker.ceph.com/issues/73452
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Introduce a new configuration option,
'cephfs_mirror_blockdiff_min_file_size', to control the minimum file
size above which block-level diff is used during CephFS mirroring.

Files smaller than the configured threshold are synchronized using
full file copy, while larger files attempt block-level delta sync.
This provides better flexibility across environments with varying
file size distributions and performance constraints.

The default value is set to 16_M (16 MiB). The value is read once
at beginning of every snapshot sync.

Fixes: https://tracker.ceph.com/issues/73452
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Do remote fs sync once just before taking snapshot
as it's faster than doing fsync on each fd after
file copy.

Moreover, all the datasync threads use the same sinlge libceph
onnection and doing ceph_fsync concurrently on different fds on
a single libcephfs connection could cause hang as observed in
testing as below. This issue is tracked at
https://tracker.ceph.com/issues/75070

-----
Thread 2 (Thread 0xffff644cc400 (LWP 74020) "d_replayer-0"):
0  0x0000ffff8e82656c in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
1  0x0000ffff8e828ff0 [PAC] in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libc.so.6
2  0x0000ffff8fc90fd4 [PAC] in ceph::condition_variable_debug::wait ...
3  0x0000ffff9080fc9c in ceph::condition_variable_debug::wait<Client::wait_on_context_list ...
4  Client::wait_on_context_list ... at /lsandbox/upstream/ceph/src/client/Client.cc:4540
5  0x0000ffff9083fae8 in Client::_fsync ... at /lsandbox/upstream/ceph/src/client/Client.cc:13299
6  0x0000ffff90840278 in Client::_fsync ...
7  0x0000ffff90840514 in Client::fsync ... at /lsandbox/upstream/ceph/src/client/Client.cc:13042
8  0x0000ffff907f06e0 in ceph_fsync ... at /lsandbox/upstream/ceph/src/libcephfs.cc:316
9  0x0000aaaaad5b2f88 in cephfs::mirror::PeerReplayer::copy_to_remote ...
----

Fixes: https://tracker.ceph.com/issues/73452
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Following configs are introduced:
  - cephfs_mirror_max_datasync_threads
  - cephfs_mirror_blockdiff_min_file_size

Fixes: https://tracker.ceph.com/issues/73452
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Also mentions about blockdiff which is no longer used
for small files and about new configuration
introduced.

Fixes: https://tracker.ceph.com/issues/73452
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Fixes: https://tracker.ceph.com/issues/73452
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Lock order 1:
InstanceWatcher::m_lock ----> FSMirror::m_lock
Lock order 2:
FSMirror::m_lock -----> InstanceWatcher::m_lock

The Lock order 1 is where it's aborted and it happens
during blocklisting. The InstanceWatcher::handle_rewatch_complete()
acquires InstanceWatcher::m_lock and calls
m_elistener.set_blocklisted_ts() which tries to acquire
FSMirror::m_lock

The Lock order 2 exists in mirror peer status command.
The FSMirror::mirror_status(Formatter *f) takes FSMirro::m_lock
and calls is_blocklisted which takes InstanceWatcher::m_lock

Fix:
FSMirror::m_blocklisted_ts and FSMirror::m_failed_ts is converted
to std::<atomic> and also fixed the scope of m_lock in
InstanceWatcher::handle_rewatch_complete() and
MirrorWatcher::handle_rewatch_complete()

Look at the tracker for traceback and further details.

Fixes: https://tracker.ceph.com/issues/74953
Signed-off-by: Kotresh HR <khiremat@redhat.com>
@kotreshhr kotreshhr force-pushed the mirror-multithreaded branch from 7d765e0 to 4efa9e5 Compare February 22, 2026 20:31
@kotreshhr
Copy link
Contributor Author

There was bug and hence the Test failure: test_cephfs_mirror_sync_with_purged_snapshot Fixed the same.
Also changed config name of max data sync threads from cephfs_mirror_max_concurrent_data_sync_threads to cephfs_mirror_max_datasync_threads

@kotreshhr
Copy link
Contributor Author

Two new runs are scheduled -

  1. https://pulpito.ceph.com/khiremat-2026-02-23_03:24:28-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/

The results look good, except the known failure https://tracker.ceph.com/issues/74984 Please note that this failure can be see two test cases as they do similar stuff.

  1. tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_incremental_sync
  2. tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_sync_with_purged_snapshot

Tests ran other than above are fine.

  1. All tests ran fine except tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_incremental_sync as it's skipped.
$ pwd
/teuthology/khiremat-2026-02-23_03:24:28-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64727

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner  |wc -l
32

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner
2026-02-23T03:33:50.213 INFO:tasks.cephfs_test_runner:test_add_ancestor_and_child_directory (tasks.cephfs.test_mirroring.TestMirroring.test_add_ancestor_and_child_directory) ... ok
2026-02-23T03:34:57.539 INFO:tasks.cephfs_test_runner:test_add_directory_path_normalization (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_path_normalization) ... ok
2026-02-23T03:35:28.973 INFO:tasks.cephfs_test_runner:test_add_directory_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_with_mirroring_disabled) ... ok
2026-02-23T03:36:25.299 INFO:tasks.cephfs_test_runner:test_add_relative_directory_path (tasks.cephfs.test_mirroring.TestMirroring.test_add_relative_directory_path) ... ok
2026-02-23T03:37:28.917 INFO:tasks.cephfs_test_runner:test_basic_mirror_commands (tasks.cephfs.test_mirroring.TestMirroring.test_basic_mirror_commands) ... ok
2026-02-23T03:40:17.256 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_blocklist) ... ok
2026-02-23T03:43:59.137 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths ... ok
2026-02-23T03:50:38.242 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_cancel_sync) ... ok
2026-02-23T03:53:24.705 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_failed_sync_with_correction (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_failed_sync_with_correction) ... ok
2026-02-23T03:56:07.255 INFO:tasks.cephfs_test_runner:Test incremental snapshot synchronization with file type changes. ... ok
2026-02-23T03:57:24.988 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_peer_add_primary (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_peer_add_primary) ... ok
2026-02-23T03:58:51.697 INFO:tasks.cephfs_test_runner:Test importing peer bootstrap token ... ok
2026-02-23T04:01:57.187 INFO:tasks.cephfs_test_runner:That making changes to the remote .snap directory shows 'peer status' state: "failed" ... ok
2026-02-23T04:05:04.742 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_remove_on_stall (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_remove_on_stall) ... ok
2026-02-23T04:09:18.818 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_restart_sync_on_blocklist) ... ok
2026-02-23T04:12:38.159 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_service_daemon_status) ... ok
2026-02-23T04:15:13.276 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_stats (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_stats) ... ok
2026-02-23T04:16:57.177 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_symlink_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_symlink_sync) ... ok
2026-02-23T04:18:16.659 INFO:tasks.cephfs_test_runner:That mirroring syncs the already existing snapshot correctly. ... ok
2026-02-23T04:21:04.260 INFO:tasks.cephfs_test_runner:Test snapshot synchronization in midst of snapshot deletes. ... ok
2026-02-23T04:22:56.688 INFO:tasks.cephfs_test_runner:Test snapshot synchronization with parent directory snapshots ... ok
2026-02-23T04:24:45.250 INFO:tasks.cephfs_test_runner:test_directory_command_ls (tasks.cephfs.test_mirroring.TestMirroring.test_directory_command_ls) ... ok
2026-02-23T04:26:02.583 INFO:tasks.cephfs_test_runner:test_directory_commands (tasks.cephfs.test_mirroring.TestMirroring.test_directory_commands) ... ok
2026-02-23T04:26:41.921 INFO:tasks.cephfs_test_runner:That get/set ceph.mirror.dirty_snap_id attribute succeeds in a remote filesystem. ... ok
2026-02-23T04:28:00.294 INFO:tasks.cephfs_test_runner:test_local_and_remote_dir_root_mode (tasks.cephfs.test_mirroring.TestMirroring.test_local_and_remote_dir_root_mode) ... ok
2026-02-23T04:28:56.640 INFO:tasks.cephfs_test_runner:test_matching_peer (tasks.cephfs.test_mirroring.TestMirroring.test_matching_peer) ... ok
2026-02-23T04:30:04.128 INFO:tasks.cephfs_test_runner:test_mirror_disable_with_peer (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_disable_with_peer) ... ok
2026-02-23T04:31:33.484 INFO:tasks.cephfs_test_runner:test_mirror_peer_add_existing (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_add_existing) ... ok
2026-02-23T04:32:59.092 INFO:tasks.cephfs_test_runner:test_mirror_peer_commands (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_commands) ... ok
2026-02-23T04:33:37.210 INFO:tasks.cephfs_test_runner:Test mirror daemon init failure ... ok
2026-02-23T04:34:15.177 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ok
2026-02-23T04:34:49.720 INFO:tasks.cephfs_test_runner:test_peer_commands_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_peer_commands_with_mirroring_disabled) ... ok
  1. This is where tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_sync_with_purged_snapshot and it wasn't skipped, hence all test didn't run
$ pwd
/teuthology/khiremat-2026-02-23_03:24:28-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64729

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner  |wc -l
19

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner
2026-02-23T03:39:29.806 INFO:tasks.cephfs_test_runner:test_add_ancestor_and_child_directory (tasks.cephfs.test_mirroring.TestMirroring.test_add_ancestor_and_child_directory) ... ok
2026-02-23T03:40:38.614 INFO:tasks.cephfs_test_runner:test_add_directory_path_normalization (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_path_normalization) ... ok
2026-02-23T03:41:10.490 INFO:tasks.cephfs_test_runner:test_add_directory_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_with_mirroring_disabled) ... ok
2026-02-23T03:42:07.220 INFO:tasks.cephfs_test_runner:test_add_relative_directory_path (tasks.cephfs.test_mirroring.TestMirroring.test_add_relative_directory_path) ... ok
2026-02-23T03:43:12.192 INFO:tasks.cephfs_test_runner:test_basic_mirror_commands (tasks.cephfs.test_mirroring.TestMirroring.test_basic_mirror_commands) ... ok
2026-02-23T03:46:09.417 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_blocklist) ... ok
2026-02-23T03:50:05.537 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths ... ok
2026-02-23T03:55:41.598 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_cancel_sync) ... ok
2026-02-23T03:58:29.605 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_failed_sync_with_correction (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_failed_sync_with_correction) ... ok
2026-02-23T04:00:58.354 INFO:tasks.cephfs_test_runner:Test incremental snapshot synchronization with file type changes. ... ok
2026-02-23T04:02:06.204 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_peer_add_primary (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_peer_add_primary) ... ok
2026-02-23T04:03:34.343 INFO:tasks.cephfs_test_runner:Test importing peer bootstrap token ... ok
2026-02-23T04:06:46.345 INFO:tasks.cephfs_test_runner:That making changes to the remote .snap directory shows 'peer status' state: "failed" ... ok
2026-02-23T04:09:55.414 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_remove_on_stall (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_remove_on_stall) ... ok
2026-02-23T04:16:17.796 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_restart_sync_on_blocklist) ... ok
2026-02-23T04:19:39.479 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_service_daemon_status) ... ok
2026-02-23T04:22:53.057 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_stats (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_stats) ... ok
2026-02-23T04:24:33.398 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_symlink_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_symlink_sync) ... ok
2026-02-23T04:26:12.535 INFO:tasks.cephfs_test_runner:That mirroring syncs the already existing snapshot correctly. ... ok
  1. https://pulpito.ceph.com/khiremat-2026-02-23_03:25:11-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/
$ pwd
/teuthology/khiremat-2026-02-23_03:25:11-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64732

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner  |wc -l
32

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner
2026-02-23T03:34:33.170 INFO:tasks.cephfs_test_runner:test_add_ancestor_and_child_directory (tasks.cephfs.test_mirroring.TestMirroring.test_add_ancestor_and_child_directory) ... ok
2026-02-23T03:35:41.473 INFO:tasks.cephfs_test_runner:test_add_directory_path_normalization (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_path_normalization) ... ok
2026-02-23T03:36:13.038 INFO:tasks.cephfs_test_runner:test_add_directory_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_with_mirroring_disabled) ... ok
2026-02-23T03:37:16.534 INFO:tasks.cephfs_test_runner:test_add_relative_directory_path (tasks.cephfs.test_mirroring.TestMirroring.test_add_relative_directory_path) ... ok
2026-02-23T03:38:12.662 INFO:tasks.cephfs_test_runner:test_basic_mirror_commands (tasks.cephfs.test_mirroring.TestMirroring.test_basic_mirror_commands) ... ok
2026-02-23T03:41:00.277 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_blocklist) ... ok
2026-02-23T03:44:34.919 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths ... ok
2026-02-23T03:51:15.704 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_cancel_sync) ... ok
2026-02-23T03:54:10.430 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_failed_sync_with_correction (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_failed_sync_with_correction) ... ok
2026-02-23T03:56:50.111 INFO:tasks.cephfs_test_runner:Test incremental snapshot synchronization with file type changes. ... ok
2026-02-23T03:57:57.630 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_peer_add_primary (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_peer_add_primary) ... ok
2026-02-23T03:59:16.953 INFO:tasks.cephfs_test_runner:Test importing peer bootstrap token ... ok
2026-02-23T04:02:21.570 INFO:tasks.cephfs_test_runner:That making changes to the remote .snap directory shows 'peer status' state: "failed" ... ok
2026-02-23T04:05:28.927 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_remove_on_stall (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_remove_on_stall) ... ok
2026-02-23T04:09:58.600 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_restart_sync_on_blocklist) ... ok
2026-02-23T04:13:22.583 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_service_daemon_status) ... ok
2026-02-23T04:15:47.701 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_stats (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_stats) ... ok
2026-02-23T04:17:45.850 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_symlink_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_symlink_sync) ... ok
2026-02-23T04:19:05.603 INFO:tasks.cephfs_test_runner:That mirroring syncs the already existing snapshot correctly. ... ok
2026-02-23T04:21:46.810 INFO:tasks.cephfs_test_runner:Test snapshot synchronization in midst of snapshot deletes. ... ok
2026-02-23T04:23:39.187 INFO:tasks.cephfs_test_runner:Test snapshot synchronization with parent directory snapshots ... ok
2026-02-23T04:25:27.667 INFO:tasks.cephfs_test_runner:test_directory_command_ls (tasks.cephfs.test_mirroring.TestMirroring.test_directory_command_ls) ... ok
2026-02-23T04:26:52.722 INFO:tasks.cephfs_test_runner:test_directory_commands (tasks.cephfs.test_mirroring.TestMirroring.test_directory_commands) ... ok
2026-02-23T04:27:42.448 INFO:tasks.cephfs_test_runner:That get/set ceph.mirror.dirty_snap_id attribute succeeds in a remote filesystem. ... ok
2026-02-23T04:28:59.996 INFO:tasks.cephfs_test_runner:test_local_and_remote_dir_root_mode (tasks.cephfs.test_mirroring.TestMirroring.test_local_and_remote_dir_root_mode) ... ok
2026-02-23T04:30:03.673 INFO:tasks.cephfs_test_runner:test_matching_peer (tasks.cephfs.test_mirroring.TestMirroring.test_matching_peer) ... ok
2026-02-23T04:31:18.404 INFO:tasks.cephfs_test_runner:test_mirror_disable_with_peer (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_disable_with_peer) ... ok
2026-02-23T04:32:57.897 INFO:tasks.cephfs_test_runner:test_mirror_peer_add_existing (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_add_existing) ... ok
2026-02-23T04:34:16.187 INFO:tasks.cephfs_test_runner:test_mirror_peer_commands (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_commands) ... ok
2026-02-23T04:35:01.501 INFO:tasks.cephfs_test_runner:Test mirror daemon init failure ... ok
2026-02-23T04:35:39.574 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ok
2026-02-23T04:36:13.883 INFO:tasks.cephfs_test_runner:test_peer_commands_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_peer_commands_with_mirroring_disabled) ... ok
$ pwd
/teuthology/khiremat-2026-02-23_03:25:11-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64734
$ grep "\.\.\. ok$" teuthology.log  | grep test_runner  |wc -l
32

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner
2026-02-23T03:39:45.545 INFO:tasks.cephfs_test_runner:test_add_ancestor_and_child_directory (tasks.cephfs.test_mirroring.TestMirroring.test_add_ancestor_and_child_directory) ... ok
2026-02-23T03:40:54.338 INFO:tasks.cephfs_test_runner:test_add_directory_path_normalization (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_path_normalization) ... ok
2026-02-23T03:41:26.191 INFO:tasks.cephfs_test_runner:test_add_directory_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_with_mirroring_disabled) ... ok
2026-02-23T03:42:22.845 INFO:tasks.cephfs_test_runner:test_add_relative_directory_path (tasks.cephfs.test_mirroring.TestMirroring.test_add_relative_directory_path) ... ok
2026-02-23T03:43:19.723 INFO:tasks.cephfs_test_runner:test_basic_mirror_commands (tasks.cephfs.test_mirroring.TestMirroring.test_basic_mirror_commands) ... ok
2026-02-23T03:46:08.641 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_blocklist) ... ok
2026-02-23T03:49:56.015 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths ... ok
2026-02-23T03:55:32.990 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_cancel_sync) ... ok
2026-02-23T03:58:14.989 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_failed_sync_with_correction (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_failed_sync_with_correction) ... ok
2026-02-23T04:00:48.077 INFO:tasks.cephfs_test_runner:Test incremental snapshot synchronization with file type changes. ... ok
2026-02-23T04:01:55.933 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_peer_add_primary (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_peer_add_primary) ... ok
2026-02-23T04:03:24.092 INFO:tasks.cephfs_test_runner:Test importing peer bootstrap token ... ok
2026-02-23T04:06:39.114 INFO:tasks.cephfs_test_runner:That making changes to the remote .snap directory shows 'peer status' state: "failed" ... ok
2026-02-23T04:09:47.948 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_remove_on_stall (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_remove_on_stall) ... ok
2026-02-23T04:16:17.483 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_restart_sync_on_blocklist) ... ok
2026-02-23T04:19:44.061 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_service_daemon_status) ... ok
2026-02-23T04:23:03.422 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_stats (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_stats) ... ok
2026-02-23T04:24:43.735 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_symlink_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_symlink_sync) ... ok
2026-02-23T04:26:27.863 INFO:tasks.cephfs_test_runner:That mirroring syncs the already existing snapshot correctly. ... ok
2026-02-23T04:28:49.854 INFO:tasks.cephfs_test_runner:Test snapshot synchronization in midst of snapshot deletes. ... ok
2026-02-23T04:30:51.760 INFO:tasks.cephfs_test_runner:Test snapshot synchronization with parent directory snapshots ... ok
2026-02-23T04:32:41.063 INFO:tasks.cephfs_test_runner:test_directory_command_ls (tasks.cephfs.test_mirroring.TestMirroring.test_directory_command_ls) ... ok
2026-02-23T04:34:00.123 INFO:tasks.cephfs_test_runner:test_directory_commands (tasks.cephfs.test_mirroring.TestMirroring.test_directory_commands) ... ok
2026-02-23T04:34:40.178 INFO:tasks.cephfs_test_runner:That get/set ceph.mirror.dirty_snap_id attribute succeeds in a remote filesystem. ... ok
2026-02-23T04:36:03.592 INFO:tasks.cephfs_test_runner:test_local_and_remote_dir_root_mode (tasks.cephfs.test_mirroring.TestMirroring.test_local_and_remote_dir_root_mode) ... ok
2026-02-23T04:37:01.451 INFO:tasks.cephfs_test_runner:test_matching_peer (tasks.cephfs.test_mirroring.TestMirroring.test_matching_peer) ... ok
2026-02-23T04:38:09.551 INFO:tasks.cephfs_test_runner:test_mirror_disable_with_peer (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_disable_with_peer) ... ok
2026-02-23T04:39:39.327 INFO:tasks.cephfs_test_runner:test_mirror_peer_add_existing (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_add_existing) ... ok
2026-02-23T04:40:58.195 INFO:tasks.cephfs_test_runner:test_mirror_peer_commands (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_commands) ... ok
2026-02-23T04:41:36.922 INFO:tasks.cephfs_test_runner:Test mirror daemon init failure ... ok
2026-02-23T04:42:30.935 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ok
2026-02-23T04:43:20.121 INFO:tasks.cephfs_test_runner:test_peer_commands_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_peer_commands_with_mirroring_disabled) ... ok
khiremat@trial200:/teuthology/khiremat-2026-02-23_03:25:11-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64734$

vshankar added a commit to vshankar/ceph that referenced this pull request Feb 24, 2026
…h.com/issues/74984)

To be tested with ceph#66572

Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/75130.

vshankar added a commit to vshankar/ceph that referenced this pull request Feb 25, 2026
@vshankar
Copy link
Contributor

Most of the mirroring test ran as expected. See: https://pulpito.ceph.com/vshankar-2026-02-25_07:42:50-fs-wip-vshankar-testing-20260224.100235-testing-default-trial/

However, there is one job that has a test. See: https://pulpito.ceph.com/vshankar-2026-02-25_07:42:50-fs-wip-vshankar-testing-20260224.100235-testing-default-trial/69012/

test_mirroring_init_failure_with_recovery fails additionally.

@vshankar
Copy link
Contributor

Seems like the mirror daemon did not recover from initialisation failure that the test induced

2026-02-25T09:28:20.235 INFO:tasks.cephfs_test_runner:======================================================================
2026-02-25T09:28:20.235 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring.test_mirroring_init_failure_with_recovery)
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_vshankar_ceph_7c9c9f709b02b16b3e2005e450ecf519f2bb54de/qa/tasks/cephfs/test_mirroring.py", line 959, in test_mirroring_init_failure_with_recovery
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:    while proceed():
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:          ^^^^^^^^^
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_fa17720d0088c3ac28e473468bfc79eeaff5cd38/teuthology/contextutil.py", line 134, in __call__
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (20) after waiting for 100 seconds

@kotreshhr could you RCA please?

@vshankar
Copy link
Contributor

Once the above failure is RCA'd, this change is good to be merged.

@kotreshhr
Copy link
Contributor Author

Once the above failure is RCA'd, this change is good to be merged.

As per teuthology logs, the peer status command has failed with 22, so the admin socket command is not initiated yet.

2026-02-25T09:26:26.464 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mgr module enable mirroring
2026-02-25T09:26:26.555 INFO:tasks.ceph.mds.b.trial076.stderr:2026-02-25T09:26:26.555+0000 7fdc4c5f1640 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2026-02-25T09:26:27.391 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mgr module disable mirroring
2026-02-25T09:26:28.358 INFO:tasks.ceph.mgr.x.trial076.stderr:/usr/lib64/python3.9/site-packages/scipy/__init__.py:73: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-in
terpreters. This will likely work for most users but might cause hard to track down issues or subtle bugs. A common user of the rare sub-interpreter feature is wsgi which also allows single-interpreter mode.
2026-02-25T09:26:28.359 INFO:tasks.ceph.mgr.x.trial076.stderr:Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
2026-02-25T09:26:28.359 INFO:tasks.ceph.mgr.x.trial076.stderr:  from numpy import show_config as show_numpy_config
2026-02-25T09:26:28.390 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs mirror enable cephfs
2026-02-25T09:26:29.395 INFO:teuthology.orchestra.run:Running command with timeout 30
2026-02-25T09:26:29.396 DEBUG:teuthology.orchestra.run.trial076:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:no valid command found; 10 closest matches:
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:0
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:1
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:2
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:abort
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:assert
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:config diff
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:config diff get <var>
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config get <var>
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config help [<var>]
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config set <var> <val>...
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:admin_socket: invalid command
2026-02-25T09:26:29.505 DEBUG:teuthology.orchestra.run:got remote process result: 22
2026-02-25T09:26:29.505 WARNING:tasks.cephfs.test_mirroring:mirror daemon command with label "mirror status for fs: cephfs" failed: Command failed (mirror status for fs: cephfs) on trial076 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64'

Let's look at the mirror daemon logs

2026-02-25T09:26:29.201+0000 ec04640 20 cephfs::mirror::Mirror update_fs_mirrors
2026-02-25T09:26:29.201+0000 ec04640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2026-02-25T09:26:29.201+0000 ec04640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:29.201+0000 ec04640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x85d8650) after 2 seconds
2026-02-25T09:26:29.201+0000 e403640 20 cephfs::mirror::FSMirror ~FSMirror
2026-02-25T09:26:29.201+0000 e403640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:29.202+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:29.202+0000 e403640 20 cephfs::mirror::FSMirror init
2026-02-25T09:26:29.202+0000 e403640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2026-02-25T09:26:29.347+0000 e403640 10 cephfs::mirror::Utils connect: using mon addr=10.20.193.76
2026-02-25T09:26:29.390+0000 1380d640 20 cephfs::mirror::ClusterWatcher handle_fsmap
2026-02-25T09:26:29.390+0000 1380d640  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[{fscid=64, fs_name=cephfs}], mirroring_disabled=[]
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_filesystem: fscid=64, fs_name=cephfs
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 schedule_update_status
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::Mirror mirroring_enabled: filesystem={fscid=64, fs_name=cephfs}, pool_id=128
2026-02-25T09:26:29.407+0000 e403640 -1 asok(0x11dd5c00) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2026-02-25T09:26:29.482+0000 e403640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2026-02-25T09:26:29.520+0000 e403640 -1 cephfs::mirror::FSMirror init: error accessing local pool (id=124): (2) No such file or directory
2026-02-25T09:26:29.522+0000 e403640 20 cephfs::mirror::Mirror handle_enable_mirroring: filesystem={fscid=62, fs_name=cephfs}, peers=, r=-2
2026-02-25T09:26:29.522+0000 e403640 -1 cephfs::mirror::Mirror handle_enable_mirroring: failed to initialize FSMirror for filesystem={fscid=62, fs_name=cephfs}: (2) No such file or directory
2026-02-25T09:26:29.522+0000 1380d640  5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}
2026-02-25T09:26:29.523+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:30.390+0000 1300c640 20 cephfs::mirror::ServiceDaemon: 0x11d11630 update_status: 1 filesystem(s)
2026-02-25T09:26:31.201+0000 ec04640 20 cephfs::mirror::Mirror update_fs_mirrors
2026-02-25T09:26:31.201+0000 ec04640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=64
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 schedule_update_status
2026-02-25T09:26:31.202+0000 ec04640 20 cephfs::mirror::FSMirror init
2026-02-25T09:26:31.202+0000 ec04640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2026-02-25T09:26:31.352+0000 ec04640 10 cephfs::mirror::Utils connect: using mon addr=10.20.193.76
2026-02-25T09:26:31.410+0000 ec04640 -1 asok(0x8b1eb50) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2026-02-25T09:26:31.482+0000 ec04640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2026-02-25T09:26:31.493+0000 ec04640 20 cephfs::mirror::Utils mount: filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.673+0000 ec04640 10 cephfs::mirror::Utils mount: mounted filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.673+0000 ec04640 10 cephfs::mirror::FSMirror init: rados addrs=10.20.193.76:0/2116387652
2026-02-25T09:26:31.673+0000 ec04640 20 cephfs::mirror::FSMirror init_instance_watcher
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::InstanceWatcher init
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::InstanceWatcher create_instance
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x156bdf60) after 2 seconds
2026-02-25T09:26:31.674+0000 e403640 20 cephfs::mirror::FSMirror ~FSMirror
2026-02-25T09:26:31.674+0000 e403640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:31.674+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:31.674+0000 e403640 20 cephfs::mirror::FSMirror init

@vshankar, it's pretty obvious the admin socket command initiation failed. But I am not able related this to this PR.
May be this exists without this? Let's create a tracker and debug this separately?

@vshankar
Copy link
Contributor

vshankar commented Feb 25, 2026

Once the above failure is RCA'd, this change is good to be merged.

As per teuthology logs, the peer status command has failed with 22, so the admin socket command is not initiated yet.

2026-02-25T09:26:26.464 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mgr module enable mirroring
2026-02-25T09:26:26.555 INFO:tasks.ceph.mds.b.trial076.stderr:2026-02-25T09:26:26.555+0000 7fdc4c5f1640 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2026-02-25T09:26:27.391 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mgr module disable mirroring
2026-02-25T09:26:28.358 INFO:tasks.ceph.mgr.x.trial076.stderr:/usr/lib64/python3.9/site-packages/scipy/__init__.py:73: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-in
terpreters. This will likely work for most users but might cause hard to track down issues or subtle bugs. A common user of the rare sub-interpreter feature is wsgi which also allows single-interpreter mode.
2026-02-25T09:26:28.359 INFO:tasks.ceph.mgr.x.trial076.stderr:Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
2026-02-25T09:26:28.359 INFO:tasks.ceph.mgr.x.trial076.stderr:  from numpy import show_config as show_numpy_config
2026-02-25T09:26:28.390 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs mirror enable cephfs
2026-02-25T09:26:29.395 INFO:teuthology.orchestra.run:Running command with timeout 30
2026-02-25T09:26:29.396 DEBUG:teuthology.orchestra.run.trial076:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:no valid command found; 10 closest matches:
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:0
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:1
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:2
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:abort
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:assert
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:config diff
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:config diff get <var>
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config get <var>
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config help [<var>]
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config set <var> <val>...
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:admin_socket: invalid command
2026-02-25T09:26:29.505 DEBUG:teuthology.orchestra.run:got remote process result: 22
2026-02-25T09:26:29.505 WARNING:tasks.cephfs.test_mirroring:mirror daemon command with label "mirror status for fs: cephfs" failed: Command failed (mirror status for fs: cephfs) on trial076 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64'

Let's look at the mirror daemon logs

2026-02-25T09:26:29.201+0000 ec04640 20 cephfs::mirror::Mirror update_fs_mirrors
2026-02-25T09:26:29.201+0000 ec04640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2026-02-25T09:26:29.201+0000 ec04640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:29.201+0000 ec04640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x85d8650) after 2 seconds
2026-02-25T09:26:29.201+0000 e403640 20 cephfs::mirror::FSMirror ~FSMirror
2026-02-25T09:26:29.201+0000 e403640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:29.202+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:29.202+0000 e403640 20 cephfs::mirror::FSMirror init
2026-02-25T09:26:29.202+0000 e403640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2026-02-25T09:26:29.347+0000 e403640 10 cephfs::mirror::Utils connect: using mon addr=10.20.193.76
2026-02-25T09:26:29.390+0000 1380d640 20 cephfs::mirror::ClusterWatcher handle_fsmap
2026-02-25T09:26:29.390+0000 1380d640  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[{fscid=64, fs_name=cephfs}], mirroring_disabled=[]
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_filesystem: fscid=64, fs_name=cephfs
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 schedule_update_status
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::Mirror mirroring_enabled: filesystem={fscid=64, fs_name=cephfs}, pool_id=128
2026-02-25T09:26:29.407+0000 e403640 -1 asok(0x11dd5c00) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2026-02-25T09:26:29.482+0000 e403640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2026-02-25T09:26:29.520+0000 e403640 -1 cephfs::mirror::FSMirror init: error accessing local pool (id=124): (2) No such file or directory
2026-02-25T09:26:29.522+0000 e403640 20 cephfs::mirror::Mirror handle_enable_mirroring: filesystem={fscid=62, fs_name=cephfs}, peers=, r=-2
2026-02-25T09:26:29.522+0000 e403640 -1 cephfs::mirror::Mirror handle_enable_mirroring: failed to initialize FSMirror for filesystem={fscid=62, fs_name=cephfs}: (2) No such file or directory
2026-02-25T09:26:29.522+0000 1380d640  5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}
2026-02-25T09:26:29.523+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:30.390+0000 1300c640 20 cephfs::mirror::ServiceDaemon: 0x11d11630 update_status: 1 filesystem(s)
2026-02-25T09:26:31.201+0000 ec04640 20 cephfs::mirror::Mirror update_fs_mirrors
2026-02-25T09:26:31.201+0000 ec04640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=64
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 schedule_update_status
2026-02-25T09:26:31.202+0000 ec04640 20 cephfs::mirror::FSMirror init
2026-02-25T09:26:31.202+0000 ec04640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2026-02-25T09:26:31.352+0000 ec04640 10 cephfs::mirror::Utils connect: using mon addr=10.20.193.76
2026-02-25T09:26:31.410+0000 ec04640 -1 asok(0x8b1eb50) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2026-02-25T09:26:31.482+0000 ec04640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2026-02-25T09:26:31.493+0000 ec04640 20 cephfs::mirror::Utils mount: filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.673+0000 ec04640 10 cephfs::mirror::Utils mount: mounted filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.673+0000 ec04640 10 cephfs::mirror::FSMirror init: rados addrs=10.20.193.76:0/2116387652
2026-02-25T09:26:31.673+0000 ec04640 20 cephfs::mirror::FSMirror init_instance_watcher
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::InstanceWatcher init
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::InstanceWatcher create_instance
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x156bdf60) after 2 seconds
2026-02-25T09:26:31.674+0000 e403640 20 cephfs::mirror::FSMirror ~FSMirror
2026-02-25T09:26:31.674+0000 e403640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:31.674+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:31.674+0000 e403640 20 cephfs::mirror::FSMirror init

@vshankar, it's pretty obvious the admin socket command initiation failed. But I am not able related this to this PR.
May be this exists without this? Let's create a tracker and debug this separately?

Yeh, very likely unrelated. I just wanted to be sure before merging.

I will create a redmine tracker. I think there is something buggy wit the way the mirror daemon initializes the admin socket. Needs a closer inspection.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit 81cb89b into ceph:main Feb 25, 2026
13 checks passed
@github-actions
Copy link

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/22396729770

@github-actions
Copy link

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/22396729770

kshtsk pushed a commit to kshtsk/ceph that referenced this pull request Feb 25, 2026
…h.com/issues/74984)

To be tested with ceph#66572

Signed-off-by: Venky Shankar <vshankar@redhat.com>
kginonredhat pushed a commit to kginonredhat/ceph that referenced this pull request Mar 13, 2026
…h.com/issues/74984)

To be tested with ceph#66572

Signed-off-by: Venky Shankar <vshankar@redhat.com>
JonBailey1993 pushed a commit to JonBailey1993/ceph that referenced this pull request Mar 17, 2026
…h.com/issues/74984)

To be tested with ceph#66572

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants