tools/cephfs_mirror: Multi-threaded Mirroring by kotreshhr · Pull Request #66572 · ceph/ceph

kotreshhr · 2025-12-09T12:51:24Z

Adds a pool of datasync threads which concurrently syncs the files.
The existing thread would crawl and add files to the queue which the
datasync threads syncs concurrently. The existing thread also takes
care of mkdir, deletes, snap rename and deletes as well.

Fixes: https://tracker.ceph.com/issues/73452

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

kotreshhr · 2025-12-09T13:57:11Z

It's just a draft PR and there is lot of cleanup, error handling required.
With this approach there are couple of downsides which needs to be addressed.

The existing (crawler) thread is also taking care of entry operations (deletes, mkdir etc). Since snapshot has to be taken at remote site at the end of each snapshot sync from primary, it can't procees with crawling the next snapshot. Probably a separate lightweight crawler needs to be introduced.
The exisiting threads (crawler threads) picks up one mirror directory each. So multiple mirror directories sync parallely but each directory is single threaded. The data sync threads added as part this PR is long running similar to existing threads and picks up the jobs from SyncMechansim's data queue. Since each SyncMechanism represents single snapshot that's being synced. The pool of data sync threads sync one snapshot at a time. Even though some of the threads are free(finished syncing), it's waiting for pending threads to finish to synchronize on taking snapshots which is consistent.

So even though the entry operations happen parallely for different mirrored directories, the data sync happens one after the other.

github-actions · 2025-12-19T07:34:18Z

Config Diff Tool Output

+ added: cephfs_mirror_max_concurrent_data_sync_threads (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: old: maximum number of directory snapshots that can be synchronized concurrently by cephfs-mirror daemon. Controls the number of synchronization threads. (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: new: maximum number of directory snapshots that can be crawled concurrently by cephfs-mirror daemon. Controls the number of synchronization crawler threads. Note that the crawler threads also does entry operations like directory creations, file deletes and snapshot deletes/renames. (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: old: maximum number of concurrent snapshot synchronization threads (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: new: maximum number of concurrent snapshot synchronization crawler threads (cephfs-mirror.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

kotreshhr · 2026-01-14T20:35:25Z

@vshankar This is ready for review and testing. PTAL

vshankar

I haven't done a line-by-line review of this change yet, but overall it seems to do the correct thing and the data structures are well laid out for easy review. I like the approach and implementation. Good work @kotreshhr

Additionally there are some things we can do in the change to further enhance user experience (when reporting metrics in the future) and generally as suggestions for this PR:

Not use blockdiff when the file size is less than 8/16MB since the blockdiff call itself could be more expensive than the full data transfer. We should make this configurable by having a minimum file size below which the mirror daemon would not use blockdiff.
Add some negative tests to verify that the edge cases are appropriately handled
Run this change through fs:mirror suite to see if some tests fail (now that lab is up and running). I expect some test failures, especially those which rely on timing.
Add documentation for the introduced config
Add to PendingReleaseNotes

I will do a detailed review when the above points are incorporated.

src/tools/cephfs_mirror/PeerReplayer.cc

kotreshhr · 2026-02-04T08:54:35Z

Rebased and added a few metrics to existing asok command

kotreshhr · 2026-02-04T12:56:31Z

/config check ok

kotreshhr · 2026-02-15T20:09:04Z

I haven't done a line-by-line review of this change yet, but overall it seems to do the correct thing and the data structures are well laid out for easy review. I like the approach and implementation. Good work @kotreshhr

Additionally there are some things we can do in the change to further enhance user experience (when reporting metrics in the future) and generally as suggestions for this PR:

Not use blockdiff when the file size is less than 8/16MB since the blockdiff call itself could be more expensive than the full data transfer. We should make this configurable by having a minimum file size below which the mirror daemon would not use blockdiff.

Done, block diff is used only if file size is greater than 16 MiB

Add some negative tests to verify that the edge cases are appropriately handled

The existing test cases do cover blocklist/cancel/shutdown test cases

Run this change through fs:mirror suite to see if some tests fail (now that lab is up and running). I expect some test failures, especially those which rely on timing.

Ran once. I have scheduled one more.

Add documentation for the introduced config

Done

Add to PendingReleaseNotes

Done

I will do a detailed review when the above points are incorporated.

kotreshhr · 2026-02-16T08:56:52Z

jenkins test make check

kotreshhr · 2026-02-16T08:57:31Z

jenkins test make check arm64

kotreshhr · 2026-02-17T07:43:30Z

Rebased and fixed flake8 errors causing make check failure introduced by #67305 Not sure why that PR didn't catch that.

vshankar · 2026-02-17T07:48:17Z

Rebased and fixed flake8 errors causing make check failure introduced by #67305 Not sure why that PR didn't catch that.

Could you send 03a52ff as a separate change so that I can merge it sooner?

…wait 1. Add is_stopping() predicate at sdq_cv wait 2. Use the existing should_backoff() routine to validate shutdown/blocklsit/cancel errors and set corresponding errors. 3. Handle notify logic at the end 4. In shutdown(), notify all syncm's sdq_cv wait Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>

Introduce a new configuration option, 'cephfs_mirror_blockdiff_min_file_size', to control the minimum file size above which block-level diff is used during CephFS mirroring. Files smaller than the configured threshold are synchronized using full file copy, while larger files attempt block-level delta sync. This provides better flexibility across environments with varying file size distributions and performance constraints. The default value is set to 16_M (16 MiB). The value is read once at beginning of every snapshot sync. Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>

Do remote fs sync once just before taking snapshot as it's faster than doing fsync on each fd after file copy. Moreover, all the datasync threads use the same sinlge libceph onnection and doing ceph_fsync concurrently on different fds on a single libcephfs connection could cause hang as observed in testing as below. This issue is tracked at https://tracker.ceph.com/issues/75070 ----- Thread 2 (Thread 0xffff644cc400 (LWP 74020) "d_replayer-0"): 0 0x0000ffff8e82656c in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6 1 0x0000ffff8e828ff0 [PAC] in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libc.so.6 2 0x0000ffff8fc90fd4 [PAC] in ceph::condition_variable_debug::wait ... 3 0x0000ffff9080fc9c in ceph::condition_variable_debug::wait<Client::wait_on_context_list ... 4 Client::wait_on_context_list ... at /lsandbox/upstream/ceph/src/client/Client.cc:4540 5 0x0000ffff9083fae8 in Client::_fsync ... at /lsandbox/upstream/ceph/src/client/Client.cc:13299 6 0x0000ffff90840278 in Client::_fsync ... 7 0x0000ffff90840514 in Client::fsync ... at /lsandbox/upstream/ceph/src/client/Client.cc:13042 8 0x0000ffff907f06e0 in ceph_fsync ... at /lsandbox/upstream/ceph/src/libcephfs.cc:316 9 0x0000aaaaad5b2f88 in cephfs::mirror::PeerReplayer::copy_to_remote ... ---- Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>

Following configs are introduced: - cephfs_mirror_max_datasync_threads - cephfs_mirror_blockdiff_min_file_size Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>

Also mentions about blockdiff which is no longer used for small files and about new configuration introduced. Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>

Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>

Lock order 1: InstanceWatcher::m_lock ----> FSMirror::m_lock Lock order 2: FSMirror::m_lock -----> InstanceWatcher::m_lock The Lock order 1 is where it's aborted and it happens during blocklisting. The InstanceWatcher::handle_rewatch_complete() acquires InstanceWatcher::m_lock and calls m_elistener.set_blocklisted_ts() which tries to acquire FSMirror::m_lock The Lock order 2 exists in mirror peer status command. The FSMirror::mirror_status(Formatter *f) takes FSMirro::m_lock and calls is_blocklisted which takes InstanceWatcher::m_lock Fix: FSMirror::m_blocklisted_ts and FSMirror::m_failed_ts is converted to std::<atomic> and also fixed the scope of m_lock in InstanceWatcher::handle_rewatch_complete() and MirrorWatcher::handle_rewatch_complete() Look at the tracker for traceback and further details. Fixes: https://tracker.ceph.com/issues/74953 Signed-off-by: Kotresh HR <khiremat@redhat.com>

kotreshhr · 2026-02-22T20:32:58Z

https://pulpito.ceph.com/khiremat-2026-02-22_16:07:51-fs-wip-khiremat-mulithread-mirror-66572-reviewed-4-qaskip-distro-default-trial/

There was bug and hence the Test failure: test_cephfs_mirror_sync_with_purged_snapshot Fixed the same.
Also changed config name of max data sync threads from cephfs_mirror_max_concurrent_data_sync_threads to cephfs_mirror_max_datasync_threads

kotreshhr · 2026-02-23T03:26:33Z

Two new runs are scheduled -

kotreshhr · 2026-02-23T06:35:31Z

Two new runs are scheduled -

https://pulpito.ceph.com/khiremat-2026-02-23_03:24:28-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/

The results look good, except the known failure https://tracker.ceph.com/issues/74984 Please note that this failure can be see two test cases as they do similar stuff.

tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_incremental_sync
tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_sync_with_purged_snapshot

Tests ran other than above are fine.

All tests ran fine except tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_incremental_sync as it's skipped.

$ pwd
/teuthology/khiremat-2026-02-23_03:24:28-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64727

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner  |wc -l
32

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner
2026-02-23T03:33:50.213 INFO:tasks.cephfs_test_runner:test_add_ancestor_and_child_directory (tasks.cephfs.test_mirroring.TestMirroring.test_add_ancestor_and_child_directory) ... ok
2026-02-23T03:34:57.539 INFO:tasks.cephfs_test_runner:test_add_directory_path_normalization (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_path_normalization) ... ok
2026-02-23T03:35:28.973 INFO:tasks.cephfs_test_runner:test_add_directory_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_with_mirroring_disabled) ... ok
2026-02-23T03:36:25.299 INFO:tasks.cephfs_test_runner:test_add_relative_directory_path (tasks.cephfs.test_mirroring.TestMirroring.test_add_relative_directory_path) ... ok
2026-02-23T03:37:28.917 INFO:tasks.cephfs_test_runner:test_basic_mirror_commands (tasks.cephfs.test_mirroring.TestMirroring.test_basic_mirror_commands) ... ok
2026-02-23T03:40:17.256 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_blocklist) ... ok
2026-02-23T03:43:59.137 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths ... ok
2026-02-23T03:50:38.242 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_cancel_sync) ... ok
2026-02-23T03:53:24.705 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_failed_sync_with_correction (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_failed_sync_with_correction) ... ok
2026-02-23T03:56:07.255 INFO:tasks.cephfs_test_runner:Test incremental snapshot synchronization with file type changes. ... ok
2026-02-23T03:57:24.988 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_peer_add_primary (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_peer_add_primary) ... ok
2026-02-23T03:58:51.697 INFO:tasks.cephfs_test_runner:Test importing peer bootstrap token ... ok
2026-02-23T04:01:57.187 INFO:tasks.cephfs_test_runner:That making changes to the remote .snap directory shows 'peer status' state: "failed" ... ok
2026-02-23T04:05:04.742 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_remove_on_stall (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_remove_on_stall) ... ok
2026-02-23T04:09:18.818 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_restart_sync_on_blocklist) ... ok
2026-02-23T04:12:38.159 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_service_daemon_status) ... ok
2026-02-23T04:15:13.276 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_stats (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_stats) ... ok
2026-02-23T04:16:57.177 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_symlink_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_symlink_sync) ... ok
2026-02-23T04:18:16.659 INFO:tasks.cephfs_test_runner:That mirroring syncs the already existing snapshot correctly. ... ok
2026-02-23T04:21:04.260 INFO:tasks.cephfs_test_runner:Test snapshot synchronization in midst of snapshot deletes. ... ok
2026-02-23T04:22:56.688 INFO:tasks.cephfs_test_runner:Test snapshot synchronization with parent directory snapshots ... ok
2026-02-23T04:24:45.250 INFO:tasks.cephfs_test_runner:test_directory_command_ls (tasks.cephfs.test_mirroring.TestMirroring.test_directory_command_ls) ... ok
2026-02-23T04:26:02.583 INFO:tasks.cephfs_test_runner:test_directory_commands (tasks.cephfs.test_mirroring.TestMirroring.test_directory_commands) ... ok
2026-02-23T04:26:41.921 INFO:tasks.cephfs_test_runner:That get/set ceph.mirror.dirty_snap_id attribute succeeds in a remote filesystem. ... ok
2026-02-23T04:28:00.294 INFO:tasks.cephfs_test_runner:test_local_and_remote_dir_root_mode (tasks.cephfs.test_mirroring.TestMirroring.test_local_and_remote_dir_root_mode) ... ok
2026-02-23T04:28:56.640 INFO:tasks.cephfs_test_runner:test_matching_peer (tasks.cephfs.test_mirroring.TestMirroring.test_matching_peer) ... ok
2026-02-23T04:30:04.128 INFO:tasks.cephfs_test_runner:test_mirror_disable_with_peer (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_disable_with_peer) ... ok
2026-02-23T04:31:33.484 INFO:tasks.cephfs_test_runner:test_mirror_peer_add_existing (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_add_existing) ... ok
2026-02-23T04:32:59.092 INFO:tasks.cephfs_test_runner:test_mirror_peer_commands (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_commands) ... ok
2026-02-23T04:33:37.210 INFO:tasks.cephfs_test_runner:Test mirror daemon init failure ... ok
2026-02-23T04:34:15.177 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ok
2026-02-23T04:34:49.720 INFO:tasks.cephfs_test_runner:test_peer_commands_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_peer_commands_with_mirroring_disabled) ... ok

This is where tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_sync_with_purged_snapshot and it wasn't skipped, hence all test didn't run

$ pwd
/teuthology/khiremat-2026-02-23_03:24:28-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64729

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner  |wc -l
19

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner
2026-02-23T03:39:29.806 INFO:tasks.cephfs_test_runner:test_add_ancestor_and_child_directory (tasks.cephfs.test_mirroring.TestMirroring.test_add_ancestor_and_child_directory) ... ok
2026-02-23T03:40:38.614 INFO:tasks.cephfs_test_runner:test_add_directory_path_normalization (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_path_normalization) ... ok
2026-02-23T03:41:10.490 INFO:tasks.cephfs_test_runner:test_add_directory_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_with_mirroring_disabled) ... ok
2026-02-23T03:42:07.220 INFO:tasks.cephfs_test_runner:test_add_relative_directory_path (tasks.cephfs.test_mirroring.TestMirroring.test_add_relative_directory_path) ... ok
2026-02-23T03:43:12.192 INFO:tasks.cephfs_test_runner:test_basic_mirror_commands (tasks.cephfs.test_mirroring.TestMirroring.test_basic_mirror_commands) ... ok
2026-02-23T03:46:09.417 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_blocklist) ... ok
2026-02-23T03:50:05.537 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths ... ok
2026-02-23T03:55:41.598 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_cancel_sync) ... ok
2026-02-23T03:58:29.605 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_failed_sync_with_correction (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_failed_sync_with_correction) ... ok
2026-02-23T04:00:58.354 INFO:tasks.cephfs_test_runner:Test incremental snapshot synchronization with file type changes. ... ok
2026-02-23T04:02:06.204 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_peer_add_primary (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_peer_add_primary) ... ok
2026-02-23T04:03:34.343 INFO:tasks.cephfs_test_runner:Test importing peer bootstrap token ... ok
2026-02-23T04:06:46.345 INFO:tasks.cephfs_test_runner:That making changes to the remote .snap directory shows 'peer status' state: "failed" ... ok
2026-02-23T04:09:55.414 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_remove_on_stall (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_remove_on_stall) ... ok
2026-02-23T04:16:17.796 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_restart_sync_on_blocklist) ... ok
2026-02-23T04:19:39.479 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_service_daemon_status) ... ok
2026-02-23T04:22:53.057 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_stats (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_stats) ... ok
2026-02-23T04:24:33.398 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_symlink_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_symlink_sync) ... ok
2026-02-23T04:26:12.535 INFO:tasks.cephfs_test_runner:That mirroring syncs the already existing snapshot correctly. ... ok

https://pulpito.ceph.com/khiremat-2026-02-23_03:25:11-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/

$ pwd
/teuthology/khiremat-2026-02-23_03:25:11-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64732

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner  |wc -l
32

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner
2026-02-23T03:34:33.170 INFO:tasks.cephfs_test_runner:test_add_ancestor_and_child_directory (tasks.cephfs.test_mirroring.TestMirroring.test_add_ancestor_and_child_directory) ... ok
2026-02-23T03:35:41.473 INFO:tasks.cephfs_test_runner:test_add_directory_path_normalization (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_path_normalization) ... ok
2026-02-23T03:36:13.038 INFO:tasks.cephfs_test_runner:test_add_directory_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_with_mirroring_disabled) ... ok
2026-02-23T03:37:16.534 INFO:tasks.cephfs_test_runner:test_add_relative_directory_path (tasks.cephfs.test_mirroring.TestMirroring.test_add_relative_directory_path) ... ok
2026-02-23T03:38:12.662 INFO:tasks.cephfs_test_runner:test_basic_mirror_commands (tasks.cephfs.test_mirroring.TestMirroring.test_basic_mirror_commands) ... ok
2026-02-23T03:41:00.277 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_blocklist) ... ok
2026-02-23T03:44:34.919 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths ... ok
2026-02-23T03:51:15.704 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_cancel_sync) ... ok
2026-02-23T03:54:10.430 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_failed_sync_with_correction (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_failed_sync_with_correction) ... ok
2026-02-23T03:56:50.111 INFO:tasks.cephfs_test_runner:Test incremental snapshot synchronization with file type changes. ... ok
2026-02-23T03:57:57.630 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_peer_add_primary (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_peer_add_primary) ... ok
2026-02-23T03:59:16.953 INFO:tasks.cephfs_test_runner:Test importing peer bootstrap token ... ok
2026-02-23T04:02:21.570 INFO:tasks.cephfs_test_runner:That making changes to the remote .snap directory shows 'peer status' state: "failed" ... ok
2026-02-23T04:05:28.927 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_remove_on_stall (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_remove_on_stall) ... ok
2026-02-23T04:09:58.600 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_restart_sync_on_blocklist) ... ok
2026-02-23T04:13:22.583 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_service_daemon_status) ... ok
2026-02-23T04:15:47.701 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_stats (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_stats) ... ok
2026-02-23T04:17:45.850 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_symlink_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_symlink_sync) ... ok
2026-02-23T04:19:05.603 INFO:tasks.cephfs_test_runner:That mirroring syncs the already existing snapshot correctly. ... ok
2026-02-23T04:21:46.810 INFO:tasks.cephfs_test_runner:Test snapshot synchronization in midst of snapshot deletes. ... ok
2026-02-23T04:23:39.187 INFO:tasks.cephfs_test_runner:Test snapshot synchronization with parent directory snapshots ... ok
2026-02-23T04:25:27.667 INFO:tasks.cephfs_test_runner:test_directory_command_ls (tasks.cephfs.test_mirroring.TestMirroring.test_directory_command_ls) ... ok
2026-02-23T04:26:52.722 INFO:tasks.cephfs_test_runner:test_directory_commands (tasks.cephfs.test_mirroring.TestMirroring.test_directory_commands) ... ok
2026-02-23T04:27:42.448 INFO:tasks.cephfs_test_runner:That get/set ceph.mirror.dirty_snap_id attribute succeeds in a remote filesystem. ... ok
2026-02-23T04:28:59.996 INFO:tasks.cephfs_test_runner:test_local_and_remote_dir_root_mode (tasks.cephfs.test_mirroring.TestMirroring.test_local_and_remote_dir_root_mode) ... ok
2026-02-23T04:30:03.673 INFO:tasks.cephfs_test_runner:test_matching_peer (tasks.cephfs.test_mirroring.TestMirroring.test_matching_peer) ... ok
2026-02-23T04:31:18.404 INFO:tasks.cephfs_test_runner:test_mirror_disable_with_peer (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_disable_with_peer) ... ok
2026-02-23T04:32:57.897 INFO:tasks.cephfs_test_runner:test_mirror_peer_add_existing (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_add_existing) ... ok
2026-02-23T04:34:16.187 INFO:tasks.cephfs_test_runner:test_mirror_peer_commands (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_commands) ... ok
2026-02-23T04:35:01.501 INFO:tasks.cephfs_test_runner:Test mirror daemon init failure ... ok
2026-02-23T04:35:39.574 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ok
2026-02-23T04:36:13.883 INFO:tasks.cephfs_test_runner:test_peer_commands_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_peer_commands_with_mirroring_disabled) ... ok

$ pwd
/teuthology/khiremat-2026-02-23_03:25:11-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64734
$ grep "\.\.\. ok$" teuthology.log  | grep test_runner  |wc -l
32

$ grep "\.\.\. ok$" teuthology.log  | grep test_runner
2026-02-23T03:39:45.545 INFO:tasks.cephfs_test_runner:test_add_ancestor_and_child_directory (tasks.cephfs.test_mirroring.TestMirroring.test_add_ancestor_and_child_directory) ... ok
2026-02-23T03:40:54.338 INFO:tasks.cephfs_test_runner:test_add_directory_path_normalization (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_path_normalization) ... ok
2026-02-23T03:41:26.191 INFO:tasks.cephfs_test_runner:test_add_directory_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_add_directory_with_mirroring_disabled) ... ok
2026-02-23T03:42:22.845 INFO:tasks.cephfs_test_runner:test_add_relative_directory_path (tasks.cephfs.test_mirroring.TestMirroring.test_add_relative_directory_path) ... ok
2026-02-23T03:43:19.723 INFO:tasks.cephfs_test_runner:test_basic_mirror_commands (tasks.cephfs.test_mirroring.TestMirroring.test_basic_mirror_commands) ... ok
2026-02-23T03:46:08.641 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_blocklist) ... ok
2026-02-23T03:49:56.015 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths ... ok
2026-02-23T03:55:32.990 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_cancel_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_cancel_sync) ... ok
2026-02-23T03:58:14.989 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_failed_sync_with_correction (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_failed_sync_with_correction) ... ok
2026-02-23T04:00:48.077 INFO:tasks.cephfs_test_runner:Test incremental snapshot synchronization with file type changes. ... ok
2026-02-23T04:01:55.933 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_peer_add_primary (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_peer_add_primary) ... ok
2026-02-23T04:03:24.092 INFO:tasks.cephfs_test_runner:Test importing peer bootstrap token ... ok
2026-02-23T04:06:39.114 INFO:tasks.cephfs_test_runner:That making changes to the remote .snap directory shows 'peer status' state: "failed" ... ok
2026-02-23T04:09:47.948 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_remove_on_stall (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_remove_on_stall) ... ok
2026-02-23T04:16:17.483 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_restart_sync_on_blocklist) ... ok
2026-02-23T04:19:44.061 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_service_daemon_status) ... ok
2026-02-23T04:23:03.422 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_stats (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_stats) ... ok
2026-02-23T04:24:43.735 INFO:tasks.cephfs_test_runner:test_cephfs_mirror_symlink_sync (tasks.cephfs.test_mirroring.TestMirroring.test_cephfs_mirror_symlink_sync) ... ok
2026-02-23T04:26:27.863 INFO:tasks.cephfs_test_runner:That mirroring syncs the already existing snapshot correctly. ... ok
2026-02-23T04:28:49.854 INFO:tasks.cephfs_test_runner:Test snapshot synchronization in midst of snapshot deletes. ... ok
2026-02-23T04:30:51.760 INFO:tasks.cephfs_test_runner:Test snapshot synchronization with parent directory snapshots ... ok
2026-02-23T04:32:41.063 INFO:tasks.cephfs_test_runner:test_directory_command_ls (tasks.cephfs.test_mirroring.TestMirroring.test_directory_command_ls) ... ok
2026-02-23T04:34:00.123 INFO:tasks.cephfs_test_runner:test_directory_commands (tasks.cephfs.test_mirroring.TestMirroring.test_directory_commands) ... ok
2026-02-23T04:34:40.178 INFO:tasks.cephfs_test_runner:That get/set ceph.mirror.dirty_snap_id attribute succeeds in a remote filesystem. ... ok
2026-02-23T04:36:03.592 INFO:tasks.cephfs_test_runner:test_local_and_remote_dir_root_mode (tasks.cephfs.test_mirroring.TestMirroring.test_local_and_remote_dir_root_mode) ... ok
2026-02-23T04:37:01.451 INFO:tasks.cephfs_test_runner:test_matching_peer (tasks.cephfs.test_mirroring.TestMirroring.test_matching_peer) ... ok
2026-02-23T04:38:09.551 INFO:tasks.cephfs_test_runner:test_mirror_disable_with_peer (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_disable_with_peer) ... ok
2026-02-23T04:39:39.327 INFO:tasks.cephfs_test_runner:test_mirror_peer_add_existing (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_add_existing) ... ok
2026-02-23T04:40:58.195 INFO:tasks.cephfs_test_runner:test_mirror_peer_commands (tasks.cephfs.test_mirroring.TestMirroring.test_mirror_peer_commands) ... ok
2026-02-23T04:41:36.922 INFO:tasks.cephfs_test_runner:Test mirror daemon init failure ... ok
2026-02-23T04:42:30.935 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ok
2026-02-23T04:43:20.121 INFO:tasks.cephfs_test_runner:test_peer_commands_with_mirroring_disabled (tasks.cephfs.test_mirroring.TestMirroring.test_peer_commands_with_mirroring_disabled) ... ok
khiremat@trial200:/teuthology/khiremat-2026-02-23_03:25:11-fs-wip-khiremat-mulithread-mirror-66572-reviewed-5-fix-purge-snap-distro-default-trial/64734$

…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar · 2026-02-24T10:04:49Z

This PR is under test in https://tracker.ceph.com/issues/75130.

* refs/pull/66572/head:

vshankar · 2026-02-25T09:43:04Z

Most of the mirroring test ran as expected. See: https://pulpito.ceph.com/vshankar-2026-02-25_07:42:50-fs-wip-vshankar-testing-20260224.100235-testing-default-trial/

However, there is one job that has a test. See: https://pulpito.ceph.com/vshankar-2026-02-25_07:42:50-fs-wip-vshankar-testing-20260224.100235-testing-default-trial/69012/

test_mirroring_init_failure_with_recovery fails additionally.

vshankar · 2026-02-25T09:50:42Z

Seems like the mirror daemon did not recover from initialisation failure that the test induced

2026-02-25T09:28:20.235 INFO:tasks.cephfs_test_runner:======================================================================
2026-02-25T09:28:20.235 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring.test_mirroring_init_failure_with_recovery)
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_vshankar_ceph_7c9c9f709b02b16b3e2005e450ecf519f2bb54de/qa/tasks/cephfs/test_mirroring.py", line 959, in test_mirroring_init_failure_with_recovery
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:    while proceed():
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:          ^^^^^^^^^
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_fa17720d0088c3ac28e473468bfc79eeaff5cd38/teuthology/contextutil.py", line 134, in __call__
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2026-02-25T09:28:20.236 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (20) after waiting for 100 seconds

@kotreshhr could you RCA please?

vshankar · 2026-02-25T10:17:55Z

Once the above failure is RCA'd, this change is good to be merged.

kotreshhr · 2026-02-25T11:37:43Z

Once the above failure is RCA'd, this change is good to be merged.

As per teuthology logs, the peer status command has failed with 22, so the admin socket command is not initiated yet.

2026-02-25T09:26:26.464 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mgr module enable mirroring
2026-02-25T09:26:26.555 INFO:tasks.ceph.mds.b.trial076.stderr:2026-02-25T09:26:26.555+0000 7fdc4c5f1640 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2026-02-25T09:26:27.391 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mgr module disable mirroring
2026-02-25T09:26:28.358 INFO:tasks.ceph.mgr.x.trial076.stderr:/usr/lib64/python3.9/site-packages/scipy/__init__.py:73: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-in
terpreters. This will likely work for most users but might cause hard to track down issues or subtle bugs. A common user of the rare sub-interpreter feature is wsgi which also allows single-interpreter mode.
2026-02-25T09:26:28.359 INFO:tasks.ceph.mgr.x.trial076.stderr:Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
2026-02-25T09:26:28.359 INFO:tasks.ceph.mgr.x.trial076.stderr:  from numpy import show_config as show_numpy_config
2026-02-25T09:26:28.390 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs mirror enable cephfs
2026-02-25T09:26:29.395 INFO:teuthology.orchestra.run:Running command with timeout 30
2026-02-25T09:26:29.396 DEBUG:teuthology.orchestra.run.trial076:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:no valid command found; 10 closest matches:
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:0
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:1
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:2
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:abort
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:assert
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:config diff
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:config diff get <var>
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config get <var>
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config help [<var>]
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config set <var> <val>...
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:admin_socket: invalid command
2026-02-25T09:26:29.505 DEBUG:teuthology.orchestra.run:got remote process result: 22
2026-02-25T09:26:29.505 WARNING:tasks.cephfs.test_mirroring:mirror daemon command with label "mirror status for fs: cephfs" failed: Command failed (mirror status for fs: cephfs) on trial076 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64'

Let's look at the mirror daemon logs

2026-02-25T09:26:29.201+0000 ec04640 20 cephfs::mirror::Mirror update_fs_mirrors
2026-02-25T09:26:29.201+0000 ec04640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2026-02-25T09:26:29.201+0000 ec04640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:29.201+0000 ec04640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x85d8650) after 2 seconds
2026-02-25T09:26:29.201+0000 e403640 20 cephfs::mirror::FSMirror ~FSMirror
2026-02-25T09:26:29.201+0000 e403640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:29.202+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:29.202+0000 e403640 20 cephfs::mirror::FSMirror init
2026-02-25T09:26:29.202+0000 e403640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2026-02-25T09:26:29.347+0000 e403640 10 cephfs::mirror::Utils connect: using mon addr=10.20.193.76
2026-02-25T09:26:29.390+0000 1380d640 20 cephfs::mirror::ClusterWatcher handle_fsmap
2026-02-25T09:26:29.390+0000 1380d640  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[{fscid=64, fs_name=cephfs}], mirroring_disabled=[]
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_filesystem: fscid=64, fs_name=cephfs
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 schedule_update_status
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::Mirror mirroring_enabled: filesystem={fscid=64, fs_name=cephfs}, pool_id=128
2026-02-25T09:26:29.407+0000 e403640 -1 asok(0x11dd5c00) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2026-02-25T09:26:29.482+0000 e403640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2026-02-25T09:26:29.520+0000 e403640 -1 cephfs::mirror::FSMirror init: error accessing local pool (id=124): (2) No such file or directory
2026-02-25T09:26:29.522+0000 e403640 20 cephfs::mirror::Mirror handle_enable_mirroring: filesystem={fscid=62, fs_name=cephfs}, peers=, r=-2
2026-02-25T09:26:29.522+0000 e403640 -1 cephfs::mirror::Mirror handle_enable_mirroring: failed to initialize FSMirror for filesystem={fscid=62, fs_name=cephfs}: (2) No such file or directory
2026-02-25T09:26:29.522+0000 1380d640  5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}
2026-02-25T09:26:29.523+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:30.390+0000 1300c640 20 cephfs::mirror::ServiceDaemon: 0x11d11630 update_status: 1 filesystem(s)
2026-02-25T09:26:31.201+0000 ec04640 20 cephfs::mirror::Mirror update_fs_mirrors
2026-02-25T09:26:31.201+0000 ec04640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=64
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 schedule_update_status
2026-02-25T09:26:31.202+0000 ec04640 20 cephfs::mirror::FSMirror init
2026-02-25T09:26:31.202+0000 ec04640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2026-02-25T09:26:31.352+0000 ec04640 10 cephfs::mirror::Utils connect: using mon addr=10.20.193.76
2026-02-25T09:26:31.410+0000 ec04640 -1 asok(0x8b1eb50) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2026-02-25T09:26:31.482+0000 ec04640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2026-02-25T09:26:31.493+0000 ec04640 20 cephfs::mirror::Utils mount: filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.673+0000 ec04640 10 cephfs::mirror::Utils mount: mounted filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.673+0000 ec04640 10 cephfs::mirror::FSMirror init: rados addrs=10.20.193.76:0/2116387652
2026-02-25T09:26:31.673+0000 ec04640 20 cephfs::mirror::FSMirror init_instance_watcher
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::InstanceWatcher init
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::InstanceWatcher create_instance
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x156bdf60) after 2 seconds
2026-02-25T09:26:31.674+0000 e403640 20 cephfs::mirror::FSMirror ~FSMirror
2026-02-25T09:26:31.674+0000 e403640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:31.674+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:31.674+0000 e403640 20 cephfs::mirror::FSMirror init

@vshankar, it's pretty obvious the admin socket command initiation failed. But I am not able related this to this PR.
May be this exists without this? Let's create a tracker and debug this separately?

vshankar · 2026-02-25T11:43:16Z

Once the above failure is RCA'd, this change is good to be merged.

As per teuthology logs, the peer status command has failed with 22, so the admin socket command is not initiated yet.

2026-02-25T09:26:26.464 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mgr module enable mirroring
2026-02-25T09:26:26.555 INFO:tasks.ceph.mds.b.trial076.stderr:2026-02-25T09:26:26.555+0000 7fdc4c5f1640 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2026-02-25T09:26:27.391 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mgr module disable mirroring
2026-02-25T09:26:28.358 INFO:tasks.ceph.mgr.x.trial076.stderr:/usr/lib64/python3.9/site-packages/scipy/__init__.py:73: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-in
terpreters. This will likely work for most users but might cause hard to track down issues or subtle bugs. A common user of the rare sub-interpreter feature is wsgi which also allows single-interpreter mode.
2026-02-25T09:26:28.359 INFO:tasks.ceph.mgr.x.trial076.stderr:Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
2026-02-25T09:26:28.359 INFO:tasks.ceph.mgr.x.trial076.stderr:  from numpy import show_config as show_numpy_config
2026-02-25T09:26:28.390 DEBUG:teuthology.orchestra.run.trial076:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs mirror enable cephfs
2026-02-25T09:26:29.395 INFO:teuthology.orchestra.run:Running command with timeout 30
2026-02-25T09:26:29.396 DEBUG:teuthology.orchestra.run.trial076:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:no valid command found; 10 closest matches:
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:0
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:1
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:2
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:abort
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:assert
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:config diff
2026-02-25T09:26:29.503 INFO:teuthology.orchestra.run.trial076.stderr:config diff get <var>
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config get <var>
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config help [<var>]
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:config set <var> <val>...
2026-02-25T09:26:29.504 INFO:teuthology.orchestra.run.trial076.stderr:admin_socket: invalid command
2026-02-25T09:26:29.505 DEBUG:teuthology.orchestra.run:got remote process result: 22
2026-02-25T09:26:29.505 WARNING:tasks.cephfs.test_mirroring:mirror daemon command with label "mirror status for fs: cephfs" failed: Command failed (mirror status for fs: cephfs) on trial076 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64'

Let's look at the mirror daemon logs

2026-02-25T09:26:29.201+0000 ec04640 20 cephfs::mirror::Mirror update_fs_mirrors
2026-02-25T09:26:29.201+0000 ec04640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2026-02-25T09:26:29.201+0000 ec04640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:29.201+0000 ec04640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x85d8650) after 2 seconds
2026-02-25T09:26:29.201+0000 e403640 20 cephfs::mirror::FSMirror ~FSMirror
2026-02-25T09:26:29.201+0000 e403640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:29.202+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:29.202+0000 e403640 20 cephfs::mirror::FSMirror init
2026-02-25T09:26:29.202+0000 e403640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2026-02-25T09:26:29.347+0000 e403640 10 cephfs::mirror::Utils connect: using mon addr=10.20.193.76
2026-02-25T09:26:29.390+0000 1380d640 20 cephfs::mirror::ClusterWatcher handle_fsmap
2026-02-25T09:26:29.390+0000 1380d640  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[{fscid=64, fs_name=cephfs}], mirroring_disabled=[]
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_filesystem: fscid=64, fs_name=cephfs
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 schedule_update_status
2026-02-25T09:26:29.390+0000 1380d640 10 cephfs::mirror::Mirror mirroring_enabled: filesystem={fscid=64, fs_name=cephfs}, pool_id=128
2026-02-25T09:26:29.407+0000 e403640 -1 asok(0x11dd5c00) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2026-02-25T09:26:29.482+0000 e403640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2026-02-25T09:26:29.520+0000 e403640 -1 cephfs::mirror::FSMirror init: error accessing local pool (id=124): (2) No such file or directory
2026-02-25T09:26:29.522+0000 e403640 20 cephfs::mirror::Mirror handle_enable_mirroring: filesystem={fscid=62, fs_name=cephfs}, peers=, r=-2
2026-02-25T09:26:29.522+0000 e403640 -1 cephfs::mirror::Mirror handle_enable_mirroring: failed to initialize FSMirror for filesystem={fscid=62, fs_name=cephfs}: (2) No such file or directory
2026-02-25T09:26:29.522+0000 1380d640  5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}
2026-02-25T09:26:29.523+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:30.390+0000 1300c640 20 cephfs::mirror::ServiceDaemon: 0x11d11630 update_status: 1 filesystem(s)
2026-02-25T09:26:31.201+0000 ec04640 20 cephfs::mirror::Mirror update_fs_mirrors
2026-02-25T09:26:31.201+0000 ec04640  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=64
2026-02-25T09:26:31.202+0000 ec04640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 schedule_update_status
2026-02-25T09:26:31.202+0000 ec04640 20 cephfs::mirror::FSMirror init
2026-02-25T09:26:31.202+0000 ec04640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2026-02-25T09:26:31.352+0000 ec04640 10 cephfs::mirror::Utils connect: using mon addr=10.20.193.76
2026-02-25T09:26:31.410+0000 ec04640 -1 asok(0x8b1eb50) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2026-02-25T09:26:31.482+0000 ec04640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2026-02-25T09:26:31.493+0000 ec04640 20 cephfs::mirror::Utils mount: filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.673+0000 ec04640 10 cephfs::mirror::Utils mount: mounted filesystem={fscid=64, fs_name=cephfs}
2026-02-25T09:26:31.673+0000 ec04640 10 cephfs::mirror::FSMirror init: rados addrs=10.20.193.76:0/2116387652
2026-02-25T09:26:31.673+0000 ec04640 20 cephfs::mirror::FSMirror init_instance_watcher
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::InstanceWatcher init
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::InstanceWatcher create_instance
2026-02-25T09:26:31.674+0000 ec04640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x156bdf60) after 2 seconds
2026-02-25T09:26:31.674+0000 e403640 20 cephfs::mirror::FSMirror ~FSMirror
2026-02-25T09:26:31.674+0000 e403640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2026-02-25T09:26:31.674+0000 e403640 10 cephfs::mirror::ServiceDaemon: 0x11d11630 add_or_update_fs_attribute: fscid=62
2026-02-25T09:26:31.674+0000 e403640 20 cephfs::mirror::FSMirror init

@vshankar, it's pretty obvious the admin socket command initiation failed. But I am not able related this to this PR.
May be this exists without this? Let's create a tracker and debug this separately?

Yeh, very likely unrelated. I just wanted to be sure before merging.

I will create a redmine tracker. I think there is something buggy wit the way the mirror daemon initializes the admin socket. Needs a closer inspection.

vshankar

https://pulpito.ceph.com/vshankar-2026-02-25_07:42:50-fs-wip-vshankar-testing-20260224.100235-testing-default-trial/

fs:mirror and mirror-ha runs fine. Failures are unrelated or known issues.

github-actions · 2026-02-25T12:26:30Z

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

https://tracker.ceph.com/issues/74953

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/22396729770

github-actions · 2026-02-25T12:26:54Z

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

https://tracker.ceph.com/issues/73452

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/22396729770

…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>

github-actions bot added the cephfs Ceph File System label Dec 9, 2025

kotreshhr requested a review from vshankar December 9, 2025 12:51

vshankar mentioned this pull request Dec 18, 2025

cephfs-mirror: introduce option to have a dedicated thread for a mirrored directory #66665

Open

14 tasks

github-actions bot added common config-change labels Dec 19, 2025

kotreshhr force-pushed the mirror-multithreaded branch 4 times, most recently from cd4979a to 34a23d7 Compare January 14, 2026 20:30

kotreshhr marked this pull request as ready for review January 14, 2026 20:30

vshankar requested a review from a team January 27, 2026 09:10

vshankar reviewed Jan 29, 2026

View reviewed changes

src/tools/cephfs_mirror/PeerReplayer.cc Outdated Show resolved Hide resolved

kotreshhr force-pushed the mirror-multithreaded branch from 5b4ae86 to becd6e8 Compare February 4, 2026 08:54

kotreshhr force-pushed the mirror-multithreaded branch from becd6e8 to 3b924a6 Compare February 4, 2026 11:17

kotreshhr requested a review from a team as a code owner February 4, 2026 11:17

github-actions bot added the documentation label Feb 4, 2026

kotreshhr force-pushed the mirror-multithreaded branch from 6425ffe to b7fd4bd Compare February 15, 2026 20:06

kotreshhr force-pushed the mirror-multithreaded branch from b7fd4bd to 03a52ff Compare February 17, 2026 07:39

github-actions bot added the tests label Feb 17, 2026

kotreshhr force-pushed the mirror-multithreaded branch from 03a52ff to 4ea9928 Compare February 17, 2026 07:57

kotreshhr added 7 commits February 23, 2026 00:26

doc: Document configs introduced with multi-threaded cephfs-mirror

47ddc9b

Following configs are introduced: - cephfs_mirror_max_datasync_threads - cephfs_mirror_blockdiff_min_file_size Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>

doc/PendingReleaseNotes: cephfs mirroring multi-thread

b5ed4a7

Also mentions about blockdiff which is no longer used for small files and about new configuration introduced. Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>

doc: Document cephfs mirroring multi-thread

1e48295

Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>

kotreshhr force-pushed the mirror-multithreaded branch from 7d765e0 to 4efa9e5 Compare February 22, 2026 20:31

vshankar added the wip-vshankar-testing label Feb 24, 2026

vshankar added a commit to vshankar/ceph that referenced this pull request Feb 24, 2026

qa: temporarily disable some mirroring test (see: https://tracker.cep…

2c079d5

…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar mentioned this pull request Feb 24, 2026

qa: temporarily disable some mirroring test (see: https://tracker.ceph.com/issues/74984) #67477

Merged

14 tasks

vshankar added a commit to vshankar/ceph that referenced this pull request Feb 25, 2026

Merge PR ceph#66572 into wip-vshankar-testing-20260225.073815

7c9c9f7

* refs/pull/66572/head:

vshankar approved these changes Feb 25, 2026

View reviewed changes

vshankar merged commit 81cb89b into ceph:main Feb 25, 2026
13 checks passed

vshankar removed the wip-vshankar-testing label Feb 25, 2026

kshtsk pushed a commit to kshtsk/ceph that referenced this pull request Feb 25, 2026

qa: temporarily disable some mirroring test (see: https://tracker.cep…

70161ad

…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>

kotreshhr mentioned this pull request Feb 25, 2026

mgr/mirroring: Display mon_host and fsid in daemon status command #67335

Merged

14 tasks

kginonredhat pushed a commit to kginonredhat/ceph that referenced this pull request Mar 13, 2026

qa: temporarily disable some mirroring test (see: https://tracker.cep…

2cf34ad

…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>

JonBailey1993 pushed a commit to JonBailey1993/ceph that referenced this pull request Mar 17, 2026

qa: temporarily disable some mirroring test (see: https://tracker.cep…

1ec6fda

…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>

Conversation

kotreshhr commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

kotreshhr commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Config Diff Tool Output

Uh oh!

kotreshhr commented Jan 14, 2026

Uh oh!

vshankar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kotreshhr commented Feb 4, 2026

Uh oh!

kotreshhr commented Feb 4, 2026

Uh oh!

kotreshhr commented Feb 15, 2026

Uh oh!

kotreshhr commented Feb 16, 2026

Uh oh!

kotreshhr commented Feb 16, 2026

Uh oh!

kotreshhr commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vshankar commented Feb 17, 2026

Uh oh!

kotreshhr commented Feb 22, 2026

Uh oh!

kotreshhr commented Feb 23, 2026

Uh oh!

kotreshhr commented Feb 23, 2026

Uh oh!

vshankar commented Feb 24, 2026

Uh oh!

vshankar commented Feb 25, 2026

Uh oh!

vshankar commented Feb 25, 2026

Uh oh!

vshankar commented Feb 25, 2026

Uh oh!

kotreshhr commented Feb 25, 2026

Uh oh!

vshankar commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kotreshhr commented Dec 9, 2025 •

edited

Loading

kotreshhr commented Dec 9, 2025 •

edited

Loading

github-actions bot commented Dec 19, 2025 •

edited

Loading

vshankar left a comment •

edited

Loading

kotreshhr commented Feb 17, 2026 •

edited

Loading

vshankar commented Feb 25, 2026 •

edited

Loading