tools/cephfs_mirror: Multi-threaded Mirroring#66572
Conversation
|
It's just a draft PR and there is lot of cleanup, error handling required.
So even though the entry operations happen parallely for different mirrored directories, the data sync happens one after the other. |
Config Diff Tool Output+ added: cephfs_mirror_max_concurrent_data_sync_threads (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: old: maximum number of directory snapshots that can be synchronized concurrently by cephfs-mirror daemon. Controls the number of synchronization threads. (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: new: maximum number of directory snapshots that can be crawled concurrently by cephfs-mirror daemon. Controls the number of synchronization crawler threads. Note that the crawler threads also does entry operations like directory creations, file deletes and snapshot deletes/renames. (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: old: maximum number of concurrent snapshot synchronization threads (cephfs-mirror.yaml.in)
! changed: cephfs_mirror_max_concurrent_directory_syncs: new: maximum number of concurrent snapshot synchronization crawler threads (cephfs-mirror.yaml.in)
The above configuration changes are found in the PR. Please update the relevant release documentation if necessary. |
cd4979a to
34a23d7
Compare
|
@vshankar This is ready for review and testing. PTAL |
There was a problem hiding this comment.
I haven't done a line-by-line review of this change yet, but overall it seems to do the correct thing and the data structures are well laid out for easy review. I like the approach and implementation. Good work @kotreshhr
Additionally there are some things we can do in the change to further enhance user experience (when reporting metrics in the future) and generally as suggestions for this PR:
- Not use blockdiff when the file size is less than 8/16MB since the blockdiff call itself could be more expensive than the full data transfer. We should make this configurable by having a minimum file size below which the mirror daemon would not use blockdiff.
- Add some negative tests to verify that the edge cases are appropriately handled
- Run this change through fs:mirror suite to see if some tests fail (now that lab is up and running). I expect some test failures, especially those which rely on timing.
- Add documentation for the introduced config
- Add to PendingReleaseNotes
I will do a detailed review when the above points are incorporated.
5b4ae86 to
becd6e8
Compare
|
Rebased and added a few metrics to existing asok command |
becd6e8 to
3b924a6
Compare
|
/config check ok |
6425ffe to
b7fd4bd
Compare
Done, block diff is used only if file size is greater than 16 MiB
The existing test cases do cover blocklist/cancel/shutdown test cases
Ran once. I have scheduled one more.
Done
Done
|
|
jenkins test make check |
|
jenkins test make check arm64 |
b7fd4bd to
03a52ff
Compare
|
Rebased and fixed flake8 errors causing |
03a52ff to
4ea9928
Compare
…wait 1. Add is_stopping() predicate at sdq_cv wait 2. Use the existing should_backoff() routine to validate shutdown/blocklsit/cancel errors and set corresponding errors. 3. Handle notify logic at the end 4. In shutdown(), notify all syncm's sdq_cv wait Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Introduce a new configuration option, 'cephfs_mirror_blockdiff_min_file_size', to control the minimum file size above which block-level diff is used during CephFS mirroring. Files smaller than the configured threshold are synchronized using full file copy, while larger files attempt block-level delta sync. This provides better flexibility across environments with varying file size distributions and performance constraints. The default value is set to 16_M (16 MiB). The value is read once at beginning of every snapshot sync. Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Do remote fs sync once just before taking snapshot as it's faster than doing fsync on each fd after file copy. Moreover, all the datasync threads use the same sinlge libceph onnection and doing ceph_fsync concurrently on different fds on a single libcephfs connection could cause hang as observed in testing as below. This issue is tracked at https://tracker.ceph.com/issues/75070 ----- Thread 2 (Thread 0xffff644cc400 (LWP 74020) "d_replayer-0"): 0 0x0000ffff8e82656c in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6 1 0x0000ffff8e828ff0 [PAC] in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libc.so.6 2 0x0000ffff8fc90fd4 [PAC] in ceph::condition_variable_debug::wait ... 3 0x0000ffff9080fc9c in ceph::condition_variable_debug::wait<Client::wait_on_context_list ... 4 Client::wait_on_context_list ... at /lsandbox/upstream/ceph/src/client/Client.cc:4540 5 0x0000ffff9083fae8 in Client::_fsync ... at /lsandbox/upstream/ceph/src/client/Client.cc:13299 6 0x0000ffff90840278 in Client::_fsync ... 7 0x0000ffff90840514 in Client::fsync ... at /lsandbox/upstream/ceph/src/client/Client.cc:13042 8 0x0000ffff907f06e0 in ceph_fsync ... at /lsandbox/upstream/ceph/src/libcephfs.cc:316 9 0x0000aaaaad5b2f88 in cephfs::mirror::PeerReplayer::copy_to_remote ... ---- Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Following configs are introduced: - cephfs_mirror_max_datasync_threads - cephfs_mirror_blockdiff_min_file_size Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Also mentions about blockdiff which is no longer used for small files and about new configuration introduced. Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Fixes: https://tracker.ceph.com/issues/73452 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Lock order 1: InstanceWatcher::m_lock ----> FSMirror::m_lock Lock order 2: FSMirror::m_lock -----> InstanceWatcher::m_lock The Lock order 1 is where it's aborted and it happens during blocklisting. The InstanceWatcher::handle_rewatch_complete() acquires InstanceWatcher::m_lock and calls m_elistener.set_blocklisted_ts() which tries to acquire FSMirror::m_lock The Lock order 2 exists in mirror peer status command. The FSMirror::mirror_status(Formatter *f) takes FSMirro::m_lock and calls is_blocklisted which takes InstanceWatcher::m_lock Fix: FSMirror::m_blocklisted_ts and FSMirror::m_failed_ts is converted to std::<atomic> and also fixed the scope of m_lock in InstanceWatcher::handle_rewatch_complete() and MirrorWatcher::handle_rewatch_complete() Look at the tracker for traceback and further details. Fixes: https://tracker.ceph.com/issues/74953 Signed-off-by: Kotresh HR <khiremat@redhat.com>
7d765e0 to
4efa9e5
Compare
|
There was bug and hence the |
|
Two new runs are scheduled - |
The results look good, except the known failure https://tracker.ceph.com/issues/74984 Please note that this failure can be see two test cases as they do similar stuff.
Tests ran other than above are fine.
|
…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>
|
This PR is under test in https://tracker.ceph.com/issues/75130. |
* refs/pull/66572/head:
|
Most of the mirroring test ran as expected. See: https://pulpito.ceph.com/vshankar-2026-02-25_07:42:50-fs-wip-vshankar-testing-20260224.100235-testing-default-trial/ However, there is one job that has a test. See: https://pulpito.ceph.com/vshankar-2026-02-25_07:42:50-fs-wip-vshankar-testing-20260224.100235-testing-default-trial/69012/ test_mirroring_init_failure_with_recovery fails additionally. |
|
Seems like the mirror daemon did not recover from initialisation failure that the test induced @kotreshhr could you RCA please? |
|
Once the above failure is RCA'd, this change is good to be merged. |
As per teuthology logs, the peer status command has failed with 22, so the admin socket command is not initiated yet. Let's look at the mirror daemon logs @vshankar, it's pretty obvious the admin socket command initiation failed. But I am not able related this to this PR. |
Yeh, very likely unrelated. I just wanted to be sure before merging. I will create a redmine tracker. I think there is something buggy wit the way the mirror daemon initializes the admin socket. Needs a closer inspection. |
vshankar
left a comment
There was a problem hiding this comment.
fs:mirror and mirror-ha runs fine. Failures are unrelated or known issues.
|
This is an automated message by src/script/redmine-upkeep.py. I have resolved the following tracker ticket due to the merge of this PR: No backports are pending for the ticket. If this is incorrect, please update the tracker Update Log: https://github.com/ceph/ceph/actions/runs/22396729770 |
|
This is an automated message by src/script/redmine-upkeep.py. I have resolved the following tracker ticket due to the merge of this PR: No backports are pending for the ticket. If this is incorrect, please update the tracker Update Log: https://github.com/ceph/ceph/actions/runs/22396729770 |
…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>
…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>
…h.com/issues/74984) To be tested with ceph#66572 Signed-off-by: Venky Shankar <vshankar@redhat.com>
Adds a pool of datasync threads which concurrently syncs the files.
The existing thread would crawl and add files to the queue which the
datasync threads syncs concurrently. The existing thread also takes
care of mkdir, deletes, snap rename and deletes as well.
Fixes: https://tracker.ceph.com/issues/73452
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.