Skip to content

cephfs-mirror: cephfs-mirror should be able to do operations concurrently for large directory(updated)#63950

Closed
sajibreadd wants to merge 1 commit intoceph:mainfrom
sajibreadd:wip-cephfs-mirror-update
Closed

cephfs-mirror: cephfs-mirror should be able to do operations concurrently for large directory(updated)#63950
sajibreadd wants to merge 1 commit intoceph:mainfrom
sajibreadd:wip-cephfs-mirror-update

Conversation

@sajibreadd
Copy link
Member

@sajibreadd sajibreadd commented Jun 16, 2025

Fixes: https://tracker.ceph.com/issues/69190
Signed-off-by: Md Mahamudur Rahaman Sajib mahamudur.sajib@croit.io

Concurrent Full Mirroring Design

Overview

The goal of this design is to improve the efficiency of directory mirroring at scale.
In the previous approach, concurrency was present but provided little benefit when:

  • A single large directory was being mirrored, or
  • The number of active directories being synced was small.

To solve this, the mirroring process is restructured around two dedicated task thread pools.


Thread Pools

1. FileSyncPool

  • Responsible for file transfers.
  • Files are leaf nodes of the directory tree → no dependencies.
  • Each file transfer request is pushed into this pool’s queue and executed asynchronously.
  • This achieves maximum concurrency for file transfers with no complexity.

2. DirScanPool

  • Responsible for directory scanning and creation, directory I/O, file metadata I/O.

  • Directory creation has dependencies:

              (1)
             /   \
          (2)     (3)
         /   \     /  \
     (4)    (5)(6)   (7)
    

In this tree, directory (2) must exist before (4) and (5) can be created.

  • To respect ordering, tasks are defined as:

“Create a directory and push tasks for its sub-directories into the pool.”

  • Example:
  • Task 1 creates directory (1) and enqueues tasks for (2) and (3).
  • One thread take task 2 creates directory (2) and enqueues tasks for (4) and (5).
  • Another thread take task 3 creates directory (2) and enqueues tasks for (6) and (7).
  • This results in a concurrent breadth-first traversal of the tree.

Queue Size Constraint

  • In production, we manage hundreds of millions of directories (up to ~800M).
  • Storing every directory as a task in the queue is infeasible.
  • The queue size is therefore capped (e.g., 1e5 tasks).

Avoiding Deadlocks

A naïve wait–notify model can deadlock. Example:

  • Queue size = 1, worker count = 1.
  • While processing Task 1, the worker cannot enqueue Task 3 because the queue is full.
  • Task 1 never finishes → deadlock.

Solution:

  • If the queue is full, instead of enqueuing sub-directory tasks, the worker processes them inline with DFS.
  • While DFS, always looking for chance to push subdirectory task into the queue to distribute I/O load between threads evenly
  • This ensures continuous progress and avoids deadlocks.

Benefits

  • Uniform Workload Distribution: Keeps all threads busy scanning directories.
  • Continuous Throughput: Ensures FileSyncPool is constantly fed with file-transfer tasks.
  • Scalability: Handles massive directory trees without blowing up memory.
  • Robustness: Prevents deadlocks under high load.

Summary

This design achieves true concurrent mirroring by separating file transfers and directory operations into dedicated pools, applying dependency-aware task definitions, limiting queue size, and adopting an inline execution fallback when the queue is full.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@sajibreadd sajibreadd changed the title cephfs-mirror: cephfs-mirror should be able to do operations concurrently for large directory cephfs-mirror: cephfs-mirror should be able to do operations concurrently for large directory(updated) Jun 16, 2025
@sajibreadd sajibreadd requested a review from vshankar June 16, 2025 08:00
@sajibreadd sajibreadd force-pushed the wip-cephfs-mirror-update branch from 86722da to 293ca4e Compare June 16, 2025 08:01
@sajibreadd sajibreadd requested a review from joscollin June 16, 2025 08:17
@sajibreadd sajibreadd force-pushed the wip-cephfs-mirror-update branch from 293ca4e to 10a3b63 Compare June 16, 2025 08:41
Copy link
Member

@joscollin joscollin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar
Test failure: https://pulpito.ceph.com/jcollin-2025-06-17_02:42:47-fs:mirror-wip-jcollin-testing-con-160625-distro-default-smithi/

We had all green for blockdiff. So this failure is introduced by this PR.

- cephfs-mirror
min: 1
with_legacy: true
- name: cephfs_mirror_sync_latest_snapshot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar @sajibreadd
cephfs_mirror_sync_latest_snapshot is already implemented by #61929 3 months ago. So this should be dropped from this PR.

@sajibreadd
Copy link
Member Author

@joscollin According to the traceback it's failing here


025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_d61f5f0fd2af84e2eaa69ab6dc4e2bcff0ef8cbe/qa/tasks/cephfs/test_mirroring.py", line 1420, in test_cephfs_mirror_cancel_mirroring_and_readd
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:    self.peer_add(self.primary_fs_name, self.primary_fs_id, "client.mirror_remote@ceph", self.secondary_fs_name)
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_d61f5f0fd2af84e2eaa69ab6dc4e2bcff0ef8cbe/qa/tasks/cephfs/test_mirroring.py", line 113, in peer_add
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:    self.verify_peer_added(fs_name, fs_id, peer_spec, remote_fs_name)
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_d61f5f0fd2af84e2eaa69ab6dc4e2bcff0ef8cbe/qa/tasks/cephfs/test_mirroring.py", line 91, in verify_peer_added
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:    res = self.mirror_daemon_command(f'mirror status for fs: {fs_name}',
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_d61f5f0fd2af84e2eaa69ab6dc4e2bcff0ef8cbe/qa/tasks/cephfs/test_mirroring.py", line 296, in mirror_daemon_command
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:    p = self.mount_a.client_remote.run(args=
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_434ff78d8bc6a539432ba7c1c9e64d92eb7bb71e/teuthology/orchestra/remote.py", line 535, in run
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2025-06-17T03:14:48.242 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_434ff78d8bc6a539432ba7c1c9e64d92eb7bb71e/teuthology/orchestra/run.py", line 461, in run
2025-06-17T03:14:48.243 INFO:tasks.cephfs_test_runner:    r.wait()
2025-06-17T03:14:48.243 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_434ff78d8bc6a539432ba7c1c9e64d92eb7bb71e/teuthology/orchestra/run.py", line 161, in wait
2025-06-17T03:14:48.243 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2025-06-17T03:14:48.243 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_434ff78d8bc6a539432ba7c1c9e64d92eb7bb71e/teuthology/orchestra/run.py", line 181, in _raise_for_status
2025-06-17T03:14:48.243 INFO:tasks.cephfs_test_runner:    raise CommandFailedError(
2025-06-17T03:14:48.243 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed (mirror status for fs: cephfs) on smithi195 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@14'

On this line, which means, the status command is failing.

self.peer_add(self.primary_fs_name, self.primary_fs_id, "client.mirror_remote@ceph", self.secondary_fs_name)

Before this line there was

self.add_directory(self.primary_fs_name, self.primary_fs_id, '/d0')
self.add_directory(self.primary_fs_name, self.primary_fs_id, '/d1')
self.add_directory(self.primary_fs_name, self.primary_fs_id, '/d2')

Which also internally calls the status command, but didn't fail. Can it be possible adding that peer crashes the daemon itself?

@sajibreadd
Copy link
Member Author

sajibreadd commented Jun 17, 2025

Which also internally calls the status command, but didn't fail. Can it be possible adding that peer crashes the daemon itself?
Yes, it was crashing. Found the problem.

if (g_ceph_context->_conf->cephfs_mirror_sync_latest_snapshot) {
    auto it = local_snap_map.rbegin();
    if (it->first != last_snap_id) { // crash here
}

@joscollin As cephfs_mirror_sync_latest_snapshot by default true, so when directory has no snapshots, this comparison will have seg fault but anyway you told me to remove this part of code. So removing it will fix the error.

@sajibreadd sajibreadd force-pushed the wip-cephfs-mirror-update branch from 10a3b63 to 340f138 Compare June 17, 2025 13:53
@sajibreadd
Copy link
Member Author

@joscollin the error should be fixed now.

@sajibreadd sajibreadd force-pushed the wip-cephfs-mirror-update branch from 340f138 to 970e64f Compare June 17, 2025 13:59
 1. Concurrent file syncing using user defined thread count.
 2. Concurrent directory scanning using user defined thread count.
 3. When remote is inconsistent(failed sync attempt previously happend)
    we traeversed the unchanged part between previous snapshot and
    current snapshot using snapdiff strategy(as it is intact in remote
    as well). For changed part we use remote state as diff base, which
    saved us from syncing already transfrerred entries.

Some minor improvements:
 1. Saved stat reading and updating using a change mask. Which contains
    all the stat changes and state of the entry by using a bitmask. Such
    that we can only update the changed stats.
 2. For an entry when we know that it's a new entry(or previous snapshot has
    different type), we simply moved on creating the entries under it's subtree
    without taking care of what remote or previous snapshot has. More specifically,
    it should not do unnecessary stat in the base state(previous snapshot or remote).
    As in the major improvemnt point 2, I already handled the failed attempt case
    So this strategy will work in all case.
 3. In propagate_deleted_entries, we did some stat call to find out deleted
    entries, but we did stat for entries which exist in both current and previous
    (or remote state) snapshot. I used a map to save the change mask in a map
    such that we don't need to stat it again. For obvious memory limitation, we won't
    be able to keep all the entries. That's why I limit the map till 1e5 entries.
    So for a certain moment, each thread can contain a map which can have 1e5 entries
    in the worst case.
4.  Added some extra stats
Fixes: https://tracker.ceph.com/issues/69190
Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
@sajibreadd sajibreadd force-pushed the wip-cephfs-mirror-update branch from 970e64f to 07e0495 Compare June 18, 2025 07:41
@joscollin
Copy link
Member

@sajibreadd
Copy link
Member Author

Failed again: https://pulpito.ceph.com/jcollin-2025-06-18_06:10:05-fs:mirror-wip-jcollin-testing-con-180625-distro-default-smithi/

Okay, the old one get passed, but this one is with test_cephfs_mirror_incremental_sync_with_type_mixup. Checking.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Aug 19, 2025
@github-actions github-actions bot removed the stale label Sep 9, 2025
@thomascazampoure
Copy link

Hi, just checking on the status of this PR. Any update on its progress?

@vshankar
Copy link
Contributor

Hi, just checking on the status of this PR. Any update on its progress?

Hey @thomascazampoure - it is a largish feature and the team hasn't yet gotten to reviewing it in depth. We are planning to get this included in the next (Umbrella) release.

@vshankar
Copy link
Contributor

@sajibreadd I created a feature tracker to use asynchronous IO interface in libcephfs in the mirror daemon. See: https://tracker.ceph.com/issues/73577

The interfaces are not stable yet, but the plan is stabilise of the Umbrella release. Please let me know your thoughts on this. cc @thomascazampoure

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Dec 20, 2025
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Jan 19, 2026
@sajibreadd sajibreadd deleted the wip-cephfs-mirror-update branch March 16, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants