Skip to content

cephfs-mirror: cephfs-mirror should be able to do operations concurrently for large directory#61245

Closed
sajibreadd wants to merge 1 commit intoceph:mainfrom
sajibreadd:wip-69190
Closed

cephfs-mirror: cephfs-mirror should be able to do operations concurrently for large directory#61245
sajibreadd wants to merge 1 commit intoceph:mainfrom
sajibreadd:wip-69190

Conversation

@sajibreadd
Copy link
Member

@sajibreadd sajibreadd commented Jan 7, 2025

Fixes: https://tracker.ceph.com/issues/69190
Signed-off-by: Md Mahamudur Rahaman Sajib mahamudur.sajib@croit.io

Here idea is to make the full mirroring concurrent. Previously there is concurrency but it comes to meaningless when there is single large directory is about to mirror or number of directory actively syncing is small. I tried a different approach where,
I maintained 2 task thread pool, one is for file transfer and one is for other operation(can be directory operation / traversing the directory tree)

file transfer thread pool is kind of self explanatory, whenever we need to transfer file we will push this task into that thread pool's queue and file will be transferred asynchronously. As file's are just leaves of the tree there is no dependency and complication while asynchronously transferring it.

For other operation, it is not easy as file transferring. Here is an exmaple,

        1
      /   \
     2     3
    / \   / \
   4   5 6   7

Let's assume all nodes in this tree are directory then (2) must be created before (4) and (5). So we can not just throw as directory creation task into a thread pool. Idea is to define the task intelligently,
a task is defined as,
"create a directory and push task of creating it's sub-directory inside the thread pool."
So here for example, for task1 is creating directory 1 and push task of for 2 and 3 in the thread pool. This way are actually doing bfs concurrently. But one observation is about queue size. In our production we got 800 million directories, it is not wise to keep 800 million task in the task queue. So idea is to limit the queue till 1e5 number of tasks. But problem is, here normal wait-notify pattern will not work. Here is an example, let's suppose we have 1 thread and queue size is 1. So when we are about to do task1, it will never finish(fall into deadlock), as it doesn't have enough space to push task for subtree (3). So idea is to whenever it will find that queue is full(which represents all threads are working), instead of pushing tasks for sub-directories, it will do it's own. In this way we can kind of uniformly keep all threads busy.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added cephfs Ceph File System common labels Jan 7, 2025
@sajibreadd sajibreadd force-pushed the wip-69190 branch 4 times, most recently from ca570e2 to 1f75983 Compare January 13, 2025 03:33
@vshankar vshankar requested a review from a team January 13, 2025 05:22
- cephfs-mirror
min: 0
max: 11
- name: cephfs_mirror_max_concurrent_file_transfer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this (number of threads per peer) dynamic which can increase based on how busy sync threads are esp. when large files are being transferred. There should obviously be a max cap for this. What do you think? @sajibreadd

Copy link
Member Author

@sajibreadd sajibreadd Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean whenever a new file transfer is starting it will start through a thread(creating a new one) and if number of threads operating already reached to max cap then it should wait? @vshankar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, something like a dynamic thread pool.

Copy link
Member Author

@sajibreadd sajibreadd Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Venky, I was thinking about when user will be adding a directory sync at that time user will have privilege to add number of threads to sync that directory, So we can sync directories independenty with associated threadpool instead of channelizing all thread pool's tasks into single thread pool. Also user will be able to update it in runtime. Let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be abused in many way by the user. What can be done is to prioritize directories for whose snapshots are to be synced. That way, higher priority directories will be given a relatively larger number of threads from the thread pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sajibreadd Try reading src/common/WorkQueue.h::ThreadPool. Basically it allows implementing a shared work queue amongst multiple threads. With the priority thing I mentioned above, it would allow a large share of threads to operate on a directory that has a higher sync-priority.

@sajibreadd sajibreadd force-pushed the wip-69190 branch 2 times, most recently from cb70b8e to 9777553 Compare January 14, 2025 18:55
Copy link
Member

@joscollin joscollin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check further in the following days.

Copy link
Member

@joscollin joscollin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] https://pulpito.ceph.com/jcollin-2025-01-16_05:46:12-fs:mirror-wip-jcollin-testing-1501-distro-default-smithi/

[2] https://pulpito.ceph.com/jcollin-2025-01-16_08:04:03-fs:mirror-wip-jcollin-testing-1501-distro-default-smithi/

While testing these changes, I got 2 failures of incremental sync tests in 2 different runs [1] and [2]. Looks like one of the incremental sync test fails every time and they are two different errors. Failure [1] seems to be a new one and Failure [2] tracked here. Please check the logs and fix them.

@sajibreadd
Copy link
Member Author

[1] https://pulpito.ceph.com/jcollin-2025-01-16_05:46:12-fs:mirror-wip-jcollin-testing-1501-distro-default-smithi/

[2] https://pulpito.ceph.com/jcollin-2025-01-16_08:04:03-fs:mirror-wip-jcollin-testing-1501-distro-default-smithi/

While testing these changes, I got 2 failures of incremental sync tests in 2 different runs [1] and [2]. Looks like one of the incremental sync test fails every time and they are two different errors. Failure [1] seems to be a new one and Failure [2] tracked here. Please check the logs and fix them.

@joscollin You mean my PR is failing those tests? Or the old cephfs-mirror incremental sync?

@joscollin
Copy link
Member

joscollin commented Jan 16, 2025

[1] https://pulpito.ceph.com/jcollin-2025-01-16_05:46:12-fs:mirror-wip-jcollin-testing-1501-distro-default-smithi/
[2] https://pulpito.ceph.com/jcollin-2025-01-16_08:04:03-fs:mirror-wip-jcollin-testing-1501-distro-default-smithi/
While testing these changes, I got 2 failures of incremental sync tests in 2 different runs [1] and [2]. Looks like one of the incremental sync test fails every time and they are two different errors. Failure [1] seems to be a new one and Failure [2] tracked here. Please check the logs and fix them.

@joscollin You mean my PR is failing those tests? Or the old cephfs-mirror incremental sync?

The changes in this PR is failing those tests. You could also try running fs:mirror suite with this PR commits in the test branch.

"cephfs_mirror_max_concurrent_file_transfer")
<< "cephfs_mirror_max_number_of_threads="
<< g_ceph_context->_conf.get_val<uint64_t>(
"cephfs_mirror_max_number_of_threads")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have already read these values from the config above, right? Just use a variable instead of reading again?

void PeerReplayer::OpHandlerThreadPool::handle_other_task_async_sync(
C_MirrorContext *task) {
bool success = false;
if (other_task_queue.size() < task_queue_limit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is unnecessary. We are getting the success status anyway from do_other_task_async and do_other_task_async checks the size already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joscollin Yes, I just added the check without locking the mutex. This check is saving me unnecessary locking of the queue mutex. Like if queue is full, it will unnecessarily lock the mutex, check the size and fail to push the task and it will lead to do the task right away. So I can just check the queue outside once. Thing is it can happen that unlocked queue size can provide us wrong value but it's not far from the locked value. And using the wrong value to check is harmful, in worst case we are just doing the task right away.

if (other_task_queue.size() < task_queue_limit) {
success = do_other_task_async(task);
}
if (!success) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have if (success) { here for the task to execute one of the replayer-> tasks, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, here success, represents that we are successfully push the task into the queue(doing it asynchronously). If we are no successful, we are doing it right way synchronously. And yes it will invoke the replayer-> task.

@vshankar
Copy link
Contributor

@joscollin Is this change waiting on comments to be addressed by @sajibreadd ?

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@joscollin
Copy link
Member

@joscollin Is this change waiting on comments to be addressed by @sajibreadd ?

Yes, this comment needs to be addressed mainly. I'll continue reviewing this tomorrow.

…ntly for large directory

Fixes: https://tracker.ceph.com/issues/69190
Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
@sajibreadd
Copy link
Member Author

sajibreadd commented Feb 13, 2025

In this new change I updated the code to remove unnecessary stats(get stats/update stats) which has been a major complain for cephfs-mirror. Also I have a question,

suppose for a path /x/y/z(directory),
If /x/y is symbolink link(which is pointing to a directory and that directory contains z as a sub-directory) in previous snapshot and for current snapshot /x/y is a directory(not a symbolic link) and has z as a sub-directory as well, in that case, for path /x/y/z what will be the output of snapdiff between current snapshot and previous snapshot?

@sajibreadd
Copy link
Member Author

@joscollin Is this change waiting on comments to be addressed by @sajibreadd ?

Yes, this comment needs to be addressed mainly. I'll continue reviewing this tomorrow.

@joscollin @vshankar I tried those tests, what I found, snapdiff is providing same entry multiple times. Is it supposed to do that? I debugged from the client level even from the mds level, it is returning same dentry twice.

2025-02-16T04:45:09.448+0100 732997200640  0 client.4234 _readdir_r_cb-->/snap_1/ceph-qa-suite/suites, fg=*, [1148764438241738754, /snap_2/ceph-qa-suite/suites/stress],
[1148856551331594242, /snap_2/ceph-qa-suite/suites/multi-version],
[1148889733443616770, /snap_2/ceph-qa-suite/suites/powercycle],
[1149295304521023490, /snap_2/ceph-qa-suite/suites/rbd],
[1149336403599949826, /snap_2/ceph-qa-suite/suites/rest],
[1149467676859432962, /snap_2/ceph-qa-suite/suites/rgw],
[1149563574083911682, /snap_1/ceph-qa-suite/suites/ceph-deploy],
[1149563574083911683, /snap_2/ceph-qa-suite/suites/ceph-deploy],
[1149655363709042690, /snap_2/ceph-qa-suite/suites/tgt],
[1149686119432978434, /snap_2/ceph-qa-suite/suites/samba],
[1150138713154519042, /snap_2/ceph-qa-suite/suites/big],
[1150442851063037954, /snap_2/ceph-qa-suite/suites/mixed-clients],
[1150489865385672706, /snap_2/ceph-qa-suite/suites/fs],
[1150767220951875586, /snap_2/ceph-qa-suite/suites/krbd],
[1151002859500732418, /snap_2/ceph-qa-suite/suites/rados],
[1151014028563185666, /snap_2/ceph-qa-suite/suites/knfs],
[1151089665084751874, /snap_2/ceph-qa-suite/suites/experimental],
[1151253723440218114, /snap_2/ceph-qa-suite/suites/marginal],
[1151274782201741314, /snap_2/ceph-qa-suite/suites/kcephfs],
[1151349434370490370, /snap_2/ceph-qa-suite/suites/upgrade],
[1151399614084022274, /snap_2/ceph-qa-suite/suites/dummy],
[1151472702884675586, /snap_2/ceph-qa-suite/suites/hadoop],
[1152014975254921218, /snap_2/ceph-qa-suite/suites/multimds],
[1152826851849142274, /snap_2/ceph-qa-suite/suites/smoke],

I logged the list of dentries it is returning, all the dentries exist is snap_1 and snap_2, but ceph-qa-suite/suites/ceph-deploy is returned twice, one for snap_1 and one for snap_2. I debugged from the mds level,

2025-02-16T04:45:09.447+0100 7f32d6000640  0 mds.0.server check-->/ceph_repo/ceph-qa-suite/suites-->/ceph_repo/ceph-qa-suite/suites/stress
/ceph_repo/ceph-qa-suite/suites/multi-version
/ceph_repo/ceph-qa-suite/suites/powercycle
/ceph_repo/ceph-qa-suite/suites/rbd
/ceph_repo/ceph-qa-suite/suites/rest
/ceph_repo/ceph-qa-suite/suites/rgw
/ceph_repo/ceph-qa-suite/suites/ceph-deploy
/ceph_repo/ceph-qa-suite/suites/ceph-deploy
/ceph_repo/ceph-qa-suite/suites/tgt
/ceph_repo/ceph-qa-suite/suites/samba
/ceph_repo/ceph-qa-suite/suites/big
/ceph_repo/ceph-qa-suite/suites/mixed-clients
/ceph_repo/ceph-qa-suite/suites/fs
/ceph_repo/ceph-qa-suite/suites/krbd
/ceph_repo/ceph-qa-suite/suites/rados
/ceph_repo/ceph-qa-suite/suites/knfs
/ceph_repo/ceph-qa-suite/suites/experimental
/ceph_repo/ceph-qa-suite/suites/marginal
/ceph_repo/ceph-qa-suite/suites/kcephfs
/ceph_repo/ceph-qa-suite/suites/upgrade
/ceph_repo/ceph-qa-suite/suites/dummy
/ceph_repo/ceph-qa-suite/suites/hadoop
/ceph_repo/ceph-qa-suite/suites/multimds
/ceph_repo/ceph-qa-suite/suites/smoke

I got these two entries in the mds as well. I think It is some bug which git expose for snapshot making or snapdiff is supposed to handle that. Let me know your thoughts.

@vshankar
Copy link
Contributor

@joscollin @vshankar I tried those tests, what I found, snapdiff is providing same entry multiple times. Is it supposed to do that? I debugged from the client level even from the mds level, it is returning same dentry twice.

That's odd and should not happen. Can you open a tracker for this with the steps to reproduce (isolated reproducer will be handy).

@joscollin
Copy link
Member

@joscollin Is this change waiting on comments to be addressed by @sajibreadd ?

Yes, this comment needs to be addressed mainly. I'll continue reviewing this tomorrow.

@joscollin @vshankar I tried those tests, what I found, snapdiff is providing same entry multiple times. Is it supposed to do that? I debugged from the client level even from the mds level, it is returning same dentry twice.

@vshankar @sajibreadd This is normal as per snapdiff, if there's a change in ceph-deploy or any of it's sub directories or files.
In this case, there should be a change either with ceph-deploy or anywhere within it's sub-tree. That's why it's coming for snap1 and again for snap2 as per snapdiff api implementation. @sajibreadd You could verify that.

@vshankar
Copy link
Contributor

@joscollin Is this change waiting on comments to be addressed by @sajibreadd ?

Yes, this comment needs to be addressed mainly. I'll continue reviewing this tomorrow.

@joscollin @vshankar I tried those tests, what I found, snapdiff is providing same entry multiple times. Is it supposed to do that? I debugged from the client level even from the mds level, it is returning same dentry twice.

@vshankar @sajibreadd This is normal as per snapdiff, if there's a change in ceph-deploy or any of it's sub directories or files. In this case, there should be a change either with ceph-deploy or anywhere within it's sub-tree. That's why it's coming for snap1 and again for snap2 as per snapdiff api implementation. @sajibreadd You could verify that.

A snap diff call would return (entry, snap-id) and depending on the snap-id the caller infers if the entry is new/modified or deleted. So, are you saying snap diff could return (entry, snapid1) and (entry, snapid2)? @joscollin

@joscollin
Copy link
Member

@joscollin Is this change waiting on comments to be addressed by @sajibreadd ?

Yes, this comment needs to be addressed mainly. I'll continue reviewing this tomorrow.

@joscollin @vshankar I tried those tests, what I found, snapdiff is providing same entry multiple times. Is it supposed to do that? I debugged from the client level even from the mds level, it is returning same dentry twice.

@vshankar @sajibreadd This is normal as per snapdiff, if there's a change in ceph-deploy or any of it's sub directories or files. In this case, there should be a change either with ceph-deploy or anywhere within it's sub-tree. That's why it's coming for snap1 and again for snap2 as per snapdiff api implementation. @sajibreadd You could verify that.

A snap diff call would return (entry, snap-id) and depending on the snap-id the caller infers if the entry is new/modified or deleted. So, are you saying snap diff could return (entry, snapid1) and (entry, snapid2)? @joscollin

I don't recall if the snapdiff returns (entry, snap-id). It always return entries starting from the root. So if we have a/b/c, the snapdiff call would return entries: a, b and c for a change in c. That's how snapdiff API works.

[1149563574083911682, /snap_1/ceph-qa-suite/suites/ceph-deploy],
So this^ should be for a change in ceph-deploy directory, while snapshot-ting snap_1.

and

[1149563574083911683, /snap_2/ceph-qa-suite/suites/ceph-deploy],
This^ should be for another change under ceph-deploy directory, while snapshot-ing snap_2.

For both changes, ceph-deploy is a parent directory. In the qa test, we are creating, doing git reset and git pull. So ceph-deploy would undergo creation and further updates before each corresponding snapshot-ing/mirroring.

By the way, these logs are not generated by ceph/qa/tasks/cephfs/test_mirroring.py. We don't have the words snap_1 and snap_2 anywhere in our fs:mirror tests. So it should be taken from somewhere else.

@sajibreadd
Copy link
Member Author

sajibreadd commented Feb 24, 2025

A snap diff call would return (entry, snap-id) and depending on the snap-id the caller infers if the entry is new/modified or deleted. So, are you saying snap diff could return (entry, snapid1) and (entry, snapid2)? @joscollin

I talked to @ifed01, according to him, it is working as expected, it should return two entries. So basically having same entries twice for snap_1 and snap_2 means, it gets deleted first and recreated. So we need to handle this case in the cephfs-mirror code using snapid.

  • if we find a entry having snapid = prev snapshot and doing stat on current snapshot(which we are already doing) showing it exist in the current snapshot, we will just ignore it.
  • if stat shows it does not exist then we will delete it

@vshankar
Copy link
Contributor

@joscollin @vshankar I tried those tests, what I found, snapdiff is providing same entry multiple times. Is it supposed to do that? I debugged from the client level even from the mds level, it is returning same dentry twice.

That can happen what the file is present in the snapshots but the file type is different (dir vs file).

@vshankar
Copy link
Contributor

A snap diff call would return (entry, snap-id) and depending on the snap-id the caller infers if the entry is new/modified or deleted. So, are you saying snap diff could return (entry, snapid1) and (entry, snapid2)? @joscollin

I talked to @ifed01, according to him, it is working as expected, it should return two entries. So basically having same entries twice for snap_1 and snap_2 means, it gets deleted first and recreated. So we need to handle this case in the cephfs-mirror code using snapid.

  • if we find a entry having snapid = prev snapshot and doing stat on current snapshot(which we are already doing) showing it exist in the current snapshot, we will just ignore it.
  • if stat shows it does not exist then we will delete it

That's right. I left a comment regarding the same when I checked the code (didn't recall this behaviour earlier). For some reason, the comment I left from the github mobile app didn't get posted.

@vshankar
Copy link
Contributor

Along with #61937, this should give a nice performance boost to the cephfs mirror daemon. @sajibreadd FYI.

@sajibreadd
Copy link
Member Author

Along with #61937, this should give a nice performance boost to the cephfs mirror daemon. @sajibreadd FYI.

@vshankar Sorry, I was busy with trying non-snapdiff(concurrent) version in one of our customer cluster to see how things work, found a client bug and some bugs in cephfs-mirror code(will mention later here). That's why couldn't wrap it up this PR. Today I will try to wrap it up such that it can be reviewed next week.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions github-actions bot removed the stale label May 12, 2025
@vshankar
Copy link
Contributor

@vshankar Sorry, I was busy with trying non-snapdiff(concurrent) version in one of our customer cluster to see how things work, found a client bug and some bugs in cephfs-mirror code(will mention later here). That's why couldn't wrap it up this PR. Today I will try to wrap it up such that it can be reviewed next week.

Hey @sajibreadd -- Would it be possible to refresh/rebase the changes to take it forward?

@sajibreadd
Copy link
Member Author

sajibreadd commented May 13, 2025

Hey @sajibreadd -- Would it be possible to refresh/rebase the changes to take it forward?

yes, on it.

@sajibreadd
Copy link
Member Author

sajibreadd commented Jun 5, 2025

I will try finishing this PR this week.

@sajibreadd
Copy link
Member Author

@vshankar @joscollin I am closing this PR as I created a new one
#63950

@sajibreadd sajibreadd closed this Jun 16, 2025
@sajibreadd sajibreadd deleted the wip-69190 branch March 16, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants