cephfs-mirror: cephfs-mirror should be able to do operations concurrently for large directory#61245
cephfs-mirror: cephfs-mirror should be able to do operations concurrently for large directory#61245sajibreadd wants to merge 1 commit intoceph:mainfrom
Conversation
ca570e2 to
1f75983
Compare
| - cephfs-mirror | ||
| min: 0 | ||
| max: 11 | ||
| - name: cephfs_mirror_max_concurrent_file_transfer |
There was a problem hiding this comment.
How about making this (number of threads per peer) dynamic which can increase based on how busy sync threads are esp. when large files are being transferred. There should obviously be a max cap for this. What do you think? @sajibreadd
There was a problem hiding this comment.
Did you mean whenever a new file transfer is starting it will start through a thread(creating a new one) and if number of threads operating already reached to max cap then it should wait? @vshankar
There was a problem hiding this comment.
Right, something like a dynamic thread pool.
There was a problem hiding this comment.
Hi Venky, I was thinking about when user will be adding a directory sync at that time user will have privilege to add number of threads to sync that directory, So we can sync directories independenty with associated threadpool instead of channelizing all thread pool's tasks into single thread pool. Also user will be able to update it in runtime. Let me know your thoughts.
There was a problem hiding this comment.
That can be abused in many way by the user. What can be done is to prioritize directories for whose snapshots are to be synced. That way, higher priority directories will be given a relatively larger number of threads from the thread pool.
There was a problem hiding this comment.
@sajibreadd Try reading src/common/WorkQueue.h::ThreadPool. Basically it allows implementing a shared work queue amongst multiple threads. With the priority thing I mentioned above, it would allow a large share of threads to operate on a directory that has a higher sync-priority.
cb70b8e to
9777553
Compare
joscollin
left a comment
There was a problem hiding this comment.
Will check further in the following days.
e719af4 to
07f900f
Compare
joscollin
left a comment
There was a problem hiding this comment.
While testing these changes, I got 2 failures of incremental sync tests in 2 different runs [1] and [2]. Looks like one of the incremental sync test fails every time and they are two different errors. Failure [1] seems to be a new one and Failure [2] tracked here. Please check the logs and fix them.
@joscollin You mean my PR is failing those tests? Or the old cephfs-mirror incremental sync? |
The changes in this PR is failing those tests. You could also try running |
| "cephfs_mirror_max_concurrent_file_transfer") | ||
| << "cephfs_mirror_max_number_of_threads=" | ||
| << g_ceph_context->_conf.get_val<uint64_t>( | ||
| "cephfs_mirror_max_number_of_threads") |
There was a problem hiding this comment.
You have already read these values from the config above, right? Just use a variable instead of reading again?
| void PeerReplayer::OpHandlerThreadPool::handle_other_task_async_sync( | ||
| C_MirrorContext *task) { | ||
| bool success = false; | ||
| if (other_task_queue.size() < task_queue_limit) { |
There was a problem hiding this comment.
This check is unnecessary. We are getting the success status anyway from do_other_task_async and do_other_task_async checks the size already.
There was a problem hiding this comment.
@joscollin Yes, I just added the check without locking the mutex. This check is saving me unnecessary locking of the queue mutex. Like if queue is full, it will unnecessarily lock the mutex, check the size and fail to push the task and it will lead to do the task right away. So I can just check the queue outside once. Thing is it can happen that unlocked queue size can provide us wrong value but it's not far from the locked value. And using the wrong value to check is harmful, in worst case we are just doing the task right away.
| if (other_task_queue.size() < task_queue_limit) { | ||
| success = do_other_task_async(task); | ||
| } | ||
| if (!success) { |
There was a problem hiding this comment.
We should have if (success) { here for the task to execute one of the replayer-> tasks, right?
There was a problem hiding this comment.
No, here success, represents that we are successfully push the task into the queue(doing it asynchronously). If we are no successful, we are doing it right way synchronously. And yes it will invoke the replayer-> task.
|
@joscollin Is this change waiting on comments to be addressed by @sajibreadd ? |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Yes, this comment needs to be addressed mainly. I'll continue reviewing this tomorrow. |
3132511 to
eca2787
Compare
…ntly for large directory Fixes: https://tracker.ceph.com/issues/69190 Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
eca2787 to
adfb476
Compare
|
In this new change I updated the code to remove unnecessary stats(get stats/update stats) which has been a major complain for cephfs-mirror. Also I have a question, suppose for a path |
@joscollin @vshankar I tried those tests, what I found, snapdiff is providing same entry multiple times. Is it supposed to do that? I debugged from the client level even from the mds level, it is returning same dentry twice. I logged the list of dentries it is returning, all the dentries exist is I got these two entries in the mds as well. I think It is some bug which git expose for snapshot making or snapdiff is supposed to handle that. Let me know your thoughts. |
That's odd and should not happen. Can you open a tracker for this with the steps to reproduce (isolated reproducer will be handy). |
@vshankar @sajibreadd This is normal as per snapdiff, if there's a change in ceph-deploy or any of it's sub directories or files. |
A snap diff call would return (entry, snap-id) and depending on the snap-id the caller infers if the entry is new/modified or deleted. So, are you saying snap diff could return (entry, snapid1) and (entry, snapid2)? @joscollin |
I don't recall if the snapdiff returns
and
For both changes, By the way, these logs are not generated by |
I talked to @ifed01, according to him, it is working as expected, it should return two entries. So basically having same entries twice for snap_1 and snap_2 means, it gets deleted first and recreated. So we need to handle this case in the cephfs-mirror code using snapid.
|
That can happen what the file is present in the snapshots but the file type is different (dir vs file). |
That's right. I left a comment regarding the same when I checked the code (didn't recall this behaviour earlier). For some reason, the comment I left from the github mobile app didn't get posted. |
|
Along with #61937, this should give a nice performance boost to the cephfs mirror daemon. @sajibreadd FYI. |
@vshankar Sorry, I was busy with trying non-snapdiff(concurrent) version in one of our customer cluster to see how things work, found a client bug and some bugs in cephfs-mirror code(will mention later here). That's why couldn't wrap it up this PR. Today I will try to wrap it up such that it can be reviewed next week. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Hey @sajibreadd -- Would it be possible to refresh/rebase the changes to take it forward? |
yes, on it. |
|
I will try finishing this PR this week. |
|
@vshankar @joscollin I am closing this PR as I created a new one |
Fixes: https://tracker.ceph.com/issues/69190
Signed-off-by: Md Mahamudur Rahaman Sajib mahamudur.sajib@croit.io
Here idea is to make the full mirroring concurrent. Previously there is concurrency but it comes to meaningless when there is single large directory is about to mirror or number of directory actively syncing is small. I tried a different approach where,
I maintained 2 task thread pool, one is for file transfer and one is for other operation(can be directory operation / traversing the directory tree)
file transfer thread pool is kind of self explanatory, whenever we need to transfer file we will push this task into that thread pool's queue and file will be transferred asynchronously. As file's are just leaves of the tree there is no dependency and complication while asynchronously transferring it.
For other operation, it is not easy as file transferring. Here is an exmaple,
Let's assume all nodes in this tree are directory then (2) must be created before (4) and (5). So we can not just throw as directory creation task into a thread pool. Idea is to define the task intelligently,
a task is defined as,
"create a directory and push task of creating it's sub-directory inside the thread pool."
So here for example, for task1 is creating directory 1 and push task of for 2 and 3 in the thread pool. This way are actually doing bfs concurrently. But one observation is about queue size. In our production we got 800 million directories, it is not wise to keep 800 million task in the task queue. So idea is to limit the queue till 1e5 number of tasks. But problem is, here normal wait-notify pattern will not work. Here is an example, let's suppose we have 1 thread and queue size is 1. So when we are about to do task1, it will never finish(fall into deadlock), as it doesn't have enough space to push task for subtree (3). So idea is to whenever it will find that queue is full(which represents all threads are working), instead of pushing tasks for sub-directories, it will do it's own. In this way we can kind of uniformly keep all threads busy.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e