Skip to content

reef: cephfs_mirror: use snapdiff api for incremental syncing#58985

Open
joscollin wants to merge 10 commits intoceph:reeffrom
joscollin:wip-65222-reef
Open

reef: cephfs_mirror: use snapdiff api for incremental syncing#58985
joscollin wants to merge 10 commits intoceph:reeffrom
joscollin:wip-65222-reef

Conversation

@joscollin
Copy link
Member

@joscollin joscollin commented Aug 1, 2024

backport tracker: https://tracker.ceph.com/issues/65222
backport tracker: https://tracker.ceph.com/issues/69244
backport tracker: https://tracker.ceph.com/issues/72612
Fixes: https://tracker.ceph.com/issues/71110


backport of #54633
parent tracker: https://tracker.ceph.com/issues/61334

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

@joscollin joscollin added this to the reef milestone Aug 1, 2024
@joscollin joscollin added the cephfs Ceph File System label Aug 1, 2024
@lxbsz
Copy link
Member

lxbsz commented Aug 6, 2024

jenkins test make check

@rishabh-d-dave
Copy link
Contributor

This needs a re-run and a rebuild since binaries built on shaman have been deleted by now . See - https://tracker.ceph.com/issues/67493#note-11.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 26, 2024
@vshankar vshankar removed the stale label Nov 26, 2024
@vshankar
Copy link
Contributor

@joscollin
Copy link
Member Author

@joscollin PTAL at the failure - https://pulpito.ceph.com/rishabh-2024-09-10_14:18:13-fs-wip-rishabh-testing-20240910.073325-reef-distro-default-smithi/7899232/

(which is reproducible)

@vshankar
This is a known issue [1] https://tracker.ceph.com/issues/68567. As I remember, this failure exists even before snapdiff change and still exist in main. This is getting failed intermittently. So I did a check. My observation is: The source and destination snapshots doesn't show any differences and it's been synced correctly. So it's failing because of something else, different from incremental sync, which need to be RCA'ed.

My opinion is we shouldn't block this PR because of this and should be tracked separately [1].

@joscollin
Copy link
Member Author

@joscollin PTAL at the failure - https://pulpito.ceph.com/rishabh-2024-09-10_14:18:13-fs-wip-rishabh-testing-20240910.073325-reef-distro-default-smithi/7899232/

(which is reproducible)

@vshankar
Please include this fix #61068 and rerun the QA.

@joscollin
Copy link
Member Author

@vshankar Added more commits. Please re-run the QA.

@vshankar
Copy link
Contributor

vshankar commented Jan 3, 2025

jenkins retest this please

@vshankar
Copy link
Contributor

Blocked due to: #54633 (review)

@joscollin
Copy link
Member Author

Fix created: #61539.

@mchangir
Copy link
Contributor

mchangir commented Feb 7, 2025

jenkins test docs

@mchangir
Copy link
Contributor

mchangir commented Feb 7, 2025

jenkins test make check

@mchangir
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/69881.

@joscollin
Copy link
Member Author

This PR is under test in https://tracker.ceph.com/issues/69881.

@mchangir This PR is DNM and waiting for the fix #61539.

@mchangir
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/69881.

@mchangir This PR is DNM and waiting for the fix #61539.

Thanks for the reminder.

@joscollin joscollin changed the title [DNM] reef: cephfs_mirror: use snapdiff api for incremental syncing reef: cephfs_mirror: use snapdiff api for incremental syncing Jul 30, 2025
joscollin and others added 10 commits July 30, 2025 11:13
Use snapdiff api to sync only the delta of files between two snapshots.

Fixes: https://tracker.ceph.com/issues/61334
Signed-off-by: Jos Collin <jcollin@redhat.com>
(cherry picked from commit 96c351c)
Likely ceph#55619 missed that.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 6f1d21c)
ceph#55619 eliminates the need for ceph_close() call after successful ceph_fdopendir() one.
And introduces automatic file descriptor's close when corresponding ceph_closedir() is called.
That hasty ceph_close() call makes file descriptor
available for a new allocation which might conflict with the automatic fd close in the above ceph_closedir().

Full PeerReplayer::do_synchronize() has been reworked to close fds
properly, depending on whether ceph_fdopendir() has been already applied
to them.

Additionally for the sake of uniformity this reworks incremental do_synchronize()
in a way to do final fd closings similar to full implementation.

Plus this effectively reverts
ceph@bd78bdc
as it looks like a wrong approach to fight broken file descriptor
references. No much sense in reopening of the current snapshot's root folder
on each new entry processing. Instead this patch just doesn't close it from the
beginning.

Fixes: https://tracker.ceph.com/issues/68853

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 7a747bc)
This avoids considering latest changes from the source filesystem for
the mirroring of already existing snapshots. Thus the destination
filesystem and snapshots would be created based only on the source snapshots.
The destination fs would be a replica of the last snapshot taken.

Fixes: https://tracker.ceph.com/issues/68567
Signed-off-by: Jos Collin <jcollin@redhat.com>
(cherry picked from commit 23e4cd5)
Fixes: https://tracker.ceph.com/issues/68567
Signed-off-by: Jos Collin <jcollin@redhat.com>
(cherry picked from commit 830626b)
Fixes: https://tracker.ceph.com/issues/69671
Signed-off-by: Jos Collin <jcollin@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit d9ac431)

 Conflicts:
    src/tools/cephfs_mirror/PeerReplayer.h
    - Resolved cherry-pick conflicts due to 691ed01 not backported to reef
As a debug aid when the snapdiff request is erroring out.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit ae9b035)

 Conflicts:
    src/mds/Server.cc
    - Resolved cherry-pick conflicts
Fixes: https://tracker.ceph.com/issues/59067
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
(cherry picked from commit 749c770)
Fixes: http://tracker.ceph.com/issues/70287
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit 33c6f23)
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit e16820f)
@vshankar
Copy link
Contributor

@joscollin - as discussed, we need the additional commit from the main branch that was dropped in the latest update. And if there is a bug in that commit when backporting, then we should find that first.

@joscollin
Copy link
Member Author

@joscollin - as discussed, we need the additional commit from the main branch that was dropped in the latest update. And if there is a bug in that commit when backporting, then we should find that first.

I can debug and fix it. But as I mentioned in the discussion, that fix would make reef newer than main in git.
If we are backporting the entire file blockdiff feature, it makes sense. Otherwise, backporting subclassing alone doesn't fix anything in reef AFAIU.

@vshankar
Copy link
Contributor

@joscollin - as discussed, we need the additional commit from the main branch that was dropped in the latest update. And if there is a bug in that commit when backporting, then we should find that first.

I can debug and fix it. But as I mentioned in the discussion, that fix would make reef newer than main in git.

I don't understand how it will be newer than main. The subclassing change is a commit that needs backporting since the main branch tests have been running with that.

If we are backporting the entire file blockdiff feature, it makes sense. Otherwise, backporting subclassing alone doesn't fix anything in reef AFAIU.

The subclassing commit(*) has the remote and snapdiff sync class mechanism. The commit is standalone and does not reply on blockdiff.

(*): d9ac431#diff-902f4a8c220846756f7af155a115519a0382579cd04dcc7069121665c245edfe

@joscollin
Copy link
Member Author

jenkins test make check

1 similar comment
@joscollin
Copy link
Member Author

jenkins test make check

@vshankar
Copy link
Contributor

@joscollin any update on this?

@joscollin
Copy link
Member Author

@joscollin any update on this?

@vshankar
Yes, there is a bug in readdir_snapdiff. readdir_snapdiff->_readdir_get_frag fails (EINVAL) only for the directory type. The directory type delete --> recreate --> read fails. The interesting thing is it fails along with the fix 33c6f23.

I think @ifed01 also faced the same issue and working on a fix for it? I need to review this PR: #64995.

@joscollin
Copy link
Member Author

I think @ifed01 also faced the same issue and working on a fix for it? I need to review this PR: #64995.

No, it doesn't fix the issue found in this PR.
https://pulpito.ceph.com/jcollin-2025-08-19_15:07:28-fs:mirror-wip-jcollin-testing-reef-65222-1908_2-distro-default-smithi/8451654/

@vshankar
Copy link
Contributor

I think @ifed01 also faced the same issue and working on a fix for it? I need to review this PR: #64995.

No, it doesn't fix the issue found in this PR. https://pulpito.ceph.com/jcollin-2025-08-19_15:07:28-fs:mirror-wip-jcollin-testing-reef-65222-1908_2-distro-default-smithi/8451654/

Yeh, I was about to leave a comment since I reviewed PR #64995 and it does not look like the test failed due to that bug.

@joscollin
Copy link
Member Author

I think @ifed01 also faced the same issue and working on a fix for it? I need to review this PR: #64995.

No, it doesn't fix the issue found in this PR. https://pulpito.ceph.com/jcollin-2025-08-19_15:07:28-fs:mirror-wip-jcollin-testing-reef-65222-1908_2-distro-default-smithi/8451654/

Yeh, I was about to leave a comment since I reviewed PR #64995 and it does not look like the test failed due to that bug.

I'm debugging this.

@vshankar
Copy link
Contributor

vshankar commented Oct 7, 2025

I think @ifed01 also faced the same issue and working on a fix for it? I need to review this PR: #64995.

No, it doesn't fix the issue found in this PR. https://pulpito.ceph.com/jcollin-2025-08-19_15:07:28-fs:mirror-wip-jcollin-testing-reef-65222-1908_2-distro-default-smithi/8451654/

Yeh, I was about to leave a comment since I reviewed PR #64995 and it does not look like the test failed due to that bug.

I'm debugging this.

@joscollin - any update on this?

@joscollin
Copy link
Member Author

I think @ifed01 also faced the same issue and working on a fix for it? I need to review this PR: #64995.

No, it doesn't fix the issue found in this PR. https://pulpito.ceph.com/jcollin-2025-08-19_15:07:28-fs:mirror-wip-jcollin-testing-reef-65222-1908_2-distro-default-smithi/8451654/

Yeh, I was about to leave a comment since I reviewed PR #64995 and it does not look like the test failed due to that bug.

I'm debugging this.

@joscollin - any update on this?

@vshankar
This backport has some failures, which could easily be avoided by dropping the commits from https://github.com/ceph/ceph/pull/62250/commits. Since the upcoming reef release is the final release, I think fixing it that way should be fine IMO.

@github-actions
Copy link

github-actions bot commented Oct 8, 2025

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@smacfaulk
Copy link

Is this not making the 18.2.8 release? Docs currently state that snapdiff is used in reef mirroring.

@vshankar
Copy link
Contributor

Is this not making the 18.2.8 release? Docs currently state that snapdiff is used in reef mirroring.

Unfortunately, not. I see the docs are incorrect, possibly due to the fact that doc changes got backported first.

@vshankar
Copy link
Contributor

Is this not making the 18.2.8 release? Docs currently state that snapdiff is used in reef mirroring.

https://tracker.ceph.com/issues/74985

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants