Skip to content

cephfs_mirror: avoid latest changes on the source fs to enable mirroring#61539

Closed
joscollin wants to merge 3 commits intoceph:mainfrom
joscollin:wip-B68567_B69671
Closed

cephfs_mirror: avoid latest changes on the source fs to enable mirroring#61539
joscollin wants to merge 3 commits intoceph:mainfrom
joscollin:wip-B68567_B69671

Conversation

@joscollin
Copy link
Member

@joscollin joscollin commented Jan 27, 2025

This avoids considering latest changes from the source filesystem for
the mirroring of already existing snapshots. Thus the destination filesystem
and snapshots would be created based only on the source snapshots.
The destination fs would be a replica of the last snapshot taken.

To check this, do:

$ sudo ./bin/mount.ceph admin@.a=/ /mnt/cephfs1 -o conf=$(pwd)/ceph.conf;
$ sudo ./bin/mount.ceph admin@.remotefs=/ /mnt/cephfs2 -o conf=$(pwd)/ceph.conf;

Make some changes and take a snapshot:

$ mkdir -p /mnt/cephfs1/dir/d1
$ touch /mnt/cephfs1/dir/d1/file1
$ mkdir -p /mnt/cephfs1/dir/.snap/snap1; 

Make some more changes & take another snapshot:

$ touch /mnt/cephfs1/dir/d1/file2
$ mkdir -p /mnt/cephfs1/dir/.snap/snap2;

Make a change and don't take snapshot:
$ rm /mnt/cephfs1/dir/d1/file2

Add dir for mirroring:
$ ./bin/ceph fs snapshot mirror add a /dir

Results:

$ tree /mnt/cephfs1/dir/.snap
/mnt/cephfs1/dir/.snap
├── snap1
│   └── d1
│       └── file1
└── snap2
    └── d1
        ├── file1
        └── file2
$ tree /mnt/cephfs2/dir/.snap
/mnt/cephfs2/dir/.snap
├── snap1
│   └── d1
│       └── file1
└── snap2
    └── d1
        ├── file1
        └── file2
$ tree /mnt/cephfs1/dir
/mnt/cephfs1/dir
└── d1
    └── file1
$ tree /mnt/cephfs2/dir
/mnt/cephfs2/dir
└── d1
    ├── file1
    └── file2

Fixes: https://tracker.ceph.com/issues/68567
Fixes: https://tracker.ceph.com/issues/69671

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label Jan 27, 2025
@joscollin
Copy link
Member Author

jenkins test api

2 similar comments
@joscollin
Copy link
Member Author

jenkins test api

@joscollin
Copy link
Member Author

jenkins test api

@joscollin joscollin requested a review from a team January 28, 2025 05:35
@joscollin joscollin changed the title cephfs_mirror: avoid latest changes on the source fs for mirroring cephfs_mirror: avoid latest changes on the source fs to enable mirroring Jan 28, 2025
@joscollin
Copy link
Member Author

QA test case in progress.

@joscollin
Copy link
Member Author

jenkins test make check

@joscollin
Copy link
Member Author

jenkins test windows

@joscollin
Copy link
Member Author

@joscollin
Copy link
Member Author

jenkins test make check

@joscollin
Copy link
Member Author

Putting DNM, as this needs more fix for purged_snapshot failure.

@joscollin joscollin removed the DNM label Feb 1, 2025
@joscollin
Copy link
Member Author

@joscollin
Copy link
Member Author

@vshankar
This PR is ready for review.
Please note that PeerReplayer::validate_source is more or less a copy-paste of PeerReplayer::pre_sync_check_and_open_handles. I would prefer using PeerReplayer::pre_sync_check_and_open_handles instead as it does the switching of local prev snapshot to remote dir_root. WDYT?

@joscollin
Copy link
Member Author

@vshankar
Copy link
Contributor

vshankar commented Feb 7, 2025

The new qa test code updated and fs:mirror and fs:valgrind passed:
https://pulpito.ceph.com/jcollin-2025-02-06_13:57:56-fs:mirror-wip-B68567_B69671-distro-default-smithi/
https://pulpito.ceph.com/jcollin-2025-02-06_13:59:33-fs:valgrind-wip-B68567_B69671-distro-default-smithi/

@vshankar This PR is ready for review.

Thanks. I will have a look and put it to test.

CEPH_STATX_MODE | CEPH_STATX_UID | CEPH_STATX_GID |
CEPH_STATX_SIZE | CEPH_STATX_ATIME | CEPH_STATX_MTIME,
AT_STATX_DONT_SYNC | AT_SYMLINK_NOFOLLOW);
npath = entry_path(epath, nname);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joscollin Could you share a sample log snippet involving these paths without and with the fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have it handy. I need to reproduce it with old code then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joscollin - if possible please share this, otherwise I will check it when I pull in this change to test the file block diff integration with the mirror daemon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar Let me see, if I could reproduce it. I don't have a build now. So please don't wait for me. If you are testing block diff, please pull this PR.

break
self.disable_mirroring(self.primary_fs_name, self.primary_fs_id)

def test_cephfs_mirror_sync_already_existing_snapshots(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a redundant test? I believe there is at least one test that checks if snapshots are synchronized as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this should be the only test that does add_directory after snapshot-ting. Is there any function does that already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, how is this test relevant to this fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar
This test checks if the changes are taken from the previous snapshot or not, while adding a directory.
There are multiple things gets fixed by this PR.

  1. Incremental sync after adding a directory for mirroring -> We have test for this.
  2. Incremental sync of the already existing snapshots while adding a directory for mirroring -> We don't have test for this.
  3. switch to remote dir_root when local prev snapshot is unavailable -> We have test for this.

Hope that clarifies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't (2) pretty much covered with (1)? Do we really need a separate test for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar I think it's better to have a separate test for it (add_directory after snapshot-ting), as the same thing is reported here: #54633 (comment).

}

// Determines if the source is the local (previous) snapshot or the remote dir_root
int PeerReplayer::validate_source(const std::string &dir_root,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - I don't understand why is this required. PeerReplayer would use local snapshot comparison (and hence local mount handles) when it's sure that the data on the remote file system (live) exactly matches the last snapshot transferred. This is determined using an xattr (from the remote file system directory).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But this change is made to resolve the test failure https://tracker.ceph.com/issues/69671. This happens only when the source snapshot is deleted during an incremental sync. Then the incremental sync has to check and make the switch to remote dir_root as per test_cephfs_mirror_sync_with_purged_snapshot.

Previously, before snapdiff, this was taken care of as everything was a full sync. But with incremental sync, as per the test (test_cephfs_mirror_sync_with_purged_snapshot), the source has to be validated each time. Hope this clarifies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the bug was that the mirror daemon was (incorrectly) using incremental sync with remote directory snapshot comparison?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar
There's no issue with incremental sync with remote directory snapshot comparison. But the snapdiff code (the new do_synchronize function) wasn't making the switch to the remote dir_root. There was no such mechanism in the new do_synchronize. I think I've missed it back then. That's the reason for this failure https://tracker.ceph.com/issues/69671. With this PR, the switch happens, as per the original design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. So, if a snapshot is deleted when it was used for incremental sync, the mirror daemon code was still incorrectly using incremental sync in the next iteration, yes? If that's the case, then can you put this up in the commit message please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was.
Updated the commit message @vshankar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar There's no issue with incremental sync with remote directory snapshot comparison. But the snapdiff code (the new do_synchronize function) wasn't making the switch to the remote dir_root. There was no such mechanism in the new do_synchronize. I think I've missed it back then. That's the reason for this failure https://tracker.ceph.com/issues/69671. With this PR, the switch happens, as per the original design.

Sorry for the wrong explanation. I didn't miss to implement the switching. This is a regression caused by 7a747bc. That's the reason test_cephfs_mirror_sync_with_purged_snapshot fails now. It was passing earlier, but now it fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the commit message with the details @vshankar

@joscollin joscollin requested a review from vshankar February 12, 2025 13:55
@joscollin
Copy link
Member Author

@vshankar ping

@joscollin
Copy link
Member Author

@vshankar ping :)
This is needed for other PRs to test.

@vshankar
Copy link
Contributor

@vshankar ping :)
This is needed for other PRs to test.

I am planning to test this with #61937 w/ mirror daemon change to use file block diff. I still have to push that bit of code to the PR.

@vshankar
Copy link
Contributor

@joscollin a suggestion - please do not mark a comment as resolved until the participants in a comment have agreed to either incorporate the fix or leave the change as status-quo.

@joscollin
Copy link
Member Author

@joscollin a suggestion - please do not mark a comment as resolved until the participants in a comment have agreed to either incorporate the fix or leave the change as status-quo.

@vshankar I was following what you've suggested earlier. You wanted it to mark the comment resolved after replying. Anyway, I'll stop clicking resolve button.

@vshankar
Copy link
Contributor

@joscollin a suggestion - please do not mark a comment as resolved until the participants in a comment have agreed to either incorporate the fix or leave the change as status-quo.

@vshankar I was following what you've suggested earlier. You wanted it to mark the comment resolved after replying. Anyway, I'll stop clicking resolve button.

That's not what I meant - If there is agreement on the comment then its fine to mark the comment as resolved.

This avoids considering latest changes from the source filesystem for
the mirroring of already existing snapshots. Thus the destination
filesystem and snapshots would be created based only on the source snapshots.
The destination fs would be a replica of the last snapshot taken.

Fixes: https://tracker.ceph.com/issues/68567
Signed-off-by: Jos Collin <jcollin@redhat.com>
…napshot is unavailable

Fixes the incremental sync not switching the source to remote dir_root, when the
local previous snapshot is deleted. This is a regression caused by
ceph@7a747bc.

Fixes: https://tracker.ceph.com/issues/69671
Signed-off-by: Jos Collin <jcollin@redhat.com>
@joscollin
Copy link
Member Author

jenkins test make check arm64

@vshankar
Copy link
Contributor

Relevant commits pulled into PR #62250

@vshankar vshankar closed this Mar 28, 2025
@joscollin joscollin deleted the wip-B68567_B69671 branch March 28, 2025 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants