Skip to content

qa: fix test_cephfs_mirror_incremental_sync failure#61068

Closed
joscollin wants to merge 2 commits intoceph:mainfrom
joscollin:wip-B68567-clear-git-logs
Closed

qa: fix test_cephfs_mirror_incremental_sync failure#61068
joscollin wants to merge 2 commits intoceph:mainfrom
joscollin:wip-B68567-clear-git-logs

Conversation

@joscollin
Copy link
Member

@joscollin joscollin commented Dec 12, 2024

Fixes: https://tracker.ceph.com/issues/68567

This PR fixes a long standing QA failure, by avoiding the git logs for the snapshot sync as they change unexpectedly.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@joscollin
Copy link
Member Author

self.add_directory(self.primary_fs_name, self.primary_fs_id, f'/{repo_path}')

# clear git logs before taking snapshot
self.mount_a.run_shell(['rm', '-rf', f'{repo_path}/.git/logs'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not 100% clear to me that git logs are causing the test failure. Could you explain how?

Copy link
Member Author

@joscollin joscollin Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens mainly during the incremental sync. The md5sum differs consistently for the files .git/logs/HEAD and .git/logs/refs/heads/giant for the source and destination snapshots (thus the checksum differs), which means it changes during the incremental sync. I think that's because it's a repo and writes something to those files, as the path changes ?

As our goal is to test the incremental sync, avoiding the logs (a few KBs) altogether, works very well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've checked further, this issue doesn't exist in vstart. It happens in teuthology only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens mainly during the incremental sync. The md5sum differs consistently for the files .git/logs/HEAD and .git/logs/refs/heads/main for the source and destination snapshots (thus the checksum differs), which means it changes during the incremental sync. I think that's because it's a repo and writes something to those files, as the path changes ?

But the checksum are for snapshots, which are immutable, so I don't understand how the checksum would change 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snapshot-diff-screenshot

Please see the screenshot^. This issue exist for the test test_cephfs_mirror_sync_with_purged_snapshot also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens mainly during the incremental sync. The md5sum differs consistently for the files .git/logs/HEAD and .git/logs/refs/heads/main for the source and destination snapshots (thus the checksum differs), which means it changes during the incremental sync. I think that's because it's a repo and writes something to those files, as the path changes ?

But the checksum are for snapshots, which are immutable, so I don't understand how the checksum would change 🤔

@vshankar
Look at the above instance, there's a difference in .git/refs/heads/giant too.

I think the solution would be to separate the working directory (.git repo) and the dir_root. So that the dir_root would contain the updated ceph-qa-suite without the .git dir, ready to sync.

Copy link
Member Author

@joscollin joscollin Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens mainly during the incremental sync. The md5sum differs consistently for the files .git/logs/HEAD and .git/logs/refs/heads/main for the source and destination snapshots (thus the checksum differs), which means it changes during the incremental sync. I think that's because it's a repo and writes something to those files, as the path changes ?

But the checksum are for snapshots, which are immutable, so I don't understand how the checksum would change 🤔

@vshankar
It could be because of this issue #54633 (comment), but as I remember test_cephfs_mirror_incremental_sync fails intermittently even before snapdiff was merged.

Anyway, let me know your decision on this. Can I drop this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, we need to fix it there rather than working around it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution would be to separate the working directory (.git repo) and the dir_root. So that the dir_root would contain the updated ceph-qa-suite without the .git dir, ready to sync.

@joscollin I really don't understand why are we tinkering around with .git stuff. The repo was cloned in cephfs and snapshots were taken. So, the snapshotted directory should be correctly replicated irrespective of what lives under that directory.

Clear the git logs before taking a new snapshot,
as they are not good candidates to verify an incremental sync.

Fixes: https://tracker.ceph.com/issues/68567
Signed-off-by: Jos Collin <jcollin@redhat.com>
It's easier to save the HEAD and reset to it instead of using git pull.

Signed-off-by: Jos Collin <jcollin@redhat.com>
@joscollin joscollin force-pushed the wip-B68567-clear-git-logs branch from 50586e0 to dd9c172 Compare December 16, 2024 11:07
@joscollin
Copy link
Member Author

jenkins test make check arm64

@joscollin joscollin closed this Jan 13, 2025
@sajibreadd
Copy link
Member

Is cephfs-mirror incremental sync(using snapdiff) fixed? I still found some problem. I can share the reproducer.

  1. Take linux github repo for syncing.
  2. After cloning take snapshot 1.
  3. Checkout branch v6.11 and take snapshot 2.
  4. Start syncing.

Firs snapshot will be synced successfully as it does not use snapdiff. But for the second snapshot, I found this.
2025-01-16T10:17:45.317+0100 7ebf51a00640 -1 cephfs::mirror::PeerReplayer(662f65da-5464-4bc1-aa0e-a7a350c40aee) copy_to_remote: failed to create remote file path=linux/include/dt-bindings/clock/qcom,sm8650-dispcc.h: (40) Too many levels of symbolic links
Just want to acknowledge I tested this applying #60909 this PR as I know cephfs-mirror is facing some mis-mirroring problem because of libcephfs bug. Also note that I tried with non-snapdiff version in the reef for which this problem didn't occur.

Also I feel we need to take a bit care when incremental sync fails. Because if it fails, I am not sure whether it mismirrors in the next phase or not.

@joscollin
Copy link
Member Author

Is cephfs-mirror incremental sync(using snapdiff) fixed? I still found some problem. I can share the reproducer.

  1. Take linux github repo for syncing.
  2. After cloning take snapshot 1.
  3. Checkout branch v6.11 and take snapshot 2.
  4. Start syncing.

Firs snapshot will be synced successfully as it does not use snapdiff. But for the second snapshot, I found this. 2025-01-16T10:17:45.317+0100 7ebf51a00640 -1 cephfs::mirror::PeerReplayer(662f65da-5464-4bc1-aa0e-a7a350c40aee) copy_to_remote: failed to create remote file path=linux/include/dt-bindings/clock/qcom,sm8650-dispcc.h: (40) Too many levels of symbolic links Just want to acknowledge I tested this applying #60909 this PR as I know cephfs-mirror is facing some mis-mirroring problem because of libcephfs bug. Also note that I tried with non-snapdiff version in the reef for which this problem didn't occur.

Also I feel we need to take a bit care when incremental sync fails. Because if it fails, I am not sure whether it mismirrors in the next phase or not.

@sajibreadd
From the description, it's clear that these are not same issues. This PR tried to fix an intermittent QA failure, which happens only during the incremental sync of a repo (the checksum differs for .git directory) and it's certainly not a copy_to_remote: failed to create remote file path issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants