Skip to content

cephfs_mirror, qa: fix mirror daemon doesn't restart when blocklisted or failed#56193

Merged
vshankar merged 3 commits intoceph:mainfrom
joscollin:wip-B64927-test_cephfs_mirror_blocklist-fail
Jul 15, 2024
Merged

cephfs_mirror, qa: fix mirror daemon doesn't restart when blocklisted or failed#56193
vshankar merged 3 commits intoceph:mainfrom
joscollin:wip-B64927-test_cephfs_mirror_blocklist-fail

Conversation

@joscollin
Copy link
Member

@joscollin joscollin commented Mar 14, 2024

Fixes: https://tracker.ceph.com/issues/64927
Fixes: https://tracker.ceph.com/issues/51964
Fixes: https://tracker.ceph.com/issues/63931
Fixes: https://tracker.ceph.com/issues/63089

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added cephfs Ceph File System tests labels Mar 14, 2024
@joscollin joscollin marked this pull request as draft March 15, 2024 03:01
@joscollin joscollin force-pushed the wip-B64927-test_cephfs_mirror_blocklist-fail branch from 7e3eacf to 1c563a1 Compare March 15, 2024 09:06
@joscollin joscollin marked this pull request as ready for review March 15, 2024 09:07
@joscollin joscollin force-pushed the wip-B64927-test_cephfs_mirror_blocklist-fail branch 2 times, most recently from 194f9c6 to 3d96002 Compare March 19, 2024 02:43
@joscollin joscollin requested a review from a team March 19, 2024 02:44
@joscollin
Copy link
Member Author

jenkins test make check

@joscollin joscollin force-pushed the wip-B64927-test_cephfs_mirror_blocklist-fail branch from 3d96002 to d80d586 Compare March 19, 2024 14:56
@joscollin
Copy link
Member Author

jenkins test api

vshankar added a commit to vshankar/ceph that referenced this pull request Apr 12, 2024
* refs/pull/56193/head:
	qa: fixes test_cephfs_mirror_blocklist raises KeyError: 'rados_inst'
vshankar added a commit to vshankar/ceph that referenced this pull request Apr 30, 2024
* refs/pull/56193/head:
	qa: fixes test_cephfs_mirror_blocklist raises KeyError: 'rados_inst'
vshankar added a commit to vshankar/ceph that referenced this pull request May 6, 2024
* refs/pull/56193/head:
	qa: fixes test_cephfs_mirror_blocklist raises KeyError: 'rados_inst'
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL at the failure.

@joscollin joscollin force-pushed the wip-B64927-test_cephfs_mirror_blocklist-fail branch 3 times, most recently from c30ceb1 to 3b8b4ca Compare May 30, 2024 15:23
@joscollin joscollin changed the title qa: fixes test_cephfs_mirror_blocklist raises KeyError: 'rados_inst' cephfs_mirror, qa: fix mirror daemon doesn't restart when blocklisted or failed May 30, 2024
@joscollin joscollin marked this pull request as draft June 3, 2024 06:04
@joscollin
Copy link
Member Author

@vshankar Testing and fix on Listener in progress. Changing this to draft.

@vshankar
Copy link
Contributor

vshankar commented Jun 3, 2024

@vshankar Testing and fix on Listener in progress. Changing this to draft.

Move it back when its ready for review.

@joscollin joscollin force-pushed the wip-B64927-test_cephfs_mirror_blocklist-fail branch 2 times, most recently from acf1adc to 7f9f3e2 Compare June 4, 2024 04:26
@joscollin
Copy link
Member Author

joscollin commented Jun 4, 2024

Test Passed: https://pulpito.ceph.com/jcollin-2024-06-04_00:41:08-fs:mirror-wip-jcollin-testing14-distro-default-smithi/.
@vshankar This test ran on the top of your failed branch wip-vshankar-testing-20240506.064357 and it's passed now. The PR is ready for review.

a61d165 is the actual fix for the failure.
The qa change is just a better way of doing the same thing, which could be dropped. But I recommend the test should wait for atleast 60 seconds before querying the new rados_inst.

@joscollin joscollin marked this pull request as ready for review June 4, 2024 04:51
@joscollin joscollin force-pushed the wip-B64927-test_cephfs_mirror_blocklist-fail branch from 6178129 to 886f92b Compare June 7, 2024 10:05
@joscollin
Copy link
Member Author

@vshankar Thanks for the approval.

I'm working on this tracker today: https://tracker.ceph.com/issues/63931. From teuthology.log, I see the failed mirror daemon didn't restart. So mostly this PR would make a fix for https://tracker.ceph.com/issues/63931 also. I'll confirm that today evening.

The problem is that we don't have remote logs for the failed runs in https://tracker.ceph.com/issues/63931. They are too old and probably got erased. I could reproduce the failure from the same branch, but there's no centos8 support available at the moment. So I'm trying to reproduce after rebasing the branch to main. If that doesn't work, I'll rely on the teuthology.log and the source code and confirm it by today evening.

@vshankar
https://tracker.ceph.com/issues/63931 also needs the fix in this PR. The commit messages are updated accordingly and the PR is rebased with main. Please check.

@joscollin joscollin force-pushed the wip-B64927-test_cephfs_mirror_blocklist-fail branch from 886f92b to 330b7b9 Compare June 10, 2024 03:03
@joscollin
Copy link
Member Author

rebased

@joscollin
Copy link
Member Author

@vshankar Nothing changed, just commit messages updated and rebased since your approval.

@joscollin
Copy link
Member Author

jenkins test make check arm64

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/66521.

@joscollin
Copy link
Member Author

This PR is under test in https://tracker.ceph.com/issues/66521.

The test result looks good. No cephfs-mirror failures.

joscollin added 3 commits July 5, 2024 10:14
…tamp in FSMirror

Have FSMirror register a listener with InstanceWatcher/MirrorWatcher which would get invoked when the mirror daemon is blocklisted or failed.
Thus FSMirror can maintain the last blocklisted/failed timestamp and use that for restarting the mirror daemon.

Fixes: https://tracker.ceph.com/issues/64927
Fixes: https://tracker.ceph.com/issues/51964
Fixes: https://tracker.ceph.com/issues/63931
Fixes: https://tracker.ceph.com/issues/63089
Signed-off-by: Jos Collin <jcollin@redhat.com>
After blocklisted/failed, wait for the mirror daemon restart
which is after 30 seconds timeout and then check for the new rados_inst.

Fixes: https://tracker.ceph.com/issues/64927
Signed-off-by: Jos Collin <jcollin@redhat.com>
@joscollin joscollin force-pushed the wip-B64927-test_cephfs_mirror_blocklist-fail branch from 330b7b9 to a9a5691 Compare July 5, 2024 04:45
@joscollin
Copy link
Member Author

@vshankar Nothing changed, just commit messages updated and rebased.

@joscollin
Copy link
Member Author

jenkins test make check

@joscollin
Copy link
Member Author

jenkins test make check arm64

@joscollin
Copy link
Member Author

jenkins test make check

joscollin added a commit to joscollin/ceph that referenced this pull request Jul 12, 2024
* refs/pull/56193/head:
	qa: Wait for mirror daemon restart before getting new rados_inst
	cephfs_mirror: Fixed negative seconds
	cephfs_mirror: Add ErrorListener to maintain blocklisted/failed timestamp in FSMirror

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit 62eb727 into ceph:main Jul 15, 2024
@joscollin joscollin deleted the wip-B64927-test_cephfs_mirror_blocklist-fail branch July 15, 2024 12:12
NitzanMordhai pushed a commit to NitzanMordhai/ceph that referenced this pull request Aug 1, 2024
…irror_blocklist-fail

cephfs_mirror, qa: fix mirror daemon doesn't restart when blocklisted or failed

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants