fix client mount in mgr/volumes and clean up snapshots in test_snap_schedules.py by mchangir · Pull Request #58771 · ceph/ceph

mchangir · 2024-07-24T02:30:58Z

Problem 1: During test tearDown and setUp, there's a small window where the mgr/volumes cloner thread attempts to fetch the next job from the deleted filesystem. This causes the mount to hang.
Solution 1: Conditionally get the next job only if there is a valid fs_map. This avoids the implicit mount in the get next job path.

Problem 2: The mgr declares itself as available as soon as it dispatches code to the finisher to load the python extension modules. This causes the mgr dump command to return the mgr as available even when the modules aren't in the active state to handle commands.
Solution 2: Sleep for a minute to allow the mgr finisher to instantiate the python extension modules so that they can start handling commands.

Problem 3: Mgr uses stale mounts to fetch clone jobs.
Solution 3: Fail the mgr between two tests so that there is a failover and a state cleanup.

Problem 4: snap-schedule tests incorrectly try to remove the subvolume when snapshots exist.
Solution 4: Remove the snapshot schedule before deleting the snapshots so that there's no race between snapshot creation and subvolume removal.

Fixes: https://tracker.ceph.com/issues/66009
Signed-off-by: Milind Changire mchangir@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

Problem: During test tearDown and setUp, there's a small window where the mgr/volumes cloner thread attempts to fetch the next job from the deleted filesystem. This causes the mount to hang. Solution: Conditionally get the next job only if there is a valid fs_map. This avoids the implicit mount in the get next job path. Fixes: https://tracker.ceph.com/issues/66009 Signed-off-by: Milind Changire <mchangir@redhat.com>

Problem 1: The mgr declares itself as available as soon as it dispatches code to the finisher to load the python extension modules. This causes the mgr dump command to return the mgr as available even when the modules aren't in the active state to handle commands. Solution 1: Sleep for a minute to allow the mgr finisher to instantiate the python extension modules so that they can start handling commands. ----- Problem 2: Mgr uses stale mounts to fetch clone jobs. Solution 2: Fail the mgr between two tests so that there is a failover and a state cleanup. Fixes: https://tracker.ceph.com/issues/66009 Signed-off-by: Milind Changire <mchangir@redhat.com>

Problem: snap-schedule tests incorrectly try to remove the subvolume when snapshots exist. Solution 4: Remove the snapshot schedule before deleting the snapshots so that there's no race between snapshot creation and subvolume removal. Fixes: https://tracker.ceph.com/issues/66009 Signed-off-by: Milind Changire <mchangir@redhat.com>

mchangir · 2024-07-24T02:44:29Z

Teuthology Job Set with the fixes has passed with 100% GREEN

mchangir · 2024-07-24T02:57:23Z

The state of the PR is that the next job is fetched by the cloner thread if there is a valid fs_map. This was a quick fix for teuthology tests since we usually have only a single volume.

The next job for the specific volume should be fetched only if the name of the volume exists in the fs_map. I'll update on this further.

mchangir · 2024-07-24T03:05:01Z

Also, the mgr restarts if the set of enabled modules changes. This is good. But if the fs_map changes, it affects the functioning of the volumes module, as in this case.

So, ideally the mgr should also restart if the number of volumes changes or if the volume is renamed.

vshankar · 2024-07-26T06:09:30Z

src/pybind/mgr/volumes/fs/async_job.py

                                self.async_job.jobs[m] = []
                                self.async_job.q.append(m)
-                        vol_job = self.async_job.get_job()
+                        vol_job = self.async_job.get_job() if self.vc.mgr.has_fs_map else None


This just narrows the race window. IMO, we should check if its possible to pass a custom mount timeout (something like 10 seconds) for mounts required by mgr/volumes, since its safe to timeout the mount process since it gets retried. That would prevent the finisher thread in ceph-mgr to be blocked for long time.

vshankar · 2024-07-26T06:20:22Z

qa/tasks/cephfs/cephfs_test_case.py

        self.configs_set = set()

+        if self.last_active_mgr:
+            self.mon_manager.revive_mgr(self.last_active_mgr)


Doesn't it suffice to fail the manager (done during teardown) and then wait a bit so that it's active and available for servicing requests. What's the need to revive the manager instance that was failed?

vshankar · 2024-09-11T12:51:22Z

Not required as it's superceeded by #58547. @mchangir would you agree?

mchangir · 2024-09-11T12:54:12Z

Not required as it's superceeded by #58547. @mchangir would you agree?

yes, correct
this PR is superseded

mchangir · 2024-09-11T12:54:21Z

closing this PR

github-actions bot added cephfs Ceph File System pybind tests labels Jul 24, 2024

mchangir added 3 commits July 24, 2024 08:01

mchangir force-pushed the mgr-volumes-test-async_job-debug branch from 92b24c6 to 46612f5 Compare July 24, 2024 02:40

mchangir mentioned this pull request Jul 24, 2024

qa: failfast mount for better performance and unblock fs volume ls #58547

Merged

14 tasks

mchangir mentioned this pull request Jul 25, 2024

tests_snap_schedules.py: fix subvol tests #58301

Closed

14 tasks

vshankar reviewed Jul 26, 2024

View reviewed changes

mchangir closed this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix client mount in mgr/volumes and clean up snapshots in test_snap_schedules.py#58771

fix client mount in mgr/volumes and clean up snapshots in test_snap_schedules.py#58771
mchangir wants to merge 3 commits intoceph:mainfrom
mchangir:mgr-volumes-test-async_job-debug

mchangir commented Jul 24, 2024 •

edited

Loading

Uh oh!

mchangir commented Jul 24, 2024

Uh oh!

mchangir commented Jul 24, 2024

Uh oh!

mchangir commented Jul 24, 2024

Uh oh!

vshankar Jul 26, 2024

Uh oh!

vshankar Jul 26, 2024

Uh oh!

vshankar commented Sep 11, 2024

Uh oh!

mchangir commented Sep 11, 2024

Uh oh!

mchangir commented Sep 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mchangir commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

mchangir commented Jul 24, 2024

Uh oh!

mchangir commented Jul 24, 2024

Uh oh!

mchangir commented Jul 24, 2024

Uh oh!

vshankar Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

vshankar Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

vshankar commented Sep 11, 2024

Uh oh!

mchangir commented Sep 11, 2024

Uh oh!

mchangir commented Sep 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mchangir commented Jul 24, 2024 •

edited

Loading