fix client mount in mgr/volumes and clean up snapshots in test_snap_schedules.py#58771
fix client mount in mgr/volumes and clean up snapshots in test_snap_schedules.py#58771
Conversation
Problem: During test tearDown and setUp, there's a small window where the mgr/volumes cloner thread attempts to fetch the next job from the deleted filesystem. This causes the mount to hang. Solution: Conditionally get the next job only if there is a valid fs_map. This avoids the implicit mount in the get next job path. Fixes: https://tracker.ceph.com/issues/66009 Signed-off-by: Milind Changire <mchangir@redhat.com>
Problem 1: The mgr declares itself as available as soon as it dispatches code to the finisher to load the python extension modules. This causes the mgr dump command to return the mgr as available even when the modules aren't in the active state to handle commands. Solution 1: Sleep for a minute to allow the mgr finisher to instantiate the python extension modules so that they can start handling commands. ----- Problem 2: Mgr uses stale mounts to fetch clone jobs. Solution 2: Fail the mgr between two tests so that there is a failover and a state cleanup. Fixes: https://tracker.ceph.com/issues/66009 Signed-off-by: Milind Changire <mchangir@redhat.com>
Problem: snap-schedule tests incorrectly try to remove the subvolume when snapshots exist. Solution 4: Remove the snapshot schedule before deleting the snapshots so that there's no race between snapshot creation and subvolume removal. Fixes: https://tracker.ceph.com/issues/66009 Signed-off-by: Milind Changire <mchangir@redhat.com>
92b24c6 to
46612f5
Compare
|
Teuthology Job Set with the fixes has passed with 100% GREEN |
|
The state of the PR is that the next job is fetched by the cloner thread if there is a valid fs_map. This was a quick fix for teuthology tests since we usually have only a single volume. The next job for the specific volume should be fetched only if the name of the volume exists in the fs_map. I'll update on this further. |
|
Also, the mgr restarts if the set of enabled modules changes. This is good. But if the fs_map changes, it affects the functioning of the volumes module, as in this case. So, ideally the mgr should also restart if the number of volumes changes or if the volume is renamed. |
| self.async_job.jobs[m] = [] | ||
| self.async_job.q.append(m) | ||
| vol_job = self.async_job.get_job() | ||
| vol_job = self.async_job.get_job() if self.vc.mgr.has_fs_map else None |
There was a problem hiding this comment.
This just narrows the race window. IMO, we should check if its possible to pass a custom mount timeout (something like 10 seconds) for mounts required by mgr/volumes, since its safe to timeout the mount process since it gets retried. That would prevent the finisher thread in ceph-mgr to be blocked for long time.
| self.configs_set = set() | ||
|
|
||
| if self.last_active_mgr: | ||
| self.mon_manager.revive_mgr(self.last_active_mgr) |
There was a problem hiding this comment.
Doesn't it suffice to fail the manager (done during teardown) and then wait a bit so that it's active and available for servicing requests. What's the need to revive the manager instance that was failed?
|
closing this PR |
Problem 1: During test tearDown and setUp, there's a small window where the mgr/volumes cloner thread attempts to fetch the next job from the deleted filesystem. This causes the mount to hang.
Solution 1: Conditionally get the next job only if there is a valid fs_map. This avoids the implicit mount in the get next job path.
Problem 2: The mgr declares itself as available as soon as it dispatches code to the finisher to load the python extension modules. This causes the
mgr dumpcommand to return the mgr asavailableeven when the modules aren't in theactivestate to handle commands.Solution 2: Sleep for a minute to allow the mgr finisher to instantiate the python extension modules so that they can start handling commands.
Problem 3: Mgr uses stale mounts to fetch clone jobs.
Solution 3: Fail the mgr between two tests so that there is a failover and a state cleanup.
Problem 4: snap-schedule tests incorrectly try to remove the subvolume when snapshots exist.
Solution 4: Remove the snapshot schedule before deleting the snapshots so that there's no race between snapshot creation and subvolume removal.
Fixes: https://tracker.ceph.com/issues/66009
Signed-off-by: Milind Changire mchangir@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e