mgr/volumes: fetch trash and clone entries without blocking volume access#33413
mgr/volumes: fetch trash and clone entries without blocking volume access#33413batrick merged 3 commits intoceph:masterfrom
Conversation
ajarr
left a comment
There was a problem hiding this comment.
I don't completely follow the dead lock issue. It's between the main command dispatcher thread and the purge queue thread? Looking at the remove_subvolume() method https://github.com/ceph/ceph/blob/master/src/pybind/mgr/volumes/fs/volume.py#L153
after acquiring volume lock, it renames the subvolume and kick starts the purge thread. It doesn't wait on the purge threads.
qa/tasks/cephfs/test_volumes.py
Outdated
| # verify trash dir is clean | ||
| self._wait_for_trash_empty() | ||
|
|
||
| def test_async_subvolume_rm_large(self): |
There was a problem hiding this comment.
This test passed for me locally even without the change below.
There was a problem hiding this comment.
its a best effort test to validate the fix since its a lock acquisition race.
There was a problem hiding this comment.
btw, I think we can just stick to removing large number of subvolumes in the original async rm test.
qa/tasks/cephfs/test_volumes.py
Outdated
| self.mount_a.mount() | ||
|
|
||
| # verify trash dir is clean | ||
| self._wait_for_trash_empty(timeout=300) |
There was a problem hiding this comment.
There is a bug in _wait_for_trash_empty() method. It doesn't pass the timeout argument down to the_wait_for_dir_empty() method. Please correct that in this PR.
right -- the command dispatcher thread calls ->queue_job() which acquires AsyncJobs::lock and the purge threads acquires this lock and accesses the volume in exclusive mode. |
Saw a deadlock when deleting lot of subvolumes -- purge threads were stuck in accessing global lock for volume access. This can happen when there is a concurrent remove (which renames and signals the purge threads) and a purge thread is just about to scan the trash directory for entries. For the fix, purge threads fetches entries by accessing the volume in lockless mode. This is safe from functionality point-of-view as the rename and directory scan is correctly handled by the filesystem. Worst case the purge thread would pick up the trash entry on next scan, never leaving a stale trash entry. Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Fixes: http://tracker.ceph.com/issues/44207 Signed-off-by: Venky Shankar <vshankar@redhat.com>
|
@batrick http://pulpito.ceph.com/vshankar-2020-02-24_18:41:47-fs-wip-vshankar-testing-2020-02-24-102202-testing-basic-smithi/ EDIT: "-N 10" for repeat runs |
|
Test failures are because the file system was deleted and the cleanup connection thread hangs. Not related to this PR so I will merge. Thanks Venky! |
Normally, the cleanup thread would know that the volume no longer exist. If the volume got removed after the cleanup thread fetched the fs handle, the thread can hang. |
No description provided.