mgr/volumes: fetch trash and clone entries without blocking volume access by vshankar · Pull Request #33413 · ceph/ceph

vshankar · 2020-02-19T14:30:30Z

No description provided.

ajarr

I don't completely follow the dead lock issue. It's between the main command dispatcher thread and the purge queue thread? Looking at the remove_subvolume() method https://github.com/ceph/ceph/blob/master/src/pybind/mgr/volumes/fs/volume.py#L153
after acquiring volume lock, it renames the subvolume and kick starts the purge thread. It doesn't wait on the purge threads.

ajarr · 2020-02-20T13:25:11Z

qa/tasks/cephfs/test_volumes.py

        # verify trash dir is clean
        self._wait_for_trash_empty()
+
+    def test_async_subvolume_rm_large(self):


This test passed for me locally even without the change below.

its a best effort test to validate the fix since its a lock acquisition race.

btw, I think we can just stick to removing large number of subvolumes in the original async rm test.

ajarr · 2020-02-20T13:27:12Z

qa/tasks/cephfs/test_volumes.py

+        self.mount_a.mount()
+
+        # verify trash dir is clean
+        self._wait_for_trash_empty(timeout=300)


There is a bug in _wait_for_trash_empty() method. It doesn't pass the timeout argument down to the_wait_for_dir_empty() method. Please correct that in this PR.

vshankar · 2020-02-24T04:30:33Z

I don't completely follow the dead lock issue. It's between the main command dispatcher thread and the purge queue thread? Looking at the remove_subvolume() method https://github.com/ceph/ceph/blob/master/src/pybind/mgr/volumes/fs/volume.py#L153

right -- the command dispatcher thread calls ->queue_job() which acquires AsyncJobs::lock and the purge threads acquires this lock and accesses the volume in exclusive mode.

Saw a deadlock when deleting lot of subvolumes -- purge threads were stuck in accessing global lock for volume access. This can happen when there is a concurrent remove (which renames and signals the purge threads) and a purge thread is just about to scan the trash directory for entries. For the fix, purge threads fetches entries by accessing the volume in lockless mode. This is safe from functionality point-of-view as the rename and directory scan is correctly handled by the filesystem. Worst case the purge thread would pick up the trash entry on next scan, never leaving a stale trash entry. Signed-off-by: Venky Shankar <vshankar@redhat.com>

Signed-off-by: Venky Shankar <vshankar@redhat.com>

Fixes: http://tracker.ceph.com/issues/44207 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar · 2020-02-24T18:47:27Z

@batrick http://pulpito.ceph.com/vshankar-2020-02-24_18:41:47-fs-wip-vshankar-testing-2020-02-24-102202-testing-basic-smithi/

EDIT: "-N 10" for repeat runs

batrick · 2020-02-24T18:58:38Z

https://tracker.ceph.com/issues/44276

batrick · 2020-02-25T02:17:25Z

Test failures are because the file system was deleted and the cleanup connection thread hangs. Not related to this PR so I will merge. Thanks Venky!

vshankar · 2020-02-25T02:43:22Z

Test failures are because the file system was deleted and the cleanup connection thread hangs. Not related to this PR so I will merge. Thanks Venky!

Normally, the cleanup thread would know that the volume no longer exist. If the volume got removed after the cleanup thread fetched the fs handle, the thread can hang.

batrick · 2020-02-25T03:46:05Z

https://tracker.ceph.com/issues/44281

vshankar added the cephfs Ceph File System label Feb 19, 2020

vshankar requested review from ajarr and batrick February 19, 2020 14:30

ajarr reviewed Feb 20, 2020

View reviewed changes

vshankar added 3 commits February 24, 2020 04:27

test: pass timeout argument to mount::wait_for_dir_empty()

5ec09a2

Signed-off-by: Venky Shankar <vshankar@redhat.com>

test: verify purge queue w/ large number of subvolumes

92b2008

Fixes: http://tracker.ceph.com/issues/44207 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar force-pushed the wip-44207 branch from 66aab71 to 92b2008 Compare February 24, 2020 09:27

vshankar added wip-vshankar-testing needs-review labels Feb 24, 2020

batrick approved these changes Feb 25, 2020

View reviewed changes

batrick merged commit 0908d48 into ceph:master Feb 25, 2020

vshankar deleted the wip-44207 branch February 25, 2020 03:11

vshankar mentioned this pull request Feb 28, 2020

mgr/volumes: allow canceling in-progress/pending clones #33532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/volumes: fetch trash and clone entries without blocking volume access#33413

mgr/volumes: fetch trash and clone entries without blocking volume access#33413
batrick merged 3 commits intoceph:masterfrom
vshankar:wip-44207

vshankar commented Feb 19, 2020

Uh oh!

ajarr left a comment

Uh oh!

ajarr Feb 20, 2020

Uh oh!

vshankar Feb 24, 2020

Uh oh!

vshankar Feb 24, 2020

Uh oh!

ajarr Feb 20, 2020

Uh oh!

vshankar Feb 24, 2020

Uh oh!

vshankar commented Feb 24, 2020

Uh oh!

vshankar commented Feb 24, 2020 •

edited

Loading

Uh oh!

batrick commented Feb 24, 2020

Uh oh!

batrick commented Feb 25, 2020

Uh oh!

vshankar commented Feb 25, 2020

Uh oh!

batrick commented Feb 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vshankar commented Feb 19, 2020

Uh oh!

ajarr left a comment

Choose a reason for hiding this comment

Uh oh!

ajarr Feb 20, 2020

Choose a reason for hiding this comment

Uh oh!

vshankar Feb 24, 2020

Choose a reason for hiding this comment

Uh oh!

vshankar Feb 24, 2020

Choose a reason for hiding this comment

Uh oh!

ajarr Feb 20, 2020

Choose a reason for hiding this comment

Uh oh!

vshankar Feb 24, 2020

Choose a reason for hiding this comment

Uh oh!

vshankar commented Feb 24, 2020

Uh oh!

vshankar commented Feb 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

batrick commented Feb 24, 2020

Uh oh!

batrick commented Feb 25, 2020

Uh oh!

vshankar commented Feb 25, 2020

Uh oh!

batrick commented Feb 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vshankar commented Feb 24, 2020 •

edited

Loading