Skip to content

qa/tasks/cephfs/test_scrub.py: use umount_wait to avoid possible ceph-fuse daemon stuck#35328

Merged
batrick merged 1 commit intoceph:masterfrom
lxbsz:stuck_mds
Jun 3, 2020
Merged

qa/tasks/cephfs/test_scrub.py: use umount_wait to avoid possible ceph-fuse daemon stuck#35328
batrick merged 1 commit intoceph:masterfrom
lxbsz:stuck_mds

Conversation

@lxbsz
Copy link
Member

@lxbsz lxbsz commented Jun 1, 2020

If the ceph-fuse client need to flush the caps and does sync wait,
the umount() will just return successfully, then the netns container
will be destroyed and the network will not be reachable, but the
ceph-fuse daemon is still stucked and waiting for the flush caps ack.

This will cause the ceph-fuse daemon get stuck forever and if the
mds daemons get restarted, it will try to reconnect the clients,
but the stucked ceph-fuse daemnon won't reply to it, because the
ceph-fuse client is not reachable any more.

Fixes: https://tracker.ceph.com/issues/45665
Signed-off-by: Xiubo Li xiubli@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@lxbsz lxbsz requested review from batrick and fullerdj June 1, 2020 02:07
@lxbsz lxbsz changed the title qa/tasks/cephfs/test_scrub.py: use umount_wait to avoid possible mds stuck qa/tasks/cephfs/test_scrub.py: use umount_wait to avoid possible ceph-fuse daemon stuck Jun 1, 2020
If the ceph-fuse client need to flush the caps and does sync wait,
the umount() will just return successfully, then the netns container
will be destroyed and the network will not be reachable, but the
ceph-fuse daemon is still stucked and waiting for the flush caps ack.

This will cause the ceph-fuse daemon get stuck forever and if the
mds daemons get restarted, it will try to reconnect the clients,
but the stucked ceph-fuse daemnon won't reply to it, because it is
not reachable any more.

Fixes: https://tracker.ceph.com/issues/45665
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants