Project

General

Profile

Actions

Bug #66077

closed

qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname

Added by Rishabh Dave almost 2 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Testing
Target version:
-
% Done:

0%

Source:
Q/A
Backport:
squid,reef,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
qa, qa-failure
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-3413-g3f4aee27ee
Released In:
v20.2.0~2501
Upkeep Timestamp:
2025-11-01T01:36:12+00:00

Description

https://pulpito.ceph.com/rishabh-2024-05-16_08:48:59-fs:functional-main-testing-default-smithi/7709044

2024-05-16T09:30:58.167 DEBUG:teuthology.orchestra.run.smithi116:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd blocklist ls
...
2024-05-16T09:30:58.479 INFO:teuthology.orchestra.run.smithi116.stderr:2024-05-16T09:30:58.480+0000 7f2841f51640  1 -- 172.21.15.116:0/3567133676 wait complete.
2024-05-16T09:30:58.489 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.1...
2024-05-16T09:30:58.489 INFO:teuthology.orchestra.run:Running command with timeout 300
2024-05-16T09:30:58.489 DEBUG:teuthology.orchestra.run.smithi149:> sudo umount /home/ubuntu/cephtest/mnt.1
2024-05-16T09:31:05.113 DEBUG:teuthology.orchestra.run.smithi116:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-05-16T09:31:05.116 DEBUG:teuthology.orchestra.run.smithi149:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-05-16T09:31:35.177 DEBUG:teuthology.orchestra.run.smithi116:> sudo logrotate /etc/logrotate.d/ceph-test.conf
...
2024-05-16T09:35:36.295 DEBUG:teuthology.orchestra.run.smithi149:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-05-16T09:35:58.494 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/channel.py", line 745, in recv_stderr
    out = self.in_stderr_buffer.read(nbytes, self.timeout)
  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/buffered_pipe.py", line 154, in read
    raise PipeTimeout()
paramiko.buffered_pipe.PipeTimeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 323, in copy_file_to
    copy_to_log(src, logger, capture=stream, quiet=quiet)
  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 276, in copy_to_log
    for line in f:
  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/file.py", line 109, in __next__
    line = self.readline()
  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/file.py", line 275, in readline
    new_data = self._read(n)
  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/channel.py", line 1374, in _read
    return self.channel.recv_stderr(size)
  File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/channel.py", line 747, in recv_stderr
    raise socket.timeout()
socket.timeout
2024-05-16T09:35:58.496 ERROR:teuthology:Uncaught exception (Hub)

Related issues 5 (0 open5 closed)

Related to CephFS - Bug #66160: qa/cephfs: allow max tests to run in test_admin.pyResolvedRishabh Dave

Actions
Related to CephFS - Bug #66946: qa/cephfs: unmount hangs after test_fs_rename_fails_for_non_existent_fsResolvedXiubo Li

Actions
Copied to CephFS - Backport #66930: quincy: qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname RejectedRishabh DaveActions
Copied to CephFS - Backport #66931: reef: qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname ResolvedRishabh DaveActions
Copied to CephFS - Backport #66932: squid: qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname ResolvedRishabh DaveActions
Actions #1

Updated by Rishabh Dave almost 2 years ago

  • Component(FS) qa-suite added
  • Labels (FS) qa, qa-failure added
Actions #2

Updated by Venky Shankar almost 2 years ago

  • Category set to Testing
  • Status changed from New to Triaged
  • Assignee set to Rishabh Dave
  • Priority changed from Normal to High
  • Source set to Q/A

This run too has no daemon logs similar to other runs in https://pulpito.ceph.com/rishabh-2024-05-16_08:48:59-fs:functional-main-testing-default-smithi/

@Rishabh Dave - is this reproducible?

Actions #3

Updated by Rishabh Dave almost 2 years ago

  • Related to Bug #66160: qa/cephfs: allow max tests to run in test_admin.py added
Actions #4

Updated by Venky Shankar over 1 year ago

@Rishabh Dave Let's get an update/RCA on this soon.

Actions #5

Updated by Rishabh Dave over 1 year ago · Edited

test_single_path_authorize_on_nonalphanumeric_fsname has 2 clients -- self.mount_a and self.mount_b. At the beginning of this test former client is umounted, CephFS already available on the cluster is deleted, a new FS is created, self.mount_a mounts this new FS and then rest of testing proceed on it. During teardown, attempts to unmount both is made. Former is unmounted smoothly but for latter client umounting hangs and causes test to crash.

To prevent this there are 2 solutions: unmount self.mount_b at the beginnig of test and keep only 1 client throughout the test or remount self.mount_b to the new FS. Former is better since only one client is required throughout the test second client is redundant. Code for it is posted here - https://github.com/ceph/ceph/pull/58311.

However, this patch doesn't seem to be the right approach to fix this. Since without the patch the test passes with FUSE client(1) and fails with kernel client(2), the bug lies in kclient rather than this test code.

From (1) -

The issue is same as what was seen originally -

2024-06-27T17:14:38.903 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.1...
2024-06-27T17:14:38.903 INFO:teuthology.orchestra.run:Running command with timeout 300
2024-06-27T17:14:38.903 DEBUG:teuthology.orchestra.run.smithi159:> sudo umount /home/ubuntu/cephtest/mnt.1
2024-06-27T17:14:55.246 DEBUG:teuthology.orchestra.run.smithi067:> sudo logrotate /etc/logrotate.d/ceph-test.conf
...(removing repetitive entries logrotate command)
2024-06-27T17:19:26.317 DEBUG:teuthology.orchestra.run.smithi159:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-06-27T17:19:38.907 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_544fecbcd55f3d2b6f478823823ce40cbefef1d4/virtualenv/lib/python3.10/site-packages/paramiko/channel.py", line 745, in recv_stderr
    out = self.in_stderr_buffer.read(nbytes, self.timeout)
  File "/home/teuthworker/src/git.ceph.com_teuthology_544fecbcd55f3d2b6f478823823ce40cbefef1d4/virtualenv/lib/python3.10/site-packages/paramiko/buffered_pipe.py", line 154, in read
    raise PipeTimeout()
paramiko.buffered_pipe.PipeTimeout

and it was reproducible with vstart_runner originally but it isn't anymore. Perhaps because kernel has been upgraded since then.

@Venky Shankar
cc @Xiubo Li

(1) https://pulpito.ceph.com/rishabh-2024-06-27_16:29:08-fs:functional-main-testing-default-smithi/7776315
(2) https://pulpito.ceph.com/rishabh-2024-06-27_16:28:49-fs:functional-main-testing-default-smithi/7776312

Actions #7

Updated by Rishabh Dave over 1 year ago

the bug lies in kclient rather than this test code.

If this is correct, I propose that we add a separate test case that reproduces this kclient bug. And, in the mean time, we remove the unusued client (at least temporarily) from this test, so that test can be run as usual when PRs are tested against main branch. This ensures that the test coverage is not reduced while kclient is being fixed

Actions #8

Updated by Venky Shankar over 1 year ago · Edited

@Xiubo Li I think this bug is when the stock kernel gets chosen for the run: https://pulpito.ceph.com/rishabh-2024-06-27_16:28:49-fs:functional-main-testing-default-smithi/7776312/

@Rishabh Dave mentioned that you were unable to reproduce this with the testing kernel.

Actions #9

Updated by Rishabh Dave over 1 year ago

Venky Shankar wrote in #note-8:

@Xiubo Li I think this bug is when the stock kernel gets chosen for the run: https://pulpito.ceph.com/rishabh-2024-06-27_16:28:49-fs:functional-main-testing-default-smithi/7776312/

@Rishabh Dave mentioned that you were unable to reproduce this with the testing kernel.

Xiubo didn't try to reproduce this (AFAIK). He checked if the CephFS, which was mounted using testing kernel, can be unmounted after MDS has been killed without hanging or crashingo or using --force or --lazy.

Actions #10

Updated by Venky Shankar over 1 year ago

Rishabh Dave wrote in #note-9:

Venky Shankar wrote in #note-8:

@Xiubo Li I think this bug is when the stock kernel gets chosen for the run: https://pulpito.ceph.com/rishabh-2024-06-27_16:28:49-fs:functional-main-testing-default-smithi/7776312/

@Rishabh Dave mentioned that you were unable to reproduce this with the testing kernel.

Xiubo didn't try to reproduce this (AFAIK). He checked if the CephFS, which was mounted using testing kernel, can be unmounted after MDS has been killed without hanging or crashingo or using --force or --lazy.

OK, I misunderstood our conversation then. When the MDS is not reachable, the user should pass --force to allow unmounting otherwise umount would hang due to the client trying to flush data to the MDS.

So, it's pretty obvious now that the client that's left unmounted initially after the file system is deleted is the one blocked on umount. So, either run the test with one client or umount the extra client. Would that suffice, @Rishabh Dave ?

Actions #11

Updated by Rishabh Dave over 1 year ago

So, either run the test with one client or umount the extra client. Would that suffice, @Rishabh Dave ?

Yes. I've already implemented this approach and tested this. See: https://tracker.ceph.com/issues/66077#note-6

Actions #12

Updated by Rishabh Dave over 1 year ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 58311
Actions #13

Updated by Rishabh Dave over 1 year ago

  • Backport set to squid,reef,quincy
Actions #14

Updated by Rishabh Dave over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #15

Updated by Rishabh Dave over 1 year ago

  • Copied to Backport #66930: quincy: qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname added
Actions #16

Updated by Rishabh Dave over 1 year ago

  • Copied to Backport #66931: reef: qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname added
Actions #17

Updated by Rishabh Dave over 1 year ago

  • Copied to Backport #66932: squid: qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname added
Actions #18

Updated by Rishabh Dave over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #19

Updated by Xiubo Li over 1 year ago

  • Related to Bug #66946: qa/cephfs: unmount hangs after test_fs_rename_fails_for_non_existent_fs added
Actions #20

Updated by Upkeep Bot 9 months ago

  • Status changed from Pending Backport to Resolved
  • Upkeep Timestamp set to 2025-07-08T18:35:53+00:00
Actions #21

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 3f4aee27ee08821e287d808e7b9dc5f90136b531
  • Fixed In set to v19.3.0-3413-g3f4aee27ee
  • Upkeep Timestamp changed from 2025-07-08T18:35:53+00:00 to 2025-08-02T04:51:15+00:00
Actions #22

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2501
  • Upkeep Timestamp changed from 2025-08-02T04:51:15+00:00 to 2025-11-01T01:36:12+00:00
Actions

Also available in: Atom PDF