Bug #66077
closedqa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname
0%
Description
2024-05-16T09:30:58.167 DEBUG:teuthology.orchestra.run.smithi116:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd blocklist ls
...
2024-05-16T09:30:58.479 INFO:teuthology.orchestra.run.smithi116.stderr:2024-05-16T09:30:58.480+0000 7f2841f51640 1 -- 172.21.15.116:0/3567133676 wait complete.
2024-05-16T09:30:58.489 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.1...
2024-05-16T09:30:58.489 INFO:teuthology.orchestra.run:Running command with timeout 300
2024-05-16T09:30:58.489 DEBUG:teuthology.orchestra.run.smithi149:> sudo umount /home/ubuntu/cephtest/mnt.1
2024-05-16T09:31:05.113 DEBUG:teuthology.orchestra.run.smithi116:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-05-16T09:31:05.116 DEBUG:teuthology.orchestra.run.smithi149:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-05-16T09:31:35.177 DEBUG:teuthology.orchestra.run.smithi116:> sudo logrotate /etc/logrotate.d/ceph-test.conf
...
2024-05-16T09:35:36.295 DEBUG:teuthology.orchestra.run.smithi149:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-05-16T09:35:58.494 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/channel.py", line 745, in recv_stderr
out = self.in_stderr_buffer.read(nbytes, self.timeout)
File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/buffered_pipe.py", line 154, in read
raise PipeTimeout()
paramiko.buffered_pipe.PipeTimeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 323, in copy_file_to
copy_to_log(src, logger, capture=stream, quiet=quiet)
File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/teuthology/orchestra/run.py", line 276, in copy_to_log
for line in f:
File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/file.py", line 109, in __next__
line = self.readline()
File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/file.py", line 275, in readline
new_data = self._read(n)
File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/channel.py", line 1374, in _read
return self.channel.recv_stderr(size)
File "/home/teuthworker/src/git.ceph.com_teuthology_1ae7ad82388e92a475afff437d49054826c019a1/virtualenv/lib/python3.8/site-packages/paramiko/channel.py", line 747, in recv_stderr
raise socket.timeout()
socket.timeout
2024-05-16T09:35:58.496 ERROR:teuthology:Uncaught exception (Hub)
Updated by Rishabh Dave almost 2 years ago
- Component(FS) qa-suite added
- Labels (FS) qa, qa-failure added
Updated by Venky Shankar almost 2 years ago
- Category set to Testing
- Status changed from New to Triaged
- Assignee set to Rishabh Dave
- Priority changed from Normal to High
- Source set to Q/A
This run too has no daemon logs similar to other runs in https://pulpito.ceph.com/rishabh-2024-05-16_08:48:59-fs:functional-main-testing-default-smithi/
@Rishabh Dave - is this reproducible?
Updated by Rishabh Dave almost 2 years ago
- Related to Bug #66160: qa/cephfs: allow max tests to run in test_admin.py added
Updated by Venky Shankar over 1 year ago
@Rishabh Dave Let's get an update/RCA on this soon.
Updated by Rishabh Dave over 1 year ago · Edited
test_single_path_authorize_on_nonalphanumeric_fsname has 2 clients -- self.mount_a and self.mount_b. At the beginning of this test former client is umounted, CephFS already available on the cluster is deleted, a new FS is created, self.mount_a mounts this new FS and then rest of testing proceed on it. During teardown, attempts to unmount both is made. Former is unmounted smoothly but for latter client umounting hangs and causes test to crash.
To prevent this there are 2 solutions: unmount self.mount_b at the beginnig of test and keep only 1 client throughout the test or remount self.mount_b to the new FS. Former is better since only one client is required throughout the test second client is redundant. Code for it is posted here - https://github.com/ceph/ceph/pull/58311.
However, this patch doesn't seem to be the right approach to fix this. Since without the patch the test passes with FUSE client(1) and fails with kernel client(2), the bug lies in kclient rather than this test code.
From (1) -
The issue is same as what was seen originally -
2024-06-27T17:14:38.903 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.1...
2024-06-27T17:14:38.903 INFO:teuthology.orchestra.run:Running command with timeout 300
2024-06-27T17:14:38.903 DEBUG:teuthology.orchestra.run.smithi159:> sudo umount /home/ubuntu/cephtest/mnt.1
2024-06-27T17:14:55.246 DEBUG:teuthology.orchestra.run.smithi067:> sudo logrotate /etc/logrotate.d/ceph-test.conf
...(removing repetitive entries logrotate command)
2024-06-27T17:19:26.317 DEBUG:teuthology.orchestra.run.smithi159:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-06-27T17:19:38.907 ERROR:teuthology:Uncaught exception (Hub)
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_544fecbcd55f3d2b6f478823823ce40cbefef1d4/virtualenv/lib/python3.10/site-packages/paramiko/channel.py", line 745, in recv_stderr
out = self.in_stderr_buffer.read(nbytes, self.timeout)
File "/home/teuthworker/src/git.ceph.com_teuthology_544fecbcd55f3d2b6f478823823ce40cbefef1d4/virtualenv/lib/python3.10/site-packages/paramiko/buffered_pipe.py", line 154, in read
raise PipeTimeout()
paramiko.buffered_pipe.PipeTimeout
and it was reproducible with vstart_runner originally but it isn't anymore. Perhaps because kernel has been upgraded since then.
(1) https://pulpito.ceph.com/rishabh-2024-06-27_16:29:08-fs:functional-main-testing-default-smithi/7776315
(2) https://pulpito.ceph.com/rishabh-2024-06-27_16:28:49-fs:functional-main-testing-default-smithi/7776312
Updated by Rishabh Dave over 1 year ago
The patch at https://github.com/ceph/ceph/pull/58311 works fine -
kernel: https://pulpito.ceph.com/rishabh-2024-06-27_16:29:14-fs:functional-main-testing-default-smithi/7776318/teuthology.log
fuse: http://qa-proxy.ceph.com/teuthology/rishabh-2024-06-27_16:29:26-fs:functional-main-testing-default-smithi/7776321/teuthology.log
FUSE job failed but much much after test_single_path_authorize_on_nonalphanumeric_fsname passes.
Updated by Rishabh Dave over 1 year ago
the bug lies in kclient rather than this test code.
If this is correct, I propose that we add a separate test case that reproduces this kclient bug. And, in the mean time, we remove the unusued client (at least temporarily) from this test, so that test can be run as usual when PRs are tested against main branch. This ensures that the test coverage is not reduced while kclient is being fixed
Updated by Venky Shankar over 1 year ago · Edited
@Xiubo Li I think this bug is when the stock kernel gets chosen for the run: https://pulpito.ceph.com/rishabh-2024-06-27_16:28:49-fs:functional-main-testing-default-smithi/7776312/
@Rishabh Dave mentioned that you were unable to reproduce this with the testing kernel.
Updated by Rishabh Dave over 1 year ago
Venky Shankar wrote in #note-8:
@Xiubo Li I think this bug is when the stock kernel gets chosen for the run: https://pulpito.ceph.com/rishabh-2024-06-27_16:28:49-fs:functional-main-testing-default-smithi/7776312/
@Rishabh Dave mentioned that you were unable to reproduce this with the testing kernel.
Xiubo didn't try to reproduce this (AFAIK). He checked if the CephFS, which was mounted using testing kernel, can be unmounted after MDS has been killed without hanging or crashingo or using --force or --lazy.
Updated by Venky Shankar over 1 year ago
Rishabh Dave wrote in #note-9:
Venky Shankar wrote in #note-8:
@Xiubo Li I think this bug is when the stock kernel gets chosen for the run: https://pulpito.ceph.com/rishabh-2024-06-27_16:28:49-fs:functional-main-testing-default-smithi/7776312/
@Rishabh Dave mentioned that you were unable to reproduce this with the testing kernel.
Xiubo didn't try to reproduce this (AFAIK). He checked if the CephFS, which was mounted using testing kernel, can be unmounted after MDS has been killed without hanging or crashingo or using
--forceor--lazy.
OK, I misunderstood our conversation then. When the MDS is not reachable, the user should pass --force to allow unmounting otherwise umount would hang due to the client trying to flush data to the MDS.
So, it's pretty obvious now that the client that's left unmounted initially after the file system is deleted is the one blocked on umount. So, either run the test with one client or umount the extra client. Would that suffice, @Rishabh Dave ?
Updated by Rishabh Dave over 1 year ago
So, either run the test with one client or umount the extra client. Would that suffice, @Rishabh Dave ?
Yes. I've already implemented this approach and tested this. See: https://tracker.ceph.com/issues/66077#note-6
Updated by Rishabh Dave over 1 year ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 58311
Updated by Rishabh Dave over 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Rishabh Dave over 1 year ago
- Copied to Backport #66930: quincy: qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname added
Updated by Rishabh Dave over 1 year ago
- Copied to Backport #66931: reef: qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname added
Updated by Rishabh Dave over 1 year ago
- Copied to Backport #66932: squid: qa/cephfs: unmount hangs after test_single_path_authorize_on_nonalphanumeric_fsname added
Updated by Rishabh Dave over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Xiubo Li over 1 year ago
- Related to Bug #66946: qa/cephfs: unmount hangs after test_fs_rename_fails_for_non_existent_fs added
Updated by Upkeep Bot 9 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp set to 2025-07-08T18:35:53+00:00
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 3f4aee27ee08821e287d808e7b9dc5f90136b531
- Fixed In set to v19.3.0-3413-g3f4aee27ee
- Upkeep Timestamp changed from 2025-07-08T18:35:53+00:00 to 2025-08-02T04:51:15+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~2501
- Upkeep Timestamp changed from 2025-08-02T04:51:15+00:00 to 2025-11-01T01:36:12+00:00