Bug #71572
closedmissing debugfs status file in testing kernel (possibly intermittent)
0%
Description
Right after kernel mount, the "status" file in debugs is read (by code in qa/cephfs). In some runs, this particular debugs file isn't present.
2025-06-03T08:10:45.440 DEBUG:teuthology.orchestra.run.smithi133:> (cd /home/ubuntu/cephtest/mnt.0 && exec stdin-killer --timeout=300 -- bash -c 'sudo dd if=/sys/kernel/debug/ceph/ff8cc1ea-9b92-4ce0-aa21-aff1539c4292.client4821/status') 2025-06-03T08:10:45.520 INFO:teuthology.orchestra.run.smithi133.stderr:2025-06-03T08:10:45 stdin-killer INFO: expiration expected; waiting 300 seconds for command to complete 2025-06-03T08:10:45.540 INFO:teuthology.orchestra.run.smithi133.stderr:dd: failed to open '/sys/kernel/debug/ceph/ff8cc1ea-9b92-4ce0-aa21-aff1539c4292.client4821/status': No such file or directory 2025-06-03T08:10:45.542 INFO:teuthology.orchestra.run.smithi133.stderr:2025-06-03T08:10:45 stdin-killer INFO: command exited with status 1: exiting normally with same code! 2025-06-03T08:10:45.550 DEBUG:teuthology.orchestra.run:got remote process result: 1 2025-06-03T08:10:45.551 WARNING:tasks.cephfs.kernel_mount:failed to fetch mount info - tests depending on mount addr/inst may fail!
Although the test runs till completion, any code path that relies on the mount address or instance will fail. One such path is during unmount() where there is a check is the client is blocklisted. To check this, qa code does a ceph osd blocklist ls and checks if the mount addr is in the list of addressed returned by the blocklist command. However, the address (self.addr) is of type NoneType (None in python world). Checking the presence of None in a string, results in the following trace
2025-06-03T09:40:22.179 ERROR:teuthology.run_tasks:Manager failed: kclient
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_eaeb97003cfc43fc86754e4e45e7b398c784dedf/teuthology/run_tasks.py", line 160, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_ceph-c_25d5c1a9e98389c24b632e49b2b72796c0934fc0/qa/tasks/kclient.py", line 139, in task
forced = umount_all()
File "/home/teuthworker/src/git.ceph.com_ceph-c_25d5c1a9e98389c24b632e49b2b72796c0934fc0/qa/tasks/kclient.py", line 121, in umount_all
mount.umount()
File "/home/teuthworker/src/git.ceph.com_ceph-c_25d5c1a9e98389c24b632e49b2b72796c0934fc0/qa/tasks/cephfs/kernel_mount.py", line 152, in umount
if self.is_blocked():
File "/home/teuthworker/src/git.ceph.com_ceph-c_25d5c1a9e98389c24b632e49b2b72796c0934fc0/qa/tasks/cephfs/mount.py", line 204, in is_blocked
return self.addr in output
TypeError: 'in <string>' requires string as left operand, not NoneType
Updated by Venky Shankar 10 months ago
- Subject changed from missing debugs status file in testing kernel (possibly intermittent) to missing debugfs status file in testing kernel (possibly intermittent)
Updated by Venky Shankar 10 months ago
FWIW, a workaround to get tests running would be to use -k for-linus with the teuthology-suite command . That branch was built ~40 days ago.
Updated by Venky Shankar 10 months ago
Venky Shankar wrote in #note-2:
FWIW, a workaround to get tests running would be to use
-k for-linuswith theteuthology-suitecommand . That branch was built ~40 days ago.
... with the obvious caveat that any changes b/w that and now will not be tested.
Updated by Laura Flores 10 months ago
- Backport set to tentacle
/a/teuthology-2025-06-04_01:08:04-upgrade-tentacle-distro-default-smithi/8309119
Updated by Venky Shankar 9 months ago
@Alex Markuze tells me that no changes are in the kernel driver related to debugfs stuff. So, this could be something in the build?
Updated by Venky Shankar 9 months ago ยท Edited
... and it seems that the issue is not intermittent.
Updated by Ilya Dryomov 9 months ago
Venky Shankar wrote in #note-6:
@Alex Markuze tells me that no changes are in the kernel driver related to debugfs stuff. So, this could be something in the build?
I suspect a problem with Slava's commit in the testing branch: https://github.com/ceph/ceph-client/commit/c26b22533e47783368fbf0b0b84b08fa396c3ad4. The change seems over-complicated for what it needs to do and definitely fishy because the new have_mon_and_osd_map atomic counter is incremented in ceph_monc_init() and ceph_osdc_init() and therefore can be set to 2 (CEPH_CLIENT_HAS_MON_AND_OSD_MAP) before any map is actually received. This can mess with "wait until both monmap and osdmap is received" logic in __ceph_open_session() which is also responsible for initializing debugfs directory for the client.
I'm going to drop it from the testing branch, please retry once the new build is ready.
Updated by Ilya Dryomov 9 months ago
Ilya Dryomov wrote in #note-8:
This can mess with "wait until both monmap and osdmap is received" logic in __ceph_open_session() which is also responsible for initializing debugfs directory for the client.
Confirmed -- debugfs directory ends up being created as 00000000-0000-0000-0000-000000000000.client0 because neither the cluster FSID nor the client ID are known at that point.
Updated by Venky Shankar 9 months ago
Ilya Dryomov wrote in #note-9:
Ilya Dryomov wrote in #note-8:
This can mess with "wait until both monmap and osdmap is received" logic in __ceph_open_session() which is also responsible for initializing debugfs directory for the client.
Confirmed -- debugfs directory ends up being created as 00000000-0000-0000-0000-000000000000.client0 because neither the cluster FSID nor the client ID are known at that point.
Thanks for confirming that, @Ilya Dryomov
Updated by Venky Shankar 9 months ago
- Status changed from New to Closed
Using the newest testing kernel build which has the problematic commit reverted, these failures aren't seen. Closing this tracker.
Updated by Venky Shankar 9 months ago
- Related to Bug #71611: ERROR: test_newops_getvxattr (tasks.cephfs.test_newops.TestNewOps) added
Updated by Laura Flores 9 months ago
- Related to deleted (Bug #71611: ERROR: test_newops_getvxattr (tasks.cephfs.test_newops.TestNewOps))
Updated by Laura Flores 9 months ago
- Has duplicate Bug #71611: ERROR: test_newops_getvxattr (tasks.cephfs.test_newops.TestNewOps) added