Bug #66009
closedqa: `fs volume ls` command times out waiting for fs to come online
0%
Description
- no stack trace in either of mgr, mds or mon logs
- in mon logs, for 4 seconds after the timeout happened for
fs volume lsthere was no rank 0; then mds.b got rank 0
Files
Updated by Milind Changire almost 2 years ago
- Subject changed from qa: @fs volume ls` command times out waiting for fs to come online to qa: `fs volume ls` command times out waiting for fs to come online
Updated by Venky Shankar almost 2 years ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Assignee set to Milind Changire
- Priority changed from Normal to High
- Target version set to v20.0.0
Updated by Venky Shankar almost 2 years ago
Updated by Venky Shankar almost 2 years ago
Discussed a bit with Milind - `volume ls` command in mgr/volumes does not take the global lock - it just iterates the mdsmap and returns a list of file systems (volumes). So, it's not clear where and which thread is being blocked to cause the command to timeout. I suggested Milind to increase the command timeout and then get a stacktrace of all threads on ceph-mgr when the command hangs in a teuthology run (since the test failure pretty much reproducible).
Updated by Venky Shankar almost 2 years ago
One recent commit in ceph-mgr is
commit f34b439d626621df3f67d8d20162d581836003cb
Merge: b71803b6702 190f37978f7
Author: Patrick Donnelly <pdonnell@redhat.com>
Date: Mon Apr 22 15:07:27 2024 -0400
Merge PR #56047 into main
* refs/pull/56047/head:
mgr: process map before notifying clients
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
But I can't certainly say that it could be causing this, but it seems to be related to processing cluster maps.
Updated by Patrick Donnelly almost 2 years ago
I'm not seeing the usual amount "volumes" debugging in the mgr log:
2024-05-14T07:19:45.505+0000 7fd2c5e94640 0 [volumes INFO mgr_util] scanning for idle connections..
2024-05-14T07:19:45.505+0000 7fd2c5e94640 0 [volumes INFO mgr_util] cleaning up connections: []
...
2024-05-14T07:18:09.645+0000 7fd2dbf80640 1 -- [v2:172.21.15.69:6832/1346735391,v1:172.21.15.69:6833/1346735391] <== client.5796 172.21.15.19:0/2037185831 1 ==== mgr_command(tid 0: {"prefix": "fs volume ls", "target": ["mon-mgr", ""]}) ==== 77+0+0 (secure 0 0 0) 0x55e51587b1e0 con 0x55e515fdb400
2024-05-14T07:18:09.645+0000 7fd2dbf80640 10 mgr.server _handle_command decoded-size=2 prefix=fs volume ls
2024-05-14T07:18:09.646+0000 7fd2dbf80640 20 is_capable service=py module=volumes command=fs volume ls read addr - on cap allow *
2024-05-14T07:18:09.646+0000 7fd2dbf80640 20 allow so far , doing grant allow *
2024-05-14T07:18:09.646+0000 7fd2dbf80640 20 allow all
2024-05-14T07:18:09.646+0000 7fd2dbf80640 10 mgr.server _allowed_command client.admin capable
2024-05-14T07:18:09.646+0000 7fd2dbf80640 0 log_channel(audit) log [DBG] : from='client.5796 -' entity='client.admin' cmd=[{"prefix": "fs volume ls", "target": ["mon-mgr", ""]}]: dispatch
2024-05-14T07:18:09.646+0000 7fd2dbf80640 10 mgr.server _handle_command passing through command 'fs volume ls' size 2
...
From: /teuthology/mchangir-2024-05-14_06:44:38-fs-wip-mchangir-mgr-snap_schedule-correctly-fetch-mds_max_snaps_per_dir-from-mds-distro-default-smithi/7705653/remote/smithi069/log/ceph-mgr.x.log.gz
There's no indication why the command is never dispatched. I don't think this can be related to the notify change but it's possible. The volumes plugin doesn't even implement the notify call (and all we did was reorder things).
So, I would recommend:
- make sure we're getting full debugging on volumes/snap_schedule
- increase debug_finisher for the manager to 20
Hopefully that can shed some light on what's going on.
Updated by Venky Shankar almost 2 years ago
Milind mentioned that after increasing command timeout (to 900s), the command hang wasn't seen, however, all snap-schedule jobs failed with an mds "evicted client XXX" warning. I'm seeing something similar in tasks/nfs. See: https://tracker.ceph.com/issues/66049
Updated by Venky Shankar almost 2 years ago
- Related to Bug #66049: qa, tasks/nfs: client.15263 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs added
Updated by Milind Changire almost 2 years ago
RCA of test failure Test failure: test_multi_snap_schedule is stalled due to ansible failure
Updated by Venky Shankar almost 2 years ago
Milind Changire wrote in #note-9:
RCA of test failure Test failure: test_multi_snap_schedule is stalled due to ansible failure
The (ansible) fix is merged. Let's get the testing going.
Updated by Milind Changire almost 2 years ago
Will mgr client sessions go stale on mds side if mgr is restarted ?
I think the client ID will be lost across a mgr restart.
If so, then
"2024-05-22T17:44:21.152633+0000 mds.d (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi150:y (8757), after 301.178 seconds" in cluster log
is the result of client session going stale on mds side due to mgr respawn and there's nothing to worry about
Also, the mgr respawn was due to change in the set of enabled modules
handle_mgr_map respawning because set of enabled modules changed!
Updated by Venky Shankar almost 2 years ago
Milind Changire wrote in #note-11:
Will mgr client sessions go stale on mds side if mgr is restarted ?
Yes, that' possible. The mgr periodically sends its rados addresses to ceph-mon for blocklisting, however, there is a window where it doesn't (racing with mgr restart) in which case the MDS would blocklist and evict those stale libcephfs handles.
I think the client ID will be lost across a mgr restart.
If so, then
[...]
is the result of client session going stale on mds side due to mgr respawn and there's nothing to worry about
Also, the mgr respawn was due to change in the set of enabled modules
[...]
Fair enough. But have we started seeing this warning in our tests lately?
Updated by Milind Changire almost 2 years ago
Venky Shankar wrote in #note-12:
I think the client ID will be lost across a mgr restart.
If so, then
[...]
is the result of client session going stale on mds side due to mgr respawn and there's nothing to worry about
Also, the mgr respawn was due to change in the set of enabled modules
[...]Fair enough. But have we started seeing this warning in our tests lately?
I'm not sure about other runs, but I got to see this warning when the job was declared as failed.
This warning showed up after bumping out the timeout to 900 for the fs volume ls command.
fyi - I don't think the timeout has any relation with the mgr client eviction.
Updated by Venky Shankar almost 2 years ago
Milind Changire wrote in #note-13:
Venky Shankar wrote in #note-12:
I think the client ID will be lost across a mgr restart.
If so, then
[...]
is the result of client session going stale on mds side due to mgr respawn and there's nothing to worry about
Also, the mgr respawn was due to change in the set of enabled modules
[...]Fair enough. But have we started seeing this warning in our tests lately?
I'm not sure about other runs, but I got to see this warning when the job was declared as failed.
This warning showed up after bumping out the timeout to 900 for thefs volume lscommand.
fyi - I don't think the timeout has any relation with the mgr client eviction.
Yeh, those are two separate issues. The command timeout masks the eviction warning I think.
BTW, have you tried reverting the patch mentioned in note-5 and see if the issue reproduces (with default command timeout).
Updated by Milind Changire almost 2 years ago
Venky Shankar wrote in #note-14:
BTW, have you tried reverting the patch mentioned in note-5 and see if the issue reproduces (with default command timeout).
I'll try that next.
Updated by Venky Shankar almost 2 years ago
Milind Changire wrote in #note-16:
Teuthology Job failed with default timeout even after reverting PR mentioned in note-#5
Oh dear. We are back to square one on this. The fact that the command (volume ls) hang isn't observed with a larger command timeout is now confusing.
Can you upload the thread backtraces you captured during then perceived hang in this ticket @Milind Changire
Updated by Milind Changire almost 2 years ago
Updated by Milind Changire almost 2 years ago
2024-05-15T10:35:47.760+0000 7ff6a02bc640 20 mgr Gil Switched to new thread state 0x562347fc66c0 2024-05-15T10:35:47.760+0000 7ff6a02bc640 20 mgr config_notify Calling volumes._config_notify... ... 2024-05-15T10:35:47.760+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option max_concurrent_clones = 4 2024-05-15T10:35:47.760+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/snapshot_clone_delay 2024-05-15T10:35:47.760+0000 7ff6a02bc640 10 mgr get_typed_config snapshot_clone_delay not found 2024-05-15T10:35:47.760+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option snapshot_clone_delay = 0 2024-05-15T10:35:47.760+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/periodic_async_work 2024-05-15T10:35:47.760+0000 7ff6a02bc640 10 mgr get_typed_config periodic_async_work not found 2024-05-15T10:35:47.760+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option periodic_async_work = False 2024-05-15T10:35:47.760+0000 7ff6b031c640 20 mgr Gil Switched to new thread state 0x56234880c6c0 ... 2024-05-15T10:40:45.466+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option snapshot_clone_no_wait = True 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/log_level 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config log_level not found 2024-05-15T10:40:45.466+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option log_level = 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/log_to_file 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config log_to_file not found 2024-05-15T10:40:45.466+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option log_to_file = False 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/sqlite3_killpoint 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config sqlite3_killpoint not found 2024-05-15T10:40:45.466+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option sqlite3_killpoint = 0 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/log_to_cluster 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config log_to_cluster not found 2024-05-15T10:40:45.466+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option log_to_cluster = False 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/log_to_cluster_level 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config log_to_cluster_level not found 2024-05-15T10:40:45.466+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option log_to_cluster_level = info 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr ~Gil Destroying new thread state 0x562347fc66c0 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread done with [0x56234816d7e0,0] 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread doing [0x56234816cdc0,0,0x5623489c6700,0] 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr Gil Switched to new thread state 0x562347fc66c0 ... 2024-05-15T10:40:45.466+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option max_concurrent_clones = 4 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/snapshot_clone_delay 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config snapshot_clone_delay not found 2024-05-15T10:40:45.466+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option snapshot_clone_delay = 0 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/periodic_async_work 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config periodic_async_work not found 2024-05-15T10:40:45.466+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option periodic_async_work = False 2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/snapshot_clone_no_wait 2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config snapshot_clone_no_wait not found 2024-05-15T10:40:45.466+0000 7ff6a4ac5640 0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries 2024-05-15T10:40:45.466+0000 7ff6a52c6640 0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries 2024-05-15T10:40:45.466+0000 7ff6a5ac7640 0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries 2024-05-15T10:40:45.466+0000 7ff6a62c8640 0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries 2024-05-15T10:40:45.466+0000 7ff6a32c2640 0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries 2024-05-15T10:40:45.466+0000 7ff6a1abf640 0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries 2024-05-15T10:40:45.466+0000 7ff6a2ac1640 0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries 2024-05-15T10:40:45.466+0000 7ff6a22c0640 0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries 2024-05-15T10:40:45.467+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option snapshot_clone_no_wait = True 2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/log_level 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config log_level not found 2024-05-15T10:40:45.467+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option log_level = 2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/log_to_file 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config log_to_file not found 2024-05-15T10:40:45.467+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option log_to_file = False 2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/sqlite3_killpoint 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config sqlite3_killpoint not found 2024-05-15T10:40:45.467+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option sqlite3_killpoint = 0 2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/log_to_cluster 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config log_to_cluster not found 2024-05-15T10:40:45.467+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option log_to_cluster = False 2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config key: mgr/volumes/log_to_cluster_level 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config log_to_cluster_level not found 2024-05-15T10:40:45.467+0000 7ff6a02bc640 0 [volumes DEBUG root] mgr option log_to_cluster_level = info 2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr ~Gil Destroying new thread state 0x562347fc66c0 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr.server operator() dispatching command 'fs volume ls' size 2 2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr Gil Switched to new thread state 0x562347fc66c0 2024-05-15T10:40:45.467+0000 7ff6a02bc640 0 [volumes INFO volumes.module] Starting _cmd_fs_volume_ls(prefix:fs volume ls, target:['mon-mgr', '']) < "" 2024-05-15T10:40:45.467+0000 7ff6a02bc640 0 [volumes INFO volumes.module] Finishing _cmd_fs_volume_ls(prefix:fs volume ls, target:['mon-mgr', '']) < "" 2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr ~Gil Destroying new thread state 0x562347fc66c0 2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr.server reply success 2024-05-15T10:40:45.467+0000 7ff6a02bc640 1 -- [v2:172.21.15.193:6835/3001227076,v1:172.21.15.193:6837/3001227076] --> 172.21.15.193:0/3903477197 -- mgr_command_reply(tid 0: 0 ) -- 0x562348d881a0 con 0x5623489c8000 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr.server operator() command returned 0 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread done with [0x56234816cdc0,0,0x5623489c6700,0] 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread empty 2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread sleeping
The above sequence of mgr logs point to some contention on the python gil.
I'm not able to conclusively say if it is because of the pending mount operation which fails due to a ETIMEDOUT while waiting for the mds to respond. The mount operation is initiated in the purgequeue job path.
Also, if the timeout is bumped to 900 seconds, then eventually the fs volume ls passes ... presumably because the mds finally comes up and responds to the fs operations.
Updated by Venky Shankar almost 2 years ago
- Related to Bug #66483: qa: CommandFailedError in test_multi_snap_schedule added
Updated by Venky Shankar over 1 year ago
@Milind Changire bisected the commits and looks like https://github.com/ceph/ceph/pull/55823/commits/329edd3c56baff81cb76c5008ac2eaf7c110f9e1#diff-501f5d864f91e98b1c214da58065308a41554b87ce6366f2314a84c19c287ca7R239 is the culprit.
Updated by Milind Changire over 1 year ago
Git bisect of merges resulted in the following:
BAD c2b21dbf694 Leonid S. Usov 2024-04-30 13:53:46 +0300 Merge pull request #55823 from ceph/wip-lusov-quiescer GOOD 19c2c8cef15 Matan Breizman 2024-04-30 12:29:59 +0300 Merge pull request #56806 from xxhdx1985126/wip-65399
The aim is to revert patch:
commit 329edd3c56baff81cb76c5008ac2eaf7c110f9e1
Author: Leonid Usov <leonid.usov@ibm.com>
Date: Thu Apr 25 19:41:57 2024 +0300
qa/tasks/cephfs: don't create a new CephManager if there is one in the context
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
and run the --suite fs --filter-all snap-schedule job set for about 20 consecutive iterations and see if it behaves well for form a conclusive opinion.
Updated by Milind Changire over 1 year ago
The fs volume ls command getting stuck is just a side-effect of the _config_notify being called for the module. The fs volume ls command gets queued up behind the _config_notify.
Currently, I'm trying to set up a thread parallel to the worker finisher thread that watches for some time to be elapsed in the Context execution call before forcefully doing a ceph_assert(false);
I hope to catch a sensible core dump for analysis to see which mutex the _config_notify is blocking on.
Updated by Venky Shankar over 1 year ago
Milind Changire wrote in #note-24:
The
fs volume lscommand getting stuck is just a side-effect of the_config_notifybeing called for the module. The fs volume ls command gets queued up behind the _config_notify.
Do you have a backtrace for the stuck thread?
Updated by Milind Changire over 1 year ago
- File t-a-a-bt.txt t-a-a-bt.txt added
Updated by Venky Shankar over 1 year ago
@Milind Changire pointed to me at failed run: /a/mchangir-2024-07-06_16:10:22-fs-wip-mchangir-log-use-gettid-instead-of-pthread_self-debug-distro-default-smithi/7789504
The backtrace points to a config_notify() call in mgr/volumes where the GIL is getting held and there mgr volume finisher is blocked. For that thread:
zgrep " b286 " ./remote/smithi117/log/ceph-mgr.x.log.gz 2024-07-06T16:45:45.108+0000 b286 10 finisher(0x55e982ffd1b8) finisher_thread start 2024-07-06T16:45:45.108+0000 b286 10 finisher(0x55e982ffd1b8) finisher_thread empty 2024-07-06T16:45:45.108+0000 b286 10 finisher(0x55e982ffd1b8) finisher_thread sleeping 2024-07-06T16:45:47.566+0000 b286 10 finisher(0x55e982ffd1b8) finisher_thread doing [0x55e982fc43a0,0] 2024-07-06T16:45:47.567+0000 b286 20 mgr Gil Switched to new thread state 0x55e982fb6900 2024-07-06T16:45:47.567+0000 b286 20 mgr config_notify Calling volumes._config_notify... 2024-07-06T16:45:47.567+0000 b286 10 ceph_option_get debug_mgr found: 20/20 2024-07-06T16:45:47.567+0000 b286 20 mgr get_config key: mgr/volumes/log_level 2024-07-06T16:45:47.567+0000 b286 10 mgr get_typed_config log_level not found 2024-07-06T16:45:47.567+0000 b286 20 mgr get_config key: mgr/volumes/log_to_cluster_level 2024-07-06T16:45:47.567+0000 b286 10 mgr get_typed_config log_to_cluster_level not found 2024-07-06T16:45:47.567+0000 b286 20 mgr get_config key: mgr/volumes/log_to_file 2024-07-06T16:45:47.567+0000 b286 10 mgr get_typed_config log_to_file not found 2024-07-06T16:45:47.567+0000 b286 20 mgr get_config key: mgr/volumes/log_to_cluster 2024-07-06T16:45:47.567+0000 b286 10 mgr get_typed_config log_to_cluster not found 2024-07-06T16:45:47.568+0000 b286 20 mgr get_config key: mgr/volumes/max_concurrent_clones 2024-07-06T16:45:47.568+0000 b286 10 mgr get_typed_config max_concurrent_clones not found 2024-07-06T16:45:47.568+0000 b286 0 [volumes DEBUG root] mgr option max_concurrent_clones = 4 2024-07-06T16:45:47.568+0000 b286 20 mgr get_config key: mgr/volumes/snapshot_clone_delay 2024-07-06T16:45:47.568+0000 b286 10 mgr get_typed_config snapshot_clone_delay not found 2024-07-06T16:45:47.568+0000 b286 0 [volumes DEBUG root] mgr option snapshot_clone_delay = 0 2024-07-06T16:45:47.568+0000 b286 20 mgr get_config key: mgr/volumes/periodic_async_work 2024-07-06T16:45:47.568+0000 b286 10 mgr get_typed_config periodic_async_work not found 2024-07-06T16:45:47.568+0000 b286 0 [volumes DEBUG root] mgr option periodic_async_work = False
(note `b286` tid is due to a custom change Milind added to match the thread id to gdb)
Anyway, one can see the Gil is acquired
2024-07-06T16:45:47.567+0000 b286 20 mgr Gil Switched to new thread state 0x55e982fb6900
but there is no corresponding release of the GIL. So, my guess is that config_notify() in mgr/volumes threw a backtrace due to a unhandled exception (which isn't logged!) and the finisher thread is blocked thereby queueing up fs volume ls command.
If this gets validated, then I think we should relook at how unhandled exceptions should be logged.
Updated by Kotresh Hiremath Ravishankar over 1 year ago
Sorry my bad. config_notify does use per module finisher thread. And it takes gil before calling the modules _config_notify
void ActivePyModules::config_notify()
{
std::lock_guard l(lock);
for (auto& [name, module] : modules) {
// Send all python calls down a Finisher to avoid blocking
// C++ code, and avoid any potential lock cycles.
dout(15) << "notify (config) " << name << dendl;
Finisher& mod_finisher = py_module_registry.get_active_module_finisher(name);
// workaround for https://bugs.llvm.org/show_bug.cgi?id=35984
mod_finisher.queue(new LambdaContext([module=module](int r){
module->config_notify();
}));
}
}
void ActivePyModule::config_notify()
{
if (is_dead()) {
dout(5) << "cancelling config_notify" << dendl;
return;
}
Gil gil(py_module->pMyThreadState, true);
dout(20) << "Calling " << py_module->get_name() << "._config_notify..."
<< dendl;
auto remoteResult = PyObject_CallMethod(pClassInstance,
const_cast<char*>("_config_notify"),
(char*)NULL);
if (remoteResult != nullptr) {
Py_DECREF(remoteResult);
}
}
Updated by Milind Changire over 1 year ago
Updated by Venky Shankar over 1 year ago · Edited
Milind Changire wrote in #note-29:
@Kotresh Hiremath Ravishankar fyi - latest teuthology job set
Milind mentioned about the cloner thread having the lock acquired for long (~5 minutes) and that causes the config_notify() call via volumes finisher to block on waiting to be acquired. AFAICT, the cloner threads (and the purge threads) run without holding async_job::lock, but its good to check if that's still holds true (maybe, some change got merged that affected that).
Updated by Milind Changire over 1 year ago
For this job, following is the excerpt of the mgr logs ...
2024-07-10T15:10:46.530+0000 b37d 0 [volumes DEBUG volumes.fs.async_job] JobThread(cloner.1).locked after wait <locked _thread.lock object at 0x7f2db0e0c720> 2024-07-10T15:10:46.530+0000 b37d 0 [volumes DEBUG volumes.fs.async_job] processing 1 volume entries 2024-07-10T15:10:46.530+0000 b37d 0 [volumes DEBUG volumes.fs.async_cloner] fetching clone entry for volume 'cephfs' 2024-07-10T15:10:46.530+0000 b37d 0 [volumes DEBUG mgr_util] [get] volume: (cephfs) connection: ([]) 2024-07-10T15:10:46.530+0000 b37d 0 [volumes DEBUG mgr_util] [get] spawning new connection since no connection is unused and we still have room for more 2024-07-10T15:10:46.530+0000 b37d 0 [volumes DEBUG mgr_util] Connecting to cephfs 'cephfs' 2024-07-10T15:10:46.530+0000 b37d 0 [volumes DEBUG mgr_util] Setting user ID and group ID of CephFS mount as root... ... 2024-07-10T15:10:46.535+0000 b37d 0 [volumes DEBUG mgr_util] CephFS mounting... ... 2024-07-10T15:10:46.537+0000 b37d 10 client.5138 did not get mds through better means, so chose random mds -1 2024-07-10T15:10:46.537+0000 b37d 20 client.5138 mds is -1 2024-07-10T15:10:46.537+0000 b37d 10 client.5138 target mds.-1 not active, waiting for new mdsmap ... 2024-07-10T15:10:50.213+0000 b388 20 mgr config_notify Calling volumes._config_notify... ... 2024-07-10T15:11:20.217+0000 b388 0 [volumes INFO mgr_util] possible deadlock acquiring <locked _thread.lock object at 0x7f2db0e0c720> 2024-07-10T15:11:20.217+0000 b388 0 [volumes DEBUG mgr_util] locking <locked _thread.lock object at 0x7f2db0e0c720> with 5 timeout ... 2024-07-10T15:15:46.588+0000 b37d 1 client.5138 shutdown ... 2024-07-10T15:15:46.589+0000 b37d 0 [volumes ERROR volumes.fs.async_cloner] error fetching clone entry for volume 'cephfs' (-110 (error calling ceph_mount)) 2024-07-10T15:15:46.589+0000 b37d 0 [volumes DEBUG volumes.fs.async_job] JobThread(cloner.1).unlocking for wait <locked _thread.lock object at 0x7f2db0e0c720> ... 2024-07-10T15:15:46.589+0000 b380 0 [volumes DEBUG volumes.fs.async_job] AsyncJobs(cloner.tick).locked after wait <locked _thread.lock object at 0x7f2db0e0c720> 2024-07-10T15:15:46.590+0000 b380 0 [volumes DEBUG volumes.fs.async_job] AsyncJobs(cloner.tick).unlocking for wait <locked _thread.lock object at 0x7f2db0e0c720> 2024-07-10T15:15:46.590+0000 b37c 0 [volumes DEBUG volumes.fs.async_job] JobThread(cloner.0).locked after wait <locked _thread.lock object at 0x7f2db0e0c720> 2024-07-10T15:15:46.590+0000 b37c 0 [volumes DEBUG volumes.fs.async_job] processing 1 volume entries 2024-07-10T15:15:46.590+0000 b37c 0 [volumes DEBUG volumes.fs.async_cloner] fetching clone entry for volume 'cephfs' 2024-07-10T15:15:46.590+0000 b37c 0 [volumes DEBUG mgr_util] [get] volume: (cephfs) connection: ([]) 2024-07-10T15:15:46.590+0000 b37c 0 [volumes DEBUG mgr_util] [get] spawning new connection since no connection is unused and we still have room for more 2024-07-10T15:15:46.590+0000 b37c 0 [volumes DEBUG mgr_util] Connecting to cephfs 'cephfs' 2024-07-10T15:15:46.590+0000 b37c 0 [volumes DEBUG mgr_util] Setting user ID and group ID of CephFS mount as root... ... 2024-07-10T15:15:46.596+0000 b37c 0 [volumes DEBUG mgr_util] CephFS mounting... ... 2024-07-10T15:15:46.604+0000 b37c 0 [volumes DEBUG mgr_util] Connection to cephfs 'cephfs' complete ... 2024-07-10T15:15:46.604+0000 b37c 20 client.5725 (path is #0x1/volumes/_index/clone) ... 2024-07-10T15:15:46.605+0000 b37f 0 [volumes DEBUG volumes.fs.async_job] JobThread(cloner.3).unlocking for wait <locked _thread.lock object at 0x7f2db0e0c720> ... 2024-07-10T15:15:46.607+0000 b388 0 [volumes INFO volumes.module] Starting _cmd_fs_volume_ls(prefix:fs volume ls, target:['mon-mgr', '']) < "" 2024-07-10T15:15:46.607+0000 b388 0 [volumes INFO volumes.module] Finishing _cmd_fs_volume_ls(prefix:fs volume ls, target:['mon-mgr', '']) < ""
so the config_notify path (thread b37c) has a problem with receiving the mds_map and the client connection eventually gets shut down
looks like most of the time is spent in waiting for the mds_map ... which never arrives
why is target mds chosen as -1 by the client ?
Updated by Venky Shankar over 1 year ago
This issue happens when there is a race between the async threads in mgr/volumes that has fetched a list of file systems to process for initiating clones and a fs fail (followed by fs rm) command to delete the file system (probably done right after finishing a test). The clone thread finds the file system in the fs map and tries to establish a mount to it, but, around the same time, the file system is removed thereby the mount call from libcephfs is trying to find an MDS to connect to. A new file system does get created, but with a different fscid.
For the fix, for now, the test can set client_mount_timeout to a lower value (say 30 seconds), so that the thread does not get stuck for long - the connection would timeout and the next try would successfully connect to the new file system.
We should also check if connecting to the file system (in mgr/volumes and mgr_util) can be done by dropping locks - that would be a nice enhancement. WDYT?
Updated by Milind Changire over 1 year ago · Edited
The actual connection to the cephfs via the python wrapper to the libcephfs is done under nogil.
Is there any other lock that needs to be dropped ?
Do you mean to drop the AsyncJobs.lock when attempting the connection to the fs ?
Although the problem is manifested via test_snap_schedules.py, the piece of code due to which we see the problem is common across all cephfs tests. So we'll need to set client_mount_timeout to a lower value for the entire cephfs test suite.
Updated by Venky Shankar over 1 year ago
Milind Changire wrote in #note-33:
The actual connection to the cephfs via the python wrapper to the libcephfs is done under
nogil.
Is there any other lock that needs to be dropped ?
Do you mean to drop theAsyncJobs.lockwhen attempting the connection to the fs ?
There is AsyncJobs.lock and then the mgr_util.py::CephfsConnectionPool::lock. Once the code decides that it requires to initiate a file system connection, it can drop the lock, but we need to be careful since another thread might come in at this point and touch some structures that are incomplete. So, audit those parts carefully.
Although the problem is manifested via
test_snap_schedules.py, the piece of code due to which we see the problem is common across all cephfs tests. So we'll need to setclient_mount_timeoutto a lower value for the entire cephfs test suite.
Agreed. Right now there is
qa/cephfs/conf/client.yaml: client mount timeout: 600
which was introduced by commit 73fa0efcbb15a47157615705043288f2913ff3bc - I think that needs to be lowered.
Updated by Milind Changire over 1 year ago
- 4 out of 6 jobs failed with
cluster [WRN] evicting unresponsive client smithi111:x (7409), after 301.769 seconds
this warning is mostly due to a mount being unresponsive from the mgr - 2 out of 6 jobs failed with core dumps
this is due topthread_getname_np()failure[Current thread is 1 (Thread 0x7f2618676640 (LWP 48305))] (gdb) bt #0 0x00007f261da8b94c in __pthread_kill_implementation () from /lib64/libc.so.6 #1 0x00007f261da3e646 in raise () from /lib64/libc.so.6 #2 0x000056458ed05ea7 in reraise_fatal(int) () #3 0x000056458ed0fbd1 in handle_oneshot_fatal_signal(int) () #4 <signal handler called> #5 0x00007f261da8b2b6 in pthread_getname_np@GLIBC_2.12 () from /lib64/libc.so.6 #6 0x00007f261e705597 in ceph::logging::Log::dump_recent() () from /usr/lib64/ceph/libceph-common.so.2 #7 0x000056458e8d15ef in MDSDaemon::respawn() () #8 0x000056458e8c9b3e in MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&) () #9 0x000056458e8cfdc7 in MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&) () #10 0x000056458e8d0a06 in MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&) () #11 0x00007f261e5967fd in Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&) () from /usr/lib64/ceph/libceph-common.so.2 #12 0x00007f261e5970f1 in DispatchQueue::entry() () from /usr/lib64/ceph/libceph-common.so.2 #13 0x00007f261e66e005 in DispatchQueue::DispatchThread::entry() () from /usr/lib64/ceph/libceph-common.so.2 #14 0x00007f261e423cf7 in Thread::entry_wrapper() () from /usr/lib64/ceph/libceph-common.so.2 #15 0x00007f261e423d13 in Thread::_entry_func(void*) () from /usr/lib64/ceph/libceph-common.so.2 #16 0x00007f261da89c02 in start_thread () from /lib64/libc.so.6 #17 0x00007f261db0ec40 in clone3 () from /lib64/libc.so.6
- restored
fs volume lscommand timeout to default of 120 seconds - reduced
client_mount_timeoutfrom 300 seconds to 30 seconds
Updated by Milind Changire over 1 year ago
- Pull request ID changed from 58547 to 58771
Updated by Venky Shankar over 1 year ago
- Status changed from Triaged to Pending Backport
- Source set to Development
- Backport set to quincy,reef,squid
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #67826: squid: qa: `fs volume ls` command times out waiting for fs to come online added
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #67827: reef: qa: `fs volume ls` command times out waiting for fs to come online added
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #67828: quincy: qa: `fs volume ls` command times out waiting for fs to come online added
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Venky Shankar over 1 year ago
- Pull request ID changed from 58771 to 58547
Updated by Konstantin Shalygin about 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Venky Shankar 10 months ago
- Related to Bug #71516: qa: `volume ls` times out with test_snap_schedule_subvol_and_group_arguments_16 test added
Updated by Upkeep Bot 9 months ago
- Merge Commit set to 52deba6b14a095eb1cf0d8eb397861dc3bc900c6
- Fixed In set to v19.3.0-4615-g52deba6b14a0
- Upkeep Timestamp set to 2025-06-30T17:49:15+00:00
Updated by Upkeep Bot 9 months ago
- Merge Commit deleted (
52deba6b14a095eb1cf0d8eb397861dc3bc900c6) - Fixed In deleted (
v19.3.0-4615-g52deba6b14a0) - Upkeep Timestamp deleted (
2025-06-30T17:49:15+00:00)
Updated by Upkeep Bot 9 months ago
- Merge Commit set to 52deba6b14a095eb1cf0d8eb397861dc3bc900c6
- Fixed In set to v19.3.0-4615-g52deba6b14a0
- Upkeep Timestamp set to 2025-06-30T17:55:37+00:00
Updated by Upkeep Bot 9 months ago
- Fixed In changed from v19.3.0-4615-g52deba6b14a0 to v19.3.0-4615
- Upkeep Timestamp changed from 2025-06-30T17:55:37+00:00 to 2025-06-30T17:55:37
Updated by Upkeep Bot 9 months ago
- Fixed In changed from v19.3.0-4615 to v19.3.0-4615-g52deba6b14a0
- Upkeep Timestamp changed from 2025-06-30T17:55:37 to 2025-06-30T18:02:43+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-4615-g52deba6b14a0 to v19.3.0-4615-g52deba6b14a
- Upkeep Timestamp changed from 2025-06-30T18:02:43+00:00 to 2025-07-08T18:29:54+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-4615-g52deba6b14a to v19.3.0-4615-g52deba6b14
- Upkeep Timestamp changed from 2025-07-08T18:29:54+00:00 to 2025-07-14T17:10:31+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~2126
- Upkeep Timestamp changed from 2025-07-14T17:10:31+00:00 to 2025-11-01T01:27:50+00:00