Bug #66009: qa: `fs volume ls` command times out waiting for fs to come online - CephFS - Ceph

52deba6b14a095eb1cf0d8eb397861dc3bc900c6

Category:

Correctness/Safety

Target version:

Ceph - v20.0.0

% Done:

Source:

Development

Backport:

quincy,reef,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

58547

Tags (freeform):

backport_processed

Merge Commit:

Fixed In:

v19.3.0-4615-g52deba6b14

Released In:

v20.2.0~2126

Upkeep Timestamp:

2025-11-01T01:27:50+00:00

Description

Teuthology Job
Teuthology Job

In one of the jobs:

no stack trace in either of mgr, mds or mon logs
in mon logs, for 4 seconds after the timeout happened for fs volume ls there was no rank 0; then mds.b got rank 0

Files

Download all files

bt-ceph-mgr-smithi018.txt (108 KB) bt-ceph-mgr-smithi018.txt	active mgr backtrace with bumped up timeout	Milind Changire, 05/29/2024 12:58 PM
t-a-a-bt.txt (138 KB) t-a-a-bt.txt	job stuck in ActivePyModule::config_notify()	Milind Changire, 07/06/2024 02:49 AM

Related issues 6 (2 open — 4 closed)

Related to CephFS - Bug #66049: qa, tasks/nfs: client.15263 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs

Triaged

Mahesh Mohan

Related to CephFS - Bug #66483: qa: CommandFailedError in test_multi_snap_schedule

Duplicate

Related to CephFS - Bug #71516: qa: `volume ls` times out with test_snap_schedule_subvol_and_group_arguments_16 test

New

Mahesh Mohan

Copied to CephFS - Backport #67826: squid: qa: `fs volume ls` command times out waiting for fs to come online

Resolved

Copied to CephFS - Backport #67827: reef: qa: `fs volume ls` command times out waiting for fs to come online

Resolved

Copied to CephFS - Backport #67828: quincy: qa: `fs volume ls` command times out waiting for fs to come online

Rejected

Updated by Milind Changire almost 2 years ago

Subject changed from qa: @fs volume ls` command times out waiting for fs to come online to qa: `fs volume ls` command times out waiting for fs to come online

Actions

Updated by Venky Shankar almost 2 years ago

Category set to Correctness/Safety
Status changed from New to Triaged
Assignee set to Milind Changire
Priority changed from Normal to High
Target version set to v20.0.0

Actions

Updated by Venky Shankar almost 2 years ago

Pretty much reproducible - https://pulpito.ceph.com/vshankar-2024-05-10_15:47:15-fs-wip-vshankar-testing-20240509.114851-debug-testing-default-smithi/7701537/

Actions

Updated by Venky Shankar almost 2 years ago

Discussed a bit with Milind - `volume ls` command in mgr/volumes does not take the global lock - it just iterates the mdsmap and returns a list of file systems (volumes). So, it's not clear where and which thread is being blocked to cause the command to timeout. I suggested Milind to increase the command timeout and then get a stacktrace of all threads on ceph-mgr when the command hangs in a teuthology run (since the test failure pretty much reproducible).

Actions

Updated by Venky Shankar almost 2 years ago

One recent commit in ceph-mgr is

commit f34b439d626621df3f67d8d20162d581836003cb
Merge: b71803b6702 190f37978f7
Author: Patrick Donnelly <pdonnell@redhat.com>
Date:   Mon Apr 22 15:07:27 2024 -0400

    Merge PR #56047 into main

    * refs/pull/56047/head:
            mgr: process map before notifying clients

    Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

But I can't certainly say that it could be causing this, but it seems to be related to processing cluster maps.

Actions

Updated by Patrick Donnelly almost 2 years ago

I'm not seeing the usual amount "volumes" debugging in the mgr log:

2024-05-14T07:19:45.505+0000 7fd2c5e94640  0 [volumes INFO mgr_util] scanning for idle connections..
2024-05-14T07:19:45.505+0000 7fd2c5e94640  0 [volumes INFO mgr_util] cleaning up connections: []
...
2024-05-14T07:18:09.645+0000 7fd2dbf80640  1 -- [v2:172.21.15.69:6832/1346735391,v1:172.21.15.69:6833/1346735391] <== client.5796 172.21.15.19:0/2037185831 1 ==== mgr_command(tid 0: {"prefix": "fs volume ls", "target": ["mon-mgr", ""]}) ==== 77+0+0 (secure 0 0 0) 0x55e51587b1e0 con 0x55e515fdb400
2024-05-14T07:18:09.645+0000 7fd2dbf80640 10 mgr.server _handle_command decoded-size=2 prefix=fs volume ls
2024-05-14T07:18:09.646+0000 7fd2dbf80640 20 is_capable service=py module=volumes command=fs volume ls read addr - on cap allow *
2024-05-14T07:18:09.646+0000 7fd2dbf80640 20  allow so far , doing grant allow *
2024-05-14T07:18:09.646+0000 7fd2dbf80640 20  allow all
2024-05-14T07:18:09.646+0000 7fd2dbf80640 10 mgr.server _allowed_command  client.admin capable
2024-05-14T07:18:09.646+0000 7fd2dbf80640  0 log_channel(audit) log [DBG] : from='client.5796 -' entity='client.admin' cmd=[{"prefix": "fs volume ls", "target": ["mon-mgr", ""]}]: dispatch
2024-05-14T07:18:09.646+0000 7fd2dbf80640 10 mgr.server _handle_command passing through command 'fs volume ls' size 2
...

From: /teuthology/mchangir-2024-05-14_06:44:38-fs-wip-mchangir-mgr-snap_schedule-correctly-fetch-mds_max_snaps_per_dir-from-mds-distro-default-smithi/7705653/remote/smithi069/log/ceph-mgr.x.log.gz

There's no indication why the command is never dispatched. I don't think this can be related to the notify change but it's possible. The volumes plugin doesn't even implement the notify call (and all we did was reorder things).

So, I would recommend:

- make sure we're getting full debugging on volumes/snap_schedule
- increase debug_finisher for the manager to 20

Hopefully that can shed some light on what's going on.

Actions

Updated by Venky Shankar almost 2 years ago

Milind mentioned that after increasing command timeout (to 900s), the command hang wasn't seen, however, all snap-schedule jobs failed with an mds "evicted client XXX" warning. I'm seeing something similar in tasks/nfs. See: https://tracker.ceph.com/issues/66049

Actions

Updated by Venky Shankar almost 2 years ago

Related to Bug #66049: qa, tasks/nfs: client.15263 isn't responding to mclientcaps(revoke), ino 0x1 pending pAsLsXs issued pAsLsXsFs added

Actions

Updated by Milind Changire almost 2 years ago

RCA of test failure Test failure: test_multi_snap_schedule is stalled due to ansible failure

Actions

#10

Updated by Venky Shankar almost 2 years ago

Milind Changire wrote in #note-9:

RCA of test failure Test failure: test_multi_snap_schedule is stalled due to ansible failure

The (ansible) fix is merged. Let's get the testing going.

Actions

#11

Updated by Milind Changire almost 2 years ago

Will mgr client sessions go stale on mds side if mgr is restarted ?
I think the client ID will be lost across a mgr restart.
If so, then

"2024-05-22T17:44:21.152633+0000 mds.d (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi150:y (8757), after 301.178 seconds" in cluster log

is the result of client session going stale on mds side due to mgr respawn and there's nothing to worry about
Also, the mgr respawn was due to change in the set of enabled modules

handle_mgr_map respawning because set of enabled modules changed!

Actions

#12

Updated by Venky Shankar almost 2 years ago

Milind Changire wrote in #note-11:

Will mgr client sessions go stale on mds side if mgr is restarted ?

Yes, that' possible. The mgr periodically sends its rados addresses to ceph-mon for blocklisting, however, there is a window where it doesn't (racing with mgr restart) in which case the MDS would blocklist and evict those stale libcephfs handles.

I think the client ID will be lost across a mgr restart.
If so, then
[...]
is the result of client session going stale on mds side due to mgr respawn and there's nothing to worry about
Also, the mgr respawn was due to change in the set of enabled modules
[...]

Fair enough. But have we started seeing this warning in our tests lately?

Actions

#13

Updated by Milind Changire almost 2 years ago

Venky Shankar wrote in #note-12:

I think the client ID will be lost across a mgr restart.
If so, then
[...]
is the result of client session going stale on mds side due to mgr respawn and there's nothing to worry about
Also, the mgr respawn was due to change in the set of enabled modules
[...]

Fair enough. But have we started seeing this warning in our tests lately?

I'm not sure about other runs, but I got to see this warning when the job was declared as failed.
This warning showed up after bumping out the timeout to 900 for the fs volume ls command.
fyi - I don't think the timeout has any relation with the mgr client eviction.

Actions

#14

Updated by Venky Shankar almost 2 years ago

Milind Changire wrote in #note-13:

Venky Shankar wrote in #note-12:

I think the client ID will be lost across a mgr restart.
If so, then
[...]
is the result of client session going stale on mds side due to mgr respawn and there's nothing to worry about
Also, the mgr respawn was due to change in the set of enabled modules
[...]

Fair enough. But have we started seeing this warning in our tests lately?

I'm not sure about other runs, but I got to see this warning when the job was declared as failed.
This warning showed up after bumping out the timeout to 900 for the fs volume ls command.
fyi - I don't think the timeout has any relation with the mgr client eviction.

Yeh, those are two separate issues. The command timeout masks the eviction warning I think.

BTW, have you tried reverting the patch mentioned in note-5 and see if the issue reproduces (with default command timeout).

Actions

#15

Updated by Milind Changire almost 2 years ago

Venky Shankar wrote in #note-14:

BTW, have you tried reverting the patch mentioned in note-5 and see if the issue reproduces (with default command timeout).

I'll try that next.

Actions

Teuthology Job failed with default timeout even after reverting PR mentioned in note-#5

#16

Updated by Milind Changire almost 2 years ago

Actions

#17

Updated by Venky Shankar almost 2 years ago

Milind Changire wrote in #note-16:

Teuthology Job failed with default timeout even after reverting PR mentioned in note-#5

Oh dear. We are back to square one on this. The fact that the command (volume ls) hang isn't observed with a larger command timeout is now confusing.

Can you upload the thread backtraces you captured during then perceived hang in this ticket @Milind Changire

Actions

#18

Updated by Milind Changire almost 2 years ago

File bt-ceph-mgr-smithi018.txt bt-ceph-mgr-smithi018.txt added

Actions

#19

Updated by Milind Changire almost 2 years ago

2024-05-15T10:35:47.760+0000 7ff6a02bc640 20 mgr Gil Switched to new thread state 0x562347fc66c0
2024-05-15T10:35:47.760+0000 7ff6a02bc640 20 mgr config_notify Calling volumes._config_notify...
...
2024-05-15T10:35:47.760+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option max_concurrent_clones = 4
2024-05-15T10:35:47.760+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/snapshot_clone_delay
2024-05-15T10:35:47.760+0000 7ff6a02bc640 10 mgr get_typed_config  snapshot_clone_delay not found 
2024-05-15T10:35:47.760+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option snapshot_clone_delay = 0
2024-05-15T10:35:47.760+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/periodic_async_work
2024-05-15T10:35:47.760+0000 7ff6a02bc640 10 mgr get_typed_config  periodic_async_work not found 
2024-05-15T10:35:47.760+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option periodic_async_work = False
2024-05-15T10:35:47.760+0000 7ff6b031c640 20 mgr Gil Switched to new thread state 0x56234880c6c0
...
2024-05-15T10:40:45.466+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option snapshot_clone_no_wait = True
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/log_level
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config  log_level not found 
2024-05-15T10:40:45.466+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option log_level = 
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/log_to_file
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config  log_to_file not found 
2024-05-15T10:40:45.466+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option log_to_file = False
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/sqlite3_killpoint
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config  sqlite3_killpoint not found 
2024-05-15T10:40:45.466+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option sqlite3_killpoint = 0
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/log_to_cluster
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config  log_to_cluster not found 
2024-05-15T10:40:45.466+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option log_to_cluster = False
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/log_to_cluster_level
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config  log_to_cluster_level not found 
2024-05-15T10:40:45.466+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option log_to_cluster_level = info
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr ~Gil Destroying new thread state 0x562347fc66c0
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread done with [0x56234816d7e0,0]
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread doing [0x56234816cdc0,0,0x5623489c6700,0]
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr Gil Switched to new thread state 0x562347fc66c0
...
2024-05-15T10:40:45.466+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option max_concurrent_clones = 4
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/snapshot_clone_delay
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config  snapshot_clone_delay not found 
2024-05-15T10:40:45.466+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option snapshot_clone_delay = 0
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/periodic_async_work
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config  periodic_async_work not found 
2024-05-15T10:40:45.466+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option periodic_async_work = False
2024-05-15T10:40:45.466+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/snapshot_clone_no_wait
2024-05-15T10:40:45.466+0000 7ff6a02bc640 10 mgr get_typed_config  snapshot_clone_no_wait not found 
2024-05-15T10:40:45.466+0000 7ff6a4ac5640  0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries
2024-05-15T10:40:45.466+0000 7ff6a52c6640  0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries
2024-05-15T10:40:45.466+0000 7ff6a5ac7640  0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries
2024-05-15T10:40:45.466+0000 7ff6a62c8640  0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries
2024-05-15T10:40:45.466+0000 7ff6a32c2640  0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries
2024-05-15T10:40:45.466+0000 7ff6a1abf640  0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries
2024-05-15T10:40:45.466+0000 7ff6a2ac1640  0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries
2024-05-15T10:40:45.466+0000 7ff6a22c0640  0 [volumes DEBUG volumes.fs.async_job] processing 0 volume entries
2024-05-15T10:40:45.467+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option snapshot_clone_no_wait = True
2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/log_level
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config  log_level not found 
2024-05-15T10:40:45.467+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option log_level = 
2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/log_to_file
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config  log_to_file not found 
2024-05-15T10:40:45.467+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option log_to_file = False
2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/sqlite3_killpoint
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config  sqlite3_killpoint not found 
2024-05-15T10:40:45.467+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option sqlite3_killpoint = 0
2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/log_to_cluster
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config  log_to_cluster not found 
2024-05-15T10:40:45.467+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option log_to_cluster = False
2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr get_config  key: mgr/volumes/log_to_cluster_level
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr get_typed_config  log_to_cluster_level not found 
2024-05-15T10:40:45.467+0000 7ff6a02bc640  0 [volumes DEBUG root]  mgr option log_to_cluster_level = info
2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr ~Gil Destroying new thread state 0x562347fc66c0
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr.server operator() dispatching command 'fs volume ls' size 2
2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr Gil Switched to new thread state 0x562347fc66c0
2024-05-15T10:40:45.467+0000 7ff6a02bc640  0 [volumes INFO volumes.module] Starting _cmd_fs_volume_ls(prefix:fs volume ls, target:['mon-mgr', '']) < "" 
2024-05-15T10:40:45.467+0000 7ff6a02bc640  0 [volumes INFO volumes.module] Finishing _cmd_fs_volume_ls(prefix:fs volume ls, target:['mon-mgr', '']) < "" 
2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr ~Gil Destroying new thread state 0x562347fc66c0
2024-05-15T10:40:45.467+0000 7ff6a02bc640 20 mgr.server reply success
2024-05-15T10:40:45.467+0000 7ff6a02bc640  1 -- [v2:172.21.15.193:6835/3001227076,v1:172.21.15.193:6837/3001227076] --> 172.21.15.193:0/3903477197 -- mgr_command_reply(tid 0: 0 ) -- 0x562348d881a0 con 0x5623489c8000
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 mgr.server operator()  command returned 0
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread done with [0x56234816cdc0,0,0x5623489c6700,0]
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread empty
2024-05-15T10:40:45.467+0000 7ff6a02bc640 10 finisher(0x562348173538) finisher_thread sleeping

The above sequence of mgr logs point to some contention on the python gil.
I'm not able to conclusively say if it is because of the pending mount operation which fails due to a ETIMEDOUT while waiting for the mds to respond. The mount operation is initiated in the purgequeue job path.
Also, if the timeout is bumped to 900 seconds, then eventually the fs volume ls passes ... presumably because the mds finally comes up and responds to the fs operations.

Actions

#20

Updated by Venky Shankar almost 2 years ago

Related to Bug #66483: qa: CommandFailedError in test_multi_snap_schedule added

Actions

#21

Updated by Venky Shankar over 1 year ago

@Milind Changire bisected the commits and looks like https://github.com/ceph/ceph/pull/55823/commits/329edd3c56baff81cb76c5008ac2eaf7c110f9e1#diff-501f5d864f91e98b1c214da58065308a41554b87ce6366f2314a84c19c287ca7R239 is the culprit.

Actions

#22

Updated by Milind Changire over 1 year ago

Git bisect of merges resulted in the following:

BAD  c2b21dbf694 Leonid S. Usov               2024-04-30 13:53:46 +0300 Merge pull request #55823 from ceph/wip-lusov-quiescer
GOOD 19c2c8cef15 Matan Breizman               2024-04-30 12:29:59 +0300 Merge pull request #56806 from xxhdx1985126/wip-65399

The aim is to revert patch:

commit 329edd3c56baff81cb76c5008ac2eaf7c110f9e1
Author: Leonid Usov <leonid.usov@ibm.com>
Date:   Thu Apr 25 19:41:57 2024 +0300

    qa/tasks/cephfs: don't create a new CephManager if there is one in the context

    Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

and run the --suite fs --filter-all snap-schedule job set for about 20 consecutive iterations and see if it behaves well for form a conclusive opinion.

Actions

squid: https://pulpito.ceph.com/jcollin-2024-06-26_01:33:51-fs-wip-jcollin-testing-20240625.102731-squid-distro-default-smithi/7772572

#23

Updated by Xiubo Li over 1 year ago

Actions

#24

Updated by Milind Changire over 1 year ago

The fs volume ls command getting stuck is just a side-effect of the _config_notify being called for the module. The fs volume ls command gets queued up behind the _config_notify.

Currently, I'm trying to set up a thread parallel to the worker finisher thread that watches for some time to be elapsed in the Context execution call before forcefully doing a ceph_assert(false);

I hope to catch a sensible core dump for analysis to see which mutex the _config_notify is blocking on.

Actions

#25

Updated by Venky Shankar over 1 year ago

Milind Changire wrote in #note-24:

The fs volume ls command getting stuck is just a side-effect of the _config_notify being called for the module. The fs volume ls command gets queued up behind the _config_notify.

Do you have a backtrace for the stuck thread?

Actions

#26

Updated by Milind Changire over 1 year ago

File t-a-a-bt.txt t-a-a-bt.txt added

Job with backtrace

Actions

#27

Updated by Venky Shankar over 1 year ago

@Milind Changire pointed to me at failed run: /a/mchangir-2024-07-06_16:10:22-fs-wip-mchangir-log-use-gettid-instead-of-pthread_self-debug-distro-default-smithi/7789504

The backtrace points to a config_notify() call in mgr/volumes where the GIL is getting held and there mgr volume finisher is blocked. For that thread:

zgrep " b286 " ./remote/smithi117/log/ceph-mgr.x.log.gz
2024-07-06T16:45:45.108+0000 b286 10 finisher(0x55e982ffd1b8) finisher_thread start
2024-07-06T16:45:45.108+0000 b286 10 finisher(0x55e982ffd1b8) finisher_thread empty
2024-07-06T16:45:45.108+0000 b286 10 finisher(0x55e982ffd1b8) finisher_thread sleeping
2024-07-06T16:45:47.566+0000 b286 10 finisher(0x55e982ffd1b8) finisher_thread doing [0x55e982fc43a0,0]
2024-07-06T16:45:47.567+0000 b286 20 mgr Gil Switched to new thread state 0x55e982fb6900
2024-07-06T16:45:47.567+0000 b286 20 mgr config_notify Calling volumes._config_notify...
2024-07-06T16:45:47.567+0000 b286 10 ceph_option_get debug_mgr found: 20/20
2024-07-06T16:45:47.567+0000 b286 20 mgr get_config  key: mgr/volumes/log_level
2024-07-06T16:45:47.567+0000 b286 10 mgr get_typed_config  log_level not found
2024-07-06T16:45:47.567+0000 b286 20 mgr get_config  key: mgr/volumes/log_to_cluster_level
2024-07-06T16:45:47.567+0000 b286 10 mgr get_typed_config  log_to_cluster_level not found
2024-07-06T16:45:47.567+0000 b286 20 mgr get_config  key: mgr/volumes/log_to_file
2024-07-06T16:45:47.567+0000 b286 10 mgr get_typed_config  log_to_file not found
2024-07-06T16:45:47.567+0000 b286 20 mgr get_config  key: mgr/volumes/log_to_cluster
2024-07-06T16:45:47.567+0000 b286 10 mgr get_typed_config  log_to_cluster not found
2024-07-06T16:45:47.568+0000 b286 20 mgr get_config  key: mgr/volumes/max_concurrent_clones
2024-07-06T16:45:47.568+0000 b286 10 mgr get_typed_config  max_concurrent_clones not found
2024-07-06T16:45:47.568+0000 b286  0 [volumes DEBUG root]  mgr option max_concurrent_clones = 4
2024-07-06T16:45:47.568+0000 b286 20 mgr get_config  key: mgr/volumes/snapshot_clone_delay
2024-07-06T16:45:47.568+0000 b286 10 mgr get_typed_config  snapshot_clone_delay not found
2024-07-06T16:45:47.568+0000 b286  0 [volumes DEBUG root]  mgr option snapshot_clone_delay = 0
2024-07-06T16:45:47.568+0000 b286 20 mgr get_config  key: mgr/volumes/periodic_async_work
2024-07-06T16:45:47.568+0000 b286 10 mgr get_typed_config  periodic_async_work not found
2024-07-06T16:45:47.568+0000 b286  0 [volumes DEBUG root]  mgr option periodic_async_work = False

(note `b286` tid is due to a custom change Milind added to match the thread id to gdb)

Anyway, one can see the Gil is acquired

2024-07-06T16:45:47.567+0000 b286 20 mgr Gil Switched to new thread state 0x55e982fb6900

but there is no corresponding release of the GIL. So, my guess is that config_notify() in mgr/volumes threw a backtrace due to a unhandled exception (which isn't logged!) and the finisher thread is blocked thereby queueing up fs volume ls command.

If this gets validated, then I think we should relook at how unhandled exceptions should be logged.

Actions

#28

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Sorry my bad. config_notify does use per module finisher thread. And it takes gil before calling the modules _config_notify

void ActivePyModules::config_notify()
{
  std::lock_guard l(lock);
  for (auto& [name, module] : modules) {
    // Send all python calls down a Finisher to avoid blocking
    // C++ code, and avoid any potential lock cycles.
    dout(15) << "notify (config) " << name << dendl;
    Finisher& mod_finisher = py_module_registry.get_active_module_finisher(name);
    // workaround for https://bugs.llvm.org/show_bug.cgi?id=35984
    mod_finisher.queue(new LambdaContext([module=module](int r){
      module->config_notify();
    }));
  }
}

void ActivePyModule::config_notify()
{ 
  if (is_dead()) {
    dout(5) << "cancelling config_notify" << dendl;
    return;
  }

  Gil gil(py_module->pMyThreadState, true);
  dout(20) << "Calling " << py_module->get_name() << "._config_notify..." 
           << dendl;
  auto remoteResult = PyObject_CallMethod(pClassInstance,
                                          const_cast<char*>("_config_notify"),
                                          (char*)NULL);
  if (remoteResult != nullptr) {
    Py_DECREF(remoteResult);
  } 
}

Actions

#29

Updated by Milind Changire over 1 year ago

@Kotresh Hiremath Ravishankar fyi - latest teuthology job set

Actions

#30

Updated by Venky Shankar over 1 year ago · Edited

Milind Changire wrote in #note-29:

@Kotresh Hiremath Ravishankar fyi - latest teuthology job set

Milind mentioned about the cloner thread having the lock acquired for long (~5 minutes) and that causes the config_notify() call via volumes finisher to block on waiting to be acquired. AFAICT, the cloner threads (and the purge threads) run without holding async_job::lock, but its good to check if that's still holds true (maybe, some change got merged that affected that).

Actions

#31

Updated by Milind Changire over 1 year ago

For this job, following is the excerpt of the mgr logs ...

2024-07-10T15:10:46.530+0000 b37d  0 [volumes DEBUG volumes.fs.async_job] JobThread(cloner.1).locked after wait <locked _thread.lock object at 0x7f2db0e0c720>
2024-07-10T15:10:46.530+0000 b37d  0 [volumes DEBUG volumes.fs.async_job] processing 1 volume entries
2024-07-10T15:10:46.530+0000 b37d  0 [volumes DEBUG volumes.fs.async_cloner] fetching clone entry for volume 'cephfs'
2024-07-10T15:10:46.530+0000 b37d  0 [volumes DEBUG mgr_util] [get] volume: (cephfs) connection: ([])
2024-07-10T15:10:46.530+0000 b37d  0 [volumes DEBUG mgr_util] [get] spawning new connection since no connection is unused and we still have room for more
2024-07-10T15:10:46.530+0000 b37d  0 [volumes DEBUG mgr_util] Connecting to cephfs 'cephfs'
2024-07-10T15:10:46.530+0000 b37d  0 [volumes DEBUG mgr_util] Setting user ID and group ID of CephFS mount as root...
...
2024-07-10T15:10:46.535+0000 b37d  0 [volumes DEBUG mgr_util] CephFS mounting...
...
2024-07-10T15:10:46.537+0000 b37d 10 client.5138 did not get mds through better means, so chose random mds -1
2024-07-10T15:10:46.537+0000 b37d 20 client.5138 mds is -1
2024-07-10T15:10:46.537+0000 b37d 10 client.5138  target mds.-1 not active, waiting for new mdsmap
...
2024-07-10T15:10:50.213+0000 b388 20 mgr config_notify Calling volumes._config_notify...
...
2024-07-10T15:11:20.217+0000 b388  0 [volumes INFO mgr_util] possible deadlock acquiring <locked _thread.lock object at 0x7f2db0e0c720>
2024-07-10T15:11:20.217+0000 b388  0 [volumes DEBUG mgr_util] locking <locked _thread.lock object at 0x7f2db0e0c720> with 5 timeout
...
2024-07-10T15:15:46.588+0000 b37d  1 client.5138 shutdown
...
2024-07-10T15:15:46.589+0000 b37d  0 [volumes ERROR volumes.fs.async_cloner] error fetching clone entry for volume 'cephfs' (-110 (error calling ceph_mount))
2024-07-10T15:15:46.589+0000 b37d  0 [volumes DEBUG volumes.fs.async_job] JobThread(cloner.1).unlocking for wait <locked _thread.lock object at 0x7f2db0e0c720>
...
2024-07-10T15:15:46.589+0000 b380  0 [volumes DEBUG volumes.fs.async_job] AsyncJobs(cloner.tick).locked after wait <locked _thread.lock object at 0x7f2db0e0c720>
2024-07-10T15:15:46.590+0000 b380  0 [volumes DEBUG volumes.fs.async_job] AsyncJobs(cloner.tick).unlocking for wait <locked _thread.lock object at 0x7f2db0e0c720>
2024-07-10T15:15:46.590+0000 b37c  0 [volumes DEBUG volumes.fs.async_job] JobThread(cloner.0).locked after wait <locked _thread.lock object at 0x7f2db0e0c720>
2024-07-10T15:15:46.590+0000 b37c  0 [volumes DEBUG volumes.fs.async_job] processing 1 volume entries
2024-07-10T15:15:46.590+0000 b37c  0 [volumes DEBUG volumes.fs.async_cloner] fetching clone entry for volume 'cephfs'
2024-07-10T15:15:46.590+0000 b37c  0 [volumes DEBUG mgr_util] [get] volume: (cephfs) connection: ([])
2024-07-10T15:15:46.590+0000 b37c  0 [volumes DEBUG mgr_util] [get] spawning new connection since no connection is unused and we still have room for more
2024-07-10T15:15:46.590+0000 b37c  0 [volumes DEBUG mgr_util] Connecting to cephfs 'cephfs'
2024-07-10T15:15:46.590+0000 b37c  0 [volumes DEBUG mgr_util] Setting user ID and group ID of CephFS mount as root...
...
2024-07-10T15:15:46.596+0000 b37c  0 [volumes DEBUG mgr_util] CephFS mounting...
...
2024-07-10T15:15:46.604+0000 b37c  0 [volumes DEBUG mgr_util] Connection to cephfs 'cephfs' complete
...
2024-07-10T15:15:46.604+0000 b37c 20 client.5725   (path is #0x1/volumes/_index/clone)
...
2024-07-10T15:15:46.605+0000 b37f  0 [volumes DEBUG volumes.fs.async_job] JobThread(cloner.3).unlocking for wait <locked _thread.lock object at 0x7f2db0e0c720>
...
2024-07-10T15:15:46.607+0000 b388  0 [volumes INFO volumes.module] Starting _cmd_fs_volume_ls(prefix:fs volume ls, target:['mon-mgr', '']) < "" 
2024-07-10T15:15:46.607+0000 b388  0 [volumes INFO volumes.module] Finishing _cmd_fs_volume_ls(prefix:fs volume ls, target:['mon-mgr', '']) < ""

so the config_notify path (thread b37c) has a problem with receiving the mds_map and the client connection eventually gets shut down

looks like most of the time is spent in waiting for the mds_map ... which never arrives

why is target mds chosen as -1 by the client ?

Actions

#32

Updated by Venky Shankar over 1 year ago

This issue happens when there is a race between the async threads in mgr/volumes that has fetched a list of file systems to process for initiating clones and a fs fail (followed by fs rm) command to delete the file system (probably done right after finishing a test). The clone thread finds the file system in the fs map and tries to establish a mount to it, but, around the same time, the file system is removed thereby the mount call from libcephfs is trying to find an MDS to connect to. A new file system does get created, but with a different fscid.

For the fix, for now, the test can set client_mount_timeout to a lower value (say 30 seconds), so that the thread does not get stuck for long - the connection would timeout and the next try would successfully connect to the new file system.

We should also check if connecting to the file system (in mgr/volumes and mgr_util) can be done by dropping locks - that would be a nice enhancement. WDYT?

Actions

#33

Updated by Milind Changire over 1 year ago · Edited

The actual connection to the cephfs via the python wrapper to the libcephfs is done under nogil.
Is there any other lock that needs to be dropped ?
Do you mean to drop the AsyncJobs.lock when attempting the connection to the fs ?

Although the problem is manifested via test_snap_schedules.py, the piece of code due to which we see the problem is common across all cephfs tests. So we'll need to set client_mount_timeout to a lower value for the entire cephfs test suite.

Actions

#34

Updated by Venky Shankar over 1 year ago

Milind Changire wrote in #note-33:

The actual connection to the cephfs via the python wrapper to the libcephfs is done under nogil.
Is there any other lock that needs to be dropped ?
Do you mean to drop the AsyncJobs.lock when attempting the connection to the fs ?

There is AsyncJobs.lock and then the mgr_util.py::CephfsConnectionPool::lock. Once the code decides that it requires to initiate a file system connection, it can drop the lock, but we need to be careful since another thread might come in at this point and touch some structures that are incomplete. So, audit those parts carefully.

Although the problem is manifested via test_snap_schedules.py, the piece of code due to which we see the problem is common across all cephfs tests. So we'll need to set client_mount_timeout to a lower value for the entire cephfs test suite.

Agreed. Right now there is

qa/cephfs/conf/client.yaml: client mount timeout: 600

which was introduced by commit 73fa0efcbb15a47157615705043288f2913ff3bc - I think that needs to be lowered.

Actions

#35

Updated by Milind Changire over 1 year ago

In this job set :

4 out of 6 jobs failed with cluster [WRN] evicting unresponsive client smithi111:x (7409), after 301.769 seconds
this warning is mostly due to a mount being unresponsive from the mgr

2 out of 6 jobs failed with core dumps
this is due to pthread_getname_np() failure

[Current thread is 1 (Thread 0x7f2618676640 (LWP 48305))]
(gdb) bt
#0  0x00007f261da8b94c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f261da3e646 in raise () from /lib64/libc.so.6
#2  0x000056458ed05ea7 in reraise_fatal(int) ()
#3  0x000056458ed0fbd1 in handle_oneshot_fatal_signal(int) ()
#4  <signal handler called>
#5  0x00007f261da8b2b6 in pthread_getname_np@GLIBC_2.12 () from /lib64/libc.so.6
#6  0x00007f261e705597 in ceph::logging::Log::dump_recent() () from /usr/lib64/ceph/libceph-common.so.2
#7  0x000056458e8d15ef in MDSDaemon::respawn() ()
#8  0x000056458e8c9b3e in MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&) ()
#9  0x000056458e8cfdc7 in MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&) ()
#10 0x000056458e8d0a06 in MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&) ()
#11 0x00007f261e5967fd in Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&) () from /usr/lib64/ceph/libceph-common.so.2
#12 0x00007f261e5970f1 in DispatchQueue::entry() () from /usr/lib64/ceph/libceph-common.so.2
#13 0x00007f261e66e005 in DispatchQueue::DispatchThread::entry() () from /usr/lib64/ceph/libceph-common.so.2
#14 0x00007f261e423cf7 in Thread::entry_wrapper() () from /usr/lib64/ceph/libceph-common.so.2
#15 0x00007f261e423d13 in Thread::_entry_func(void*) () from /usr/lib64/ceph/libceph-common.so.2
#16 0x00007f261da89c02 in start_thread () from /lib64/libc.so.6
#17 0x00007f261db0ec40 in clone3 () from /lib64/libc.so.6

This job set had:

restored fs volume ls command timeout to default of 120 seconds
reduced client_mount_timeout from 300 seconds to 30 seconds

Actions

#36

Updated by Milind Changire over 1 year ago

Pull request ID set to 58547

Actions

squid: https://pulpito.ceph.com/jcollin-2024-07-01_10:56:30-fs-wip-jcollin-testing-20240701.061036-squid-distro-default-smithi/7781266

#37

Updated by Xiubo Li over 1 year ago

Actions

squid: https://pulpito.ceph.com/jcollin-2024-07-12_00:07:25-fs-wip-jcollin-testing-20240711.095637-squid-distro-default-smithi/7798229

#38

Updated by Xiubo Li over 1 year ago

Actions

#39

Updated by Milind Changire over 1 year ago

Pull request ID changed from 58547 to 58771

Actions

main: https://pulpito.ceph.com/mchangir-2024-08-02_07:51:06-fs-wip-mchangir-uninline-debug-distro-default-smithi/7831757

#40

Updated by Milind Changire over 1 year ago

Actions

#41

Updated by Venky Shankar over 1 year ago

Status changed from Triaged to Pending Backport
Source set to Development
Backport set to quincy,reef,squid

Actions

#42

Updated by Upkeep Bot over 1 year ago

Copied to Backport #67826: squid: qa: `fs volume ls` command times out waiting for fs to come online added

Actions

#43

Updated by Upkeep Bot over 1 year ago

Copied to Backport #67827: reef: qa: `fs volume ls` command times out waiting for fs to come online added

Actions

#44

Updated by Upkeep Bot over 1 year ago

Copied to Backport #67828: quincy: qa: `fs volume ls` command times out waiting for fs to come online added

Actions

#45

Updated by Upkeep Bot over 1 year ago

Tags (freeform) set to backport_processed

Actions

#46

Updated by Venky Shankar over 1 year ago

Pull request ID changed from 58771 to 58547

Actions

#47

Updated by Konstantin Shalygin about 1 year ago

Status changed from Pending Backport to Resolved

Actions

#48

Updated by Venky Shankar 10 months ago

Related to Bug #71516: qa: `volume ls` times out with test_snap_schedule_subvol_and_group_arguments_16 test added

Actions

#49

Updated by Upkeep Bot 9 months ago

Merge Commit set to 52deba6b14a095eb1cf0d8eb397861dc3bc900c6
Fixed In set to v19.3.0-4615-g52deba6b14a0
Upkeep Timestamp set to 2025-06-30T17:49:15+00:00

Actions

#50

Updated by Upkeep Bot 9 months ago

Merge Commit deleted (~~52deba6b14a095eb1cf0d8eb397861dc3bc900c6~~)
Fixed In deleted (~~v19.3.0-4615-g52deba6b14a0~~)
Upkeep Timestamp deleted (~~2025-06-30T17:49:15+00:00~~)

Actions

#51

Updated by Upkeep Bot 9 months ago

Merge Commit set to 52deba6b14a095eb1cf0d8eb397861dc3bc900c6
Fixed In set to v19.3.0-4615-g52deba6b14a0
Upkeep Timestamp set to 2025-06-30T17:55:37+00:00

Actions

#52

Updated by Upkeep Bot 9 months ago

Fixed In changed from v19.3.0-4615-g52deba6b14a0 to v19.3.0-4615
Upkeep Timestamp changed from 2025-06-30T17:55:37+00:00 to 2025-06-30T17:55:37

Actions

#53

Updated by Upkeep Bot 9 months ago

Fixed In changed from v19.3.0-4615 to v19.3.0-4615-g52deba6b14a0
Upkeep Timestamp changed from 2025-06-30T17:55:37 to 2025-06-30T18:02:43+00:00

Actions

#54

Updated by Upkeep Bot 8 months ago

Fixed In changed from v19.3.0-4615-g52deba6b14a0 to v19.3.0-4615-g52deba6b14a
Upkeep Timestamp changed from 2025-06-30T18:02:43+00:00 to 2025-07-08T18:29:54+00:00

Actions

#55

Updated by Upkeep Bot 8 months ago

Fixed In changed from v19.3.0-4615-g52deba6b14a to v19.3.0-4615-g52deba6b14
Upkeep Timestamp changed from 2025-07-08T18:29:54+00:00 to 2025-07-14T17:10:31+00:00

Actions