qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr-osd-full by vshankar · Pull Request #65693 · ceph/ceph

vshankar · 2025-09-26T11:49:37Z

fs/full/subvolume_ls.sh will restart ceph-mgr periodically and that does not cleanup libcephfs handles.

Fixes: http://tracker.ceph.com/issues/73278

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

…-osd-full fs/full/subvolume_ls.sh will restart ceph-mgr periodically and that does not cleanup libcephfs handles. Fixes: http://tracker.ceph.com/issues/73278 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar

https://tracker.ceph.com/projects/cephfs/wiki/QA_main_2025#wip-vshankar-testing-20251007032848-debug

gregsfortytwo · 2025-10-23T06:02:04Z

Why doesn’t restart clean up libcephfs? That seems…not great.
Also, doesn’t a mgr failover blocklist the previous instance, which should remove these clients?

vshankar · 2025-10-23T06:27:21Z

Why doesn’t restart clean up libcephfs? That seems…not great.

This is nothing new afaik -- it has always been that way due to the reason that cleanup of plugins has been problematic in the manager, so they are never ever cleaned up. @batrick did some work on this to blocklist the clients by including the client addrs in the manager beacon message sent to monitor, but that isn't sufficient enough

batrick · 2025-10-26T16:05:22Z

Why doesn’t restart clean up libcephfs? That seems…not great. Also, doesn’t a mgr failover blocklist the previous instance, which should remove these clients?

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

There would still be a race however with waiting for the MgrMap to reflect the registered client instance. That is resolved by #51169 but it's unfortunately abandoned.

vshankar · 2025-10-27T08:47:24Z

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

I'm surely missing a bit here since its been a while I looked at #51169, but the libcephfs client addrs are sent to mon after the mount is done and there is still a race when the mgr restarts just before the addrs can be sent to the monitor in its beacon message.

batrick · 2025-10-27T14:44:42Z

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

I'm surely missing a bit here since its been a while I looked at #51169, but the libcephfs client addrs are sent to mon after the mount is done

Yes, that's why I said we could create a Rados handle first which would let us know the client instance before establishing a session with the MDS.

and there is still a race when the mgr restarts just before the addrs can be sent to the monitor in its beacon message.

Right, so blocking the return of register_client on the updated MgrMap is what #51169 should fix. The two changes together should eliminate the problem.

gregsfortytwo · 2025-10-27T15:42:05Z

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

I'm surely missing a bit here since its been a while I looked at #51169, but the libcephfs client addrs are sent to mon after the mount is done

Yes, that's why I said we could create a Rados handle first which would let us know the client instance before establishing a session with the MDS.

and there is still a race when the mgr restarts just before the addrs can be sent to the monitor in its beacon message.

Right, so blocking the return of register_client on the updated MgrMap is what #51169 should fix. The two changes together should eliminate the problem.

Seems like fixes we really need to get in. Right now a failed ceph-mgr could keep doing things to the fs/subvolumes and that seems real bad. 😮

Doesn't have to block making QA go, but I didn't realize we had this hole and we should prioritize it...

vshankar · 2025-11-03T05:12:34Z

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

I'm surely missing a bit here since its been a while I looked at #51169, but the libcephfs client addrs are sent to mon after the mount is done

Yes, that's why I said we could create a Rados handle first which would let us know the client instance before establishing a session with the MDS.

and there is still a race when the mgr restarts just before the addrs can be sent to the monitor in its beacon message.

Right, so blocking the return of register_client on the updated MgrMap is what #51169 should fix. The two changes together should eliminate the problem.

Seems like fixes we really need to get in. Right now a failed ceph-mgr could keep doing things to the fs/subvolumes and that seems real bad. 😮

Doesn't have to block making QA go, but I didn't realize we had this hole and we should prioritize it...

I have reached out to @ajarr to get the PR moving or the fs team can work on it if @ajarr agrees.

ajarr · 2025-11-05T18:10:19Z

I have reached out to @ajarr to get the PR moving or the fs team can work on it if @ajarr agrees.

@vshankar please feel free to take over.

This is still to be addressed in PR 51169, #51169 (comment) . Thanks!

qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr…

773a800

…-osd-full fs/full/subvolume_ls.sh will restart ceph-mgr periodically and that does not cleanup libcephfs handles. Fixes: http://tracker.ceph.com/issues/73278 Signed-off-by: Venky Shankar <vshankar@redhat.com>

github-actions bot added cephfs Ceph File System tests labels Sep 26, 2025

vshankar added the wip-vshankar-testing label Sep 26, 2025

vshankar mentioned this pull request Sep 26, 2025

mds, client, mgr/volumes: toggle snapshot visibility for subvolume based paths #64345

Merged

14 tasks

vshankar commented Oct 9, 2025

View reviewed changes

vshankar requested a review from a team October 11, 2025 17:54

dparmar18 approved these changes Oct 23, 2025

View reviewed changes

vshankar merged commit 1530ebd into ceph:main Nov 3, 2025
13 of 14 checks passed

vshankar removed the wip-vshankar-testing label Nov 3, 2025

This was referenced Nov 5, 2025

tentacle: qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr-osd-full #66125

Merged

squid: qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr-osd-full #66126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr-osd-full#65693

qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr-osd-full#65693
vshankar merged 1 commit intoceph:mainfrom
vshankar:wip-73278

vshankar commented Sep 26, 2025

Uh oh!

vshankar left a comment

Uh oh!

gregsfortytwo commented Oct 23, 2025

Uh oh!

vshankar commented Oct 23, 2025

Uh oh!

batrick commented Oct 26, 2025

Uh oh!

vshankar commented Oct 27, 2025

Uh oh!

batrick commented Oct 27, 2025

Uh oh!

gregsfortytwo commented Oct 27, 2025

Uh oh!

vshankar commented Nov 3, 2025

Uh oh!

Uh oh!

ajarr commented Nov 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

vshankar commented Sep 26, 2025

Contribution Guidelines

Checklist

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

gregsfortytwo commented Oct 23, 2025

Uh oh!

vshankar commented Oct 23, 2025

Uh oh!

batrick commented Oct 26, 2025

Uh oh!

vshankar commented Oct 27, 2025

Uh oh!

batrick commented Oct 27, 2025

Uh oh!

gregsfortytwo commented Oct 27, 2025

Uh oh!

vshankar commented Nov 3, 2025

Uh oh!

Uh oh!

ajarr commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ajarr commented Nov 5, 2025 •

edited

Loading