Skip to content

qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr-osd-full#65693

Merged
vshankar merged 1 commit intoceph:mainfrom
vshankar:wip-73278
Nov 3, 2025
Merged

qa: ignore cluster warning (evicting unresponsive ...) with tasks/mgr-osd-full#65693
vshankar merged 1 commit intoceph:mainfrom
vshankar:wip-73278

Conversation

@vshankar
Copy link
Contributor

fs/full/subvolume_ls.sh will restart ceph-mgr periodically and that does not cleanup libcephfs handles.

Fixes: http://tracker.ceph.com/issues/73278

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

…-osd-full

fs/full/subvolume_ls.sh will restart ceph-mgr periodically and
that does not cleanup libcephfs handles.

Fixes: http://tracker.ceph.com/issues/73278
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Copy link
Contributor Author

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar requested a review from a team October 11, 2025 17:54
@gregsfortytwo
Copy link
Member

Why doesn’t restart clean up libcephfs? That seems…not great.
Also, doesn’t a mgr failover blocklist the previous instance, which should remove these clients?

@vshankar
Copy link
Contributor Author

Why doesn’t restart clean up libcephfs? That seems…not great.

This is nothing new afaik -- it has always been that way due to the reason that cleanup of plugins has been problematic in the manager, so they are never ever cleaned up. @batrick did some work on this to blocklist the clients by including the client addrs in the manager beacon message sent to monitor, but that isn't sufficient enough

@batrick
Copy link
Member

batrick commented Oct 26, 2025

Why doesn’t restart clean up libcephfs? That seems…not great. Also, doesn’t a mgr failover blocklist the previous instance, which should remove these clients?

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

There would still be a race however with waiting for the MgrMap to reflect the registered client instance. That is resolved by #51169 but it's unfortunately abandoned.

@vshankar
Copy link
Contributor Author

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

I'm surely missing a bit here since its been a while I looked at #51169, but the libcephfs client addrs are sent to mon after the mount is done and there is still a race when the mgr restarts just before the addrs can be sent to the monitor in its beacon message.

@batrick
Copy link
Member

batrick commented Oct 27, 2025

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

I'm surely missing a bit here since its been a while I looked at #51169, but the libcephfs client addrs are sent to mon after the mount is done

Yes, that's why I said we could create a Rados handle first which would let us know the client instance before establishing a session with the MDS.

and there is still a race when the mgr restarts just before the addrs can be sent to the monitor in its beacon message.

Right, so blocking the return of register_client on the updated MgrMap is what #51169 should fix. The two changes together should eliminate the problem.

@gregsfortytwo
Copy link
Member

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

I'm surely missing a bit here since its been a while I looked at #51169, but the libcephfs client addrs are sent to mon after the mount is done

Yes, that's why I said we could create a Rados handle first which would let us know the client instance before establishing a session with the MDS.

and there is still a race when the mgr restarts just before the addrs can be sent to the monitor in its beacon message.

Right, so blocking the return of register_client on the updated MgrMap is what #51169 should fix. The two changes together should eliminate the problem.

Seems like fixes we really need to get in. Right now a failed ceph-mgr could keep doing things to the fs/subvolumes and that seems real bad. 😮

Doesn't have to block making QA go, but I didn't realize we had this hole and we should prioritize it...

@vshankar
Copy link
Contributor Author

vshankar commented Nov 3, 2025

I haven't looked at the test in a while but there is a race between a mgr libcephfs handle mounting CephFS, registering the client, and then the beacon sent to the mons including the new client instance for blocklist. This could be a little better if the mgr created a Rados handle first, registered the client instance, and then passed that handle to Libcephfs.

I'm surely missing a bit here since its been a while I looked at #51169, but the libcephfs client addrs are sent to mon after the mount is done

Yes, that's why I said we could create a Rados handle first which would let us know the client instance before establishing a session with the MDS.

and there is still a race when the mgr restarts just before the addrs can be sent to the monitor in its beacon message.

Right, so blocking the return of register_client on the updated MgrMap is what #51169 should fix. The two changes together should eliminate the problem.

Seems like fixes we really need to get in. Right now a failed ceph-mgr could keep doing things to the fs/subvolumes and that seems real bad. 😮

Doesn't have to block making QA go, but I didn't realize we had this hole and we should prioritize it...

I have reached out to @ajarr to get the PR moving or the fs team can work on it if @ajarr agrees.

@vshankar vshankar merged commit 1530ebd into ceph:main Nov 3, 2025
13 of 14 checks passed
@ajarr
Copy link
Contributor

ajarr commented Nov 5, 2025

I have reached out to @ajarr to get the PR moving or the fs team can work on it if @ajarr agrees.

@vshankar please feel free to take over.

This is still to be addressed in PR 51169, #51169 (comment) . Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants