squid: mds: use regular dispatch for processing metrics#57678
squid: mds: use regular dispatch for processing metrics#57678lxbsz merged 2 commits intoceph:squidfrom
Conversation
There have been cases where the MDS does an undesirable failover because it misses heartbeat resets after a long recovery in up:replay. It was observed that the MDS was processing a flood of metrics messages from all reconnecting clients. This likely caused undersiable MetricAggregator::lock contention in the messenger threads while fast dispatching client metrics. Instead, use the normal dispatch where acquiring locks is okay to do. See-also: linux.git/f7c2f4f6ce16fb58f7d024f3e1b40023c4b43ff9 Fixes: https://tracker.ceph.com/issues/65658 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> (cherry picked from commit ed1fe99)
Since these are no longer fast dispatched, we need to ensure they are processed in a timely fashion and ahead of any incoming requests. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> (cherry picked from commit d56b502)
|
Tested in https://tracker.ceph.com/issues/66423 |
joscollin
left a comment
There was a problem hiding this comment.
@batrick @vshankar
I see a failure https://pulpito.ceph.com/leonidus-2024-06-12_09:41:32-fs-wip-lusov-testing-20240611.123850-squid-distro-default-smithi/7751718. Can you confirm if that's not related?
Have you checked the failure in |
@vshankar |
|
The test passes in Leonid's branch But then the test also fails in the upstream squid at 1373963: So I've checked in main too at e879ce8. The test passes there: |
@joscollin This is the last metrics that the test fetched: You should check what in metrics the test is checking for. Its very much possible that this PR could cause this test failure (and likely many more in this test source) due to MDS not processing the client metrics with a relatively lower priority. |
* refs/pull/57678/head: messages/MClientMetrics: increase priority ahead of regular requests mds: use regular dispatch for processing metrics
@vshankar Need to check why that happens. |
That's it. We need that fix here. |
Good job figuring that out 👍 |
|
@vshankar The fix is already merged to squid. So I'll take this PR for testing in my next squid batch. |
|
This PR is under test in https://tracker.ceph.com/issues/66762. |
lxbsz
left a comment
There was a problem hiding this comment.
Checked all the failures, they are all not related, please see 2024-07-09 in https://tracker.ceph.com/projects/cephfs/wiki/Squid.
@lxbsz Could you please merge this, if there are no related failures? |
@lxbsz you can link directly like: https://tracker.ceph.com/projects/cephfs/wiki/Squid#2024-07-09 |
backport tracker: https://tracker.ceph.com/issues/66188
backport of #57081
parent tracker: https://tracker.ceph.com/issues/65658
this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh