squid: mds: use regular dispatch for processing metrics by batrick · Pull Request #57678 · ceph/ceph

batrick · 2024-05-23T18:00:57Z

backport tracker: https://tracker.ceph.com/issues/66188

backport of #57081
parent tracker: https://tracker.ceph.com/issues/65658

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

There have been cases where the MDS does an undesirable failover because it misses heartbeat resets after a long recovery in up:replay. It was observed that the MDS was processing a flood of metrics messages from all reconnecting clients. This likely caused undersiable MetricAggregator::lock contention in the messenger threads while fast dispatching client metrics. Instead, use the normal dispatch where acquiring locks is okay to do. See-also: linux.git/f7c2f4f6ce16fb58f7d024f3e1b40023c4b43ff9 Fixes: https://tracker.ceph.com/issues/65658 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> (cherry picked from commit ed1fe99)

Since these are no longer fast dispatched, we need to ensure they are processed in a timely fashion and ahead of any incoming requests. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> (cherry picked from commit d56b502)

joscollin · 2024-06-14T10:57:02Z

Tested in https://tracker.ceph.com/issues/66423

joscollin

@batrick @vshankar
I see a failure https://pulpito.ceph.com/leonidus-2024-06-12_09:41:32-fs-wip-lusov-testing-20240611.123850-squid-distro-default-smithi/7751718. Can you confirm if that's not related?

vshankar · 2024-06-17T06:49:03Z

@batrick @vshankar I see a failure https://pulpito.ceph.com/leonidus-2024-06-12_09:41:32-fs-wip-lusov-testing-20240611.123850-squid-distro-default-smithi/7751718. Can you confirm if that's not related?

Have you checked the failure in test_client_metrics_and_metadata test?

joscollin · 2024-06-18T11:55:02Z

@batrick @vshankar I see a failure https://pulpito.ceph.com/leonidus-2024-06-12_09:41:32-fs-wip-lusov-testing-20240611.123850-squid-distro-default-smithi/7751718. Can you confirm if that's not related?

Have you checked the failure in test_client_metrics_and_metadata test?

@vshankar
It's not getting the metrics
2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_00761a09b9e071898ab1cd2e913c1a0c41c97ab4/qa/tasks/cephfs/test_mds_metrics.py", line 529, in test_client_metrics_and_metadata 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: valid, metrics = self._get_metrics( 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_00761a09b9e071898ab1cd2e913c1a0c41c97ab4/qa/tasks/cephfs/test_mds_metrics.py", line 101, in _get_metrics 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: while proceed(): 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_teuthology_861a8dcf7aa816a26e13f039336f7ed0a9aec0fa/teuthology/contextutil.py", line 134, in __call__ 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: raise MaxWhileTries(error_msg) 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for metrics' reached maximum tries (31) after waiting for 30 seconds

joscollin · 2024-06-20T05:13:56Z

The test passes in Leonid's branch wip-lusov-testing-20240611.123850-squid, when this PR's merge commit is reverted:
https://pulpito.ceph.com/jcollin-2024-06-20_00:57:42-fs:functional-wip-jcollin-testing-distro-default-smithi/

But then the test also fails in the upstream squid at 1373963:
https://pulpito.ceph.com/jcollin-2024-06-20_03:07:50-fs:functional-squid-distro-default-smithi/, which obviously doesn't have this PR.

So I've checked in main too at e879ce8. The test passes there:
https://pulpito.ceph.com/jcollin-2024-06-19_13:29:35-fs:functional-main-distro-default-smithi/

vshankar · 2024-06-20T10:52:44Z

It's not getting the metrics

@joscollin This is the last metrics that the test fetched:

2024-06-12T11:32:28.079 INFO:teuthology.orchestra.run.smithi012.stdout:{"version": 2, "global_counters": ["cap_hit", "read_latency", "write_latency", "metadata_latency", "dentry_lease", "opened_files", "pinned_icaps", "opened_inodes", "read_io_sizes", "write_io_sizes",
"avg_read_latency", "stdev_read_latency", "avg_write_latency", "stdev_write_latency", "avg_metadata_latency", "stdev_metadata_latency"], "counters": [], "client_metadata": {"fs2": {"client.5395": {"hostname": "smithi053", "root": "/", "mount_point": "N/A", "valid_metric
s": ["cap_hit", "read_latency", "write_latency", "metadata_latency", "dentry_lease", "opened_files", "pinned_icaps", "opened_inodes", "read_io_sizes", "write_io_sizes", "avg_read_latency", "stdev_read_latency", "avg_write_latency", "stdev_write_latency", "avg_metadata_l
atency", "stdev_metadata_latency"], "kernel_version": "6.9.0-gff88c41504f3", "IP": "192.168.0.1"}}}, "global_metrics": {"fs2": {"client.5395": [[27, 25], [0, 0], [0, 11080148], [0, 28083081], [1, 0], [0, 1], [1, 1], [0, 1], [0, 0], [1, 1048576], [0, 0], [0, 0], [0, 1108
0148], [0, 1], [0, 825976], [34415393216886, 34]]}}, "metrics": {"delayed_ranks": [], "mds.0": {"client.5395": []}}}

You should check what in metrics the test is checking for. Its very much possible that this PR could cause this test failure (and likely many more in this test source) due to MDS not processing the client metrics with a relatively lower priority.

* refs/pull/57678/head: messages/MClientMetrics: increase priority ahead of regular requests mds: use regular dispatch for processing metrics

joscollin · 2024-06-27T08:15:36Z

2024-06-12T11:32:28.079 INFO:teuthology.orchestra.run.smithi012.stdout:{"version": 2,

@vshankar
test_client_metrics_and_metadata creates fs1 and fs2. It's a multi_fs execution. But the metrics doesn't have fs1 and thus returns False from here each time in verify_mds_metrics (30 times) and quits.

Need to check why that happens.

joscollin · 2024-06-27T08:31:10Z

2024-06-12T11:32:28.079 INFO:teuthology.orchestra.run.smithi012.stdout:{"version": 2,

@vshankar test_client_metrics_and_metadata creates fs1 and fs2. It's a multi_fs execution. But the metrics doesn't have fs1 and thus returns False from here each time in verify_mds_metrics (30 times) and quits.

Need to check why that happens.

 25689 2024-06-12T11:31:34.772 INFO:tasks.ceph.mgr.y.smithi012.stderr:Exception in thread Thread-21:
 25690 2024-06-12T11:31:34.772 INFO:tasks.ceph.mgr.y.smithi012.stderr:Traceback (most recent call last):
 25691 2024-06-12T11:31:34.772 INFO:tasks.ceph.mgr.y.smithi012.stderr:  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
 25692 2024-06-12T11:31:34.772 INFO:tasks.ceph.mgr.y.smithi012.stderr:    self.run()
 25693 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:  File "/usr/lib/python3.10/threading.py", line 1378, in run
 25694 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:    self.function(*self.args, **self.kwargs)
 25695 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:  File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in re_register_queries
 25696 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:    if self.mx_last_updated >= ua_last_updated:
 25697 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated'

That's it. We need that fix here.

vshankar · 2024-06-27T08:37:19Z

That's it. We need that fix here.

Good job figuring that out 👍

joscollin · 2024-06-27T09:16:26Z

@vshankar The fix is already merged to squid. So I'll take this PR for testing in my next squid batch.

joscollin · 2024-07-01T06:11:50Z

This PR is under test in https://tracker.ceph.com/issues/66762.

lxbsz

Checked all the failures, they are all not related, please see 2024-07-09 in https://tracker.ceph.com/projects/cephfs/wiki/Squid.

joscollin · 2024-07-17T04:28:13Z

Checked all the failures, they are all not related, please see 2024-07-09 in https://tracker.ceph.com/projects/cephfs/wiki/Squid.

@lxbsz Could you please merge this, if there are no related failures?

batrick · 2024-07-17T12:39:17Z

Checked all the failures, they are all not related, please see 2024-07-09 in https://tracker.ceph.com/projects/cephfs/wiki/Squid.

@lxbsz you can link directly like: https://tracker.ceph.com/projects/cephfs/wiki/Squid#2024-07-09

batrick added 2 commits May 23, 2024 14:00

batrick added this to the squid milestone May 23, 2024

batrick added the cephfs Ceph File System label May 23, 2024

leonid-s-usov added wip-lusov-bp-1 and removed wip-lusov-bp-1 labels Jun 11, 2024

joscollin requested changes Jun 14, 2024

View reviewed changes

joscollin added the wip-jcollin-testing-squid2 Assigned for review label Jul 1, 2024

lxbsz approved these changes Jul 15, 2024

View reviewed changes

joscollin approved these changes Jul 17, 2024

View reviewed changes

lxbsz merged commit 929494c into ceph:squid Jul 17, 2024

joscollin removed the wip-jcollin-testing-squid2 Assigned for review label Jul 17, 2024

batrick deleted the wip-66188-squid branch July 17, 2024 12:38

Conversation

batrick commented May 23, 2024

Uh oh!

joscollin commented Jun 14, 2024

Uh oh!

joscollin left a comment

Choose a reason for hiding this comment

Uh oh!

vshankar commented Jun 17, 2024

Uh oh!

joscollin commented Jun 18, 2024

Uh oh!

joscollin commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vshankar commented Jun 20, 2024

Uh oh!

joscollin commented Jun 27, 2024

Uh oh!

joscollin commented Jun 27, 2024

Uh oh!

vshankar commented Jun 27, 2024

Uh oh!

joscollin commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joscollin commented Jul 1, 2024

Uh oh!

lxbsz left a comment

Choose a reason for hiding this comment

Uh oh!

joscollin commented Jul 17, 2024

Uh oh!

batrick commented Jul 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

joscollin commented Jun 20, 2024 •

edited

Loading

joscollin commented Jun 27, 2024 •

edited

Loading