Skip to content

squid: mds: use regular dispatch for processing metrics#57678

Merged
lxbsz merged 2 commits intoceph:squidfrom
batrick:wip-66188-squid
Jul 17, 2024
Merged

squid: mds: use regular dispatch for processing metrics#57678
lxbsz merged 2 commits intoceph:squidfrom
batrick:wip-66188-squid

Conversation

@batrick
Copy link
Member

@batrick batrick commented May 23, 2024

backport tracker: https://tracker.ceph.com/issues/66188


backport of #57081
parent tracker: https://tracker.ceph.com/issues/65658

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

batrick added 2 commits May 23, 2024 14:00
There have been cases where the MDS does an undesirable failover because it
misses heartbeat resets after a long recovery in up:replay.  It was observed
that the MDS was processing a flood of metrics messages from all reconnecting
clients. This likely caused undersiable MetricAggregator::lock contention in
the messenger threads while fast dispatching client metrics.

Instead, use the normal dispatch where acquiring locks is okay to do.

See-also: linux.git/f7c2f4f6ce16fb58f7d024f3e1b40023c4b43ff9
Fixes: https://tracker.ceph.com/issues/65658
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit ed1fe99)
Since these are no longer fast dispatched, we need to ensure they are processed
in a timely fashion and ahead of any incoming requests.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit d56b502)
@batrick batrick added this to the squid milestone May 23, 2024
@batrick batrick added the cephfs Ceph File System label May 23, 2024
@joscollin
Copy link
Member

Tested in https://tracker.ceph.com/issues/66423

Copy link
Member

@joscollin joscollin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar
Copy link
Contributor

@batrick @vshankar I see a failure https://pulpito.ceph.com/leonidus-2024-06-12_09:41:32-fs-wip-lusov-testing-20240611.123850-squid-distro-default-smithi/7751718. Can you confirm if that's not related?

Have you checked the failure in test_client_metrics_and_metadata test?

@joscollin
Copy link
Member

@batrick @vshankar I see a failure https://pulpito.ceph.com/leonidus-2024-06-12_09:41:32-fs-wip-lusov-testing-20240611.123850-squid-distro-default-smithi/7751718. Can you confirm if that's not related?

Have you checked the failure in test_client_metrics_and_metadata test?

@vshankar
It's not getting the metrics
2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_00761a09b9e071898ab1cd2e913c1a0c41c97ab4/qa/tasks/cephfs/test_mds_metrics.py", line 529, in test_client_metrics_and_metadata 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: valid, metrics = self._get_metrics( 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_00761a09b9e071898ab1cd2e913c1a0c41c97ab4/qa/tasks/cephfs/test_mds_metrics.py", line 101, in _get_metrics 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: while proceed(): 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_teuthology_861a8dcf7aa816a26e13f039336f7ed0a9aec0fa/teuthology/contextutil.py", line 134, in __call__ 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner: raise MaxWhileTries(error_msg) 2024-06-12T11:32:42.467 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for metrics' reached maximum tries (31) after waiting for 30 seconds

@joscollin
Copy link
Member

joscollin commented Jun 20, 2024

The test passes in Leonid's branch wip-lusov-testing-20240611.123850-squid, when this PR's merge commit is reverted:
https://pulpito.ceph.com/jcollin-2024-06-20_00:57:42-fs:functional-wip-jcollin-testing-distro-default-smithi/

But then the test also fails in the upstream squid at 1373963:
https://pulpito.ceph.com/jcollin-2024-06-20_03:07:50-fs:functional-squid-distro-default-smithi/, which obviously doesn't have this PR.

So I've checked in main too at e879ce8. The test passes there:
https://pulpito.ceph.com/jcollin-2024-06-19_13:29:35-fs:functional-main-distro-default-smithi/

@vshankar
Copy link
Contributor

It's not getting the metrics

@joscollin This is the last metrics that the test fetched:

2024-06-12T11:32:28.079 INFO:teuthology.orchestra.run.smithi012.stdout:{"version": 2, "global_counters": ["cap_hit", "read_latency", "write_latency", "metadata_latency", "dentry_lease", "opened_files", "pinned_icaps", "opened_inodes", "read_io_sizes", "write_io_sizes",
"avg_read_latency", "stdev_read_latency", "avg_write_latency", "stdev_write_latency", "avg_metadata_latency", "stdev_metadata_latency"], "counters": [], "client_metadata": {"fs2": {"client.5395": {"hostname": "smithi053", "root": "/", "mount_point": "N/A", "valid_metric
s": ["cap_hit", "read_latency", "write_latency", "metadata_latency", "dentry_lease", "opened_files", "pinned_icaps", "opened_inodes", "read_io_sizes", "write_io_sizes", "avg_read_latency", "stdev_read_latency", "avg_write_latency", "stdev_write_latency", "avg_metadata_l
atency", "stdev_metadata_latency"], "kernel_version": "6.9.0-gff88c41504f3", "IP": "192.168.0.1"}}}, "global_metrics": {"fs2": {"client.5395": [[27, 25], [0, 0], [0, 11080148], [0, 28083081], [1, 0], [0, 1], [1, 1], [0, 1], [0, 0], [1, 1048576], [0, 0], [0, 0], [0, 1108
0148], [0, 1], [0, 825976], [34415393216886, 34]]}}, "metrics": {"delayed_ranks": [], "mds.0": {"client.5395": []}}}

You should check what in metrics the test is checking for. Its very much possible that this PR could cause this test failure (and likely many more in this test source) due to MDS not processing the client metrics with a relatively lower priority.

joscollin pushed a commit to joscollin/ceph that referenced this pull request Jun 27, 2024
* refs/pull/57678/head:
	messages/MClientMetrics: increase priority ahead of regular requests
	mds: use regular dispatch for processing metrics
@joscollin
Copy link
Member

2024-06-12T11:32:28.079 INFO:teuthology.orchestra.run.smithi012.stdout:{"version": 2,

@vshankar
test_client_metrics_and_metadata creates fs1 and fs2. It's a multi_fs execution. But the metrics doesn't have fs1 and thus returns False from here each time in verify_mds_metrics (30 times) and quits.

Need to check why that happens.

@joscollin
Copy link
Member

2024-06-12T11:32:28.079 INFO:teuthology.orchestra.run.smithi012.stdout:{"version": 2,

@vshankar test_client_metrics_and_metadata creates fs1 and fs2. It's a multi_fs execution. But the metrics doesn't have fs1 and thus returns False from here each time in verify_mds_metrics (30 times) and quits.

Need to check why that happens.

 25689 2024-06-12T11:31:34.772 INFO:tasks.ceph.mgr.y.smithi012.stderr:Exception in thread Thread-21:
 25690 2024-06-12T11:31:34.772 INFO:tasks.ceph.mgr.y.smithi012.stderr:Traceback (most recent call last):
 25691 2024-06-12T11:31:34.772 INFO:tasks.ceph.mgr.y.smithi012.stderr:  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
 25692 2024-06-12T11:31:34.772 INFO:tasks.ceph.mgr.y.smithi012.stderr:    self.run()
 25693 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:  File "/usr/lib/python3.10/threading.py", line 1378, in run
 25694 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:    self.function(*self.args, **self.kwargs)
 25695 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:  File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in re_register_queries
 25696 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:    if self.mx_last_updated >= ua_last_updated:
 25697 2024-06-12T11:31:34.773 INFO:tasks.ceph.mgr.y.smithi012.stderr:AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated'

That's it. We need that fix here.

@vshankar
Copy link
Contributor

That's it. We need that fix here.

Good job figuring that out 👍

@joscollin
Copy link
Member

joscollin commented Jun 27, 2024

@vshankar The fix is already merged to squid. So I'll take this PR for testing in my next squid batch.

@joscollin joscollin added the wip-jcollin-testing-squid2 Assigned for review label Jul 1, 2024
@joscollin
Copy link
Member

This PR is under test in https://tracker.ceph.com/issues/66762.

Copy link
Member

@lxbsz lxbsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked all the failures, they are all not related, please see 2024-07-09 in https://tracker.ceph.com/projects/cephfs/wiki/Squid.

@joscollin
Copy link
Member

Checked all the failures, they are all not related, please see 2024-07-09 in https://tracker.ceph.com/projects/cephfs/wiki/Squid.

@lxbsz Could you please merge this, if there are no related failures?

@lxbsz lxbsz merged commit 929494c into ceph:squid Jul 17, 2024
@joscollin joscollin removed the wip-jcollin-testing-squid2 Assigned for review label Jul 17, 2024
@batrick batrick deleted the wip-66188-squid branch July 17, 2024 12:38
@batrick
Copy link
Member Author

batrick commented Jul 17, 2024

Checked all the failures, they are all not related, please see 2024-07-09 in https://tracker.ceph.com/projects/cephfs/wiki/Squid.

@lxbsz you can link directly like: https://tracker.ceph.com/projects/cephfs/wiki/Squid#2024-07-09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants