mds: record and forward client side metrics to manager by vshankar · Pull Request #26004 · ceph/ceph

vshankar · 2019-01-17T13:36:31Z

This is a WIP and has basic messaging boilerplates for an MDS to forward metrics (client metrics, etc..) to rank 0. Follow up PRs would stitch all this together with clients periodically sending its metrics to its session MDS with this MDS forwarding it to rank 0 (which would aggregate all session metadata and update the mgr).

batrick · 2019-01-17T15:24:14Z

Partially-fixes: http://tracker.ceph.com/issues/36253

batrick · 2019-03-21T20:31:17Z

@vshankar please rebase this when you get a chance.

vshankar · 2019-03-22T03:47:18Z

@vshankar please rebase this when you get a chance.

yep, on it w/ the changes to forwarding metrics to the manager.

vshankar · 2019-03-27T12:52:00Z

Still keeping this as WIP -- I should probably include a detailed implementation writeup here (or in the commit message). I'll do that on priority -- but this should be ok for a round of review if someone wants to delve into the details right now!

batrick

please rebase

vshankar · 2019-04-25T04:28:07Z

please rebase

I'll get to this next week.

vshankar · 2019-06-18T13:22:15Z

@batrick -- rebased + updated. please take a look.

src/messages/MClientMetrics.h

src/mds/metrics/client/Types.cc

src/mgr/MDSPerfMetricCollector.cc

src/mds/metrics/client/Types.h

src/mds/ServiceDaemon.h

src/mds/ServiceDaemon.cc

src/mds/MDSPerfMetricTypes.h

batrick · 2019-08-19T22:35:49Z

@vshankar what's the status of this PR? Is it blocking on my review?

vshankar · 2019-08-20T04:00:28Z

@vshankar what's the status of this PR? Is it blocking on my review?

kind of -- although I am waiting for more comments on PR #29214 so I can rebase.

If you want you can go ahead with a round of review. I addressed most of your comments.

gregsfortytwo

Okay, I went through the big parts of the MDS data structures and locking and have some comments and questions.

I eventually realized when looking at the spinlocks that you probably did that for fast dispatch? Literally using a spinlock is not really what those requirements mean, though — the important part is to not block. This code may be okay in that regard but it will certainly be better with a mutex, and I can tell you that one very important cycle to avoid is doing anything in fast dispatch which can block on sending a message back out. Looks like that’s okay now because the rank0 update function drops the local lock, but be careful of it when updating.

gregsfortytwo · 2020-04-22T23:31:40Z

src/include/cephfs/metrics/Types.h

+    inline void operator()(const ClientMetricPayload &payload) const {
+      ClientMetricType metric_type = ClientMetricPayload::METRIC_TYPE;
+      m_formatter->dump_string("client_metric_type", stringify(metric_type));
+      payload.dump(m_formatter);


Shouldn't we enclose the payload.dump call in a "section"?

m_formatter->open_object_section() and m_formatter->close_section()

maybe -- I'll take a closer look.

gregsfortytwo · 2020-04-24T19:26:56Z

src/messages/MMDSMetrics.h

+  }
+
+  void print(ostream &out) const override {
+    out << "mds_metrics";


This should include a little more to help track and debug issues, at minimum which MDS it's coming from and the number of embedded client metric updates. I guess we don't have any sequence numbers we can embed in this protocol? (Hmm that makes me nervous but I guess these numbers are okay to be lossy?)

ok -- makes sense to add some info about the payload.

Regarding the absence of seq numbers, this does not have to be strict in terms of message ordering, so it ok to be lossy.

btw, just fyi, metrics_message_t (which is a member variable in this message class) has a sequence number which is used to identify laginness, not for ordering as such.

gregsfortytwo · 2020-04-24T20:01:12Z

src/mds/MetricsHandler.h

+  MDSRank *mds;
+  // drop this lock when calling ->send_message_mds() else mds might
+  // deadlock
+  ceph::spinlock lock;


Spinlocks need to be VERY clear about what data they protect and why. At a quick skim this looks like it ought be a mutex instead.

gregsfortytwo · 2020-04-24T20:18:49Z

src/mds/MetricsHandler.h

+  // deadlock
+  ceph::spinlock lock;
+
+  // ISN sent by rank0 pinger is 1


ISN? What is this for given that it's not related to last_updated_seq?

MDSPinger sens a ping message with initial sequence number as 1.

gregsfortytwo · 2020-04-27T03:17:58Z

src/mds/MetricsHandler.h

+  // locally to figure out if a session got added and removed
+  // within an update to rank 0. note that, this is initialized
+  // with the incoming sequence number sent by rank 0.
+  version_t last_updated_seq = 0;


I don’t see this implemented as described, looks like it only sets the per-client seq in add_session, which is called on new sessions but not when updating them?

you are right the comment -- it's a bit outdated. last_updated_seq is used to determine if a message carrying session information was sent to rank 0. If a session got added and removed before a timer task refreshes session updated/metrics to rank 0, the corresponding entry for metrics can be removed locally and need not be sent to rank 0 (for removal).

gregsfortytwo · 2020-04-27T03:23:16Z

src/mds/MetricAggregator.h

+private:
+  // drop this lock when calling ->send_message_mds() else mds might
+  // deadlock
+  ceph::spinlock lock;


Again, need to document what this covers and I don’t see any reason for it to be a spinlock.

gregsfortytwo · 2020-04-27T03:41:15Z

src/mds/MDSPinger.cc

+  std::scoped_lock locker(lock);
+  auto it = ping_state_by_rank.find(rank);
+  if (it == ping_state_by_rank.end()) {
+    derr << ": rank=" << rank << " was never sent ping request." << dendl;


Does detecting this matter? It seems like if it does we should assert (at least when testing), and if it doesn’t we shouldn’t bother to check.

This handles a corner case of an mds becoming active, then going down before rank 0 can send a ping request.

That doesn't seem like it's worth logging to derr though?

gregsfortytwo · 2020-04-27T03:46:22Z

src/mds/MDSPinger.h

+  MDSRank *mds;
+  // drop this lock when calling ->send_message_mds() else mds might
+  // deadlock
+  ceph::spinlock lock;


Another spinlock

gregsfortytwo · 2020-04-27T05:37:28Z

src/mds/MetricAggregator.cc

  dout(10) << dendl;
+  ceph_assert(ceph_mutex_is_locked_by_me(mds->mds_lock));

+  mds->mds_lock.unlock();


Rather than do all this with the real mds_lock, why not just read out the address of mds 0 in notify_mdsmap and send the message using a non-rank-based interface? That seems a lot cleaner with the layers and makes locking simpler.

IIRC, Patrick and I discussed somewhat this a while back where a message could be sent to an mds without acquiring mds_lock. I think the point where we decided to go ahead with the current implementation was somewhat related to sending mdsmap before sending out the actual message. Thinking again on this -- does it really matter to send an (updated) mdsmap in this scenario?

Ohhhh is it trying to make sure the receiving MDS always has a current MDSMap before dealing with requests?
Hmm, that is a thing we don't want to break in the general case, but it looks like these messages already handle being delayed or misdirected. You'd have to make sure it doesn't erroneously start thinking it owns a metrics gathering server, and that if an MDS sends an update to the old mds0 it correctly sends all the same updates to the new one — but since these area all "total" rather than "since last update" values I think it should be fine?

right -- guess it doesn't matter if we sent some updates to an old active mds. this would simplify the locking bits.

vshankar · 2020-04-29T10:19:52Z

I eventually realized when looking at the spinlocks that you probably did that for fast dispatch? Literally using a spinlock is not really what those requirements mean, though — the important part is to not block. This code may be okay in that regard but it will certainly be better with a mutex, and I can tell you that one very important cycle to avoid is doing anything in fast dispatch which can block on sending a message back out. Looks like that’s okay now because the rank0 update function drops the local lock, but be careful of it when updating.

Everything used mutexes initially, but then @batrick suggested to use spinlocks and it made sense to me. I get the point about not blocking in fast dispatch. Mind explaining a bit more as in why mutexes would be better here?

gregsfortytwo · 2020-05-02T20:14:30Z

Everything used mutexes initially, but then @batrick suggested to use spinlocks and it made sense to me. I get the point about not blocking in fast dispatch. Mind explaining a bit more as in why mutexes would be better here?

The kernel knows about mutexes and that informs scheduling decisions; spinlocks are invisible so if you're unlucky you might get descheduled while holding one. Thanks to futex a mutex is also generally going to be about the same speed as a spinlock is; really the only time I know it's appropriate to use a userspace spinlock is when you need to change a few variables atomically.

vshankar · 2020-05-04T11:21:50Z

The kernel knows about mutexes and that informs scheduling decisions; spinlocks are invisible so if you're unlucky you might get descheduled while holding one. Thanks to futex a mutex is also generally going to be about the same speed as a spinlock is; really the only time I know it's appropriate to use a userspace spinlock is when you need to change a few variables atomically.

Given that with the changes to send metrics updates to mds0 via non-rank interface, locking would only be required to track the updates. I think sticking to spinlocks probably makes sense now?

gregsfortytwo · 2020-05-05T02:47:11Z

The kernel knows about mutexes and that informs scheduling decisions; spinlocks are invisible so if you're unlucky you might get descheduled while holding one. Thanks to futex a mutex is also generally going to be about the same speed as a spinlock is; really the only time I know it's appropriate to use a userspace spinlock is when you need to change a few variables atomically.

Given that with the changes to send metrics updates to mds0 via non-rank interface, locking would only be required to track the updates. I think sticking to spinlocks probably makes sense now?

What I'm actually worried about here, aside from it not being a good habit to get into, is that I think there are some memory allocations happening underneath those map operations and that's something to avoid with spinlocks anywhere.

vshankar · 2020-05-05T10:16:08Z

What I'm actually worried about here, aside from it not being a good habit to get into, is that I think there are some memory allocations happening underneath those map operations and that's something to avoid with spinlocks anywhere.

fair enough -- I'll do the changes.

vshankar · 2020-05-14T13:32:07Z

retest this please

vshankar · 2020-05-14T13:32:36Z

@batrick @gregsfortytwo please take a look

Signed-off-by: Venky Shankar <vshankar@redhat.com>

These types will be used for supporting user define querying from ceph-mgr. Signed-off-by: Venky Shankar <vshankar@redhat.com>

Signed-off-by: Venky Shankar <vshankar@redhat.com>

`MetricAggregator` class aggregates metrics from all active ranks and places metrics appropriately as defined by user queries. This is implemented as a separate dispatcher since metric update messages from active MDSs are frequent so as to avoid messages getting stuck in MDSRank queue (suggested by Patrick). Signed-off-by: Venky Shankar <vshankar@redhat.com>

Every MDS maintains a view of metrics that are forwarded to it by clients. This is updated when clients forward metrics via MClientMetrics message type. Periodically, each MDS forwards its collected metrics to MDS rank 0 (which maintains an aggregated view of metrics of all clients on all ranks). Signed-off-by: Venky Shankar <vshankar@redhat.com>

`MetricAggregator` sets up manager callback to forward metrics data to ceph-mgr. Also, add querying interfaces for adding and removing user queries and a simple interface to fetch MDS perf metrics. Fixes: http://tracker.ceph.com/issues/36253 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar · 2020-06-01T09:01:39Z

@batrick rebased and updated -- please take a look

batrick · 2020-06-01T19:36:48Z

https://shaman.ceph.com/builds/ceph/wip-pdonnell-testing-20200601.182437/539e4ec08992d77c3ecbbcd6249bd94810c0e669/

Tests coming soon...

batrick · 2020-06-01T22:24:09Z

http://pulpito.ceph.com/?branch=wip-pdonnell-testing-20200601.182437

batrick · 2020-06-02T03:56:20Z

@vshankar some of the failures look new. Can you check especially the evicted client warning causes.

vshankar · 2020-06-02T04:05:12Z

@vshankar some of the failures look new. Can you check especially the evicted client warning causes.

checking

vshankar · 2020-06-02T08:43:56Z

For test: http://pulpito.ceph.com/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110407/

client.6362 did not reconnect to the MDS.

MDS notes that clients would reconnect:

2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.server apply_blacklist: killed 0
2020-06-01T23:16:55.330+0000 7ffa98c38700  4 mds.0.73 reconnect_start: killed 0 blacklisted sessions (2 blacklist entries, 1)
2020-06-01T23:16:55.330+0000 7ffa98c38700  1 mds.0.server reconnect_clients -- 1 sessions
2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.sessionmap dump
2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.sessionmap client.6362 0x5578fc197180 state open completed {11=0x100000001f7} prealloc_inos [0x10000000003~0x1f4,0x100000001f8~0x3e8] delegated_inos [] used_inos []
2020-06-01T23:16:55.330+0000 7ffa98c38700  7 mds.0.tableserver(snaptable) finish_recovery

But client.6362 never reconnects back -- MDS times out for reconnect and evict the client:

2020-06-01T23:17:42.339+0000 7ffa9b43d700  5 mds.beacon.a received beacon reply up:reconnect seq 14 rtt 0
2020-06-01T23:17:43.333+0000 7ffa94c30700 10 mds.0.cache cache not ready for trimming
2020-06-01T23:17:43.333+0000 7ffa94c30700 20 mds.0.cache upkeep thread waiting interval 0.999070644s
2020-06-01T23:17:43.880+0000 7ffa97435700  7 mds.0.server reconnect timed out, 1 clients have not reconnected in time
2020-06-01T23:17:43.880+0000 7ffa97435700  1 mds.0.server reconnect gives up on client.6362 172.21.15.205:0/1374455594
2020-06-01T23:17:43.880+0000 7ffa97435700  0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi205:0 (6362), after waiting 48.5506 seconds during MDS startup
2020-06-01T23:17:43.880+0000 7ffa97435700  1 mds.0.73 Evicting (and blacklisting) client session 6362 (172.21.15.205:0/1374455594)
2020-06-01T23:17:43.880+0000 7ffa97435700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 6362 (172.21.15.205:0/1374455594)

Doesn't look like this PR is causing client evictions.

lxbsz · 2020-06-02T08:51:13Z

For test: http://pulpito.ceph.com/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110407/

client.6362 did not reconnect to the MDS.

MDS notes that clients would reconnect:

2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.server apply_blacklist: killed 0
2020-06-01T23:16:55.330+0000 7ffa98c38700  4 mds.0.73 reconnect_start: killed 0 blacklisted sessions (2 blacklist entries, 1)
2020-06-01T23:16:55.330+0000 7ffa98c38700  1 mds.0.server reconnect_clients -- 1 sessions
2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.sessionmap dump
2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.sessionmap client.6362 0x5578fc197180 state open completed {11=0x100000001f7} prealloc_inos [0x10000000003~0x1f4,0x100000001f8~0x3e8] delegated_inos [] used_inos []
2020-06-01T23:16:55.330+0000 7ffa98c38700  7 mds.0.tableserver(snaptable) finish_recovery

But client.6362 never reconnects back -- MDS times out for reconnect and evict the client:

2020-06-01T23:17:42.339+0000 7ffa9b43d700  5 mds.beacon.a received beacon reply up:reconnect seq 14 rtt 0
2020-06-01T23:17:43.333+0000 7ffa94c30700 10 mds.0.cache cache not ready for trimming
2020-06-01T23:17:43.333+0000 7ffa94c30700 20 mds.0.cache upkeep thread waiting interval 0.999070644s
2020-06-01T23:17:43.880+0000 7ffa97435700  7 mds.0.server reconnect timed out, 1 clients have not reconnected in time
2020-06-01T23:17:43.880+0000 7ffa97435700  1 mds.0.server reconnect gives up on client.6362 172.21.15.205:0/1374455594
2020-06-01T23:17:43.880+0000 7ffa97435700  0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi205:0 (6362), after waiting 48.5506 seconds during MDS startup
2020-06-01T23:17:43.880+0000 7ffa97435700  1 mds.0.73 Evicting (and blacklisting) client session 6362 (172.21.15.205:0/1374455594)
2020-06-01T23:17:43.880+0000 7ffa97435700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 6362 (172.21.15.205:0/1374455594)

Doesn't look like this PR is causing client evictions.

This is a known issue, which has been fixed in https://tracker.ceph.com/issues/45665.

More detail logs please see http://qa-proxy.ceph.com/teuthology/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110407/remote/smithi205/log/ceph-client.0.65436.log.gz.

vshankar · 2020-06-02T12:53:52Z

This is a known issue, which has been fixed in https://tracker.ceph.com/issues/45665.

thx @lxbsz

batrick · 2020-06-02T20:48:56Z

https://tracker.ceph.com/issues/45829

otherwise looks normal.

vshankar added the cephfs Ceph File System label Jan 17, 2019

batrick self-requested a review January 17, 2019 15:19

vshankar force-pushed the wip-inter-mds-stats branch from d8cdccf to 7d1df7f Compare February 4, 2019 14:13

vshankar force-pushed the wip-inter-mds-stats branch from 7d1df7f to 0e1e979 Compare March 27, 2019 12:37

vshankar changed the title ~~[WIP] mds: introduce intra-mds message for forwarding client metrics~~ [WIP] mds: record and forward client side metrics to manager Mar 27, 2019

vshankar force-pushed the wip-inter-mds-stats branch from 0e1e979 to ab02b01 Compare April 8, 2019 13:11

vshankar changed the title ~~[WIP] mds: record and forward client side metrics to manager~~ mds: record and forward client side metrics to manager Apr 8, 2019

batrick requested changes Apr 24, 2019

View reviewed changes

vshankar force-pushed the wip-inter-mds-stats branch 2 times, most recently from bb63f30 to f2a96a4 Compare June 18, 2019 12:49

vshankar closed this Jun 18, 2019

vshankar reopened this Jun 18, 2019

batrick self-requested a review June 18, 2019 19:29

batrick self-assigned this Jun 18, 2019

batrick added the needs-review label Jun 18, 2019

batrick requested changes Jun 19, 2019

View reviewed changes

batrick removed the needs-review label Jul 30, 2019

batrick assigned vshankar and unassigned batrick Jul 30, 2019

vshankar force-pushed the wip-inter-mds-stats branch 2 times, most recently from d483817 to 6e7b697 Compare August 13, 2019 04:45

vshankar force-pushed the wip-inter-mds-stats branch from 6e7b697 to e288ff8 Compare August 26, 2019 12:57

vshankar force-pushed the wip-inter-mds-stats branch from b2fe51d to 42ee568 Compare April 14, 2020 09:15

vshankar requested review from a team as code owners April 14, 2020 09:15

gregsfortytwo requested changes Apr 27, 2020

View reviewed changes

vshankar added 8 commits June 1, 2020 04:04

client: new message type for providing client side metrics

c669de1

Signed-off-by: Venky Shankar <vshankar@redhat.com>

mds: new intra-mds message type for forwarding aggregated metrics

80bac27

Signed-off-by: Venky Shankar <vshankar@redhat.com>

mgr: introduce query/report types for ceph metadata server

f5f1d1e

These types will be used for supporting user define querying from ceph-mgr. Signed-off-by: Venky Shankar <vshankar@redhat.com>

mds: inter-mds ping-pong message and type

0e98ba0

Signed-off-by: Venky Shankar <vshankar@redhat.com>

mds: non-rank based interface for sending message to an mds

6b8f779

Signed-off-by: Venky Shankar <vshankar@redhat.com>

batrick approved these changes Jun 2, 2020

View reviewed changes

joscollin mentioned this pull request Aug 5, 2020

mds: Fix 8MMDSPing is not an MMDSOp type #36004

Merged

Conversation

vshankar commented Jan 17, 2019

Uh oh!

batrick commented Jan 17, 2019

Uh oh!

batrick commented Mar 21, 2019

Uh oh!

vshankar commented Mar 22, 2019

Uh oh!

vshankar commented Mar 27, 2019

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

vshankar commented Apr 25, 2019

Uh oh!

vshankar commented Jun 18, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

batrick commented Aug 19, 2019

Uh oh!

vshankar commented Aug 20, 2019

Uh oh!

gregsfortytwo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vshankar commented Apr 29, 2020

Uh oh!

gregsfortytwo commented May 2, 2020

Uh oh!

vshankar commented May 4, 2020

gregsfortytwo left a comment •

edited

Loading

vshankar commented May 14, 2020 •

edited

Loading

lxbsz commented Jun 2, 2020 •

edited

Loading