Skip to content

mds: record and forward client side metrics to manager#26004

Merged
batrick merged 8 commits intoceph:masterfrom
vshankar:wip-inter-mds-stats
Jun 2, 2020
Merged

mds: record and forward client side metrics to manager#26004
batrick merged 8 commits intoceph:masterfrom
vshankar:wip-inter-mds-stats

Conversation

@vshankar
Copy link
Contributor

This is a WIP and has basic messaging boilerplates for an MDS to forward metrics (client metrics, etc..) to rank 0. Follow up PRs would stitch all this together with clients periodically sending its metrics to its session MDS with this MDS forwarding it to rank 0 (which would aggregate all session metadata and update the mgr).

@vshankar vshankar added the cephfs Ceph File System label Jan 17, 2019
@batrick batrick self-requested a review January 17, 2019 15:19
@batrick
Copy link
Member

batrick commented Jan 17, 2019

Partially-fixes: http://tracker.ceph.com/issues/36253

@vshankar vshankar force-pushed the wip-inter-mds-stats branch from d8cdccf to 7d1df7f Compare February 4, 2019 14:13
@batrick
Copy link
Member

batrick commented Mar 21, 2019

@vshankar please rebase this when you get a chance.

@vshankar
Copy link
Contributor Author

@vshankar please rebase this when you get a chance.

yep, on it w/ the changes to forwarding metrics to the manager.

@vshankar vshankar force-pushed the wip-inter-mds-stats branch from 7d1df7f to 0e1e979 Compare March 27, 2019 12:37
@vshankar vshankar changed the title [WIP] mds: introduce intra-mds message for forwarding client metrics [WIP] mds: record and forward client side metrics to manager Mar 27, 2019
@vshankar
Copy link
Contributor Author

Still keeping this as WIP -- I should probably include a detailed implementation writeup here (or in the commit message). I'll do that on priority -- but this should be ok for a round of review if someone wants to delve into the details right now!

@vshankar vshankar force-pushed the wip-inter-mds-stats branch from 0e1e979 to ab02b01 Compare April 8, 2019 13:11
@vshankar vshankar changed the title [WIP] mds: record and forward client side metrics to manager mds: record and forward client side metrics to manager Apr 8, 2019
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rebase

@vshankar
Copy link
Contributor Author

please rebase

I'll get to this next week.

@vshankar vshankar force-pushed the wip-inter-mds-stats branch 2 times, most recently from bb63f30 to f2a96a4 Compare June 18, 2019 12:49
@vshankar
Copy link
Contributor Author

@batrick -- rebased + updated. please take a look.

@vshankar vshankar closed this Jun 18, 2019
@vshankar vshankar reopened this Jun 18, 2019
@batrick batrick self-requested a review June 18, 2019 19:29
@batrick batrick self-assigned this Jun 18, 2019
@batrick batrick assigned vshankar and unassigned batrick Jul 30, 2019
@vshankar vshankar force-pushed the wip-inter-mds-stats branch 2 times, most recently from d483817 to 6e7b697 Compare August 13, 2019 04:45
@batrick
Copy link
Member

batrick commented Aug 19, 2019

@vshankar what's the status of this PR? Is it blocking on my review?

@vshankar
Copy link
Contributor Author

@vshankar what's the status of this PR? Is it blocking on my review?

kind of -- although I am waiting for more comments on PR #29214 so I can rebase.

If you want you can go ahead with a round of review. I addressed most of your comments.

@vshankar vshankar force-pushed the wip-inter-mds-stats branch from 6e7b697 to e288ff8 Compare August 26, 2019 12:57
@vshankar vshankar force-pushed the wip-inter-mds-stats branch from b2fe51d to 42ee568 Compare April 14, 2020 09:15
@vshankar vshankar requested review from a team as code owners April 14, 2020 09:15
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I went through the big parts of the MDS data structures and locking and have some comments and questions.

I eventually realized when looking at the spinlocks that you probably did that for fast dispatch? Literally using a spinlock is not really what those requirements mean, though — the important part is to not block. This code may be okay in that regard but it will certainly be better with a mutex, and I can tell you that one very important cycle to avoid is doing anything in fast dispatch which can block on sending a message back out. Looks like that’s okay now because the rank0 update function drops the local lock, but be careful of it when updating.

inline void operator()(const ClientMetricPayload &payload) const {
ClientMetricType metric_type = ClientMetricPayload::METRIC_TYPE;
m_formatter->dump_string("client_metric_type", stringify(metric_type));
payload.dump(m_formatter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we enclose the payload.dump call in a "section"?

m_formatter->open_object_section() and m_formatter->close_section()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe -- I'll take a closer look.

}

void print(ostream &out) const override {
out << "mds_metrics";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should include a little more to help track and debug issues, at minimum which MDS it's coming from and the number of embedded client metric updates. I guess we don't have any sequence numbers we can embed in this protocol? (Hmm that makes me nervous but I guess these numbers are okay to be lossy?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok -- makes sense to add some info about the payload.

Regarding the absence of seq numbers, this does not have to be strict in terms of message ordering, so it ok to be lossy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, just fyi, metrics_message_t (which is a member variable in this message class) has a sequence number which is used to identify laginness, not for ordering as such.

MDSRank *mds;
// drop this lock when calling ->send_message_mds() else mds might
// deadlock
ceph::spinlock lock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spinlocks need to be VERY clear about what data they protect and why. At a quick skim this looks like it ought be a mutex instead.

// deadlock
ceph::spinlock lock;

// ISN sent by rank0 pinger is 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISN? What is this for given that it's not related to last_updated_seq?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDSPinger sens a ping message with initial sequence number as 1.

// locally to figure out if a session got added and removed
// within an update to rank 0. note that, this is initialized
// with the incoming sequence number sent by rank 0.
version_t last_updated_seq = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see this implemented as described, looks like it only sets the per-client seq in add_session, which is called on new sessions but not when updating them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right the comment -- it's a bit outdated. last_updated_seq is used to determine if a message carrying session information was sent to rank 0. If a session got added and removed before a timer task refreshes session updated/metrics to rank 0, the corresponding entry for metrics can be removed locally and need not be sent to rank 0 (for removal).

private:
// drop this lock when calling ->send_message_mds() else mds might
// deadlock
ceph::spinlock lock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, need to document what this covers and I don’t see any reason for it to be a spinlock.

std::scoped_lock locker(lock);
auto it = ping_state_by_rank.find(rank);
if (it == ping_state_by_rank.end()) {
derr << ": rank=" << rank << " was never sent ping request." << dendl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does detecting this matter? It seems like if it does we should assert (at least when testing), and if it doesn’t we shouldn’t bother to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handles a corner case of an mds becoming active, then going down before rank 0 can send a ping request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem like it's worth logging to derr though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

MDSRank *mds;
// drop this lock when calling ->send_message_mds() else mds might
// deadlock
ceph::spinlock lock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another spinlock

dout(10) << dendl;
ceph_assert(ceph_mutex_is_locked_by_me(mds->mds_lock));

mds->mds_lock.unlock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than do all this with the real mds_lock, why not just read out the address of mds 0 in notify_mdsmap and send the message using a non-rank-based interface? That seems a lot cleaner with the layers and makes locking simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, Patrick and I discussed somewhat this a while back where a message could be sent to an mds without acquiring mds_lock. I think the point where we decided to go ahead with the current implementation was somewhat related to sending mdsmap before sending out the actual message. Thinking again on this -- does it really matter to send an (updated) mdsmap in this scenario?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhh is it trying to make sure the receiving MDS always has a current MDSMap before dealing with requests?
Hmm, that is a thing we don't want to break in the general case, but it looks like these messages already handle being delayed or misdirected. You'd have to make sure it doesn't erroneously start thinking it owns a metrics gathering server, and that if an MDS sends an update to the old mds0 it correctly sends all the same updates to the new one — but since these area all "total" rather than "since last update" values I think it should be fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right -- guess it doesn't matter if we sent some updates to an old active mds. this would simplify the locking bits.

@vshankar
Copy link
Contributor Author

I eventually realized when looking at the spinlocks that you probably did that for fast dispatch? Literally using a spinlock is not really what those requirements mean, though — the important part is to not block. This code may be okay in that regard but it will certainly be better with a mutex, and I can tell you that one very important cycle to avoid is doing anything in fast dispatch which can block on sending a message back out. Looks like that’s okay now because the rank0 update function drops the local lock, but be careful of it when updating.

Everything used mutexes initially, but then @batrick suggested to use spinlocks and it made sense to me. I get the point about not blocking in fast dispatch. Mind explaining a bit more as in why mutexes would be better here?

@gregsfortytwo
Copy link
Member

Everything used mutexes initially, but then @batrick suggested to use spinlocks and it made sense to me. I get the point about not blocking in fast dispatch. Mind explaining a bit more as in why mutexes would be better here?

The kernel knows about mutexes and that informs scheduling decisions; spinlocks are invisible so if you're unlucky you might get descheduled while holding one. Thanks to futex a mutex is also generally going to be about the same speed as a spinlock is; really the only time I know it's appropriate to use a userspace spinlock is when you need to change a few variables atomically.

@vshankar
Copy link
Contributor Author

vshankar commented May 4, 2020

The kernel knows about mutexes and that informs scheduling decisions; spinlocks are invisible so if you're unlucky you might get descheduled while holding one. Thanks to futex a mutex is also generally going to be about the same speed as a spinlock is; really the only time I know it's appropriate to use a userspace spinlock is when you need to change a few variables atomically.

Given that with the changes to send metrics updates to mds0 via non-rank interface, locking would only be required to track the updates. I think sticking to spinlocks probably makes sense now?

@gregsfortytwo
Copy link
Member

The kernel knows about mutexes and that informs scheduling decisions; spinlocks are invisible so if you're unlucky you might get descheduled while holding one. Thanks to futex a mutex is also generally going to be about the same speed as a spinlock is; really the only time I know it's appropriate to use a userspace spinlock is when you need to change a few variables atomically.

Given that with the changes to send metrics updates to mds0 via non-rank interface, locking would only be required to track the updates. I think sticking to spinlocks probably makes sense now?

What I'm actually worried about here, aside from it not being a good habit to get into, is that I think there are some memory allocations happening underneath those map operations and that's something to avoid with spinlocks anywhere.

@vshankar
Copy link
Contributor Author

vshankar commented May 5, 2020

What I'm actually worried about here, aside from it not being a good habit to get into, is that I think there are some memory allocations happening underneath those map operations and that's something to avoid with spinlocks anywhere.

fair enough -- I'll do the changes.

@vshankar
Copy link
Contributor Author

vshankar commented May 14, 2020

retest this please

@vshankar
Copy link
Contributor Author

@batrick @gregsfortytwo please take a look

vshankar added 8 commits June 1, 2020 04:04
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
These types will be used for supporting user define querying
from ceph-mgr.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
`MetricAggregator` class aggregates metrics from all active ranks
and places metrics appropriately as defined by user queries.

This is implemented as a separate dispatcher since metric update
messages from active MDSs are frequent so as to avoid messages
getting stuck in MDSRank queue (suggested by Patrick).

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Every MDS maintains a view of metrics that are forwarded to
it by clients. This is updated when clients forward metrics
via MClientMetrics message type. Periodically, each MDS
forwards its collected metrics to MDS rank 0 (which maintains
an aggregated view of metrics of all clients on all ranks).

Signed-off-by: Venky Shankar <vshankar@redhat.com>
`MetricAggregator` sets up manager callback to forward metrics
data to ceph-mgr. Also, add querying interfaces for adding and
removing user queries and a simple interface to fetch MDS perf
metrics.

Fixes: http://tracker.ceph.com/issues/36253
Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar
Copy link
Contributor Author

vshankar commented Jun 1, 2020

@batrick rebased and updated -- please take a look

@batrick
Copy link
Member

batrick commented Jun 1, 2020

@batrick
Copy link
Member

batrick commented Jun 1, 2020

@batrick
Copy link
Member

batrick commented Jun 2, 2020

@vshankar some of the failures look new. Can you check especially the evicted client warning causes.

@vshankar
Copy link
Contributor Author

vshankar commented Jun 2, 2020

@vshankar some of the failures look new. Can you check especially the evicted client warning causes.

checking

@vshankar
Copy link
Contributor Author

vshankar commented Jun 2, 2020

For test: http://pulpito.ceph.com/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110407/

client.6362 did not reconnect to the MDS.

MDS notes that clients would reconnect:

2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.server apply_blacklist: killed 0
2020-06-01T23:16:55.330+0000 7ffa98c38700  4 mds.0.73 reconnect_start: killed 0 blacklisted sessions (2 blacklist entries, 1)
2020-06-01T23:16:55.330+0000 7ffa98c38700  1 mds.0.server reconnect_clients -- 1 sessions
2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.sessionmap dump
2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.sessionmap client.6362 0x5578fc197180 state open completed {11=0x100000001f7} prealloc_inos [0x10000000003~0x1f4,0x100000001f8~0x3e8] delegated_inos [] used_inos []
2020-06-01T23:16:55.330+0000 7ffa98c38700  7 mds.0.tableserver(snaptable) finish_recovery

But client.6362 never reconnects back -- MDS times out for reconnect and evict the client:

2020-06-01T23:17:42.339+0000 7ffa9b43d700  5 mds.beacon.a received beacon reply up:reconnect seq 14 rtt 0
2020-06-01T23:17:43.333+0000 7ffa94c30700 10 mds.0.cache cache not ready for trimming
2020-06-01T23:17:43.333+0000 7ffa94c30700 20 mds.0.cache upkeep thread waiting interval 0.999070644s
2020-06-01T23:17:43.880+0000 7ffa97435700  7 mds.0.server reconnect timed out, 1 clients have not reconnected in time
2020-06-01T23:17:43.880+0000 7ffa97435700  1 mds.0.server reconnect gives up on client.6362 172.21.15.205:0/1374455594
2020-06-01T23:17:43.880+0000 7ffa97435700  0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi205:0 (6362), after waiting 48.5506 seconds during MDS startup
2020-06-01T23:17:43.880+0000 7ffa97435700  1 mds.0.73 Evicting (and blacklisting) client session 6362 (172.21.15.205:0/1374455594)
2020-06-01T23:17:43.880+0000 7ffa97435700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 6362 (172.21.15.205:0/1374455594) 

Doesn't look like this PR is causing client evictions.

@lxbsz
Copy link
Member

lxbsz commented Jun 2, 2020

For test: http://pulpito.ceph.com/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110407/

client.6362 did not reconnect to the MDS.

MDS notes that clients would reconnect:

2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.server apply_blacklist: killed 0
2020-06-01T23:16:55.330+0000 7ffa98c38700  4 mds.0.73 reconnect_start: killed 0 blacklisted sessions (2 blacklist entries, 1)
2020-06-01T23:16:55.330+0000 7ffa98c38700  1 mds.0.server reconnect_clients -- 1 sessions
2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.sessionmap dump
2020-06-01T23:16:55.330+0000 7ffa98c38700 10 mds.0.sessionmap client.6362 0x5578fc197180 state open completed {11=0x100000001f7} prealloc_inos [0x10000000003~0x1f4,0x100000001f8~0x3e8] delegated_inos [] used_inos []
2020-06-01T23:16:55.330+0000 7ffa98c38700  7 mds.0.tableserver(snaptable) finish_recovery

But client.6362 never reconnects back -- MDS times out for reconnect and evict the client:

2020-06-01T23:17:42.339+0000 7ffa9b43d700  5 mds.beacon.a received beacon reply up:reconnect seq 14 rtt 0
2020-06-01T23:17:43.333+0000 7ffa94c30700 10 mds.0.cache cache not ready for trimming
2020-06-01T23:17:43.333+0000 7ffa94c30700 20 mds.0.cache upkeep thread waiting interval 0.999070644s
2020-06-01T23:17:43.880+0000 7ffa97435700  7 mds.0.server reconnect timed out, 1 clients have not reconnected in time
2020-06-01T23:17:43.880+0000 7ffa97435700  1 mds.0.server reconnect gives up on client.6362 172.21.15.205:0/1374455594
2020-06-01T23:17:43.880+0000 7ffa97435700  0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi205:0 (6362), after waiting 48.5506 seconds during MDS startup
2020-06-01T23:17:43.880+0000 7ffa97435700  1 mds.0.73 Evicting (and blacklisting) client session 6362 (172.21.15.205:0/1374455594)
2020-06-01T23:17:43.880+0000 7ffa97435700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 6362 (172.21.15.205:0/1374455594) 

Doesn't look like this PR is causing client evictions.

This is a known issue, which has been fixed in https://tracker.ceph.com/issues/45665.

More detail logs please see http://qa-proxy.ceph.com/teuthology/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110407/remote/smithi205/log/ceph-client.0.65436.log.gz.

@vshankar
Copy link
Contributor Author

vshankar commented Jun 2, 2020

This is a known issue, which has been fixed in https://tracker.ceph.com/issues/45665.

thx @lxbsz

@batrick
Copy link
Member

batrick commented Jun 2, 2020

https://tracker.ceph.com/issues/45829

otherwise looks normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants