mds: record and forward client side metrics to manager#26004
mds: record and forward client side metrics to manager#26004batrick merged 8 commits intoceph:masterfrom
Conversation
|
|
d8cdccf to
7d1df7f
Compare
|
@vshankar please rebase this when you get a chance. |
yep, on it w/ the changes to forwarding metrics to the manager. |
7d1df7f to
0e1e979
Compare
|
Still keeping this as WIP -- I should probably include a detailed implementation writeup here (or in the commit message). I'll do that on priority -- but this should be ok for a round of review if someone wants to delve into the details right now! |
0e1e979 to
ab02b01
Compare
I'll get to this next week. |
bb63f30 to
f2a96a4
Compare
|
@batrick -- rebased + updated. please take a look. |
d483817 to
6e7b697
Compare
|
@vshankar what's the status of this PR? Is it blocking on my review? |
6e7b697 to
e288ff8
Compare
b2fe51d to
42ee568
Compare
There was a problem hiding this comment.
Okay, I went through the big parts of the MDS data structures and locking and have some comments and questions.
I eventually realized when looking at the spinlocks that you probably did that for fast dispatch? Literally using a spinlock is not really what those requirements mean, though — the important part is to not block. This code may be okay in that regard but it will certainly be better with a mutex, and I can tell you that one very important cycle to avoid is doing anything in fast dispatch which can block on sending a message back out. Looks like that’s okay now because the rank0 update function drops the local lock, but be careful of it when updating.
| inline void operator()(const ClientMetricPayload &payload) const { | ||
| ClientMetricType metric_type = ClientMetricPayload::METRIC_TYPE; | ||
| m_formatter->dump_string("client_metric_type", stringify(metric_type)); | ||
| payload.dump(m_formatter); |
There was a problem hiding this comment.
Shouldn't we enclose the payload.dump call in a "section"?
m_formatter->open_object_section() and m_formatter->close_section()
There was a problem hiding this comment.
maybe -- I'll take a closer look.
src/messages/MMDSMetrics.h
Outdated
| } | ||
|
|
||
| void print(ostream &out) const override { | ||
| out << "mds_metrics"; |
There was a problem hiding this comment.
This should include a little more to help track and debug issues, at minimum which MDS it's coming from and the number of embedded client metric updates. I guess we don't have any sequence numbers we can embed in this protocol? (Hmm that makes me nervous but I guess these numbers are okay to be lossy?)
There was a problem hiding this comment.
ok -- makes sense to add some info about the payload.
Regarding the absence of seq numbers, this does not have to be strict in terms of message ordering, so it ok to be lossy.
There was a problem hiding this comment.
btw, just fyi, metrics_message_t (which is a member variable in this message class) has a sequence number which is used to identify laginness, not for ordering as such.
src/mds/MetricsHandler.h
Outdated
| MDSRank *mds; | ||
| // drop this lock when calling ->send_message_mds() else mds might | ||
| // deadlock | ||
| ceph::spinlock lock; |
There was a problem hiding this comment.
Spinlocks need to be VERY clear about what data they protect and why. At a quick skim this looks like it ought be a mutex instead.
| // deadlock | ||
| ceph::spinlock lock; | ||
|
|
||
| // ISN sent by rank0 pinger is 1 |
There was a problem hiding this comment.
ISN? What is this for given that it's not related to last_updated_seq?
There was a problem hiding this comment.
MDSPinger sens a ping message with initial sequence number as 1.
| // locally to figure out if a session got added and removed | ||
| // within an update to rank 0. note that, this is initialized | ||
| // with the incoming sequence number sent by rank 0. | ||
| version_t last_updated_seq = 0; |
There was a problem hiding this comment.
I don’t see this implemented as described, looks like it only sets the per-client seq in add_session, which is called on new sessions but not when updating them?
There was a problem hiding this comment.
you are right the comment -- it's a bit outdated. last_updated_seq is used to determine if a message carrying session information was sent to rank 0. If a session got added and removed before a timer task refreshes session updated/metrics to rank 0, the corresponding entry for metrics can be removed locally and need not be sent to rank 0 (for removal).
src/mds/MetricAggregator.h
Outdated
| private: | ||
| // drop this lock when calling ->send_message_mds() else mds might | ||
| // deadlock | ||
| ceph::spinlock lock; |
There was a problem hiding this comment.
Again, need to document what this covers and I don’t see any reason for it to be a spinlock.
src/mds/MDSPinger.cc
Outdated
| std::scoped_lock locker(lock); | ||
| auto it = ping_state_by_rank.find(rank); | ||
| if (it == ping_state_by_rank.end()) { | ||
| derr << ": rank=" << rank << " was never sent ping request." << dendl; |
There was a problem hiding this comment.
Does detecting this matter? It seems like if it does we should assert (at least when testing), and if it doesn’t we shouldn’t bother to check.
There was a problem hiding this comment.
This handles a corner case of an mds becoming active, then going down before rank 0 can send a ping request.
There was a problem hiding this comment.
That doesn't seem like it's worth logging to derr though?
src/mds/MDSPinger.h
Outdated
| MDSRank *mds; | ||
| // drop this lock when calling ->send_message_mds() else mds might | ||
| // deadlock | ||
| ceph::spinlock lock; |
src/mds/MetricAggregator.cc
Outdated
| dout(10) << dendl; | ||
| ceph_assert(ceph_mutex_is_locked_by_me(mds->mds_lock)); | ||
|
|
||
| mds->mds_lock.unlock(); |
There was a problem hiding this comment.
Rather than do all this with the real mds_lock, why not just read out the address of mds 0 in notify_mdsmap and send the message using a non-rank-based interface? That seems a lot cleaner with the layers and makes locking simpler.
There was a problem hiding this comment.
IIRC, Patrick and I discussed somewhat this a while back where a message could be sent to an mds without acquiring mds_lock. I think the point where we decided to go ahead with the current implementation was somewhat related to sending mdsmap before sending out the actual message. Thinking again on this -- does it really matter to send an (updated) mdsmap in this scenario?
There was a problem hiding this comment.
Ohhhh is it trying to make sure the receiving MDS always has a current MDSMap before dealing with requests?
Hmm, that is a thing we don't want to break in the general case, but it looks like these messages already handle being delayed or misdirected. You'd have to make sure it doesn't erroneously start thinking it owns a metrics gathering server, and that if an MDS sends an update to the old mds0 it correctly sends all the same updates to the new one — but since these area all "total" rather than "since last update" values I think it should be fine?
There was a problem hiding this comment.
right -- guess it doesn't matter if we sent some updates to an old active mds. this would simplify the locking bits.
Everything used mutexes initially, but then @batrick suggested to use spinlocks and it made sense to me. I get the point about not blocking in fast dispatch. Mind explaining a bit more as in why mutexes would be better here? |
The kernel knows about mutexes and that informs scheduling decisions; spinlocks are invisible so if you're unlucky you might get descheduled while holding one. Thanks to futex a mutex is also generally going to be about the same speed as a spinlock is; really the only time I know it's appropriate to use a userspace spinlock is when you need to change a few variables atomically. |
Given that with the changes to send metrics updates to mds0 via non-rank interface, locking would only be required to track the updates. I think sticking to spinlocks probably makes sense now? |
What I'm actually worried about here, aside from it not being a good habit to get into, is that I think there are some memory allocations happening underneath those map operations and that's something to avoid with spinlocks anywhere. |
fair enough -- I'll do the changes. |
|
retest this please |
|
@batrick @gregsfortytwo please take a look |
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
These types will be used for supporting user define querying from ceph-mgr. Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
`MetricAggregator` class aggregates metrics from all active ranks and places metrics appropriately as defined by user queries. This is implemented as a separate dispatcher since metric update messages from active MDSs are frequent so as to avoid messages getting stuck in MDSRank queue (suggested by Patrick). Signed-off-by: Venky Shankar <vshankar@redhat.com>
Every MDS maintains a view of metrics that are forwarded to it by clients. This is updated when clients forward metrics via MClientMetrics message type. Periodically, each MDS forwards its collected metrics to MDS rank 0 (which maintains an aggregated view of metrics of all clients on all ranks). Signed-off-by: Venky Shankar <vshankar@redhat.com>
`MetricAggregator` sets up manager callback to forward metrics data to ceph-mgr. Also, add querying interfaces for adding and removing user queries and a simple interface to fetch MDS perf metrics. Fixes: http://tracker.ceph.com/issues/36253 Signed-off-by: Venky Shankar <vshankar@redhat.com>
|
@batrick rebased and updated -- please take a look |
|
@vshankar some of the failures look new. Can you check especially the evicted client warning causes. |
checking |
|
MDS notes that clients would reconnect: But Doesn't look like this PR is causing client evictions. |
This is a known issue, which has been fixed in https://tracker.ceph.com/issues/45665. More detail logs please see http://qa-proxy.ceph.com/teuthology/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110407/remote/smithi205/log/ceph-client.0.65436.log.gz. |
thx @lxbsz |
|
https://tracker.ceph.com/issues/45829 otherwise looks normal. |
This is a WIP and has basic messaging boilerplates for an MDS to forward metrics (client metrics, etc..) to rank 0. Follow up PRs would stitch all this together with clients periodically sending its metrics to its session MDS with this MDS forwarding it to rank 0 (which would aggregate all session metadata and update the mgr).