mon: add connection scoring and a score-based leader election system#32336
mon: add connection scoring and a score-based leader election system#32336gregsfortytwo wants to merge 65 commits intoceph:masterfrom
Conversation
|
jenkins test make check |
|
Signed-off-by check is for the test TODOs commit, will fix in rebase. |
| mon->store->get(Monitor::MONITOR_NAME, "connectivity_scores", bl); | ||
| if (bl.length()) { | ||
| bufferlist::const_iterator bi = bl.begin(); | ||
| peer_tracker.decode(bi); |
There was a problem hiding this comment.
Does anything bad happen if we persist really bad scores for a peer mon, go down for a long time, and then come back up with (old) bad scores when that peer has since been well behaved? Does it still converge quickly?
There was a problem hiding this comment.
should these be marked dead on startup, just as if they had not acked us in time? it seems inconsistent to consider out of date scores, when we disregard them if we didn't restart.
There was a problem hiding this comment.
If we have bad scores for a mon it might change the leader, but if our bad score is the margin of victory I don't think there's any way it will matter (we have N alive monitors out of M total; the maximum score is N and a local score only contributes 0-1). The local scores will converge to the right one anyway.
|
|
||
| Option("mon_election_default_strategy", Option::TYPE_UINT, Option::LEVEL_ADVANCED) | ||
| .set_default(1) | ||
| .set_min_max(1, 3) |
There was a problem hiding this comment.
maybe make this an enum instead?
There was a problem hiding this comment.
Is there a way to do that in Option? It's enums in the rest of the codebase.
There was a problem hiding this comment.
see for example osd_pool_default_type. the underlying type is a string, but the values are constrained.
There was a problem hiding this comment.
Hmm that's cool but I'd have to add more string->int parsing code and I really don't anticipate anybody except a Ceph dev setting this option.
There was a problem hiding this comment.
maybe these should all be dev options?
|
|
||
| * 1 for CLASSIC | ||
| * 2 for DISALLOW | ||
| * 3 for CONNECTIVITY |
There was a problem hiding this comment.
lowercase would match other user-facing options better, e.g. balancer mode is upmap or weight-set
| using the same commands as above in DISALLOW. | ||
|
|
||
| Examining connectivity scores | ||
| ============================= |
There was a problem hiding this comment.
------ so it's a subheading under CONNECTIVITY
There was a problem hiding this comment.
Actually don't want to do that since we provide scores even if you're not in the CONNECTIVITY mode.
| ===================== | ||
| This mode evaluates connection scores provided by each monitor for its | ||
| peers and elects the monitor with the best score. This mode is designed | ||
| to handle netsplits, which may happen if your cluster is stretched across |
There was a problem hiding this comment.
s/netsplits/asymmetrical netsplits/?
There was a problem hiding this comment.
I don't think that's accurate. If the live connections are A-B, A-C, B-C (all bidirectional) you will still permanently cycle as both A and B send propose messages to C and it spawns an election to try and bring the out-of-quorum monitors in.
I've actually seen this happen once in the wild, although I think in that case it was the network card or a bug somewhere inside the server rather than in the network infrastructure itself.
qa/config/rados.yaml
Outdated
| ceph: | ||
| conf: | ||
| global: | ||
| mon election default strategy: 3 |
There was a problem hiding this comment.
we probably want a new facet to test all modes?
liewegas
left a comment
There was a problem hiding this comment.
This looks great!
- should add the initial election mode to the cephadm.conf file too
- do you plan to merge this before adding the netsplit tests?
jdurgin
left a comment
There was a problem hiding this comment.
The approach sounds good from the docs, will take a closer look at the implementation this week. Left a few comments on the ui/docs.
|
|
||
| * 1 for CLASSIC | ||
| * 2 for DISALLOW | ||
| * 3 for CONNECTIVITY |
There was a problem hiding this comment.
lowercase would match other user-facing options better, e.g. balancer mode is upmap or weight-set
tchaikov
left a comment
There was a problem hiding this comment.
i am still reviewing this PR.
jdurgin
left a comment
There was a problem hiding this comment.
A few minor comments, my only major confusion is the score - I feel like I'm missing something about how this plays into the monitor state machine. Are rank changes/removals more common than I'm thinking, so the scores don't stick around long?
| peer_tracker.report_dead_connection(peer, now - acked_ping); | ||
| acked_ping = now; | ||
| live_pinging.erase(peer); | ||
| return; |
There was a problem hiding this comment.
it seems odd to stop pinging as soon as the peer is dead - it seems like this would cause the score to only decrease a small amount a single time when the timeout is reached. It's treated as 0 for the purposes of get_total_connection_scores(), but if the rank comes back the score will increase again.
The first ping that's acked after it comes back will have a disproportionate effect on the score as well, increasing it very quickly since the prior acked ping time could be very old.
There was a problem hiding this comment.
it seems odd to stop pinging as soon as the peer is dead - it seems like this would cause the score to only decrease a small amount a single time when the timeout is reached. It's treated as 0 for the purposes of get_total_connection_scores(), but if the rank comes back the score will increase again.
Oh dear, I busted this! Good catch. The intent was to mark the connection dead every 2 seconds with that timer event a bit lower. o_0
The first ping that's acked after it comes back will have a disproportionate effect on the score as well, increasing it very quickly since the prior acked ping time could be very old.
That shouldn't happen; when we begin_peer_ping() we set the peer_acked_ping to ceph_clock_now() so it will only add up to 2 seconds of liveness to the score. (And if the peer sends a PING_REPLY to an outdated ping, we invoke begin_peer_ping() to reset things.)
Does the scoring make sense if I fix that dead-not-being-counted bug?
There was a problem hiding this comment.
The first ping that's acked after it comes back will have a disproportionate effect on the score as well, increasing it very quickly since the prior acked ping time could be very old.
That shouldn't happen; when we begin_peer_ping() we set the peer_acked_ping to ceph_clock_now() so it will only add up to 2 seconds of liveness to the score. (And if the peer sends a PING_REPLY to an outdated ping, we invoke begin_peer_ping() to reset things.)
I see, I misread that.
Does the scoring make sense if I fix that dead-not-being-counted bug?
Yes, it makes sense to me now. Thanks for clearing that up!
This was radically broken and apparently doesn't get tested except by mon/misc.sh! Yikes! Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We were previously not handling this correctly for the movement of ranks, but now we go to a bit more effort to get it right. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
I have absolutely no idea why it's counting features, but apparently it is and bumping the value to 7 makes it pass. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We removed StringIO for py3 compatibility going forward and this collided. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
ping @gregsfortytwo |
…nElection We added score sharing in "mon: elector: spread current election scores in election messages" but neglected to actually parse them. This mostly isn't an issue because we also share scores in ping messages and restart election on score conflicts, but it's obviously better to use the data we have! Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We saw a crash when we incorrectly assumed an MMonElection included a scoring_bl, apparently because there was a disagreement on strategy. This probably means the proposer was on an old MonMap (which we actually DO NOT filter on when receiving proposals, oddly enough...), so now we drop the message if there's a conflict. This will sort itself out quickly enough! Signed-off-by: Greg Farnum <gfarnum@redhat.com>
This can happen on startup when monitors are trying to guess their ranks because not all IPs are known yet, and will get sorted out as they finish probing each other. Maybe it might also happen if a monitor is down and doesn't witness removal of a peer from the monmap, but it should also get updated as it probes so having the clashing rank ignore it in elections shouldn't be a problem. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
7f8141c to
32e4fde
Compare
|
anything else left on this before running through qa (or do you already have a run?) |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
@gregsfortytwo pls rebase |
|
Merged as a part of #35906, see #35906 (comment) |
This PR moves us much closer to properly supporting stretched Ceph clusters which
will safely survive under netsplit conditions. It:
and -reliable monitor to be leader)
Reviewers, this is probably readable either all in one go, or by patch, EXCEPT that I rewrote the ConnectionTracker implementation and left it in the history. Developer documentation added near the end of the history may be a useful aid to understanding.
I also don't want to merge the last few test patches but have left them in so you can see what's been done.
This passed a monthrash test run with the CONNECTIVITY mode: http://pulpito.ceph.com/gregf-2019-12-18_14:37:57-rados:monthrash-wip-elector-1218-distro-basic-smithi/
Both CONNECTIVITY and CLASSIC modes passed the simple netsplit test, but checking the teuthology log you can see the CLASSIC mode generally took 6 seconds (ie, an election cycle) to run every command during the netsplit, whereas CONNECTIVITY ran at normal speed. http://pulpito.ceph.com/gregf-2019-12-18_12:00:57-netsplit-wip-elector-1217-2-distro-basic-smithi/
I am now working on a (smaller) follow-on PR that will place monitors into the CRUSH map, and force any PG peering to have replicas across the appropriate CRUSH bucket (basically, different datacenters), unless the cluster decides it has been netsplit (in which case the in-quorum monitor group and its associated OSDs wins). That will include more extensive netsplit testing, since normal RADOS ops will be able to go through.
Todo:
Show available Jenkins commands
jenkins retest this pleasejenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard backendjenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox