mon: add connection scoring and a score-based leader election system by gregsfortytwo · Pull Request #32336 · ceph/ceph

gregsfortytwo · 2019-12-18T18:35:22Z

This PR moves us much closer to properly supporting stretched Ceph clusters which
will safely survive under netsplit conditions. It:

adds a ConnectionTracker and MMonPing to the monitor (via the Elector module)
adds a new CONNECTIVITY election mode (which elects the most-connected
and -reliable monitor to be leader)
also a much simpler and bare intermediate one called DISALLOW
extends the mon_unittest_election to test these modes
Adds a bunch of user commands to configure and query these
Adds some infrastructure in the MonMap to support it

Reviewers, this is probably readable either all in one go, or by patch, EXCEPT that I rewrote the ConnectionTracker implementation and left it in the history. Developer documentation added near the end of the history may be a useful aid to understanding.

I also don't want to merge the last few test patches but have left them in so you can see what's been done.

This passed a monthrash test run with the CONNECTIVITY mode: http://pulpito.ceph.com/gregf-2019-12-18_14:37:57-rados:monthrash-wip-elector-1218-distro-basic-smithi/

Both CONNECTIVITY and CLASSIC modes passed the simple netsplit test, but checking the teuthology log you can see the CLASSIC mode generally took 6 seconds (ie, an election cycle) to run every command during the netsplit, whereas CONNECTIVITY ran at normal speed. http://pulpito.ceph.com/gregf-2019-12-18_12:00:57-netsplit-wip-elector-1217-2-distro-basic-smithi/

I am now working on a (smaller) follow-on PR that will place monitors into the CRUSH map, and force any PG peering to have replicas across the appropriate CRUSH bucket (basically, different datacenters), unless the cluster decides it has been netsplit (in which case the in-quorum monitor group and its associated OSDs wins). That will include more extensive netsplit testing, since normal RADOS ops will be able to go through.

Todo:

clean up the end QA bits once reviewers indicate they are happy
add a yaml frag to monthrash suite to test both CLASSIC and CONNECTIVITY modes

Show available Jenkins commands

jenkins retest this please
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

gregsfortytwo · 2019-12-18T18:36:31Z

@liewegas, you signed off on the previous PR and probably want to look at this.

@jdurgin, you asked me tag you on a previous draft of this, so here's the new version.

@tchaikov and @jecluis, you may be interested in this monitor work.

tchaikov · 2019-12-19T06:08:47Z

jenkins test make check

gregsfortytwo · 2019-12-19T15:46:56Z

Signed-off-by check is for the test TODOs commit, will fix in rebase.
Make check looks to be a seastar issue and I don't see any evidence of my code changes, can rebase but I was leaving it so the sha1s match testing to start with.
Docs build has warnings but no errors, local patches to fix.

liewegas · 2019-12-19T23:06:38Z

src/mon/Elector.cc

+  mon->store->get(Monitor::MONITOR_NAME, "connectivity_scores", bl);
+  if (bl.length()) {
+    bufferlist::const_iterator bi = bl.begin();
+    peer_tracker.decode(bi);


Does anything bad happen if we persist really bad scores for a peer mon, go down for a long time, and then come back up with (old) bad scores when that peer has since been well behaved? Does it still converge quickly?

should these be marked dead on startup, just as if they had not acked us in time? it seems inconsistent to consider out of date scores, when we disregard them if we didn't restart.

If we have bad scores for a mon it might change the leader, but if our bad score is the margin of victory I don't think there's any way it will matter (we have N alive monitors out of M total; the maximum score is N and a local score only contributes 0-1). The local scores will converge to the right one anyway.

src/mon/ConnectionTracker2.cc

liewegas · 2019-12-19T23:23:36Z

src/common/options.cc


+    Option("mon_election_default_strategy", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
+    .set_default(1)
+    .set_min_max(1, 3)


maybe make this an enum instead?

Is there a way to do that in Option? It's enums in the rest of the codebase.

see for example osd_pool_default_type. the underlying type is a string, but the values are constrained.

Hmm that's cool but I'd have to add more string->int parsing code and I really don't anticipate anybody except a Ceph dev setting this option.

maybe these should all be dev options?

qa/workunits/cephtool/test.sh

src/mon/ConnectionTracker.h

src/mon/ConnectionTracker.cc

doc/dev/mon-elections.rst

doc/rados/operations/change-mon-elections.rst

liewegas · 2019-12-19T23:32:41Z

doc/rados/operations/change-mon-elections.rst

+
+* 1 for CLASSIC
+* 2 for DISALLOW
+* 3 for CONNECTIVITY


enum would be nicer here

lowercase would match other user-facing options better, e.g. balancer mode is upmap or weight-set

lowercase done

doc/rados/operations/change-mon-elections.rst

liewegas · 2019-12-19T23:34:06Z

doc/rados/operations/change-mon-elections.rst

+using the same commands as above in DISALLOW.
+
+Examining connectivity scores
+=============================


------ so it's a subheading under CONNECTIVITY

Actually don't want to do that since we provide scores even if you're not in the CONNECTIVITY mode.

liewegas · 2019-12-19T23:34:21Z

doc/rados/operations/change-mon-elections.rst

+=====================
+This mode evaluates connection scores provided by each monitor for its
+peers and elects the monitor with the best score. This mode is designed
+to handle netsplits, which may happen if your cluster is stretched across


s/netsplits/asymmetrical netsplits/?

I don't think that's accurate. If the live connections are A-B, A-C, B-C (all bidirectional) you will still permanently cycle as both A and B send propose messages to C and it spawns an election to try and bring the out-of-quorum monitors in.

I've actually seen this happen once in the wild, although I think in that case it was the network card or a bug somewhere inside the server rather than in the network infrastructure itself.

doc/rados/operations/change-mon-elections.rst

liewegas · 2019-12-19T23:35:26Z

qa/config/rados.yaml

  ceph:
    conf:
+      global:
+        mon election default strategy: 3


we probably want a new facet to test all modes?

liewegas

This looks great!

should add the initial election mode to the cephadm.conf file too
do you plan to merge this before adding the netsplit tests?

jdurgin

The approach sounds good from the docs, will take a closer look at the implementation this week. Left a few comments on the ui/docs.

doc/rados/operations/change-mon-elections.rst

doc/dev/mon-elections.rst

doc/rados/operations/change-mon-elections.rst

doc/dev/mon-elections.rst

jdurgin · 2020-01-04T00:36:59Z

doc/rados/operations/change-mon-elections.rst

+
+* 1 for CLASSIC
+* 2 for DISALLOW
+* 3 for CONNECTIVITY


lowercase would match other user-facing options better, e.g. balancer mode is upmap or weight-set

src/mon/ElectionLogic.h

tchaikov

i am still reviewing this PR.

src/test/mon/test_election.cc

src/mon/ElectionLogic.cc

src/mon/ElectionLogic.h

src/test/mon/test_election.cc

src/mon/ElectionLogic.cc

jdurgin

A few minor comments, my only major confusion is the score - I feel like I'm missing something about how this plays into the monitor state machine. Are rank changes/removals more common than I'm thinking, so the scores don't stick around long?

src/mon/ConnectionTracker.h

src/common/options.cc

src/mon/ConnectionTracker.h

doc/dev/mon-elections.rst

jdurgin · 2020-01-14T05:41:56Z

src/mon/Elector.cc

+    peer_tracker.report_dead_connection(peer, now - acked_ping);
+    acked_ping = now;
+    live_pinging.erase(peer);
+    return;


it seems odd to stop pinging as soon as the peer is dead - it seems like this would cause the score to only decrease a small amount a single time when the timeout is reached. It's treated as 0 for the purposes of get_total_connection_scores(), but if the rank comes back the score will increase again.

The first ping that's acked after it comes back will have a disproportionate effect on the score as well, increasing it very quickly since the prior acked ping time could be very old.

it seems odd to stop pinging as soon as the peer is dead - it seems like this would cause the score to only decrease a small amount a single time when the timeout is reached. It's treated as 0 for the purposes of get_total_connection_scores(), but if the rank comes back the score will increase again.

Oh dear, I busted this! Good catch. The intent was to mark the connection dead every 2 seconds with that timer event a bit lower. o_0

The first ping that's acked after it comes back will have a disproportionate effect on the score as well, increasing it very quickly since the prior acked ping time could be very old.

That shouldn't happen; when we begin_peer_ping() we set the peer_acked_ping to ceph_clock_now() so it will only add up to 2 seconds of liveness to the score. (And if the peer sends a PING_REPLY to an outdated ping, we invoke begin_peer_ping() to reset things.)

Does the scoring make sense if I fix that dead-not-being-counted bug?

The first ping that's acked after it comes back will have a disproportionate effect on the score as well, increasing it very quickly since the prior acked ping time could be very old.

That shouldn't happen; when we begin_peer_ping() we set the peer_acked_ping to ceph_clock_now() so it will only add up to 2 seconds of liveness to the score. (And if the peer sends a PING_REPLY to an outdated ping, we invoke begin_peer_ping() to reset things.)

I see, I misread that.

Does the scoring make sense if I fix that dead-not-being-counted bug?

Yes, it makes sense to me now. Thanks for clearing that up!

...and now done.

src/mon/ConnectionTracker.cc

This was radically broken and apparently doesn't get tested except by mon/misc.sh! Yikes! Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

We were previously not handling this correctly for the movement of ranks, but now we go to a bit more effort to get it right. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

I have absolutely no idea why it's counting features, but apparently it is and bumping the value to 7 makes it pass. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

We removed StringIO for py3 compatibility going forward and this collided. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

stale · 2020-06-14T22:46:52Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

neha-ojha · 2020-06-16T19:30:42Z

ping @gregsfortytwo

…nElection We added score sharing in "mon: elector: spread current election scores in election messages" but neglected to actually parse them. This mostly isn't an issue because we also share scores in ping messages and restart election on score conflicts, but it's obviously better to use the data we have! Signed-off-by: Greg Farnum <gfarnum@redhat.com>

We saw a crash when we incorrectly assumed an MMonElection included a scoring_bl, apparently because there was a disagreement on strategy. This probably means the proposer was on an old MonMap (which we actually DO NOT filter on when receiving proposals, oddly enough...), so now we drop the message if there's a conflict. This will sort itself out quickly enough! Signed-off-by: Greg Farnum <gfarnum@redhat.com>

This can happen on startup when monitors are trying to guess their ranks because not all IPs are known yet, and will get sorted out as they finish probing each other. Maybe it might also happen if a monitor is down and doesn't witness removal of a peer from the monmap, but it should also get updated as it probes so having the clashing rank ignore it in elections shouldn't be a problem. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

jdurgin

latest bug fixes look good

jdurgin · 2020-07-02T13:27:47Z

anything else left on this before running through qa (or do you already have a run?)

stale · 2020-09-07T00:05:23Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

yuriw · 2020-10-23T14:43:37Z

@gregsfortytwo pls rebase

neha-ojha · 2020-10-23T15:37:02Z

Merged as a part of #35906, see #35906 (comment)

gregsfortytwo added core needs-review mon labels Dec 18, 2019

gregsfortytwo requested review from jdurgin, jecluis, liewegas and tchaikov December 18, 2019 18:35