Skip to content

mon: add connection scoring and a score-based leader election system#32336

Closed
gregsfortytwo wants to merge 65 commits intoceph:masterfrom
gregsfortytwo:wip-elector
Closed

mon: add connection scoring and a score-based leader election system#32336
gregsfortytwo wants to merge 65 commits intoceph:masterfrom
gregsfortytwo:wip-elector

Conversation

@gregsfortytwo
Copy link
Member

This PR moves us much closer to properly supporting stretched Ceph clusters which
will safely survive under netsplit conditions. It:

  • adds a ConnectionTracker and MMonPing to the monitor (via the Elector module)
  • adds a new CONNECTIVITY election mode (which elects the most-connected
    and -reliable monitor to be leader)
  • also a much simpler and bare intermediate one called DISALLOW
  • extends the mon_unittest_election to test these modes
  • Adds a bunch of user commands to configure and query these
  • Adds some infrastructure in the MonMap to support it

Reviewers, this is probably readable either all in one go, or by patch, EXCEPT that I rewrote the ConnectionTracker implementation and left it in the history. Developer documentation added near the end of the history may be a useful aid to understanding.

I also don't want to merge the last few test patches but have left them in so you can see what's been done.

This passed a monthrash test run with the CONNECTIVITY mode: http://pulpito.ceph.com/gregf-2019-12-18_14:37:57-rados:monthrash-wip-elector-1218-distro-basic-smithi/

Both CONNECTIVITY and CLASSIC modes passed the simple netsplit test, but checking the teuthology log you can see the CLASSIC mode generally took 6 seconds (ie, an election cycle) to run every command during the netsplit, whereas CONNECTIVITY ran at normal speed. http://pulpito.ceph.com/gregf-2019-12-18_12:00:57-netsplit-wip-elector-1217-2-distro-basic-smithi/

I am now working on a (smaller) follow-on PR that will place monitors into the CRUSH map, and force any PG peering to have replicas across the appropriate CRUSH bucket (basically, different datacenters), unless the cluster decides it has been netsplit (in which case the in-quorum monitor group and its associated OSDs wins). That will include more extensive netsplit testing, since normal RADOS ops will be able to go through.

Todo:

  • clean up the end QA bits once reviewers indicate they are happy
  • add a yaml frag to monthrash suite to test both CLASSIC and CONNECTIVITY modes
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@gregsfortytwo
Copy link
Member Author

@liewegas, you signed off on the previous PR and probably want to look at this.

@jdurgin, you asked me tag you on a previous draft of this, so here's the new version.

@tchaikov and @jecluis, you may be interested in this monitor work.

@tchaikov
Copy link
Contributor

jenkins test make check

@gregsfortytwo
Copy link
Member Author

Signed-off-by check is for the test TODOs commit, will fix in rebase.
Make check looks to be a seastar issue and I don't see any evidence of my code changes, can rebase but I was leaving it so the sha1s match testing to start with.
Docs build has warnings but no errors, local patches to fix.

mon->store->get(Monitor::MONITOR_NAME, "connectivity_scores", bl);
if (bl.length()) {
bufferlist::const_iterator bi = bl.begin();
peer_tracker.decode(bi);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anything bad happen if we persist really bad scores for a peer mon, go down for a long time, and then come back up with (old) bad scores when that peer has since been well behaved? Does it still converge quickly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be marked dead on startup, just as if they had not acked us in time? it seems inconsistent to consider out of date scores, when we disregard them if we didn't restart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have bad scores for a mon it might change the leader, but if our bad score is the margin of victory I don't think there's any way it will matter (we have N alive monitors out of M total; the maximum score is N and a local score only contributes 0-1). The local scores will converge to the right one anyway.


Option("mon_election_default_strategy", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
.set_default(1)
.set_min_max(1, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make this an enum instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do that in Option? It's enums in the rest of the codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see for example osd_pool_default_type. the underlying type is a string, but the values are constrained.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's cool but I'd have to add more string->int parsing code and I really don't anticipate anybody except a Ceph dev setting this option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe these should all be dev options?


* 1 for CLASSIC
* 2 for DISALLOW
* 3 for CONNECTIVITY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum would be nicer here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase would match other user-facing options better, e.g. balancer mode is upmap or weight-set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase done

using the same commands as above in DISALLOW.

Examining connectivity scores
=============================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

------ so it's a subheading under CONNECTIVITY

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually don't want to do that since we provide scores even if you're not in the CONNECTIVITY mode.

=====================
This mode evaluates connection scores provided by each monitor for its
peers and elects the monitor with the best score. This mode is designed
to handle netsplits, which may happen if your cluster is stretched across
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/netsplits/asymmetrical netsplits/?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's accurate. If the live connections are A-B, A-C, B-C (all bidirectional) you will still permanently cycle as both A and B send propose messages to C and it spawns an election to try and bring the out-of-quorum monitors in.

I've actually seen this happen once in the wild, although I think in that case it was the network card or a bug somewhere inside the server rather than in the network infrastructure itself.

ceph:
conf:
global:
mon election default strategy: 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want a new facet to test all modes?

Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

  • should add the initial election mode to the cephadm.conf file too
  • do you plan to merge this before adding the netsplit tests?

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach sounds good from the docs, will take a closer look at the implementation this week. Left a few comments on the ui/docs.


* 1 for CLASSIC
* 2 for DISALLOW
* 3 for CONNECTIVITY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase would match other user-facing options better, e.g. balancer mode is upmap or weight-set

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am still reviewing this PR.

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, my only major confusion is the score - I feel like I'm missing something about how this plays into the monitor state machine. Are rank changes/removals more common than I'm thinking, so the scores don't stick around long?

peer_tracker.report_dead_connection(peer, now - acked_ping);
acked_ping = now;
live_pinging.erase(peer);
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems odd to stop pinging as soon as the peer is dead - it seems like this would cause the score to only decrease a small amount a single time when the timeout is reached. It's treated as 0 for the purposes of get_total_connection_scores(), but if the rank comes back the score will increase again.

The first ping that's acked after it comes back will have a disproportionate effect on the score as well, increasing it very quickly since the prior acked ping time could be very old.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems odd to stop pinging as soon as the peer is dead - it seems like this would cause the score to only decrease a small amount a single time when the timeout is reached. It's treated as 0 for the purposes of get_total_connection_scores(), but if the rank comes back the score will increase again.

Oh dear, I busted this! Good catch. The intent was to mark the connection dead every 2 seconds with that timer event a bit lower. o_0

The first ping that's acked after it comes back will have a disproportionate effect on the score as well, increasing it very quickly since the prior acked ping time could be very old.

That shouldn't happen; when we begin_peer_ping() we set the peer_acked_ping to ceph_clock_now() so it will only add up to 2 seconds of liveness to the score. (And if the peer sends a PING_REPLY to an outdated ping, we invoke begin_peer_ping() to reset things.)

Does the scoring make sense if I fix that dead-not-being-counted bug?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first ping that's acked after it comes back will have a disproportionate effect on the score as well, increasing it very quickly since the prior acked ping time could be very old.

That shouldn't happen; when we begin_peer_ping() we set the peer_acked_ping to ceph_clock_now() so it will only add up to 2 seconds of liveness to the score. (And if the peer sends a PING_REPLY to an outdated ping, we invoke begin_peer_ping() to reset things.)

I see, I misread that.

Does the scoring make sense if I fix that dead-not-being-counted bug?

Yes, it makes sense to me now. Thanks for clearing that up!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and now done.

This was radically broken and apparently doesn't get tested except
by mon/misc.sh! Yikes!

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We were previously not handling this correctly for the movement of ranks,
but now we go to a bit more effort to get it right.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
I have absolutely no idea why it's counting features, but
apparently it is and bumping the value to 7 makes it pass.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We removed StringIO for py3 compatibility going forward and this collided.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@stale
Copy link

stale bot commented Jun 14, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jun 14, 2020
@neha-ojha
Copy link
Member

ping @gregsfortytwo

@stale stale bot removed the stale label Jun 16, 2020
…nElection

We added score sharing in
"mon: elector: spread current election scores in election messages"
but neglected to actually parse them. This mostly isn't an issue because
we also share scores in ping messages and restart election on score conflicts,
but it's obviously better to use the data we have!

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We saw a crash when we incorrectly assumed an MMonElection included
a scoring_bl, apparently because there was a disagreement on strategy.
This probably means the proposer was on an old MonMap (which we actually
DO NOT filter on when receiving proposals, oddly enough...), so now
we drop the message if there's a conflict. This will sort itself out
quickly enough!

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
This can happen on startup when monitors are trying to guess
their ranks because not all IPs are known yet, and will get
sorted out as they finish probing each other.
Maybe it might also happen if a monitor is down and doesn't witness
removal of a peer from the monmap, but it should also get
updated as it probes so having the clashing rank ignore it
in elections shouldn't be a problem.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo gregsfortytwo requested a review from a team July 1, 2020 07:01
@gregsfortytwo gregsfortytwo requested a review from a team as a code owner July 1, 2020 07:01
Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest bug fixes look good

@jdurgin
Copy link
Member

jdurgin commented Jul 2, 2020

anything else left on this before running through qa (or do you already have a run?)

@stale
Copy link

stale bot commented Sep 7, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Sep 7, 2020
@travisn travisn removed the stale label Sep 11, 2020
@yuriw
Copy link
Contributor

yuriw commented Oct 23, 2020

@gregsfortytwo pls rebase

@neha-ojha
Copy link
Member

Merged as a part of #35906, see #35906 (comment)

@neha-ojha neha-ojha closed this Oct 23, 2020
@gregsfortytwo gregsfortytwo deleted the wip-elector branch February 21, 2022 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants