mon: MMonProbe: direct MMonJoin messages to the leader, instead of th… by gregsfortytwo · Pull Request #40839 · ceph/ceph

gregsfortytwo · 2021-04-13T23:32:07Z

…e first mon

When monitors are joining a cluster, they may send an MMonJoin message to place
themselves correctly in the map in either handle_probe_reply() or
finish_election(). These messages must be sent to the leader -- monitors do not
forward each other's messages.

Unfortunately, this scenario was missed when converting the monitors to support
connectivity-based elections, and they're sending these messages to
quorum.begin(). Fix this by including an explicit leader in MMonProbe (that the
new monitor may reference in handle_probe_reply) and using the leader
value in both locations.

Fixes: https://tracker.ceph.com/issues/50345

Signed-off-by: Greg Farnum gfarnum@redhat.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

…e first mon When monitors are joining a cluster, they may send an MMonJoin message to place themselves correctly in the map in either handle_probe_reply() or finish_election(). These messages must be sent to the leader -- monitors do not forward each other's messages. Unfortunately, this scenario was missed when converting the monitors to support connectivity-based elections, and they're sending these messages to quorum.begin(). Fix this by including an explicit leader in MMonProbe (that the new monitor may reference in handle_probe_reply) and using the leader value in both locations. Fixes: https://tracker.ceph.com/issues/50345 Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2021-04-13T23:33:16Z

Reproduced the issue locally and will schedule suites as soon as packages are built.

gregsfortytwo · 2021-04-14T14:39:42Z

Test results are pretty messy but nothing that looks like this PR.

https://pulpito.ceph.com/gregf-2021-04-14_03:14:48-rados-wip-mon-quorum-leader-413-distro-basic-smithi/
In progress run; 443 passed
13 failed
6044468: cls_cas dup_get test; there's a ticket for it
6044475: issues with cephadm and then selinux denials
6044484: looks like it tried to issue a "ceph tell" to a monitor which was electing and then the watchdog killed the test
6044592: rgw issue
6044655: test_cls_cas
6044710: issues with cephadm and then selinux denials
6044747: known OSD valgrind leak
6044756: TODO
6044792: test_cls_cas
6044827: rgw issue
6044836: test_cls_cas
6044842: looks to be https://tracker.ceph.com/issues/45647
6044843: PriorityCache.cc: 301: FAILED ceph_assert(mem_avail >= 0)
6044872: https://tracker.ceph.com/issues/45721
6044885: "wait_for_recovery: failed before timeout expired" on 1 pg
6044889: looks like https://tracker.ceph.com/issues/49868 is not quite fixed
6044930: test_cls_cas

1 dead
6044547: ansible setup failures (apt-cache issues?)
6044607 has died horribly; PG recovery failed to occur in time and then some kind of shutdown issue. It's not anything this PR could have done.
6044883: "wait_for_clean: failed before timeout expired" and then it didn't stop correctly?

https://pulpito.ceph.com/gregf-2021-04-14_03:25:22-upgrade:octopus-x-wip-mon-quorum-leader-413-distro-basic-smithi/
in progress run
2 failed
6044938: rgw issue
6044947: rgw issue
look stuck:
6044935: podman issue in logs and then test just stopped logging a bit later?
6044941: looks like it's finishing, but I also see a podman issue so who knows
6044943: podman issue in logs and then test just stopped logging a bit later?
6044948: same
6044949: same

gregsfortytwo added bug-fix core needs-qa needs-review mon labels Apr 13, 2021

gregsfortytwo requested review from athanatos and tchaikov April 13, 2021 23:32

athanatos approved these changes Apr 14, 2021

View reviewed changes

tchaikov added the wip-kefu-testing label Apr 14, 2021

tchaikov approved these changes Apr 14, 2021

View reviewed changes

tchaikov removed the needs-review label Apr 14, 2021

gregsfortytwo merged commit ef6f360 into ceph:master Apr 14, 2021

gregsfortytwo mentioned this pull request May 3, 2021

Pacific: Direct MMonJoin messages to leader, not first rank [Merge after 41130] #41131

Merged

3 tasks

gregsfortytwo deleted the wip-mon-quorum-leader branch November 28, 2023 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mon: MMonProbe: direct MMonJoin messages to the leader, instead of th…#40839

mon: MMonProbe: direct MMonJoin messages to the leader, instead of th…#40839
gregsfortytwo merged 1 commit intoceph:masterfrom
gregsfortytwo:wip-mon-quorum-leader

gregsfortytwo commented Apr 13, 2021

Uh oh!

gregsfortytwo commented Apr 13, 2021

Uh oh!

gregsfortytwo commented Apr 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gregsfortytwo commented Apr 13, 2021

Checklist

Uh oh!

gregsfortytwo commented Apr 13, 2021

Uh oh!

gregsfortytwo commented Apr 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants