Skip to content

mon: MMonProbe: direct MMonJoin messages to the leader, instead of th…#40839

Merged
gregsfortytwo merged 1 commit intoceph:masterfrom
gregsfortytwo:wip-mon-quorum-leader
Apr 14, 2021
Merged

mon: MMonProbe: direct MMonJoin messages to the leader, instead of th…#40839
gregsfortytwo merged 1 commit intoceph:masterfrom
gregsfortytwo:wip-mon-quorum-leader

Conversation

@gregsfortytwo
Copy link
Member

…e first mon

When monitors are joining a cluster, they may send an MMonJoin message to place
themselves correctly in the map in either handle_probe_reply() or
finish_election(). These messages must be sent to the leader -- monitors do not
forward each other's messages.

Unfortunately, this scenario was missed when converting the monitors to support
connectivity-based elections, and they're sending these messages to
quorum.begin(). Fix this by including an explicit leader in MMonProbe (that the
new monitor may reference in handle_probe_reply) and using the leader
value in both locations.

Fixes: https://tracker.ceph.com/issues/50345

Signed-off-by: Greg Farnum gfarnum@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

…e first mon

When monitors are joining a cluster, they may send an MMonJoin message to place
themselves correctly in the map in either handle_probe_reply() or
finish_election(). These messages must be sent to the leader -- monitors do not
forward each other's messages.

Unfortunately, this scenario was missed when converting the monitors to support
connectivity-based elections, and they're sending these messages to
quorum.begin(). Fix this by including an explicit leader in MMonProbe (that the
new monitor may reference in handle_probe_reply) and using the leader
value in both locations.

Fixes: https://tracker.ceph.com/issues/50345

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo
Copy link
Member Author

Reproduced the issue locally and will schedule suites as soon as packages are built.

@gregsfortytwo
Copy link
Member Author

Test results are pretty messy but nothing that looks like this PR.

https://pulpito.ceph.com/gregf-2021-04-14_03:14:48-rados-wip-mon-quorum-leader-413-distro-basic-smithi/
In progress run; 443 passed
13 failed
6044468: cls_cas dup_get test; there's a ticket for it
6044475: issues with cephadm and then selinux denials
6044484: looks like it tried to issue a "ceph tell" to a monitor which was electing and then the watchdog killed the test
6044592: rgw issue
6044655: test_cls_cas
6044710: issues with cephadm and then selinux denials
6044747: known OSD valgrind leak
6044756: TODO
6044792: test_cls_cas
6044827: rgw issue
6044836: test_cls_cas
6044842: looks to be https://tracker.ceph.com/issues/45647
6044843: PriorityCache.cc: 301: FAILED ceph_assert(mem_avail >= 0)
6044872: https://tracker.ceph.com/issues/45721
6044885: "wait_for_recovery: failed before timeout expired" on 1 pg
6044889: looks like https://tracker.ceph.com/issues/49868 is not quite fixed
6044930: test_cls_cas

1 dead
6044547: ansible setup failures (apt-cache issues?)
6044607 has died horribly; PG recovery failed to occur in time and then some kind of shutdown issue. It's not anything this PR could have done.
6044883: "wait_for_clean: failed before timeout expired" and then it didn't stop correctly?

https://pulpito.ceph.com/gregf-2021-04-14_03:25:22-upgrade:octopus-x-wip-mon-quorum-leader-413-distro-basic-smithi/
in progress run
2 failed
6044938: rgw issue
6044947: rgw issue
look stuck:
6044935: podman issue in logs and then test just stopped logging a bit later?
6044941: looks like it's finishing, but I also see a podman issue so who knows
6044943: podman issue in logs and then test just stopped logging a bit later?
6044948: same
6044949: same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants