mon/Elector: Change how we handle removed_ranks and notify_rank_removed() by kamoltat · Pull Request #48991 · ceph/ceph

kamoltat · 2022-11-21T17:17:48Z

Problem:

Currently, there is an issue when performing a DR test with 2 sites stretch cluster
where removing monitors and adding new ones to the cluster
causes incorrect rank in ConnectionTracker class.

This causes the monitor to think that they are someone else
in the peer_tracker copy and will never update the
correct field of itself, causing a deadlock in the election process (Ceph becoming unresponsive)
when using election strategy: 3 (Connectivity mode).

Solution:

It was really hard to debug the issue so the first thing we did was to add additional loggings to ConnectionTracker,
Elector and ElectionLogic Classes.
In ConnectionTracker::receive_peer_report we loop through ranks which is bad when there is notify_rank_removed before this and the ranks are not adjusted yet. When we rely on the rank in certain scenarios, we end up with extra peer_report copy which we don't want. Therefore, instead of passing report.rank in the function
ConnectionTracker::reports, we pass i.first instead so that we trim old ranks properly. We also added an assert in notify_rank_removed(), comparing the expected rank provided by the monmap against the rank that we adjust ourselves to as a sanity check. We edited test/mon/test_election.cc to reflect the changes made in notify_rank_removed().
MonMap::removed_rank does not get cleared every update of the epoch, this was the root cause of the problem. Therefore, we fix this by making MonMap clear removed_ranks every update. Moreover, When there is discontinuity between monmaps for more than 1 epoch or the new monitor never joined the quorum before,
we always reset peer_tracker.
Added a way for us to manually reset peer_tracker.rank when executing the command: ceph connection scores reset for each monitor. The peer_tracker.rank will match the current rank of the Monitor.
When upgrading the monitors (including booting up), we check if peer_tracker is dirty or not. If so, we clear it. Added some functions in the Elector and ConnectionTracker classes to check for clean peer_tracker. Moreover, there could be some cases where due to startup weirdness or abnormal circumstances, we might get a report from our own rank. Therefore, it doesn't hurt to add a sanity check in ConnectionTracker::report_live_connection and ConnectionTracker::report_dead_connection.
--mon-initial-members does nothing but cause monmap
to populate removed_ranks, therefore, remove all instances of --mon-initial-members in the standalone test as it has no impact on the nature of the tests themselves.
Monitor::notify_new_monmap() skips removal of non-exist rank
In RHCS the user can choose to manually remove a monitor rank
before shutting the monitor down. Causing inconsistency in monmap.
Therefore, in Monitor::notify_new_monmap() we prevent the function from going into removing our own rank or ranks that don't exist in monmap.

Fixes: https://tracker.ceph.com/issues/58049
Fixes: https://tracker.ceph.com/issues/58132

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

rzarzynski · 2022-11-21T18:07:55Z

@kamoltat: I'm assuming this is the fix for https://tracker.ceph.com/issues/58049. if so, please add the Fixes: https://tracker.ceph.com/issues/58049 tag to the commits' descriptions.

kamoltat · 2022-11-22T03:12:40Z

jenkins retest this please

rzarzynski · 2022-11-30T18:43:49Z

@kamoltat: the Fixes: ticket URL needs also to be added to the commits' descriptions, not only to the PR description.

rzarzynski · 2022-12-05T18:05:20Z

jenkins retest this please

kamoltat · 2022-12-07T19:24:25Z

Retargeting #48698 to this PR. Since #48698 was opened for hotfix downstream Pacific and was successfully cherry-picked. However, now we follow the traditional process of upstream main, then backport quincy, pacific.

kamoltat · 2022-12-07T19:32:54Z

building https://shaman.ceph.com/builds/ceph/wip-ksirivad-fix-bz-2121452/738c98adf30e42e9091ab8df08bebfe3e825d248/default/325699/

The plan is going to run against RADOS suite

kamoltat · 2022-12-07T19:58:52Z

jenkins test make check arm64

kamoltat · 2022-12-08T19:25:09Z

jenkins test make check

kamoltat · 2022-12-08T19:25:52Z

jenkins test make check arm64

kamoltat · 2022-12-08T19:26:08Z

jenkins test windows

kamoltat · 2022-12-09T14:44:57Z

jenkins test make check

…ElectionLogic & Elector Problem: Currently there are not ConnectionTracker logs, therefore it is really hard to debug Solution: Enable loggings for most functions in ConnectionTracker.cc Most of the logs are in debug_mon = 30. Also Added some logs in Elector and ElectionLogic so that debugging will be easier in the future. Fixes: https://tracker.ceph.com/issues/58049 Signed-off-by: Kamoltat <ksirivad@redhat.com>

PROBLEM: In `ConnectionTracker::receive_peer_report` we loop through ranks which is bad when there is `notify_rank_removed` before this and the ranks are not adjusted yet. When we rely on the rank in certain scenarios, we end up with extra peer_report copy which we don't want. SOLUTION: In `ConnectionTracker::receive_peer_report` instead of passing `report.rank` in the function `ConnectionTracker::reports`, we pass `i.first` instead so that trim old ranks properly. We also added a assert in notify_rank_removed(), comparing expected rank provided by the monmap against the rank that we adjust ourself to as a sanity check. We edited test/mon/test_election.cc to reflect the changes made in notify_rank_removed(). Fixes: https://tracker.ceph.com/issues/58049 Signed-off-by: Kamoltat <ksirivad@redhat.com>

when a new monitor joins, there is a chance that it will recive a monmap that recently removed a monitor and ``removed_rank`` will have some content in it. A new monitor that joins should never remove rank in peer_tracker but rather call ``notify_clear_peer_state()`` to reset the `peer_report`. In the case when it is a monitor that has joined quorum before and is only 1 epoch behind the newest monmap provided by the probe_replied monitor. We can actually remove and adjust ranks in `peer_report` since we are sure that if there is any content in removed_ranks, then it has to be because in the next epoch we are removing a rank, since every update of an epoch we always clear the removed_ranks. There is no point in keeping the content of ``removed_ranks`` after monmap gets updated to the epoch. Therefore, clear ``removed_ranks`` every update. When there is discontinuity between monmaps for more 1 epoch or the new monitor never joined quorum before, we always reset `peer_tracker`. Moreover, beneficial for monitor log to also log which rank has been removed at the current time of the monmap. So add removed_ranks to `print_summary` and `dump` in MonMap.cc. Fixes: https://tracker.ceph.com/issues/58049 Signed-off-by: Kamoltat <ksirivad@redhat.com>

In `notify_clear_peer_state()` we another mechanism in reseting our `peer_tracker.rank` to match our own monitor.rank. This is added so there is a way for us to recover from a scenrio where `peer_tracker.rank` is messed up from adjusting the ranks or removing ranks. `notifiy_clear_peer_state()` can be triggered by using the command: `ceph connection scores reset` Also in `clear_peer_reports`, besides reassigning my_reports to an empty object, we also have to make `my_reports` = `rank` from `peer_tracker`, such that we don't get -1 as a rank in my_reports. Fixes: https://tracker.ceph.com/issues/58049 Signed-off-by: Kamoltat <ksirivad@redhat.com>

…d connection report When upgrading the monitors (include booting up), we check if `peer_tracker` is dirty or not. If so, we clear it. Added some functions in `Elector` and `ConnectionTracker` class to check for clean `peer_tracker`. Moreover, there could be some cases where due to startup weirdness or abnormal circumstances, we might get a report from our own rank. Therefore, it doesn't hurt to add a sanity check in `ConnectionTracker::report_live_connection` and `ConnectionTracker::report_dead_connection`. Fixes: https://tracker.ceph.com/issues/58049 Signed-off-by: Kamoltat <ksirivad@redhat.com>

Problem: --mon-initial-members does nothing but causes monmap to populate ``removed_ranks`` because the way we start monitors in standalone tests uses ``run_mon $dir $id ..`` on each mon. Regardless of --mon-initial-members=a,b,c, if we set --mon-host=$MONA,$MONB,$MONC (which we do every single tests), everytime we run a monitor (e.g.,run mon.b) it will pre-build our monmap with ``` noname-a=mon.noname-a addrs v2:127.0.0.1:7127/0, b=mon.b addrs v2:127.0.0.1:7128/0, noname-c=mon.noname-c addrs v2:127.0.0.1:7129/0, ``` Now, with --mon-initial-members=a,b,c we are letting monmap know that we should have initial members name: a,b,c, which we only have `b` as a match. So what ``MonMap::set_initial_members`` do is that it will remove noname-a and noname-c which will populate `removed_ranks`. Solution: remove all instances of --mon-initial-members in the standalone test as it has no impact on the nature of the tests themselves. Fixes: https://tracker.ceph.com/issues/58132 Signed-off-by: Kamoltat <ksirivad@redhat.com>

kamoltat · 2022-12-09T19:00:36Z

jenkins test make check

Problem: In RHCS the user can choose to manually remove a monitor rank before shutting the monitor down. Causing inconsistency in monmap. for example we remove mon.a from the monmap, there is a short period where mon.a is still operational and will try to remove itself from monmap but we will run into an assertion in ConnectionTracker::notify_ranks_removed(). Solution: In Monitor::notify_new_monmap() we prevent the func from going into removing our own rank, or ranks that doesn't exists in monmap. FYI: this is an RHCS problem only, in ODF, we never remove a monitor from monmap before shutting it down. Fixes: https://tracker.ceph.com/issues/58049 Signed-off-by: Kamoltat <ksirivad@redhat.com>

gregsfortytwo

The first six commits were previously reviewed and approved in #48698.

The last one also LGTM!

gregsfortytwo · 2022-12-15T02:38:28Z

src/mon/Monitor.cc

+        dout(5) << "We no longer exists in the monmap! ... dropping." << dendl;
+        continue;
+      }
+      elector.notify_rank_removed(remove_rank, new_rank);


It's good to understand the use cases weird bugs like this can pop up in, but we normally don't drop downstream products into our commit messages. Just a note for future activities. :)

kamoltat · 2022-12-15T22:06:17Z

25 Failures mostly Infra failures, all the other failures are known failures.

FAILED

jobid: [7117874]
description: rados/singleton-nomsgr/{all/osd_stale_reads mon_election/classic rados supported-random-distro$/{rhel_8}}
failure_reason: Server connection dropped:
traceback:
tracker:
created_tracker: https://tracker.ceph.com/issues/58283

jobid: [7117883]
description: rados/cephadm/osds/{0-distro/ubuntu_20.04 0-nvme-loop 1-start 2-ops/repave-all}
failure_reason: Command failed on smithi094 with status 100: 'sudo apt-get clean'
traceback:
tracker:
created_tracker:

jobid: [7117927]
description: rados/multimon/{clusters/9 mon_election/connectivity msgr-failures/many msgr/async no_pools objectstore/bluestore-comp-lz4 rados supported-random-distro$/{centos_8} tasks/mon_clock_no_skews}
failure_reason: Stale jobs detected, aborting.
traceback:
tracker:
created_tracker:

jobid: [7117928]
description: rados/singleton/{all/lost-unfound-delete mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-hybrid rados supported-random-distro$/{centos_8}}
failure_reason: {'smithi093.front.sepia.ceph.com': {'changed': False, 'msg': 'Failed to connect to the host via ssh: ssh: connect to host smithi093.front.sepia.ceph.com port 22: No route to host', 'unreachable': True}}
traceback:
tracker:
created_tracker:

jobid: [7117939]
description:
failure_reason: {'smithi089.front.sepia.ceph.com': {'_ansible_no_log': False, 'msg': "failed to transfer file to /home/teuthworker/.ansible/tmp/ansible-local-12547kwq9iumd/tmpb7be8wx6 /home/ubuntu/.ansible/tmp/ansible-tmp-1671089653.5063088-18352-262171494946209/AnsiballZ_file.py:\n\nWarning: Permanently added 'smithi089.front.sepia.ceph.com,172.21.15.89' (ECDSA) to the list of known hosts.\r\ndd: failed to open '/home/ubuntu/.ansible/tmp/ansible-tmp-1671089653.5063088-18352-262171494946209/AnsiballZ_file.py': No such file or directory\n"}}
traceback:
tracker:
created_tracker:

jobid: [7117948]
description: rados/thrash-erasure-code-shec/{ceph clusters/{fixed-4 openstack} mon_election/classic msgr-failures/osd-dispatch-delay objectstore/filestore-xfs rados recovery-overrides/{more-partial-recovery} supported-random-distro$/{ubuntu_latest} thrashers/default thrashosds-health workloads/ec-rados-plugin=shec-k=4-m=3-c=2}
failure_reason: Stale jobs detected, aborting.
traceback:
tracker:
created_tracker:

jobid: [7117950]
description: rados/cephadm/osds/{0-distro/centos_8.stream_container_tools_crun 0-nvme-loop 1-start 2-ops/rm-zap-flag}
failure_reason: {'smithi016.front.sepia.ceph.com': {'changed': False, 'msg': 'Failed to connect to the host via ssh: ssh: connect to host smithi016.front.sepia.ceph.com port 22: No route to host', 'unreachable': True}}
traceback:
tracker:
created_tracker:

jobid: [7117951] (Non-infra)
description: rados/singleton-nomsgr/{all/ceph-post-file mon_election/connectivity rados supported-random-distro$/{ubuntu_latest}}
failure_reason: Command failed (workunit test post-file.sh) on smithi093 with status 255: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9d538d1d13343c5c2d41e40ab0ca09770646c22f TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/post-file.sh'
traceback: connect to host drop.ceph.com port 22: Connection timed out
tracker: https://tracker.ceph.com/issues/58097
created_tracker:

jobid: [7117977]
description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/fast mon_election/classic msgr-failures/osd-delay rados recovery-overrides/{more-async-recovery} supported-random-distro$/{ubuntu_latest} thrashers/minsize_recovery thrashosds-health workloads/ec-small-objects-fast-read-overwrites}
failure_reason: [Errno None] Unable to connect to port 22 on 172.21.15.89
traceback:
tracker: https://tracker.ceph.com/issues/47343
created_tracker:

jobid: [7117979, 7118058, 7118195] (Non infra)
description: rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/radosbench cluster/3-node k8s/1.21 net/calico rook/1.7.2}
failure_reason: Command failed on smithi037 with status 1: 'sudo kubeadm init --node-name smithi037 --token abcdef.q9zcptfltqth0cc4 --pod-network-cidr 10.249.32.0/21'
traceback:
tracker: https://tracker.ceph.com/issues/58258
created_tracker:

jobid: [7117988] (Non infra)
description: rados/objectstore/{backends/objectstore-bluestore-a supported-random-distro$/{rhel_8}}
failure_reason: Command failed on smithi174 with status 1: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c 'mkdir $TESTDIR/archive/ostest && cd $TESTDIR/archive/ostest && ulimit -Sn 16384 && CEPH_ARGS="--no-log-to-stderr --log-file $TESTDIR/archive/ceph_test_objectstore.log --debug-bluestore 20" ceph_test_objectstore --gtest_filter=*/2:-SyntheticMatrixC --gtest_catch_exceptions=0''
traceback: [ FAILED ] ObjectStore/StoreTestSpecificAUSize.SpilloverTest/2, where GetParam() = "bluestore"
tracker: https://tracker.ceph.com/issues/58256
created_tracker:

jobid: [7117990]
description: rados/perf/{ceph mon_election/classic objectstore/bluestore-basic-min-osd-mem-target openstack scheduler/dmclock_1Shard_16Threads settings/optimized ubuntu_latest workloads/radosbench_4M_write}
failure_reason: {'smithi089.front.sepia.ceph.com': {'_ansible_no_log': False, 'msg': 'Failed to connect to the host via ssh: ssh: connect to host smithi089.front.sepia.ceph.com port 22: No route to host'}}
traceback:
tracker:
created_tracker:

jobid: [7118010]
description: rados/cephadm/workunits/{0-distro/rhel_8.6_container_tools_3.0 agent/off mon_election/classic task/test_iscsi_pids_limit/{centos_8.stream_container_tools test_iscsi_pids_limit}}
failure_reason: package container-selinux-2:2.189.0-1.module_el8.7.0+1217+ea57d1f1.noarch conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1.module_el8.7.0+1217+ea57d1f1.noarch
tracker:
created_tracker:

jobid: [7118004] (Non infra)
description: rados/standalone/{supported-random-distro$/{ubuntu_latest} workloads/osd}
failure_reason: Command failed (workunit test osd/divergent-priors.sh) on smithi040 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9d538d1d13343c5c2d41e40ab0ca09770646c22f TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh'
traceback: 2022-12-15T08:34:31.500 INFO:tasks.workunit.client.0.smithi040.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:818: TEST_divergent_3: echo failure
2022-12-15T08:34:31.500 INFO:tasks.workunit.client.0.smithi040.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:819: TEST_divergent_3: return 1
2022-12-15T08:34:31.501 INFO:tasks.workunit.client.0.smithi040.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:46: run: return 1
tracker: https://tracker.ceph.com/issues/56034
created_tracker:

jobid: [7118114] (Non infra)
description: rados/singleton/{all/test-crash mon_election/classic msgr-failures/few msgr/async-v2only objectstore/filestore-xfs rados supported-random-distro$/{ubuntu_latest}}
failure_reason: Command failed (workunit test rados/test_crash.sh) on smithi055 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9d538d1d13343c5c2d41e40ab0ca09770646c22f TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_crash.sh'
traceback: [ 0 = 4 ]
tracker: https://tracker.ceph.com/issues/58098
created_tracker:

DEAD

All dead jobs are infra-related behavior

kamoltat added mon bug-fix labels Nov 21, 2022

kamoltat requested a review from a team as a code owner November 21, 2022 17:17

kamoltat self-assigned this Nov 21, 2022

github-actions bot added core tests labels Nov 21, 2022

kamoltat force-pushed the wip-ksirivad-fix-bz-2121452 branch from ecf5aa7 to 491e638 Compare November 21, 2022 22:09

kamoltat added needs-review needs-qa and removed needs-qa labels Dec 7, 2022

kamoltat force-pushed the wip-ksirivad-fix-bz-2121452 branch from 5cc6879 to e1c095d Compare December 7, 2022 19:22

kamoltat requested a review from gregsfortytwo December 7, 2022 19:32

kamoltat mentioned this pull request Dec 7, 2022

quincy: mon/Elector: Change how we handle removed_ranks and notify_rank_removed() #49311

Merged

14 tasks

kamoltat mentioned this pull request Dec 7, 2022

pacific: mon/Elector: Change how we handle removed_ranks and notify_rank_removed() #49312

Merged

14 tasks

kamoltat force-pushed the wip-ksirivad-fix-bz-2121452 branch from e1c095d to e0da0fd Compare December 7, 2022 21:48

kamoltat added 4 commits December 9, 2022 15:43

kamoltat added 2 commits December 9, 2022 15:43

kamoltat force-pushed the wip-ksirivad-fix-bz-2121452 branch from e0da0fd to 924e7ec Compare December 14, 2022 20:14

gregsfortytwo approved these changes Dec 15, 2022

View reviewed changes

kamoltat merged commit 4aa8af2 into ceph:main Dec 15, 2022

kamoltat mentioned this pull request Dec 15, 2022

mon/Elector.cc Added additional prank >= ranks_size sanity check #49259

Merged

14 tasks

Conversation

kamoltat commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

rzarzynski commented Nov 21, 2022

Uh oh!

kamoltat commented Nov 22, 2022

Uh oh!

rzarzynski commented Nov 30, 2022

Uh oh!

rzarzynski commented Dec 5, 2022

Uh oh!

kamoltat commented Dec 7, 2022

Uh oh!

kamoltat commented Dec 7, 2022

Uh oh!

kamoltat commented Dec 7, 2022

Uh oh!

kamoltat commented Dec 8, 2022

Uh oh!

kamoltat commented Dec 8, 2022

Uh oh!

kamoltat commented Dec 8, 2022

Uh oh!

kamoltat commented Dec 9, 2022

Uh oh!

kamoltat commented Dec 9, 2022

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

gregsfortytwo Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

kamoltat commented Dec 15, 2022

FAILED

DEAD

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kamoltat commented Nov 21, 2022 •

edited

Loading