Skip to content

mon/Elector: Change how we handle removed_ranks and notify_rank_removed() #48991

Merged
kamoltat merged 7 commits intoceph:mainfrom
kamoltat:wip-ksirivad-fix-bz-2121452
Dec 15, 2022
Merged

mon/Elector: Change how we handle removed_ranks and notify_rank_removed() #48991
kamoltat merged 7 commits intoceph:mainfrom
kamoltat:wip-ksirivad-fix-bz-2121452

Conversation

@kamoltat
Copy link
Member

@kamoltat kamoltat commented Nov 21, 2022

Problem:

Currently, there is an issue when performing a DR test with 2 sites stretch cluster
where removing monitors and adding new ones to the cluster
causes incorrect rank in ConnectionTracker class.

This causes the monitor to think that they are someone else
in the peer_tracker copy and will never update the
correct field of itself, causing a deadlock in the election process (Ceph becoming unresponsive)
when using election strategy: 3 (Connectivity mode).

Solution:

  1. It was really hard to debug the issue so the first thing we did was to add additional loggings to ConnectionTracker,
    Elector and ElectionLogic Classes.

  2. In ConnectionTracker::receive_peer_report we loop through ranks which is bad when there is notify_rank_removed before this and the ranks are not adjusted yet. When we rely on the rank in certain scenarios, we end up with extra peer_report copy which we don't want. Therefore, instead of passing report.rank in the function
    ConnectionTracker::reports, we pass i.first instead so that we trim old ranks properly. We also added an assert in notify_rank_removed(), comparing the expected rank provided by the monmap against the rank that we adjust ourselves to as a sanity check. We edited test/mon/test_election.cc to reflect the changes made in notify_rank_removed().

  3. MonMap::removed_rank does not get cleared every update of the epoch, this was the root cause of the problem. Therefore, we fix this by making MonMap clear removed_ranks every update. Moreover, When there is discontinuity between monmaps for more than 1 epoch or the new monitor never joined the quorum before,
    we always reset peer_tracker.

  4. Added a way for us to manually reset peer_tracker.rank when executing the command: ceph connection scores reset for each monitor. The peer_tracker.rank will match the current rank of the Monitor.

  5. When upgrading the monitors (including booting up), we check if peer_tracker is dirty or not. If so, we clear it. Added some functions in the Elector and ConnectionTracker classes to check for clean peer_tracker. Moreover, there could be some cases where due to startup weirdness or abnormal circumstances, we might get a report from our own rank. Therefore, it doesn't hurt to add a sanity check in ConnectionTracker::report_live_connection and ConnectionTracker::report_dead_connection.

  6. --mon-initial-members does nothing but cause monmap
    to populate removed_ranks, therefore, remove all instances of --mon-initial-members in the standalone test as it has no impact on the nature of the tests themselves.

  7. Monitor::notify_new_monmap() skips removal of non-exist rank
    In RHCS the user can choose to manually remove a monitor rank
    before shutting the monitor down. Causing inconsistency in monmap.
    Therefore, in Monitor::notify_new_monmap() we prevent the function from going into removing our own rank or ranks that don't exist in monmap.

Fixes: https://tracker.ceph.com/issues/58049
Fixes: https://tracker.ceph.com/issues/58132

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@kamoltat kamoltat requested a review from a team as a code owner November 21, 2022 17:17
@kamoltat kamoltat self-assigned this Nov 21, 2022
@rzarzynski
Copy link
Contributor

@kamoltat: I'm assuming this is the fix for https://tracker.ceph.com/issues/58049. if so, please add the Fixes: https://tracker.ceph.com/issues/58049 tag to the commits' descriptions.

@kamoltat kamoltat force-pushed the wip-ksirivad-fix-bz-2121452 branch from ecf5aa7 to 491e638 Compare November 21, 2022 22:09
@kamoltat
Copy link
Member Author

jenkins retest this please

@rzarzynski
Copy link
Contributor

@kamoltat: the Fixes: ticket URL needs also to be added to the commits' descriptions, not only to the PR description.

@rzarzynski
Copy link
Contributor

jenkins retest this please

@kamoltat
Copy link
Member Author

kamoltat commented Dec 7, 2022

Retargeting #48698 to this PR. Since #48698 was opened for hotfix downstream Pacific and was successfully cherry-picked. However, now we follow the traditional process of upstream main, then backport quincy, pacific.

@kamoltat
Copy link
Member Author

kamoltat commented Dec 7, 2022

@kamoltat
Copy link
Member Author

kamoltat commented Dec 7, 2022

jenkins test make check arm64

@kamoltat
Copy link
Member Author

kamoltat commented Dec 8, 2022

jenkins test make check

@kamoltat
Copy link
Member Author

kamoltat commented Dec 8, 2022

jenkins test make check arm64

@kamoltat
Copy link
Member Author

kamoltat commented Dec 8, 2022

jenkins test windows

@kamoltat
Copy link
Member Author

kamoltat commented Dec 9, 2022

jenkins test make check

…ElectionLogic & Elector

Problem:

Currently there are not ConnectionTracker logs,
therefore it is really hard to debug

Solution:

Enable loggings for most functions in ConnectionTracker.cc
Most of the logs are in debug_mon = 30.

Also Added some logs in Elector and ElectionLogic
so that debugging will be easier in the future.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
PROBLEM:

In `ConnectionTracker::receive_peer_report`
we loop through ranks which is bad when
there is `notify_rank_removed` before this and
the ranks are not adjusted yet. When we rely
on the rank in certain scenarios, we end up
with extra peer_report copy which we don't
want.

SOLUTION:

In `ConnectionTracker::receive_peer_report`
instead of passing `report.rank` in the function
`ConnectionTracker::reports`, we pass `i.first`
instead so that trim old ranks properly.

We also added a assert in notify_rank_removed(),
comparing expected rank provided by the monmap
against the rank that we adjust ourself to as
a sanity check.

We edited test/mon/test_election.cc
to reflect the changes made in notify_rank_removed().

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
when a new monitor joins, there is a chance that
it will recive a monmap that recently removed
a monitor and ``removed_rank`` will have some
content in it. A new monitor that joins
should never remove rank in peer_tracker but
rather call ``notify_clear_peer_state()``
to reset the `peer_report`.

In the case when it is a monitor that
has joined quorum before and is only 1
epoch behind the newest monmap provided
by the probe_replied monitor. We can
actually remove and adjust ranks in `peer_report`
since we are sure that if there is any content in
removed_ranks, then it has to be because in the
next epoch we are removing a rank, since every
update of an epoch we always clear the removed_ranks.

There is no point in keeping the content
of ``removed_ranks`` after monmap gets updated
to the epoch.

Therefore, clear ``removed_ranks`` every update.

When there is discontinuity between
monmaps for more 1 epoch or the new monitor never joined quorum before,
we always reset `peer_tracker`.

Moreover, beneficial for monitor log to also log
which rank has been removed at the current time
of the monmap. So add removed_ranks to `print_summary`
and `dump` in MonMap.cc.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
In `notify_clear_peer_state()` we another
mechanism in reseting our `peer_tracker.rank`
to match our own monitor.rank.

This is added so there is a way for us
to recover from a scenrio where `peer_tracker.rank`
is messed up from adjusting the ranks or removing
ranks.

`notifiy_clear_peer_state()` can be triggered
by using the command:

`ceph connection scores reset`

Also in `clear_peer_reports`, besides
reassigning my_reports to an empty object,
we also have to make `my_reports` = `rank`
from `peer_tracker`, such that we don't get
-1 as a rank in my_reports.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
…d connection report

When upgrading the monitors (include booting up),
we check if `peer_tracker` is dirty or not. If
so, we clear it. Added some functions in `Elector` and
`ConnectionTracker` class to
check for clean `peer_tracker`.

Moreover, there could be some cases where due
to startup weirdness or abnormal circumstances,
we might get a report from our own rank. Therefore,
it doesn't hurt to add a sanity check in
`ConnectionTracker::report_live_connection` and
`ConnectionTracker::report_dead_connection`.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
Problem:

--mon-initial-members does nothing but causes monmap
to populate ``removed_ranks`` because the way we start
monitors in standalone tests uses ``run_mon $dir $id ..``
on each mon. Regardless of --mon-initial-members=a,b,c, if
we set --mon-host=$MONA,$MONB,$MONC (which we do every single tests),
everytime we run a monitor (e.g.,run mon.b) it will pre-build
our monmap with

```
noname-a=mon.noname-a addrs v2:127.0.0.1:7127/0,
b=mon.b addrs v2:127.0.0.1:7128/0,
noname-c=mon.noname-c addrs v2:127.0.0.1:7129/0,
```

Now, with --mon-initial-members=a,b,c we are letting
monmap know that we should have initial members name:
a,b,c, which we only have `b` as a match. So what
``MonMap::set_initial_members`` do is that it will
remove noname-a and noname-c which will
populate `removed_ranks`.

Solution:

remove all instances of --mon-initial-members
in the standalone test as it has no impact on
the nature of the tests themselves.

Fixes: https://tracker.ceph.com/issues/58132

Signed-off-by: Kamoltat <ksirivad@redhat.com>
@kamoltat
Copy link
Member Author

kamoltat commented Dec 9, 2022

jenkins test make check

Problem:
In RHCS the user can choose to manually remove a monitor rank
before shutting the monitor down. Causing inconsistency in monmap.
for example we remove mon.a from the monmap, there is a short period
where mon.a is still operational and will try to remove itself from
monmap but we will run into an assertion in
ConnectionTracker::notify_ranks_removed().

Solution:
In Monitor::notify_new_monmap() we prevent the func
from going into removing our own rank, or
ranks that doesn't exists in monmap.

FYI: this is an RHCS problem only, in ODF,
we never remove a monitor from monmap
before shutting it down.

Fixes: https://tracker.ceph.com/issues/58049

Signed-off-by: Kamoltat <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-fix-bz-2121452 branch from e0da0fd to 924e7ec Compare December 14, 2022 20:14
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first six commits were previously reviewed and approved in #48698.

The last one also LGTM!

dout(5) << "We no longer exists in the monmap! ... dropping." << dendl;
continue;
}
elector.notify_rank_removed(remove_rank, new_rank);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to understand the use cases weird bugs like this can pop up in, but we normally don't drop downstream products into our commit messages. Just a note for future activities. :)

@kamoltat
Copy link
Member Author

25 Failures mostly Infra failures, all the other failures are known failures.


FAILED

jobid: [7117874]
description: rados/singleton-nomsgr/{all/osd_stale_reads mon_election/classic rados supported-random-distro$/{rhel_8}}
failure_reason: Server connection dropped:
traceback:
tracker:
created_tracker: https://tracker.ceph.com/issues/58283

jobid: [7117883]
description: rados/cephadm/osds/{0-distro/ubuntu_20.04 0-nvme-loop 1-start 2-ops/repave-all}
failure_reason: Command failed on smithi094 with status 100: 'sudo apt-get clean'
traceback:
tracker:
created_tracker:

jobid: [7117927]
description: rados/multimon/{clusters/9 mon_election/connectivity msgr-failures/many msgr/async no_pools objectstore/bluestore-comp-lz4 rados supported-random-distro$/{centos_8} tasks/mon_clock_no_skews}
failure_reason: Stale jobs detected, aborting.
traceback:
tracker:
created_tracker:

jobid: [7117928]
description: rados/singleton/{all/lost-unfound-delete mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-hybrid rados supported-random-distro$/{centos_8}}
failure_reason: {'smithi093.front.sepia.ceph.com': {'changed': False, 'msg': 'Failed to connect to the host via ssh: ssh: connect to host smithi093.front.sepia.ceph.com port 22: No route to host', 'unreachable': True}}
traceback:
tracker:
created_tracker:

jobid: [7117939]
description:
failure_reason: {'smithi089.front.sepia.ceph.com': {'_ansible_no_log': False, 'msg': "failed to transfer file to /home/teuthworker/.ansible/tmp/ansible-local-12547kwq9iumd/tmpb7be8wx6 /home/ubuntu/.ansible/tmp/ansible-tmp-1671089653.5063088-18352-262171494946209/AnsiballZ_file.py:\n\nWarning: Permanently added 'smithi089.front.sepia.ceph.com,172.21.15.89' (ECDSA) to the list of known hosts.\r\ndd: failed to open '/home/ubuntu/.ansible/tmp/ansible-tmp-1671089653.5063088-18352-262171494946209/AnsiballZ_file.py': No such file or directory\n"}}
traceback:
tracker:
created_tracker:

jobid: [7117948]
description: rados/thrash-erasure-code-shec/{ceph clusters/{fixed-4 openstack} mon_election/classic msgr-failures/osd-dispatch-delay objectstore/filestore-xfs rados recovery-overrides/{more-partial-recovery} supported-random-distro$/{ubuntu_latest} thrashers/default thrashosds-health workloads/ec-rados-plugin=shec-k=4-m=3-c=2}
failure_reason: Stale jobs detected, aborting.
traceback:
tracker:
created_tracker:

jobid: [7117950]
description: rados/cephadm/osds/{0-distro/centos_8.stream_container_tools_crun 0-nvme-loop 1-start 2-ops/rm-zap-flag}
failure_reason: {'smithi016.front.sepia.ceph.com': {'changed': False, 'msg': 'Failed to connect to the host via ssh: ssh: connect to host smithi016.front.sepia.ceph.com port 22: No route to host', 'unreachable': True}}
traceback:
tracker:
created_tracker:

jobid: [7117951] (Non-infra)
description: rados/singleton-nomsgr/{all/ceph-post-file mon_election/connectivity rados supported-random-distro$/{ubuntu_latest}}
failure_reason: Command failed (workunit test post-file.sh) on smithi093 with status 255: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9d538d1d13343c5c2d41e40ab0ca09770646c22f TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/post-file.sh'
traceback: connect to host drop.ceph.com port 22: Connection timed out
tracker: https://tracker.ceph.com/issues/58097
created_tracker:

jobid: [7117977]
description: rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/fast mon_election/classic msgr-failures/osd-delay rados recovery-overrides/{more-async-recovery} supported-random-distro$/{ubuntu_latest} thrashers/minsize_recovery thrashosds-health workloads/ec-small-objects-fast-read-overwrites}
failure_reason: [Errno None] Unable to connect to port 22 on 172.21.15.89
traceback:
tracker: https://tracker.ceph.com/issues/47343
created_tracker:

jobid: [7117979, 7118058, 7118195] (Non infra)
description: rados/rook/smoke/{0-distro/ubuntu_20.04 0-kubeadm 0-nvme-loop 1-rook 2-workload/radosbench cluster/3-node k8s/1.21 net/calico rook/1.7.2}
failure_reason: Command failed on smithi037 with status 1: 'sudo kubeadm init --node-name smithi037 --token abcdef.q9zcptfltqth0cc4 --pod-network-cidr 10.249.32.0/21'
traceback:
tracker: https://tracker.ceph.com/issues/58258
created_tracker:

jobid: [7117988] (Non infra)
description: rados/objectstore/{backends/objectstore-bluestore-a supported-random-distro$/{rhel_8}}
failure_reason: Command failed on smithi174 with status 1: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c 'mkdir $TESTDIR/archive/ostest && cd $TESTDIR/archive/ostest && ulimit -Sn 16384 && CEPH_ARGS="--no-log-to-stderr --log-file $TESTDIR/archive/ceph_test_objectstore.log --debug-bluestore 20" ceph_test_objectstore --gtest_filter=*/2:-SyntheticMatrixC --gtest_catch_exceptions=0''
traceback: [ FAILED ] ObjectStore/StoreTestSpecificAUSize.SpilloverTest/2, where GetParam() = "bluestore"
tracker: https://tracker.ceph.com/issues/58256
created_tracker:

jobid: [7117990]
description: rados/perf/{ceph mon_election/classic objectstore/bluestore-basic-min-osd-mem-target openstack scheduler/dmclock_1Shard_16Threads settings/optimized ubuntu_latest workloads/radosbench_4M_write}
failure_reason: {'smithi089.front.sepia.ceph.com': {'_ansible_no_log': False, 'msg': 'Failed to connect to the host via ssh: ssh: connect to host smithi089.front.sepia.ceph.com port 22: No route to host'}}
traceback:
tracker:
created_tracker:

jobid: [7118010]
description: rados/cephadm/workunits/{0-distro/rhel_8.6_container_tools_3.0 agent/off mon_election/classic task/test_iscsi_pids_limit/{centos_8.stream_container_tools test_iscsi_pids_limit}}
failure_reason: package container-selinux-2:2.189.0-1.module_el8.7.0+1217+ea57d1f1.noarch conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1.module_el8.7.0+1217+ea57d1f1.noarch
tracker:
created_tracker:

jobid: [7118004] (Non infra)
description: rados/standalone/{supported-random-distro$/{ubuntu_latest} workloads/osd}
failure_reason: Command failed (workunit test osd/divergent-priors.sh) on smithi040 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9d538d1d13343c5c2d41e40ab0ca09770646c22f TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh'
traceback: 2022-12-15T08:34:31.500 INFO:tasks.workunit.client.0.smithi040.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:818: TEST_divergent_3: echo failure
2022-12-15T08:34:31.500 INFO:tasks.workunit.client.0.smithi040.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:819: TEST_divergent_3: return 1
2022-12-15T08:34:31.501 INFO:tasks.workunit.client.0.smithi040.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:46: run: return 1
tracker: https://tracker.ceph.com/issues/56034
created_tracker:

jobid: [7118114] (Non infra)
description: rados/singleton/{all/test-crash mon_election/classic msgr-failures/few msgr/async-v2only objectstore/filestore-xfs rados supported-random-distro$/{ubuntu_latest}}
failure_reason: Command failed (workunit test rados/test_crash.sh) on smithi055 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9d538d1d13343c5c2d41e40ab0ca09770646c22f TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_crash.sh'
traceback: [ 0 = 4 ]
tracker: https://tracker.ceph.com/issues/58098
created_tracker:


DEAD

All dead jobs are infra-related behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants