mon: fix a race between `mgr fail` and MgrMonitor::prepare_beacon() by rzarzynski · Pull Request #46318 · ceph/ceph

rzarzynski · 2022-05-18T14:13:13Z

There is a race condition between the mgr fail handling and mgrbeacon.

diff --git a/src/mon/MgrMonitor.cc b/src/mon/MgrMonitor.cc
index 8ada44e2628..9000b2e0687 100644
--- a/src/mon/MgrMonitor.cc
+++ b/src/mon/MgrMonitor.cc
@@ -1203,7 +1203,9 @@ bool MgrMonitor::prepare_command(MonOpRequestRef op)
     }

     if (changed && pending_map.active_gid == 0) {
+      dout(5) << "========== changed and active_state == 0" << dendl;
       promote_standby();
+      dout(5) << "========== after promote_standby: " << pending_map.active_gid << dendl;
     }
   } else if (prefix == "mgr module enable") {
     string module;

2022-05-17T00:11:19.602+0200 7f6bd5769700  0 mon.a@0(leader) e1 handle_command mon_command({"prefix": "mgr fail", "who": "x"} v 0) v1
...
2022-05-17T00:11:19.614+0200 7f6bd5769700  5 mon.a@0(leader).mgr e25 ========== changed and active_state == 0
2022-05-17T00:11:19.614+0200 7f6bd5769700  5 mon.a@0(leader).mgr e25 ========== after promote_standby: 0
2022-05-17T00:11:19.614+0200 7f6bd5769700  4 mon.a@0(leader).mgr e25 prepare_command done, r=0
...
2022-05-17T00:11:19.630+0200 7f6bd5769700  4 mon.a@0(leader).mgr e25 selecting new active 4210 x (was 0 )

bool MgrMonitor::prepare_beacon(MonOpRequestRef op)
  if (pending_map.active_gid == m->get_gid()) {
    // ...
  } else if (pending_map.active_gid == 0) {
    // There is no currently active daemon, select this one.
    if (pending_map.standbys.count(m->get_gid())) {
      drop_standby(m->get_gid(), false);
    }
    dout(4) << "selecting new active " << m->get_gid()
            << " " << m->get_name()
            << " (was " << pending_map.active_gid << " "
            << pending_map.active_name << ")" << dendl;
    pending_map.active_gid = m->get_gid();
    pending_map.active_name = m->get_name();
    pending_map.active_change = ceph_clock_now()

The 25th version of MgrMap, when handled at mgr.x, doesn't trigger the respawn() path (while e.g. cephadm assumes it does):

2022-05-17T00:10:11.197+0200 7fa3d1e0a700 10 mgr ms_dispatch2 active mgrmap(e 25) v1
2022-05-17T00:10:11.197+0200 7fa3d1e0a700  4 mgr handle_mgr_map received map epoch 25
2022-05-17T00:10:11.197+0200 7fa3d1e0a700  4 mgr handle_mgr_map active in map: 1 active is 4210
2022-05-17T00:10:11.197+0200 7fa3d6613700 10 --2- 127.0.0.1:0/743576734 >> [v2:127.0.0.1:40929/0,v1:127.0.0.1:40930/0] conn(0x5592635ef400 0x5592635f6580 secure :-1 s=THROTTLE_DONE pgs=130 cs=0 l=1 rev1=1 crypto rx=0x55926362e810 tx=0x559263563b60 comp rx=0 tx=0).handle_read_frame_dispatch tag=17
2022-05-17T00:10:11.197+0200 7fa3d6613700  5 --2- 127.0.0.1:0/743576734 >> [v2:127.0.0.1:40929/0,v1:127.0.0.1:40930/0] conn(0x5592635ef400 0x5592635f6580 secure :-1 s=THROTTLE_DONE pgs=130 cs=0 l=1 rev1=1 crypto rx=0x55926362e810 tx=0x559263563b60 comp rx=0 tx=0).handle_message got 43089 + 0 + 0 byte message. envelope type=1796 src mon.0 off 0
2022-05-17T00:10:11.197+0200 7fa3d1e0a700 10 mgr handle_mgr_map I was already active

Signed-off-by: Radosław Zarzyński rzarzyns@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

gregsfortytwo · 2022-05-26T02:31:01Z

I'm not understanding what race condition you're saying is here, nor how this fixes anything. propose_pending() will do an immediate propose, but without that the command still waits for a commit before replying to the client which invoked "mgr fail". And this PR doesn't change anything at all about the locking sequence you've displayed in the logs.

I do note that if there are no standbys available in pending_map.standbys, promote_standby() won't make a change and just returns false. I'd guess you're seeing that?

rzarzynski · 2022-05-26T09:40:49Z

Hello Greg! Thanks for taking a look!

I think the problem is not about synchronization between the pair:

mgr fail issuer and
monitor

but it involves the third entity too: a manager which takes a different path than the MgrStandby::respawn() (full restart) supposed by the client issuing mgr fail.

Manager misses it because active_gid = 0 imposed by mgr fail gets overridden by MgrMonitor::prepare_beacon() without bumping the epoch up. This is visible here:

2022-05-17T00:11:19.614+0200 7f6bd5769700  5 mon.a@0(leader).mgr e25 ========== after promote_standby: 0
...
2022-05-17T00:11:19.630+0200 7f6bd5769700  4 mon.a@0(leader).mgr e25 selecting new active 4210 x (was 0 )

IIUC propose_pending() flushes the pending_map, and thus allows the manager to observe active_gid transitions through 0.

gregsfortytwo · 2022-05-26T14:20:33Z

Ah, that makes more sense. I see a few different solutions, though I don't know what the balance of issues is for the mgr dependencies:

We can make prepare_beacon() notice this case and not set a mgr to active if it's been marked failed (eg, by comparing the current to pending_map and seeing that it got removed).
We can do this, but as written it's bad because forcing immediate paxos commits is something we try to avoid until it's necessary (batching is good! it's very good on an iop-limited centralized system!)
we can decide that failing live mgrs is something you can't do.

Should probably discuss that with people who know more about the mgr. I'm guessing they do want the ability to fail the mgr though. So perhaps adjust this PR so that prepare_command() only proposes when handling a fail command? There's a PaxosService::force_immediate_propose() function which will do a propose after returning out of prepare_command() you can invoke, instead of invoking propose_pending() every time inside of it.

rzarzynski · 2022-05-26T22:33:38Z

My impression was the mgr-related admin commands are supposed to be executed infrequently, usually in a reaction to an operator's request (even when it happens indirectly – via the orchestrator). The other commands we would flush the pending map for are about turning on/off manager's modules. I believe they are rarely happening too while (I don't have any proof, just gut feelings) are vulnerable to similar races as mgr fail.

When investigating possible approaches I was considering a pin-point fix around prepare_beacon(). However, the complexity there leaned me towards a really dumb solution: flush pending_map after every mgr-related command. I agree we have a trade-off here: performance vs simplicity and I hope we can favor simplicity.

A way to restart mgrs is already neeeded by cephadm / orchestrator modules. The Sage's auth rotation is another user, so I'm afraid we won't be able to squeeze out this feature :-(.

gregsfortytwo · 2022-05-26T22:51:18Z

I just don't want to force commits by default on commands. It may not cause a problem immediately, but defaulting to a forced paxos commit is a bad smell in monitor code and can lead to hard-to-detect scaling issues if we expand these commands in the future. New mgr commands are most likely to be involved with upgrades or recovery situations where we're likely already loading the monitors down and needing to invoke the commands frequently!

rzarzynski · 2022-05-27T08:56:22Z

Ah, I see: future-proofness. You're worried it would be too easy to miss this flush when adding a new, possibly not so innocent and infrequent command. Makes sense. I will limit the flush to mgr fail and push a new revision. Thanks for pointing this out!

rzarzynski · 2022-06-07T11:32:16Z

@gregsfortytwo: would you mind taking another look?

gregsfortytwo

Good enough for now. In the other monitor implementations we have a "should_propose" boolean that gets set in each prefix when appropriate, but that's a simple and obvious enough change if it comes up in future.

sseshasa · 2022-06-29T14:43:30Z

Teuthology Test Result
https://pulpito.ceph.com/yuriw-2022-06-27_15:15:17-rados-wip-yuri2-testing-2022-06-24-1331-distro-default-smithi

Related Failure
@rzarzynski The following failed job appears to be related to this PR. Can you please take a look?
JobID: https://pulpito.ceph.com/yuriw-2022-06-27_15:15:17-rados-wip-yuri2-testing-2022-06-24-1331-distro-default-smithi/6900866
Failure Reason: "2022-06-27T16:40:14.344915+0000 mon.b (mon.1) 89 : cluster [WRN] Health check failed: 1/3 mons down, quorum b,c (MON_DOWN)" in cluster log
Assertion failure on mon.a: /src/mon/PaxosService.cc: 193: FAILED ceph_assert(have_pending)

There is a race condition between the `mgr fail` handling and `mgrbeacon`. ```diff diff --git a/src/mon/MgrMonitor.cc b/src/mon/MgrMonitor.cc index 8ada44e..9000b2e0687 100644 --- a/src/mon/MgrMonitor.cc +++ b/src/mon/MgrMonitor.cc @@ -1203,7 +1203,9 @@ bool MgrMonitor::prepare_command(MonOpRequestRef op) } if (changed && pending_map.active_gid == 0) { + dout(5) << "========== changed and active_state == 0" << dendl; promote_standby(); + dout(5) << "========== after promote_standby: " << pending_map.active_gid << dendl; } } else if (prefix == "mgr module enable") { string module; ``` ``` 2022-05-17T00:11:19.602+0200 7f6bd5769700 0 mon.a@0(leader) e1 handle_command mon_command({"prefix": "mgr fail", "who": "x"} v 0) v1 ... 2022-05-17T00:11:19.614+0200 7f6bd5769700 5 mon.a@0(leader).mgr e25 ========== changed and active_state == 0 2022-05-17T00:11:19.614+0200 7f6bd5769700 5 mon.a@0(leader).mgr e25 ========== after promote_standby: 0 2022-05-17T00:11:19.614+0200 7f6bd5769700 4 mon.a@0(leader).mgr e25 prepare_command done, r=0 ... 2022-05-17T00:11:19.630+0200 7f6bd5769700 4 mon.a@0(leader).mgr e25 selecting new active 4210 x (was 0 ) ``` ```cpp bool MgrMonitor::prepare_beacon(MonOpRequestRef op) if (pending_map.active_gid == m->get_gid()) { // ... } else if (pending_map.active_gid == 0) { // There is no currently active daemon, select this one. if (pending_map.standbys.count(m->get_gid())) { drop_standby(m->get_gid(), false); } dout(4) << "selecting new active " << m->get_gid() << " " << m->get_name() << " (was " << pending_map.active_gid << " " << pending_map.active_name << ")" << dendl; pending_map.active_gid = m->get_gid(); pending_map.active_name = m->get_name(); pending_map.active_change = ceph_clock_now() ``` The `25` version of `MgrMap`, when handled at `mgr.x`, doesn't trigger the `respawn()` path: ``` 2022-05-17T00:10:11.197+0200 7fa3d1e0a700 10 mgr ms_dispatch2 active mgrmap(e 25) v1 2022-05-17T00:10:11.197+0200 7fa3d1e0a700 4 mgr handle_mgr_map received map epoch 25 2022-05-17T00:10:11.197+0200 7fa3d1e0a700 4 mgr handle_mgr_map active in map: 1 active is 4210 2022-05-17T00:10:11.197+0200 7fa3d6613700 10 --2- 127.0.0.1:0/743576734 >> [v2:127.0.0.1:40929/0,v1:127.0.0.1:40930/0] conn(0x5592635ef400 0x5592635f6580 secure :-1 s=THROTTLE_DONE pgs=130 cs=0 l=1 rev1=1 crypto rx=0x55926362e810 tx=0x559263563b60 comp rx=0 tx=0).handle_read_frame_dispatch tag=17 2022-05-17T00:10:11.197+0200 7fa3d6613700 5 --2- 127.0.0.1:0/743576734 >> [v2:127.0.0.1:40929/0,v1:127.0.0.1:40930/0] conn(0x5592635ef400 0x5592635f6580 secure :-1 s=THROTTLE_DONE pgs=130 cs=0 l=1 rev1=1 crypto rx=0x55926362e810 tx=0x559263563b60 comp rx=0 tx=0).handle_message got 43089 + 0 + 0 byte message. envelope type=1796 src mon.0 off 0 2022-05-17T00:10:11.197+0200 7fa3d1e0a700 10 mgr handle_mgr_map I was already active ``` Fixes: https://tracker.ceph.com/issues/55711 Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>

tchaikov · 2022-08-09T16:12:01Z

the sudo yum -y install ceph-radosgw failures are tracked by https://tracker.ceph.com/issues/57051

tchaikov · 2022-08-12T14:01:57Z

@rzarzynski hi Radek, did you manage to fix the test failure reported at #46318 (comment) ?

if yes, i think this change is good to merge.

rzarzynski · 2022-08-12T14:16:03Z

Hi Kefu! Yup, I added the is_writeable to the condition. Thanks for your testing! :-)

batrick · 2023-03-08T20:01:34Z

Hello @rzarzynski, I think I have a more robust solution to this in #50404. Please see commit mgr: force propose whenever the active changes in that series.

cc @tchaikov @gregsfortytwo

github-actions bot added core mon labels May 18, 2022

rzarzynski force-pushed the wip-mon-fix-mgrmap-rc branch from 3620da6 to 2abdfaf Compare May 19, 2022 13:54

rzarzynski mentioned this pull request May 19, 2022

mon,auth,cephadm: support auth key rotation #43655

Merged

5 tasks

jdurgin requested a review from kamoltat May 19, 2022 17:52

rzarzynski requested review from athanatos, gregsfortytwo and sseshasa May 19, 2022 18:19

djgalloway changed the base branch from master to main May 25, 2022 19:57

rzarzynski force-pushed the wip-mon-fix-mgrmap-rc branch from 2abdfaf to 15d3dbc Compare May 27, 2022 11:22

neha-ojha approved these changes Jun 15, 2022

View reviewed changes

neha-ojha added the needs-qa label Jun 15, 2022

athanatos removed their request for review June 21, 2022 22:39

yuriw added the wip-yuri2-testing label Jun 24, 2022

gregsfortytwo approved these changes Jun 28, 2022

View reviewed changes

yuriw removed the wip-yuri2-testing label Jun 29, 2022

neha-ojha removed the needs-qa label Jul 21, 2022

rzarzynski force-pushed the wip-mon-fix-mgrmap-rc branch from 15d3dbc to 23c3f76 Compare July 22, 2022 15:20

rzarzynski requested a review from a team as a code owner July 22, 2022 15:20

rzarzynski added the needs-qa label Jul 27, 2022

tchaikov added the wip-kefu-testing label Aug 7, 2022

tchaikov assigned rzarzynski Aug 12, 2022

tchaikov merged commit 6c94665 into ceph:main Aug 12, 2022

rzarzynski mentioned this pull request Mar 6, 2023

quincy: mon: fix a race between mgr fail and MgrMonitor::prepare_beacon() #50408

Closed

idryomov mentioned this pull request Apr 14, 2023

quincy: MgrMonitor: batch commit OSDMap and MgrMap mutations #50979

Merged

batrick mentioned this pull request May 10, 2023

pacific: MgrMonitor: batch commit OSDMap and MgrMap mutations #50980

Merged

Conversation

rzarzynski commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

gregsfortytwo commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregsfortytwo commented May 26, 2022

Uh oh!

rzarzynski commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregsfortytwo commented May 26, 2022

Uh oh!

rzarzynski commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Jun 7, 2022

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

sseshasa commented Jun 29, 2022

Uh oh!

tchaikov commented Aug 9, 2022

Uh oh!

tchaikov commented Aug 12, 2022

Uh oh!

rzarzynski commented Aug 12, 2022

Uh oh!

batrick commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rzarzynski commented May 18, 2022 •

edited

Loading

gregsfortytwo commented May 26, 2022 •

edited

Loading

rzarzynski commented May 26, 2022 •

edited

Loading

rzarzynski commented May 26, 2022 •

edited

Loading

rzarzynski commented May 27, 2022 •

edited

Loading