Skip to content

mon: fix a race between mgr fail and MgrMonitor::prepare_beacon()#46318

Merged
tchaikov merged 1 commit intoceph:mainfrom
rzarzynski:wip-mon-fix-mgrmap-rc
Aug 12, 2022
Merged

mon: fix a race between mgr fail and MgrMonitor::prepare_beacon()#46318
tchaikov merged 1 commit intoceph:mainfrom
rzarzynski:wip-mon-fix-mgrmap-rc

Conversation

@rzarzynski
Copy link
Contributor

@rzarzynski rzarzynski commented May 18, 2022

There is a race condition between the mgr fail handling and mgrbeacon.

diff --git a/src/mon/MgrMonitor.cc b/src/mon/MgrMonitor.cc
index 8ada44e2628..9000b2e0687 100644
--- a/src/mon/MgrMonitor.cc
+++ b/src/mon/MgrMonitor.cc
@@ -1203,7 +1203,9 @@ bool MgrMonitor::prepare_command(MonOpRequestRef op)
     }

     if (changed && pending_map.active_gid == 0) {
+      dout(5) << "========== changed and active_state == 0" << dendl;
       promote_standby();
+      dout(5) << "========== after promote_standby: " << pending_map.active_gid << dendl;
     }
   } else if (prefix == "mgr module enable") {
     string module;
2022-05-17T00:11:19.602+0200 7f6bd5769700  0 mon.a@0(leader) e1 handle_command mon_command({"prefix": "mgr fail", "who": "x"} v 0) v1
...
2022-05-17T00:11:19.614+0200 7f6bd5769700  5 mon.a@0(leader).mgr e25 ========== changed and active_state == 0
2022-05-17T00:11:19.614+0200 7f6bd5769700  5 mon.a@0(leader).mgr e25 ========== after promote_standby: 0
2022-05-17T00:11:19.614+0200 7f6bd5769700  4 mon.a@0(leader).mgr e25 prepare_command done, r=0
...
2022-05-17T00:11:19.630+0200 7f6bd5769700  4 mon.a@0(leader).mgr e25 selecting new active 4210 x (was 0 )
bool MgrMonitor::prepare_beacon(MonOpRequestRef op)
  if (pending_map.active_gid == m->get_gid()) {
    // ...
  } else if (pending_map.active_gid == 0) {
    // There is no currently active daemon, select this one.
    if (pending_map.standbys.count(m->get_gid())) {
      drop_standby(m->get_gid(), false);
    }
    dout(4) << "selecting new active " << m->get_gid()
            << " " << m->get_name()
            << " (was " << pending_map.active_gid << " "
            << pending_map.active_name << ")" << dendl;
    pending_map.active_gid = m->get_gid();
    pending_map.active_name = m->get_name();
    pending_map.active_change = ceph_clock_now()

The 25th version of MgrMap, when handled at mgr.x, doesn't trigger the respawn() path (while e.g. cephadm assumes it does):

2022-05-17T00:10:11.197+0200 7fa3d1e0a700 10 mgr ms_dispatch2 active mgrmap(e 25) v1
2022-05-17T00:10:11.197+0200 7fa3d1e0a700  4 mgr handle_mgr_map received map epoch 25
2022-05-17T00:10:11.197+0200 7fa3d1e0a700  4 mgr handle_mgr_map active in map: 1 active is 4210
2022-05-17T00:10:11.197+0200 7fa3d6613700 10 --2- 127.0.0.1:0/743576734 >> [v2:127.0.0.1:40929/0,v1:127.0.0.1:40930/0] conn(0x5592635ef400 0x5592635f6580 secure :-1 s=THROTTLE_DONE pgs=130 cs=0 l=1 rev1=1 crypto rx=0x55926362e810 tx=0x559263563b60 comp rx=0 tx=0).handle_read_frame_dispatch tag=17
2022-05-17T00:10:11.197+0200 7fa3d6613700  5 --2- 127.0.0.1:0/743576734 >> [v2:127.0.0.1:40929/0,v1:127.0.0.1:40930/0] conn(0x5592635ef400 0x5592635f6580 secure :-1 s=THROTTLE_DONE pgs=130 cs=0 l=1 rev1=1 crypto rx=0x55926362e810 tx=0x559263563b60 comp rx=0 tx=0).handle_message got 43089 + 0 + 0 byte message. envelope type=1796 src mon.0 off 0
2022-05-17T00:10:11.197+0200 7fa3d1e0a700 10 mgr handle_mgr_map I was already active

Signed-off-by: Radosław Zarzyński rzarzyns@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@rzarzynski rzarzynski force-pushed the wip-mon-fix-mgrmap-rc branch from 3620da6 to 2abdfaf Compare May 19, 2022 13:54
@jdurgin jdurgin requested a review from kamoltat May 19, 2022 17:52
@djgalloway djgalloway changed the base branch from master to main May 25, 2022 19:57
@gregsfortytwo
Copy link
Member

gregsfortytwo commented May 26, 2022

I'm not understanding what race condition you're saying is here, nor how this fixes anything. propose_pending() will do an immediate propose, but without that the command still waits for a commit before replying to the client which invoked "mgr fail". And this PR doesn't change anything at all about the locking sequence you've displayed in the logs.

I do note that if there are no standbys available in pending_map.standbys, promote_standby() won't make a change and just returns false. I'd guess you're seeing that?

@rzarzynski
Copy link
Contributor Author

rzarzynski commented May 26, 2022

Hello Greg! Thanks for taking a look!

I think the problem is not about synchronization between the pair:

  1. mgr fail issuer and
  2. monitor

but it involves the third entity too: a manager which takes a different path than the MgrStandby::respawn() (full restart) supposed by the client issuing mgr fail.

Manager misses it because active_gid = 0 imposed by mgr fail gets overridden by MgrMonitor::prepare_beacon() without bumping the epoch up. This is visible here:

2022-05-17T00:11:19.614+0200 7f6bd5769700  5 mon.a@0(leader).mgr e25 ========== after promote_standby: 0
...
2022-05-17T00:11:19.630+0200 7f6bd5769700  4 mon.a@0(leader).mgr e25 selecting new active 4210 x (was 0 )

IIUC propose_pending() flushes the pending_map, and thus allows the manager to observe active_gid transitions through 0.

@gregsfortytwo
Copy link
Member

Ah, that makes more sense. I see a few different solutions, though I don't know what the balance of issues is for the mgr dependencies:

  1. We can make prepare_beacon() notice this case and not set a mgr to active if it's been marked failed (eg, by comparing the current to pending_map and seeing that it got removed).
  2. We can do this, but as written it's bad because forcing immediate paxos commits is something we try to avoid until it's necessary (batching is good! it's very good on an iop-limited centralized system!)
  3. we can decide that failing live mgrs is something you can't do.

Should probably discuss that with people who know more about the mgr. I'm guessing they do want the ability to fail the mgr though. So perhaps adjust this PR so that prepare_command() only proposes when handling a fail command? There's a PaxosService::force_immediate_propose() function which will do a propose after returning out of prepare_command() you can invoke, instead of invoking propose_pending() every time inside of it.

@rzarzynski
Copy link
Contributor Author

rzarzynski commented May 26, 2022

My impression was the mgr-related admin commands are supposed to be executed infrequently, usually in a reaction to an operator's request (even when it happens indirectly – via the orchestrator). The other commands we would flush the pending map for are about turning on/off manager's modules. I believe they are rarely happening too while (I don't have any proof, just gut feelings) are vulnerable to similar races as mgr fail.

When investigating possible approaches I was considering a pin-point fix around prepare_beacon(). However, the complexity there leaned me towards a really dumb solution: flush pending_map after every mgr-related command. I agree we have a trade-off here: performance vs simplicity and I hope we can favor simplicity.

A way to restart mgrs is already neeeded by cephadm / orchestrator modules. The Sage's auth rotation is another user, so I'm afraid we won't be able to squeeze out this feature :-(.

@gregsfortytwo
Copy link
Member

I just don't want to force commits by default on commands. It may not cause a problem immediately, but defaulting to a forced paxos commit is a bad smell in monitor code and can lead to hard-to-detect scaling issues if we expand these commands in the future. New mgr commands are most likely to be involved with upgrades or recovery situations where we're likely already loading the monitors down and needing to invoke the commands frequently!

@rzarzynski
Copy link
Contributor Author

rzarzynski commented May 27, 2022

Ah, I see: future-proofness. You're worried it would be too easy to miss this flush when adding a new, possibly not so innocent and infrequent command. Makes sense. I will limit the flush to mgr fail and push a new revision. Thanks for pointing this out!

@rzarzynski rzarzynski force-pushed the wip-mon-fix-mgrmap-rc branch from 2abdfaf to 15d3dbc Compare May 27, 2022 11:22
@rzarzynski
Copy link
Contributor Author

@gregsfortytwo: would you mind taking another look?

@athanatos athanatos removed their request for review June 21, 2022 22:39
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough for now. In the other monitor implementations we have a "should_propose" boolean that gets set in each prefix when appropriate, but that's a simple and obvious enough change if it comes up in future.

@sseshasa
Copy link
Contributor

Teuthology Test Result
https://pulpito.ceph.com/yuriw-2022-06-27_15:15:17-rados-wip-yuri2-testing-2022-06-24-1331-distro-default-smithi

Related Failure
@rzarzynski The following failed job appears to be related to this PR. Can you please take a look?
JobID: https://pulpito.ceph.com/yuriw-2022-06-27_15:15:17-rados-wip-yuri2-testing-2022-06-24-1331-distro-default-smithi/6900866
Failure Reason: "2022-06-27T16:40:14.344915+0000 mon.b (mon.1) 89 : cluster [WRN] Health check failed: 1/3 mons down, quorum b,c (MON_DOWN)" in cluster log
Assertion failure on mon.a: /src/mon/PaxosService.cc: 193: FAILED ceph_assert(have_pending)

There is a race condition between the `mgr fail` handling  and `mgrbeacon`.

```diff
diff --git a/src/mon/MgrMonitor.cc b/src/mon/MgrMonitor.cc
index 8ada44e..9000b2e0687 100644
--- a/src/mon/MgrMonitor.cc
+++ b/src/mon/MgrMonitor.cc
@@ -1203,7 +1203,9 @@ bool MgrMonitor::prepare_command(MonOpRequestRef op)
     }

     if (changed && pending_map.active_gid == 0) {
+      dout(5) << "========== changed and active_state == 0" << dendl;
       promote_standby();
+      dout(5) << "========== after promote_standby: " << pending_map.active_gid << dendl;
     }
   } else if (prefix == "mgr module enable") {
     string module;
```

```
2022-05-17T00:11:19.602+0200 7f6bd5769700  0 mon.a@0(leader) e1 handle_command mon_command({"prefix": "mgr fail", "who": "x"} v 0) v1
...
2022-05-17T00:11:19.614+0200 7f6bd5769700  5 mon.a@0(leader).mgr e25 ========== changed and active_state == 0
2022-05-17T00:11:19.614+0200 7f6bd5769700  5 mon.a@0(leader).mgr e25 ========== after promote_standby: 0
2022-05-17T00:11:19.614+0200 7f6bd5769700  4 mon.a@0(leader).mgr e25 prepare_command done, r=0
...
2022-05-17T00:11:19.630+0200 7f6bd5769700  4 mon.a@0(leader).mgr e25 selecting new active 4210 x (was 0 )
```

```cpp
bool MgrMonitor::prepare_beacon(MonOpRequestRef op)
  if (pending_map.active_gid == m->get_gid()) {
    // ...
  } else if (pending_map.active_gid == 0) {
    // There is no currently active daemon, select this one.
    if (pending_map.standbys.count(m->get_gid())) {
      drop_standby(m->get_gid(), false);
    }
    dout(4) << "selecting new active " << m->get_gid()
            << " " << m->get_name()
            << " (was " << pending_map.active_gid << " "
            << pending_map.active_name << ")" << dendl;
    pending_map.active_gid = m->get_gid();
    pending_map.active_name = m->get_name();
    pending_map.active_change = ceph_clock_now()
```

The `25` version of `MgrMap`, when handled at `mgr.x`, doesn't trigger the `respawn()` path:

```
2022-05-17T00:10:11.197+0200 7fa3d1e0a700 10 mgr ms_dispatch2 active mgrmap(e 25) v1
2022-05-17T00:10:11.197+0200 7fa3d1e0a700  4 mgr handle_mgr_map received map epoch 25
2022-05-17T00:10:11.197+0200 7fa3d1e0a700  4 mgr handle_mgr_map active in map: 1 active is 4210
2022-05-17T00:10:11.197+0200 7fa3d6613700 10 --2- 127.0.0.1:0/743576734 >> [v2:127.0.0.1:40929/0,v1:127.0.0.1:40930/0] conn(0x5592635ef400 0x5592635f6580 secure :-1 s=THROTTLE_DONE pgs=130 cs=0 l=1 rev1=1 crypto rx=0x55926362e810 tx=0x559263563b60 comp rx=0 tx=0).handle_read_frame_dispatch tag=17
2022-05-17T00:10:11.197+0200 7fa3d6613700  5 --2- 127.0.0.1:0/743576734 >> [v2:127.0.0.1:40929/0,v1:127.0.0.1:40930/0] conn(0x5592635ef400 0x5592635f6580 secure :-1 s=THROTTLE_DONE pgs=130 cs=0 l=1 rev1=1 crypto rx=0x55926362e810 tx=0x559263563b60 comp rx=0 tx=0).handle_message got 43089 + 0 + 0 byte message. envelope type=1796 src mon.0 off 0
2022-05-17T00:10:11.197+0200 7fa3d1e0a700 10 mgr handle_mgr_map I was already active
```

Fixes: https://tracker.ceph.com/issues/55711
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
@rzarzynski rzarzynski force-pushed the wip-mon-fix-mgrmap-rc branch from 15d3dbc to 23c3f76 Compare July 22, 2022 15:20
@rzarzynski rzarzynski requested a review from a team as a code owner July 22, 2022 15:20
@tchaikov
Copy link
Contributor

@rzarzynski hi Radek, did you manage to fix the test failure reported at #46318 (comment) ?

if yes, i think this change is good to merge.

@rzarzynski
Copy link
Contributor Author

Hi Kefu! Yup, I added the is_writeable to the condition. Thanks for your testing! :-)

@batrick
Copy link
Member

batrick commented Mar 8, 2023

Hello @rzarzynski, I think I have a more robust solution to this in #50404. Please see commit mgr: force propose whenever the active changes in that series.

cc @tchaikov @gregsfortytwo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants