mgr: refresh registered client list to monitor on mgr shutdown/respawn by vshankar · Pull Request #33272 · ceph/ceph

vshankar · 2020-02-13T13:59:58Z

No description provided.

vshankar · 2020-02-13T16:38:20Z

src/pybind/mgr/volumes/fs/operations/volume.py

            self.fs.mount(filesystem_name=self.fs_name.encode('utf-8'))
            log.debug("Connection to cephfs '{0}' complete".format(self.fs_name))
-            self.mgr._ceph_register_client(self.fs.get_addrs())
+            r = self.mgr._ceph_register_client(self.fs.get_addrs())


@batrick commit 969b660 adds a guard in _ceph_register_client() to deny registering a client when manager waits for ack from mon for the last sent beacon. between handling this failure and calling disconnect, the ack could arrive and the manager would either shutdown (exit(0)) or respawn (active -> standby). For the shutdown case its not a problem since sending the last beacon (and waiting for an ack) is executed from singal handler context, but, when respawning, manager modules could still be executing when manager waits for the last beacon ack.

There is an issue even when shutting down from a signal handler -- a module could instantiate a cephfs connection (init, mount) and before registering the handles addrs, a termination signal would never send the just instantiated client addrs resulting in the same problem. As a workaround/fix, the whole fs init/mount and register should happen when signals are blocked.

For the respawn case (comment above) -- mgr could send a SIGTERM to iself and we could execv() from the signal handler (execv is a async-singnal-safe syscall).

I don't think closing this race is possible without looking at this a different way. I see in 4bfa4fc that you removed a lot of the tear down code in MgrStandby::shutdown. This was de facto dead code due to 3363a10, by @liewegas?

Using the shutdown hook is the most attractive way to resolve this race because modules should stop all operations so the Mgr can then fire off the final beacon including all of the latest clients. Unfortunately I think that bridge has burned down with all of the havoc it caused with the dashboard. In particular, tearing down the Python interpreter "correctly" is just not worth the effort.

So, I would suggest we add another hook named stop which generically just says: "stop the module and tear down all your state". Or alternatively, we could call all the shutdown hooks but not do any other cleanup. What do you think @liewegas @tchaikov @LenzGr

batrick

Thought I posted this review but found it pending. Sorry for the delay...

src/messages/MMgrBeacon.h

src/messages/MMgrBeaconReply.h

batrick · 2020-02-17T19:22:32Z

src/pybind/mgr/volumes/fs/operations/volume.py

            self.fs.mount(filesystem_name=self.fs_name.encode('utf-8'))
            log.debug("Connection to cephfs '{0}' complete".format(self.fs_name))
-            self.mgr._ceph_register_client(self.fs.get_addrs())
+            r = self.mgr._ceph_register_client(self.fs.get_addrs())


I don't think closing this race is possible without looking at this a different way. I see in 4bfa4fc that you removed a lot of the tear down code in MgrStandby::shutdown. This was de facto dead code due to 3363a10, by @liewegas?

Using the shutdown hook is the most attractive way to resolve this race because modules should stop all operations so the Mgr can then fire off the final beacon including all of the latest clients. Unfortunately I think that bridge has burned down with all of the havoc it caused with the dashboard. In particular, tearing down the Python interpreter "correctly" is just not worth the effort.

So, I would suggest we add another hook named stop which generically just says: "stop the module and tear down all your state". Or alternatively, we could call all the shutdown hooks but not do any other cleanup. What do you think @liewegas @tchaikov @LenzGr

vshankar · 2020-03-02T14:34:05Z

@batrick -- apologies for delay -- I'll get to this in a day or two...

gregsfortytwo · 2020-03-26T04:33:56Z

@vshankar looks like this still needs some work?

vshankar · 2020-03-26T05:24:23Z

@vshankar looks like this still needs some work?

right -- testing now -- will push when done.

vshankar · 2020-04-06T12:52:12Z

@gregsfortytwo @tchaikov please take a look

vshankar · 2020-04-08T11:57:46Z

a minor update addressing Patrick's comment to reorganize header file

`stop` service request will be invoked by manager during termination or respawn. This is currently designed to stop dispatching commands to modules with returning -ESHUTDOWN error code. Signed-off-by: Venky Shankar <vshankar@redhat.com>

Signed-off-by: Venky Shankar <vshankar@redhat.com>

cleanup helper threads (purge and async cloner) when stop service request is invoked. Signed-off-by: Venky Shankar <vshankar@redhat.com>

beacon messages from manager to monitor carries a sequence number which the monitor piggybacks in the beacon reply messages. this machinary would be required to send and wait on sequence number when manager restarts or respawns and has to transmit and updated client list to the monitor for blacklisting. Signed-off-by: Venky Shankar <vshankar@redhat.com>

.. and wait for beacon ack from monitor. The last beacon message sends the updated client list to monitor for potential blacklisting. Fixes: http://tracker.ceph.com/issues/43943 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar · 2020-05-05T06:24:25Z

@batrick please take a look

EDIT: update is just a rebase

vshankar · 2020-05-21T03:58:58Z

Need to resolve the test failures.

taking a look

batrick · 2020-05-22T21:22:09Z

this pr went through the cephfs suites without issue

vshankar · 2020-05-25T13:49:04Z

this pr went through the cephfs suites without issue

The issue is seen when running rados:mgr suite which loads the selftest mgr plugin leading to mgr restart. ceph-mgr does not restart as expected due to the following sequence (seen on a vstart cluster via vstart_runner):

vstart_runner kills ceph mds
vstart_runner triggers mgr shutdown
mgr/volumes threads (purge/cloner) try to fetch entries for processing (requires instantiating cephfs filehandle)
Client::mount() tries to initiate connection with MDS

020-05-25T09:28:33.560-0400 7f6d3a4f4700  0 [volumes DEBUG mgr_util] CephFS mounting...
2020-05-25T09:28:33.560-0400 7f6d334e6700 10 client.0 ms_handle_connect on v2:192.168.0.5:40653/0
2020-05-25T09:28:33.564-0400 7f6d3a4f4700 10 client.464142 fetch_fsmap learned FSMap version 247
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 10 client.464142 fetch_fsmap finished waiting for FSMap version 247
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 10 client.464142 Subscribing to map 'mdsmap.1'
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 20 client.464142 trim_cache size 0 max 16384
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 20 client.464142 populate_metadata read hostname 'cephalopod-octo'
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 10 client.464142 did not get mds through better means, so chose random mds -1
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 20 client.464142 mds is -1
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 10 client.464142  target mds.-1 not active, waiting for new mdsmap
2020-05-25T09:28:33.568-0400 7f6d334e6700  1 client.464142 handle_mds_map epoch 247
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 10 client.464142 did not get mds through better means, so chose random mds 0
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 20 client.464142 mds is 0
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 10 client.464142 _open_mds_session mds.0
2020-05-25T09:28:33.568-0400 7f6d3a4f4700 10 client.464142 waiting for session to mds.0 to open

At a later point, mount() times out (note, debugging "make_request=" log). For this test client_mount_timeout was set to a lower value (~5 seconds). The default for this config is 300 seconds.

2020-05-25T09:28:37.568-0400 7f6d34ce9700 20 client.464142 trim_cache size 0 max 16384
2020-05-25T09:28:38.568-0400 7f6d34ce9700 20 client.464142 trim_cache size 0 max 16384
020-05-25T09:28:38.568-0400 7f6d3a4f4700 10 client.464142 make_request=-110
2020-05-25T09:28:38.568-0400 7f6d3a4f4700  1 client.464142 shutdown
2020-05-25T09:28:38.568-0400 7f6d3a4f4700  2 client.464142 _close_mds_session mds.0 seq 0
2020-05-25T09:28:38.568-0400 7f6d3a4f4700  2 client.464142 waiting for 1 mds sessions to close

Client::shutdown initiates an MDS shutdown which probably does not have a timeout (unless there is a msgr timeout that kicks in, but the test starts failing at this point).

2020-05-25T09:28:38.568-0400 7f6d3a4f4700  2 client.464142 _close_mds_session mds.0 seq 0
2020-05-25T09:28:38.568-0400 7f6d3a4f4700  2 client.464142 waiting for 1 mds sessions to close

The mgr soft shutdown kicks in which is blocked on a lock that is held while mounting.

2020-05-25T09:28:38.644-0400 7f6d5cfea700  0 [volumes INFO volumes.fs.volume] shutting down
2020-05-25T09:28:38.644-0400 7f6d5cfea700  0 [volumes INFO mgr_util] shutting down

The shutdown part in Client makes me recall this tracker: https://tracker.ceph.com/issues/44276

Need to see if Client::shutdown can be effectively handled. Will post when I get something.

EDIT: Note that Client::shutdown() gets called after Client::mount() times out (i.e., call is not initiated by mgr/volumes).

batrick · 2020-05-26T20:40:47Z

Need to see if Client::shutdown can be effectively handled. Will post when I get something.

Is it as simple as just adding timeouts for the Client shutdown?

vshankar · 2020-05-27T04:41:02Z

Is it as simple as just adding timeouts for the Client shutdown?

That's probably the safest way to go about this or some messenger config options that times out when no reply is received.

sebastian-philipp · 2020-05-27T14:16:23Z

(just note that the MGR modules won't be cleaned up when the MGRs is shutting down. They'll be killed instead)

vshankar · 2020-06-05T04:31:04Z

(just note that the MGR modules won't be cleaned up when the MGRs is shutting down. They'll be killed instead)

right -- this pr implements "soft shutdown" for plugins to perform minimal cleanup. Right now, even soft shutdown is pretty delicate. I'll post an update w/ the fixes.

vshankar · 2020-06-08T13:27:19Z

Couple of issues with the current pr:

"soft shutdown" (calling ->stop() for each plugin) was leading to each module throwing "failed to load " in cluster log. Fixing this is straightforward and involves not setting couple of variables that hint the manager that the module has failed (note that, shutdown call sets these variables, however, soft shutdown need to set them).
mgr respawns when a module is enabled -- respawn happens during handling mgrmap (dispatcher). Before respawning we need to send updated client list to monitor (for blacklisting). The client list is sent via a beacon to the monitor. However, mgr is waiting for the beacon reply before it can respawn, but the beacon reply message is in the dispatch queue since the earlier message (mgrmap) is still under processing. We could have the beacon in a separate dispatcher but I guess that would not fix this either since there is a single dispatcher thread. We could respawn asynchronously (via finisher), but that might just open up races elsewhere.

@batrick @badone -- any thoughts?

batrick · 2020-06-08T16:37:03Z

* mgr respawns when a module is enabled -- respawn happens during handling mgrmap (dispatcher). Before respawning we need to send updated client list to monitor (for blacklisting). The client list is sent via a beacon to the monitor. However, mgr is waiting for the beacon reply before it can respawn, but the beacon reply message is in the dispatch queue since the earlier message (mgrmap) is still under processing. We could have the beacon in a separate dispatcher but I guess that would not fix this either since there is a single dispatcher thread. We could respawn asynchronously (via finisher), but that might just open up races elsewhere.

Ya I don't think this is a situation we can resolve in a satisfying way. We have a similar but unresolveable issue with the mgr getting removed via ceph mgr fail. The mgr cannot send a beacon the mon will read because it's been kicked out. I think it's acceptable to just fallback to normal timeout-based eviction for these scenarios.

However, that means we'll still see these warnings in QA. Maybe it's time to throw in the towel and whitelist the warning.

vshankar · 2020-06-09T04:40:14Z

Right -- I'll close this pr for now and whitelist the warning. There are some (general) fixes in mgr/volumes though. I'll push a separate pr for that.

badone · 2020-06-09T06:20:16Z

Maybe we could open these issues up to a broader discussion so we have maximum chance of getting helpful feedback? Could one of you send an email to the dev list outlining the issues and the problems with currently known solutions? At the least this will help to make the issue more broadly known and that, in itself, may lead to a more satisfying resolution sooner.

vshankar · 2020-06-09T06:40:41Z

Maybe we could open these issues up to a broader discussion so we have maximum chance of getting helpful feedback? Could one of you send an email to the dev list outlining the issues and the problems with currently known solutions? At the least this will help to make the issue more broadly known and that, in itself, may lead to a more satisfying resolution sooner.

ACK -- I'll initiate a discussion in devel list.

…mgr client getting backlisted"" This reverts commit a7994a0. Failed attempt at solving the issue is in PR ceph#33272. Until we find a clean solution for this, whiltelisting the warning is probably the best thing for now. Fixes: http://tracker.ceph.com/issues/43943 Signed-off-by: Venky Shankar <vshankar@redhat.com>

…mgr client getting backlisted"" This reverts commit a7994a0. Failed attempt at solving the issue is in PR ceph#33272. Until we find a clean solution for this, whiltelisting the warning is probably the best thing for now. Fixes: http://tracker.ceph.com/issues/43943 Signed-off-by: Venky Shankar <vshankar@redhat.com> (cherry picked from commit 60b8f7a)

…mgr client getting backlisted"" This reverts commit a7994a0. Failed attempt at solving the issue is in PR ceph#33272. Until we find a clean solution for this, whiltelisting the warning is probably the best thing for now. Fixes: http://tracker.ceph.com/issues/43943 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar added the cephfs Ceph File System label Feb 13, 2020

vshankar requested a review from batrick February 13, 2020 13:59

batrick added bug-fix mgr needs-review labels Feb 13, 2020

vshankar commented Feb 13, 2020

View reviewed changes

batrick requested changes Feb 24, 2020

View reviewed changes

liewegas added the needs-rebase label Mar 23, 2020

gregsfortytwo removed the needs-review label Mar 26, 2020

gregsfortytwo assigned vshankar Mar 26, 2020

vshankar force-pushed the wip-43943 branch from 0b2ebe5 to 994474d Compare March 27, 2020 13:40

vshankar changed the title ~~[WIP] mgr: refresh registered client list to monitor on mgr shutdown/respawn~~ mgr: refresh registered client list to monitor on mgr shutdown/respawn Mar 30, 2020

vshankar force-pushed the wip-43943 branch from 994474d to 5907b01 Compare April 6, 2020 10:53

gregsfortytwo added needs-review and removed needs-rebase labels Apr 6, 2020

gregsfortytwo unassigned vshankar Apr 6, 2020

vshankar force-pushed the wip-43943 branch from 5907b01 to 87dcd39 Compare April 8, 2020 11:56

vshankar added 5 commits May 5, 2020 02:02

mgr: wire in stop service request in ceph manager

d38b709

Signed-off-by: Venky Shankar <vshankar@redhat.com>

mgr / volumes: override stop service request for cleanup

2b17560

cleanup helper threads (purge and async cloner) when stop service request is invoked. Signed-off-by: Venky Shankar <vshankar@redhat.com>

mgr: send a last beacon to monitor before exit or respawn

10e2da0

.. and wait for beacon ack from monitor. The last beacon message sends the updated client list to monitor for potential blacklisting. Fixes: http://tracker.ceph.com/issues/43943 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar force-pushed the wip-43943 branch from 87dcd39 to 10e2da0 Compare May 5, 2020 06:03

batrick self-requested a review May 5, 2020 21:49

batrick removed needs-qa wip-pdonnell-testing labels May 22, 2020

vshankar closed this Jun 9, 2020

Conversation

vshankar commented Feb 13, 2020

Uh oh!

vshankar Feb 13, 2020

Choose a reason for hiding this comment

Uh oh!

vshankar Feb 17, 2020

Choose a reason for hiding this comment

Uh oh!

batrick Feb 17, 2020

Choose a reason for hiding this comment

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

batrick Feb 17, 2020

Choose a reason for hiding this comment

Uh oh!

vshankar commented Mar 2, 2020

Uh oh!

gregsfortytwo commented Mar 26, 2020

Uh oh!

vshankar commented Mar 26, 2020

Uh oh!

vshankar commented Apr 6, 2020

Uh oh!

vshankar commented Apr 8, 2020

Uh oh!

vshankar commented May 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vshankar commented May 21, 2020

Uh oh!

batrick commented May 22, 2020

Uh oh!

vshankar commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

batrick commented May 26, 2020

Uh oh!

vshankar commented May 27, 2020

Uh oh!

sebastian-philipp commented May 27, 2020

Uh oh!

vshankar commented Jun 5, 2020

Uh oh!

vshankar commented Jun 8, 2020

Uh oh!

batrick commented Jun 8, 2020

Uh oh!

vshankar commented Jun 9, 2020

Uh oh!

badone commented Jun 9, 2020

Uh oh!

vshankar commented Jun 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vshankar commented May 5, 2020 •

edited

Loading

vshankar commented May 25, 2020 •

edited

Loading