pybind/mgr: reopen database handle on blocklist#50291
Conversation
1ec5311 to
b563a6d
Compare
|
jenkins test api |
|
Waiting for #50065 dependency to be merged. |
There was a problem hiding this comment.
We're trying to solve a similar problem in the rbd_support module where the RADOS client used by the module gets blocklisted. And we want the module to recover without manual intervention, i.e., without requiring the admin to execute ceph mgr fail.
Initially, I tried to implement something similar to what you've done here. When blocklist exception is raised by the module's client, shutdown the the blocklisted client connection, unregister the client, initiate a new connection and register the new client. Then I realized I may not be handling the scenario where mgr fail is executed. When mgr fail is executed, initially the MgrMonitor blocklists the registered clients of the failed mgr, then the failed mgr respawns and the standby mgr becomes active based on the MgrMap updates. Since the client blocklisting happens before the failed mgr respawns, wouldn't the failed mgr modules' rados clients (DB's client in your case) recover undesirtably? The recovered client may not get blocklisted in case the mgr respawns before getting registered. Would this PR face the same problem?
Since we expect the rbd_support module's client to get blocklisted in rare cases, we're thinking of having the rbd_support module indicate to the monitors/mgr that mgr needs to respawn. The mgr respawn already happens now when the list of mgr modules changes. So you can think of this solution as rbd_support module being disabled and enabled when it's RADOS client gets blocklisted. @batrick what do you think?
src/libcephsqlite.cc
Outdated
| _disconnect(); | ||
| return _connect(); | ||
| } | ||
| int unblocklist(rsptr _cluster) { |
There was a problem hiding this comment.
maybe rename this as reconnect() ? I find this more intuitive than unblocklist. I expected unblocklist to remove the client from the list of blocklisted addresses. The reconnect() member function defined above is never used elsewhere.
There was a problem hiding this comment.
Sure, I'll make that change in a new spin of the commits.
There was a problem hiding this comment.
Did you missed incorporating the suggested change https://github.com/ceph/ceph/pull/50291/files#r1120824761?
There was a problem hiding this comment.
Yes, I'm going with maybe_reconnect as reconnect implies unconditional reconnect.
Yes, that's a potential problem. I think the only way to address that is to have the register_client method block until it sees a new MgrMap with the new addrvec. I think disabling/enabling the module is too heavy-weight. It should be possible to recover without any race conditions if we make register_client block on the new MgrMap. |
Whose "new addrvec" are you referring to? I don't quite follow how this would fix the issue. Can you please explain? Maybe introducing a time delay before a mgr_module's or DB's rados client tries to reconnect might help? The time delay should be longer than the time taken for the failed mgr to respawn.
|
Once creating a RADOS client, the next thing the module/mgr should do is register it in the MgrMap so it can be blocklisted, yes? So block completion of register_client until the mgr receives an updated MgrMap that includes the new registered address. That should resolve any races where the module tries to manipulate data.
No, that's racy tool. Just wait for an updated MgrMap. We do this all over Ceph, including the MDS. e.g. look at |
When In addition to your suggestion that we wait for the client registration to be reflected in the MgrMap, I think we need to blocklist the registered clients when a mgr starts up (Mgr::init) instead of when MgrMonitor is trying to drop the active mgr ( MgrMonitor::drop_active).
|
It can reconnect but it will block on the new MgrMap which includes its addrvec. It's not a problem that the mgr creates a new "untracked" RADOS client if that client has not done anything yet.
I don't see why. |
Can the following happen (see below), where the MgrMap changes to register the reconnected client are proposed before the MgrMap changes to promote standby as active and drop standby? |
|
The "race" you've identified between 6 and 7 has always existed. Your timeline omits that the MgrMonitor in 6 also blocklists the "new" RADOS client created in 2. So the OSDs should stop talking to the old mgr in ~7 (of course, this is a distributed system so not all entities receive maps simultaneously). The fact that the new active can start doing things before the old instance is completely blocklisted by all OSDs is not something we can feasibly avoid really. We have strategies for scoping the issue:
|
This is the race I thought I could happen: where in
Got it. Thanks for the explanation below.
|
It won't be able to register the new RADOS client as the mgr no longer exists in the (pending or not) MgrMap. So the old mgr should never complete the "register_client" call (with the proposed change to have it block on the new MgrMap). Something that should be done but not, related to your concern, is to "plug" PAXOS whenever we are manipulating the OSDMap (blocklisting) and the MgrMap (failover). You may write up a ticket and trivial fix for that. I don't think it's a significant issue but it would be appropriate. |
|
| clients.emplace(std::string(name), std::move(addrs)); | ||
| auto n = std::string(name); | ||
| if (replace) { | ||
| clients.erase(n); |
There was a problem hiding this comment.
Since multimap clients can have identical keys (client name), I don't think this interface is safe until we make the change suggested here #51169 (comment) . So we drop the "replace" functionality for now?
There was a problem hiding this comment.
I think the intention after #51169 is merged is that register_client will always replace?
There was a problem hiding this comment.
I confess I don't like the "replace" interface but I don't see what about it is unsafe.
There was a problem hiding this comment.
The "replace" interface can also be removed when/if we switch from multimap to map. It's an internal API so it's not a big deal.
src/libcephsqlite.cc
Outdated
| _disconnect(); | ||
| return _connect(); | ||
| } | ||
| int unblocklist(rsptr _cluster) { |
There was a problem hiding this comment.
Did you missed incorporating the suggested change https://github.com/ceph/ceph/pull/50291/files#r1120824761?
| (addrv,) = cur.fetchone() | ||
| assert addrv is not None | ||
| self.log.debug(f"new libcephsqlite addrv = {addrv}") | ||
| self._ceph_register_client("libcephsqlite", addrv, True) |
There was a problem hiding this comment.
Since the replace functionality of _ceph_register_client() is not safe, just deregister and register the new address?
There was a problem hiding this comment.
The changes look good to me. My remaining question, #50291 (comment)
@mchangir since you worked on libcephsqlite, can you go through this PR as well?
| clients.emplace(std::string(name), std::move(addrs)); | ||
| auto n = std::string(name); | ||
| if (replace) { | ||
| clients.erase(n); |
ajarr
left a comment
There was a problem hiding this comment.
I see this in the make check failure ,
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
__import__(pkg_name)
File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/__init__.py", line 60, in <module>
from .module import Module, StandbyModule # noqa: F401
File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/module.py", line 24, in <module>
from mgr_module import CLIReadCommand, CLIWriteCommand, HandleCommandResult, \
File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/mgr_module.py", line 518, in <module>
def MgrModuleRecoverDB(func: abc.Callable) -> abc.Callable:
NameError: name 'abc' is not defined. Did you mean: 'abs'?
.. weird I didn't notice that before. I removed |
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Presently the libcephsqlite library becomes unusable (by design, at the time) when the RADOS instance it uses is blocklisted. Unfortunately, it's fairly common for this to happen during disruptive events. For the ceph-mgr, it's unacceptable to require manual intervention via the `ceph mgr fail` command. So, this commit reworks libcephsqlite to reconnect to RADOS when its cluster handle is blocklisted. This is safe as any open "file" by sqlite3 has its own ptr to the open cluster handle. So those open files will all continually fail until sqlite3 closes the handle (on database connection shutdown). The application can then reopen the database if desired using the fresh RADOS handle. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Two new parameters: - "name" to allow setting the name associated with the addrv. Previously this was hard-coded to the mgr module name. However, we want mgr modules to be able to update the libcephsqlite cluster handle. - "replace" to replace all addrv for the given "name". libcephsqlite only ever has one handle, remove the old ones. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Fixes: https://tracker.ceph.com/issues/55606 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
|
rebase to fix conflict |
|
This is ready to merge when RADOS QA is complete. cc @ljflores @rzarzynski |
|
From @kamoltat ref: https://trello.com/c/7y6uj4bo RADOS approved, all failures/dead jobs are unrelated and known failures. Failures: 7332480 -> Bug #58946: cephadm: KeyError: 'osdspec_affinity' - Dashboard - Ceph -> cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard Deads: 7332357 -> Bug #61164: Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds - Infrastructure - Ceph -> Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds |
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows