Skip to content

pybind/mgr: reopen database handle on blocklist#50291

Merged
yuriw merged 6 commits intoceph:mainfrom
batrick:i55606
Jul 14, 2023
Merged

pybind/mgr: reopen database handle on blocklist#50291
yuriw merged 6 commits intoceph:mainfrom
batrick:i55606

Conversation

@batrick
Copy link
Member

@batrick batrick commented Feb 27, 2023

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@batrick
Copy link
Member Author

batrick commented Feb 28, 2023

jenkins test api

@batrick
Copy link
Member Author

batrick commented Feb 28, 2023

Waiting for #50065 dependency to be merged.

Copy link
Contributor

@ajarr ajarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to solve a similar problem in the rbd_support module where the RADOS client used by the module gets blocklisted. And we want the module to recover without manual intervention, i.e., without requiring the admin to execute ceph mgr fail.

Initially, I tried to implement something similar to what you've done here. When blocklist exception is raised by the module's client, shutdown the the blocklisted client connection, unregister the client, initiate a new connection and register the new client. Then I realized I may not be handling the scenario where mgr fail is executed. When mgr fail is executed, initially the MgrMonitor blocklists the registered clients of the failed mgr, then the failed mgr respawns and the standby mgr becomes active based on the MgrMap updates. Since the client blocklisting happens before the failed mgr respawns, wouldn't the failed mgr modules' rados clients (DB's client in your case) recover undesirtably? The recovered client may not get blocklisted in case the mgr respawns before getting registered. Would this PR face the same problem?

Since we expect the rbd_support module's client to get blocklisted in rare cases, we're thinking of having the rbd_support module indicate to the monitors/mgr that mgr needs to respawn. The mgr respawn already happens now when the list of mgr modules changes. So you can think of this solution as rbd_support module being disabled and enabled when it's RADOS client gets blocklisted. @batrick what do you think?

_disconnect();
return _connect();
}
int unblocklist(rsptr _cluster) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename this as reconnect() ? I find this more intuitive than unblocklist. I expected unblocklist to remove the client from the list of blocklisted addresses. The reconnect() member function defined above is never used elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll make that change in a new spin of the commits.

Copy link
Contributor

@ajarr ajarr Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you missed incorporating the suggested change https://github.com/ceph/ceph/pull/50291/files#r1120824761?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm going with maybe_reconnect as reconnect implies unconditional reconnect.

@batrick
Copy link
Member Author

batrick commented Mar 1, 2023

Since the client blocklisting happens before the failed mgr respawns, wouldn't the failed mgr modules' rados clients (DB's client in your case) recover undesirtably? The recovered client may not get blocklisted in case the mgr respawns before getting registered. Would this PR face the same problem?

Yes, that's a potential problem. I think the only way to address that is to have the register_client method block until it sees a new MgrMap with the new addrvec.

I think disabling/enabling the module is too heavy-weight. It should be possible to recover without any race conditions if we make register_client block on the new MgrMap.

@ajarr
Copy link
Contributor

ajarr commented Mar 2, 2023

Since the client blocklisting happens before the failed mgr respawns, wouldn't the failed mgr modules' rados clients (DB's client in your case) recover undesirtably? The recovered client may not get blocklisted in case the mgr respawns before getting registered. Would this PR face the same problem?

Yes, that's a potential problem. I think the only way to address that is to have the register_client method block until it sees a new MgrMap with the new addrvec.

Whose "new addrvec" are you referring to? I don't quite follow how this would fix the issue. Can you please explain?

Maybe introducing a time delay before a mgr_module's or DB's rados client tries to reconnect might help? The time delay should be longer than the time taken for the failed mgr to respawn.

I think disabling/enabling the module is too heavy-weight. It should be possible to recover without any race conditions if we make register_client block on the new MgrMap.

@batrick
Copy link
Member Author

batrick commented Mar 2, 2023

Since the client blocklisting happens before the failed mgr respawns, wouldn't the failed mgr modules' rados clients (DB's client in your case) recover undesirtably? The recovered client may not get blocklisted in case the mgr respawns before getting registered. Would this PR face the same problem?

Yes, that's a potential problem. I think the only way to address that is to have the register_client method block until it sees a new MgrMap with the new addrvec.

Whose "new addrvec" are you referring to? I don't quite follow how this would fix the issue. Can you please explain?

Once creating a RADOS client, the next thing the module/mgr should do is register it in the MgrMap so it can be blocklisted, yes? So block completion of register_client until the mgr receives an updated MgrMap that includes the new registered address. That should resolve any races where the module tries to manipulate data.

Maybe introducing a time delay before a mgr_module's or DB's rados client tries to reconnect might help? The time delay should be longer than the time taken for the failed mgr to respawn.

No, that's racy tool. Just wait for an updated MgrMap. We do this all over Ceph, including the MDS. e.g. look at wait_for_mdsmap in the MDS.

@ajarr
Copy link
Contributor

ajarr commented Mar 2, 2023

Since the client blocklisting happens before the failed mgr respawns, wouldn't the failed mgr modules' rados clients (DB's client in your case) recover undesirtably? The recovered client may not get blocklisted in case the mgr respawns before getting registered. Would this PR face the same problem?

Yes, that's a potential problem. I think the only way to address that is to have the register_client method block until it sees a new MgrMap with the new addrvec.

Whose "new addrvec" are you referring to? I don't quite follow how this would fix the issue. Can you please explain?

Once creating a RADOS client, the next thing the module/mgr should do is register it in the MgrMap so it can be blocklisted, yes? So block completion of register_client until the mgr receives an updated MgrMap that includes the new registered address. That should resolve any races where the module tries to manipulate data.

When mgr fail is executed, first the registered clients get blocklisted, and then the MgrMap changes are proposed where the standby mgr is set as the active. Upon seeing the MgrMap changes, the standby daemon becomes active, and the failed active mgr respawns. Even with your suggestion, is it possible that the failed mgr-module's blocklisted client reconnects and registers itself before the pending MgrMap changes, where the standby mgr is set as active, are proposed?

In addition to your suggestion that we wait for the client registration to be reflected in the MgrMap, I think we need to blocklist the registered clients when a mgr starts up (Mgr::init) instead of when MgrMonitor is trying to drop the active mgr ( MgrMonitor::drop_active).

Maybe introducing a time delay before a mgr_module's or DB's rados client tries to reconnect might help? The time delay should be longer than the time taken for the failed mgr to respawn.

No, that's racy tool. Just wait for an updated MgrMap. We do this all over Ceph, including the MDS. e.g. look at wait_for_mdsmap in the MDS.

@batrick
Copy link
Member Author

batrick commented Mar 3, 2023

Even with your suggestion, is it possible that the failed mgr-module's blocklisted client reconnects and registers itself before the pending MgrMap changes, where the standby mgr is set as active, are proposed?

It can reconnect but it will block on the new MgrMap which includes its addrvec. It's not a problem that the mgr creates a new "untracked" RADOS client if that client has not done anything yet.

In addition to your suggestion that we wait for the client registration to be reflected in the MgrMap, I think we need to blocklist the registered clients when a mgr starts up (Mgr::init) instead of when MgrMonitor is trying to drop the active mgr ( MgrMonitor::drop_active).

I don't see why.

@ajarr
Copy link
Contributor

ajarr commented Mar 5, 2023

Even with your suggestion, is it possible that the failed mgr-module's blocklisted client reconnects and registers itself before the pending MgrMap changes, where the standby mgr is set as active, are proposed?

It can reconnect but it will block on the new MgrMap which includes its addrvec. It's not a problem that the mgr creates a new "untracked" RADOS client if that client has not done anything yet.

In addition to your suggestion that we wait for the client registration to be reflected in the MgrMap, I think we need to blocklist the registered clients when a mgr starts up (Mgr::init) instead of when MgrMonitor is trying to drop the active mgr ( MgrMonitor::drop_active).

I don't see why.

Can the following happen (see below), where the MgrMap changes to register the reconnected client are proposed before the MgrMap changes to promote standby as active and drop standby?

              MgrMonitor                           Active Mgr to be failed                        Standby mgr to be promoted
--------------------------------------            --------------------------                      ---------------------------     

1. registered clients get blocklisted
                                              2. mgr module's RADOS client reconnects
                                                 to recover from block listing
                                              3. mgr module's RADOS client tries to
                                                 register new client
4. MgrMap changes to register reconnected
   client's address proposed                                           
                                              5. mgr module's RADOS client registration
                                                 complete upon seeing its address in the
                                                 MgrMap
6. MgrMap changes to promote the standby mgr  6. mgr module's RADOS client starts handling
   as active and drop the active mgr are         requests
   proposed.
                                                                                                7. Standby mgr sees the MgrMap changes, becomes active
                                                                                                   and starts loading python modules.

                                              8. mgr module's RADOS client is running              The newly loaded module's RADOS client
                                                                                                   is also running after registering.
                                              9. mgr sees the MgrMap change and respawns,
                                                 which cleans up the RADOS client
                                                                                          


@batrick
Copy link
Member Author

batrick commented Mar 5, 2023

The "race" you've identified between 6 and 7 has always existed. Your timeline omits that the MgrMonitor in 6 also blocklists the "new" RADOS client created in 2. So the OSDs should stop talking to the old mgr in ~7 (of course, this is a distributed system so not all entities receive maps simultaneously).

The fact that the new active can start doing things before the old instance is completely blocklisted by all OSDs is not something we can feasibly avoid really. We have strategies for scoping the issue:

  • In CephFS, the new MDS won't do anything until it sees the old instance is blocklisted in the OSDMap (via the last_failure_osd_epoch). That way the MDS can communicate the latest OSDMap epoch when talking to OSDs to ensure that, for the OSDs it does talk to, the old MDS cannot be simultaneously doing things too since it's blocklisted.
  • In libcephsqlite, the database lock resides on one object. No client of libcephsqlite will use the database without that lock. So, the single OSD that receives the new OSDMap with the blocklisted address is sufficient for preventing erroneous access. (As long as libcephsqlite waits for the grace period of the lock to expire without breaking it, which it does.)

@ajarr
Copy link
Contributor

ajarr commented Mar 6, 2023

Your timeline omits that the MgrMonitor in 6 also blocklists the "new" RADOS client created in 2.

This is the race I thought I could happen: where in MgrMonitor::prepare_command, in between the failed mgr's client getting blocklisted and the MgrMap changes with standby mgr being set as active being proposed, the failed mgr's client reconnects from being blocklisted and gets itself registered in the MgrMap by MgrMonitor::prepare_beacon.

So the OSDs should stop talking to the old mgr in ~7 (of course, this is a distributed system so not all entities receive maps simultaneously).

The fact that the new active can start doing things before the old instance is completely blocklisted by all OSDs is not something we can feasibly avoid really.

Got it. Thanks for the explanation below.

We have strategies for scoping the issue:

  • In CephFS, the new MDS won't do anything until it sees the old instance is blocklisted in the OSDMap (via the last_failure_osd_epoch). That way the MDS can communicate the latest OSDMap epoch when talking to OSDs to ensure that, for the OSDs it does talk to, the old MDS cannot be simultaneously doing things too since it's blocklisted.
  • In libcephsqlite, the database lock resides on one object. No client of libcephsqlite will use the database without that lock. So, the single OSD that receives the new OSDMap with the blocklisted address is sufficient for preventing erroneous access. (As long as libcephsqlite waits for the grace period of the lock to expire without breaking it, which it does.)

@batrick
Copy link
Member Author

batrick commented Mar 6, 2023

Your timeline omits that the MgrMonitor in 6 also blocklists the "new" RADOS client created in 2.

This is the race I thought I could happen: where in MgrMonitor::prepare_command, in between the failed mgr's client getting blocklisted and the MgrMap changes with standby mgr being set as active being proposed, the failed mgr's client reconnects from being blocklisted and gets itself registered in the MgrMap by MgrMonitor::prepare_beacon.

It won't be able to register the new RADOS client as the mgr no longer exists in the (pending or not) MgrMap. So the old mgr should never complete the "register_client" call (with the proposed change to have it block on the new MgrMap).

Something that should be done but not, related to your concern, is to "plug" PAXOS whenever we are manipulating the OSDMap (blocklisting) and the MgrMap (failover). You may write up a ticket and trivial fix for that. I don't think it's a significant issue but it would be appropriate.

@batrick
Copy link
Member Author

batrick commented Mar 6, 2023

Something that should be done but not, related to your concern, is to "plug" PAXOS whenever we are manipulating the OSDMap (blocklisting) and the MgrMap (failover). You may write up a ticket and trivial fix for that. I don't think it's a significant issue but it would be appropriate.

https://tracker.ceph.com/issues/58923

@batrick
Copy link
Member Author

batrick commented Mar 6, 2023

@batrick batrick marked this pull request as ready for review March 9, 2023 19:20
@batrick batrick requested a review from a team as a code owner March 9, 2023 19:20
clients.emplace(std::string(name), std::move(addrs));
auto n = std::string(name);
if (replace) {
clients.erase(n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since multimap clients can have identical keys (client name), I don't think this interface is safe until we make the change suggested here #51169 (comment) . So we drop the "replace" functionality for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention after #51169 is merged is that register_client will always replace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#51169 will not include the changes to make the data structure of clients from multimap to map. Do you feel that if even if the data type of clients is not changed by #51169, this interface is still safe to have?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess I don't like the "replace" interface but I don't see what about it is unsafe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "replace" interface can also be removed when/if we switch from multimap to map. It's an internal API so it's not a big deal.

_disconnect();
return _connect();
}
int unblocklist(rsptr _cluster) {
Copy link
Contributor

@ajarr ajarr Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you missed incorporating the suggested change https://github.com/ceph/ceph/pull/50291/files#r1120824761?

(addrv,) = cur.fetchone()
assert addrv is not None
self.log.debug(f"new libcephsqlite addrv = {addrv}")
self._ceph_register_client("libcephsqlite", addrv, True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the replace functionality of _ceph_register_client() is not safe, just deregister and register the new address?

Copy link
Contributor

@ajarr ajarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. My remaining question, #50291 (comment)

@mchangir since you worked on libcephsqlite, can you go through this PR as well?

clients.emplace(std::string(name), std::move(addrs));
auto n = std::string(name);
if (replace) {
clients.erase(n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#51169 will not include the changes to make the data structure of clients from multimap to map. Do you feel that if even if the data type of clients is not changed by #51169, this interface is still safe to have?

@ajarr ajarr requested a review from mchangir June 12, 2023 12:50
Copy link
Contributor

@ajarr ajarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this in the make check failure ,

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/__init__.py", line 60, in <module>
    from .module import Module, StandbyModule  # noqa: F401
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/module.py", line 24, in <module>
    from mgr_module import CLIReadCommand, CLIWriteCommand, HandleCommandResult, \
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/mgr_module.py", line 518, in <module>
    def MgrModuleRecoverDB(func: abc.Callable) -> abc.Callable:
NameError: name 'abc' is not defined. Did you mean: 'abs'?

@batrick
Copy link
Member Author

batrick commented Jun 13, 2023

I see this in the make check failure ,

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/__init__.py", line 60, in <module>
    from .module import Module, StandbyModule  # noqa: F401
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/module.py", line 24, in <module>
    from mgr_module import CLIReadCommand, CLIWriteCommand, HandleCommandResult, \
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/mgr_module.py", line 518, in <module>
    def MgrModuleRecoverDB(func: abc.Callable) -> abc.Callable:
NameError: name 'abc' is not defined. Did you mean: 'abs'?

.. weird I didn't notice that before. I removed abc..

Copy link
Contributor

@ajarr ajarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchangir can you take a look?

batrick added 6 commits June 27, 2023 14:18
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Presently the libcephsqlite library becomes unusable (by design, at the
time) when the RADOS instance it uses is blocklisted. Unfortunately,
it's fairly common for this to happen during disruptive events. For the
ceph-mgr, it's unacceptable to require manual intervention via the
`ceph mgr fail` command.

So, this commit reworks libcephsqlite to reconnect to RADOS when its
cluster handle is blocklisted. This is safe as any open "file" by
sqlite3 has its own ptr to the open cluster handle. So those open files
will all continually fail until sqlite3 closes the handle (on database
connection shutdown). The application can then reopen the database if
desired using the fresh RADOS handle.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Two new parameters:

- "name" to allow setting the name associated with the addrv. Previously
  this was hard-coded to the mgr module name. However, we want mgr
  modules to be able to update the libcephsqlite cluster handle.

- "replace" to replace all addrv for the given "name". libcephsqlite
  only ever has one handle, remove the old ones.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Fixes: https://tracker.ceph.com/issues/55606
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member Author

batrick commented Jun 27, 2023

rebase to fix conflict

@batrick
Copy link
Member Author

batrick commented Jul 12, 2023

This is ready to merge when RADOS QA is complete. cc @ljflores @rzarzynski

@yuriw
Copy link
Contributor

yuriw commented Jul 14, 2023

From @kamoltat ref: https://trello.com/c/7y6uj4bo

RADOS approved, all failures/dead jobs are unrelated and known failures.

Failures:

7332480 -> Bug #58946: cephadm: KeyError: 'osdspec_affinity' - Dashboard - Ceph -> cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard
7332558 -> Bug #57302: ERROR: test_get_status (tasks.mgr.dashboard.test_cluster.ClusterTest) - Dashboard - Ceph -> Test failure: test_create_access_permissions (tasks.mgr.dashboard.test_pool.PoolTest)
7332565 -> Bug #57754: test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure - Ceph -> test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
7332612 -> Bug #57754: test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure - Ceph -> test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
7332613 -> Bug #55347: SELinux Denials during cephadm/workunits/test_cephadm - Infrastructure - Ceph -> SELinux Denials during cephadm/workunits/test_cephadm
7332636 -> Bug #58946: cephadm: KeyError: 'osdspec_affinity' - Dashboard - Ceph -> cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard

Deads:

7332357 -> Bug #61164: Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds - Infrastructure - Ceph -> Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds
7332405 -> Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - rgw - Ceph -> rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"
7332559 -> Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - rgw - Ceph -> rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants