mgr: ensure that all modules have started before advertising active mgr#63859
mgr: ensure that all modules have started before advertising active mgr#63859
Conversation
|
jenkins test windows |
|
jenkins test submodules |
athanatos
left a comment
There was a problem hiding this comment.
LGTM, but someone more familiar with the manager should probably have a look as well.
|
Changed the commit description; code is the same. |
|
Further context for testing: With the current solution, I used the following reproducer, which injects a time-complex bottleneck into the balancer module to cause its startup time to take longer: diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py
index 476304275c1..d2dfa01e99a 100644
--- a/src/pybind/mgr/balancer/module.py
+++ b/src/pybind/mgr/balancer/module.py
@@ -771,6 +771,11 @@ class Module(MgrModule):
if pg_upmap_activity:
self.update_pg_upmap_activity(plan) # update pg activity in `balancer status detail`
self.optimizing = False
+ # causing a bottleneck
+ for i in range(0, 1000):
+ for j in range (0, 1000):
+ x = i + j
+ self.log.debug("hitting the bottleneck in the balancer module")
self.log.debug('Sleeping for %d', sleep_interval)
self.event.wait(sleep_interval)
self.event.clear()After injecting the bottleneck, I ran this command on a vstart cluster, which shows that the mgr declared availability too early: In the mgr log, we can see that In my patch, the mgr declares availability only after In the mgr log, we can see that the mgr is sending a "starting" beacon to indicate that it is still starting, and it only sends the "active" beacon after |
batrick
left a comment
There was a problem hiding this comment.
I haven't studied this code in detail yet. I have a question on the overall approach here: why not just have the ceph-mgr sit on MMgrCommands and retry all tabled commands whenever a module becomes active. If a module never becomes active after some timeout, return -ENOSYS (not -ENOTSUP).
That way the ceph-mgr is available immediately for modules which load quickly and others just delay the command delivery until a module that understands the command successfully loads.
I see what you’re saying and how that might make mgr command handling more resilient. However, your suggestion assumes that mgr modules load independently and in parallel, but that’s not the case. Mgr modules load sequentially, and the load time of one can directly impact the next. For instance, if the orchestrator module takes longer to load, the telemetry module won’t have loaded yet, since modules load in alphabetical order and start one by one. So retrying tabled commands on a per-module basis introduces unnecessary complexity, especially since we know that if one earlier module is still pending, all subsequent ones will be as well. If we did want to check for availability on a per-module basis, we could add logic to the DaemonServer code where mgr commands are handled. However, I’ve already implemented that type of approach here: #61325 Another important detail is that this startup issue only applies to enabled modules, not disabled ones. If a module is enabled, it’s reasonable to assume it’s part of the cluster’s intended functionality. In my opinion, the mgr simply should not be considered “available” until all enabled modules have fully started. If users didn’t intend to use those modules, they’d disable them. Marking the mgr as active before those modules are ready signals false availability to the cluster. Since the cluster expects enabled modules to be usable, it’s cleaner to wait for them to finish starting up than to mark the mgr active early and try to recover later on a per-module basis. If you're interested in reviewing the module startup sequence, the relevant logic can be found in: Here, you can see how modules are loaded one by one, where each is dependent on the completion of the previous. One aspect I’m unsure about is what the mgr’s behavior should be if the loading somehow gets stalled indefinitely. My solution is currently implemented so if |
|
jenkins test api |
I don't see any real advantage to allowing the manager to be "active" without the modules being loaded -- seems a lot more complicated for little advantage. Yes, it allows a slightly faster startup, but that hardly seems critical on the manager. |
My thought was that you'd keep a vector of MMgrCommands to retry whenever a module becomes active. There would be no need to sleep the messenger thread.
The larger problem is that the ceph-mgr doesn't actually decide when it's active. The monitors tell the standby it's active. If the ceph-mgr doesn't immediately start sending beacons then the mons will assume it's dead and replace it. This seems to be the reason why the "available" flag was invented but its meaning is controversial and recently discussed in e.g. I'd argue the "available" flag is half-baked and we keep churning on this continually. I think a complete solution requires adding recovery states to the ceph-mgr starting with something like "STATE_LOADING_MODULES". The ceph-mgr eventually should tell the mons when it's ready to switch to active in the same way that the MDS do. What do you think? |
I still don’t think this would offer any significant benefit. Mgr modules load sequentially and depend on each other, so we can already infer a module’s availability based on whether the ones before it have loaded. Retrying tabled commands individually adds extra complexity, especially when we know every enabled module is expected to be loaded and functional in the cluster.
I disagree with you on this. The monitor AND the active mgr it chooses both have a hand in determining the mgr’s availability. In
The chosen active mgr ultimately tells the mon when it’s done initializing, and it is actually an expectation that the python modules be loaded per the comment written under the “if (available)” condition. The code that I’m referencing is here:
That’s not quite what’s happening here. In my solution, the mgr does keep sending beacons, just not the final “active” beacon until all modules have started. While it's waiting for initialization to complete, it sends a beacon indicating that it’s “active (starting)”. So the mon is being notified continuously, and doesn’t consider the mgr dead. Even if a module is stalled loading forever, the mon doesn’t replace the chosen active mgr. It waits for the mgr to signal that initialization is complete. See this comment for a clear demonstration of that: #63859 (comment)
Looking at the MgrStandby::send_beacon code, it seems straightforward to me that the “available” condition is set by 1) the mon choosing a new active_mgr and 2) the new active mgr signaling initialization. Introducing module-by-module availability would again break these expectations and make communication with the mon and mgr command handler much more difficult. That said, I'm open to understanding more about why the condition might be half-baked and how we can make it clearer.
Can you please elaborate on the active sequence in the MDS and how it differs from the mgr? |
|
I'm not actually sure this is a big deal. I haven't actually seen evidence that loading modules takes a long time, merely that there's a possible race condition if we send the beacon before it's actually happened. Perhaps we try this solution and see if the delay is actually a problem? |
I know you and I have gone over this scenario many times, but I want to point out the original example that motivated this fix, which is documented in https://bugzilla.redhat.com/show_bug.cgi?id=2314146. In this report, the QE team was testing upgrades from Reef/Quincy to Squid, and their scripts were erroring out as soon as the mgrs upgraded to Squid. We had not seen this race condition occur before Squid, and after digging into it, I found that it is likely due to us gradually adding more load on the various One might suggest optimizing these In the example from https://bugzilla.redhat.com/show_bug.cgi?id=2314146, the QE team is still experiencing the error after we fixed the bottleneck in the balancer, although to a lesser extent. I experimented with disabling cephadm to remove it from the loading sequence, and then the error "went away". This indicates that cephadm might have a heavy Ultimately, we likely didn't see this problem before Squid since we got lucky. But, with the introduction of the finisher thread framework, which changed the module loading sequence to an async background task, and by us gradually adding more tasks to the |
Thanks for patiently explaining and reminding me of the details Laura. I think you're right. I would also ask @idryomov to chime in here as I think this PR is doing what he previously asked we do.
No sorry, I didn't mean that the ceph-mgr should indicate via a state change for each individual module. It would just be a recovery state. The "available" flag approximates that to a degree. What I don't like is that operators won't understand why the mgr may not be working when it's clearly "active". Perhaps that's something that can be included in this PR: the mon's can log a health warning in MgrMonitor::encode_pending that the mgr is active but not yet available. (This is normal as we log similar warnings when the MDS is in a degraded state.) In any case, any necessary redesign with recovery states can be done in the future.
|
Sure, I'll request Ilya as a reviewer.
FWIW, the ceph status does already indicate that the new active mgr is still starting. This snippet was taken during the same "initialization sequence delay via balancer bottleneck" that I illustrated in #63859 (comment): The MgrStandby code defines the three possible states: "standby", "active", and "active (starting)":
Sure, it would be helpful to have a design proposal in the form of an Enhancement tracker so all the criteria and intended benefits are clearly stated. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
|
reopening since this is something we want to make available. |
Thanks Venky- I'm retesting this one more time and then going to seek final reviews. |
----------------- Explanation of Problem ---------------- When the mgr is restarted or failed over via `ceph mgr fail` or during an upgrade, mgr modules sometimes take longer to start up (this includes loading their class, commands, and module options, and being removed from the `pending_modules` map structure). This startup delay can happen due to a cluster's specific hardware or if a code bottleneck is triggered in a module’s `serve()` function (each mgr module has a `serve()` function that performs initialization tasks right when the module is loaded). When this startup delay occurs, any mgr module command issued against the cluster around the same time fails with error saying that the command is not supported: ``` $ ceph mgr fail; ceph fs volume ls Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date Module 'volumes' is not enabled/loaded (required by command 'fs volume ls'): use `ceph mgr module enable volumes` to enable it ``` We should try to lighten any bottlenecks in the mgr module `serve()` functions wherever possible, but the root cause of this failure is that the mgr sends a beacon to the mon too early, indicating that it is active before the module loading has completed. Specifically, some of the mgr modules have loaded their class but have not yet been deleted from the `pending_modules` structure, indicating that they have not finished starting up. --------------------- Explanation of Fix -------------------- This commit improves the criteria for sending the “active” beacon to the mon so the mgr does not signal that it’s active too early. We do this through the following additions: 1. A new context `ActivePyModules::recheck_modules_start` that will be set if not all modules have finished startup. 2. A new function `ActivePyModules::check_all_modules_started()` that checks if modules are still pending startup; if all have started up (`pending_modules` is empty), then we send the beacon right away. But if some are still pending, we pass the beacon task on to the new recheck context `ActivePyModules::recheck_modules_start` so we know to send the beacon later. 3. Logic in ActivePyModules::start_one() that only gets triggered if the modules did not all finish startup the first time we checked. We know this is the case if the new recheck context `recheck_modules_start` was set from `nullptr`. The beacon is only sent once `pending_modules` is confirmed to be empty, which means that all the modules have started up and are ready to support commands. 4. Adjustment of when the booleans `initializing` and `initialized` are set. These booleans come into play in MgrStandby::send_beacon() when we check that the active mgr has been initialized (thus, it is available). We only send the beacon when this boolean is set. Currently, we set these booleans at the end of Mgr::init(), which means that it gets set early before `pending_modules` is clear. With this adjustment, the bools are set only after we check that all modules have started up. The send_beacon code is triggered on mgr failover AND on every Mgr::tick(), which occurs by default every two seconds. If we don’t adjust when these bools are set, we only fix the mgr failover part, but the mgr still sends the beacon too early via Mgr::tick(). Below is the relevant code from MgrStandby::send_beacon(), which is triggered in Mgr::background_init() AND in Mgr::tick(): ``` // Whether I think I am available (request MgrMonitor to set me // as available in the map) bool available = active_mgr != nullptr && active_mgr->is_initialized(); auto addrs = available ? active_mgr->get_server_addrs() : entity_addrvec_t(); dout(10) << "sending beacon as gid " << monc.get_global_id() << dendl; ``` --------------------- Reproducing the Bug ---------------------- At face value, this issue is indeterministically reproducible since it can depend on environmental factors or specific cluster workloads. However, I was able to deterministically reproduce it by injecting a bottleneck into the balancer module: ``` diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py index d12d69f..91c83fa8023 100644 --- a/src/pybind/mgr/balancer/module.py +++ b/src/pybind/mgr/balancer/module.py @@ -772,10 +772,10 @@ class Module(MgrModule): self.update_pg_upmap_activity(plan) # update pg activity in `balancer status detail` self.optimizing = False + # causing a bottleneck + for i in range(0, 1000): + for j in range (0, 1000): + x = i + j + self.log.debug("hitting the bottleneck in the balancer module") self.log.debug('Sleeping for %d', sleep_interval) self.event.wait(sleep_interval) self.event.clear() ``` Then, the error reproduces every time by running: ``` $ ./bin/ceph mgr fail; ./bin/ceph telemetry show Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it ``` With this fix, the active mgr is marked as "initialized" only after all the modules have started up, and this error goes away. The command may take a bit longer to execute depending on the extent of the delay. ---------------------- Integration Testing --------------------- This commit adds a dev-only config that can inject a longer loading time into the mgr module loading sequence so we can simulate this scenario in a test. The config is 0 ms by default since we do not add any delay outside of testing scenarios. The config can be adjusted with the following command: `ceph config set mgr mgr_module_load_delay <ms>` A second dev-only config also allows you to specify which module you want to be delayed in loading time. You may change this with the following command: `ceph config set mgr mgr_module_load_delay_name <module name>` The workunit added here tests a simulated slow loading module scenario to ensure that this case is properly handled. --------------------- Documentation -------------------- The new documentation describes the three existing mgr states so Ceph operators can better interpret their Ceph status output. Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
…eded ----------------- Enhancement to the Original Fix ----------------- During a mgr failover, the active mgr is marked available if: 1. The mon has chosen a standby to be active 2. The chosen active mgr has all of its modules initialized Now that we've improved the criteria for sending the "active" beacon by enforcing it to retry initializing mgr modules, we need to account for extreme cases in which the modules are stuck loading for a very long time, or even indefinitely. In these extreme cases where the modules might never initialize, we don't want to delay sending the "active" beacon for too long. This can result in blocking other important mgr functionality, such as reporting PG availability in the health status. We want to avoid sending warnings about PGs being unknown in the health status when that's not ultimately the problem. To account for an exeptionally long module loading time, I added a new configurable `mgr_module_load_expiration`. If we exceed this maximum amount of time (in ms) allotted for the active mgr to load the mgr modules before declaring availability, the standby will then proceed to mark itself "available" and send the "active" beacon to the mon and unblock other critical mgr functionality. If this happens, a health error will be issued at this time, indicating which mgr modules got stuck initializing (See src/mgr/PyModuleRegistry.cc). The idea is to unblock the rest of the mgr's critical functionality while making it clear to Ceph operators that some modules are unusable. --------------------- Integration Testing -------------------- The workunit was rewritten so it tests for these scenarios: 1. Normal module loading behavior (no health error should be issued) 2. Acceptable delay in module loading behavior (no health error should be issued) 3. Unacceptable delay in module loading behavior (a health error should be issued) --------------------- Documentation -------------------- This documentation explains the "Module failed to initialize" cluster error. Users are advised to try failing over the mgr to reboot the module initialization process, then if the error persists, file a bug report. I decided to write it this way instead of providing more complex debugging tips such as advising to disable some mgr modules since every case will be different depending on which modules failed to initialize. In the bug report, developers can ask for the health detail output to narrow down which module is causing a bottleneck, and then ask the user to try disabling certain modules until the mgr is able to fully initialize. Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
Now, the command `ceph tell mgr mgr_status` will show a "pending_modules" field. This is another way for Ceph operators to check which modules haven't been initalized yet (in addition to the health error). This command was also added to testing scenarios in the workunit. Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
The current check groups modules not being enabled with failing to initialize. In this commit, we reorder the checks: 1: Screen for a module being enabled. If it's not, issue an EOPNOTSUPP with instructions on how to enable it. 2. Screen for if a module is active. If a module is enabled, then the cluster expects it to be active to support commands. If the module took too long to initialize though, we will catch this and issue an ETIMEDOUT error with a link for troubleshooting. Now, these two separate issues are not grouped together, and they are checked in the right order. Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
5152466 to
fdc072f
Compare
|
Squashed all the commits. Everything is now in order and ready for final reviews. I also ran some QA tests on the workunit, and confirmed it passed: |
|
/config check ok |
NitzanMordhai
left a comment
There was a problem hiding this comment.
I reviewed all the changes, and it looks good!
|
adding the rocky10 label to make testing easier; however, this fix is not specific to Rocky 10. |
| fi | ||
|
|
||
| echo "Check mgr_status to ensure 'pending_modules' is populated with modules we expect..." | ||
| expected='["balancer","cephadm","crash","devicehealth","iostat","nfs","orchestrator","pg_autoscaler","progress","rbd_support","status","telemetry","volumes"]' |
There was a problem hiding this comment.
@ljflores need to add nvmeof as well, it was added as "always on" module
Post the merge of this: ceph#67641 Fixes: https://tracker.ceph.com/issues/71631 Signed-off-by: Laura Flores <lflores@ibm.com>
|
jenkins test make check |
When the mgr is restarted or failed over via
ceph mgr failor during anupgrade, mgr modules sometimes take longer to start up (this includes
loading their class, commands, and module options, and being removed
from the
pending_modulesmap structure). This startup delay can happendue to a cluster's specific hardware or if a code bottleneck is triggered in
a module’s
serve()function (each mgr module has aserve()function thatperforms initialization tasks right when the module is loaded).
When this startup delay occurs, any mgr module command issued against the
cluster around the same time fails with error saying that the command is not
supported:
We should try to lighten any bottlenecks in the mgr module
serve()functions wherever possible, but the root cause of this failure is that the
mgr sends a beacon to the mon too early, indicating that it is active before
the module loading has completed. Specifically, some of the mgr modules
have loaded their class but have not yet been deleted from the
pending_modulesstructure, indicating that they have not finished starting up.
This commit improves the criteria for sending the “active” beacon to the mon so
the mgr does not signal that it’s active too early. We do this through the following additions:
A new context
ActivePyModules::recheck_modules_startthat will be set if not all moduleshave finished startup.
A new function
ActivePyModules::check_all_modules_started()that checks if modules arestill pending startup; if all have started up (
pending_modulesis empty), then we sendthe beacon right away. But if some are still pending, we pass the beacon task on to the new
recheck context
ActivePyModules::recheck_modules_startso we know to send the beacon later.Logic in ActivePyModules::start_one() that only gets triggered if the modules did not all finish
startup the first time we checked. We know this is the case if the new recheck context
recheck_modules_startwas set fromnullptr. The beacon is only sent oncepending_modulesisconfirmed to be empty, which means that all the modules have started up and are ready to support commands.
Adjustment of when the booleans
initializingandinitializedare set. These booleans come into play inMgrStandby::send_beacon() when we check that the active mgr has been initialized (thus, it is available).
We only send the beacon when this boolean is set. Currently, we set these booleans at the end of Mgr::init(),
which means that it gets set early before
pending_modulesis clear. With this adjustment, the bools are setonly after we check that all modules have started up. The send_beacon code is triggered on mgr failover AND on
every Mgr::tick(), which occurs by default every two seconds. If we don’t adjust when these bools are set, we
only fix the mgr failover part, but the mgr still sends the beacon too early via Mgr::tick(). Below is the relevant
code from MgrStandby::send_beacon(), which is triggered in Mgr::background_init() AND in Mgr::tick():
At face value, this issue is indeterministically reproducible since it
can depend on environmental factors or specific cluster workloads.
However, I was able to deterministically reproduce it by injecting a
bottleneck into the balancer module:
Then, the error reproduces every time by running:
With this fix, the active mgr is marked as "initialized" only after all
the modules have started up, and this error goes away. The command may
take a bit longer to execute depending on the extent of the delay.
Part 2 -- Adding a "max expiration" mechanism + health warning to address extreme loading cases
During a mgr failover, the active mgr is marked available if:
1. The mon has chosen a standby to be active
2. The chosen active mgr has all of its modules initialized
Now that we've improved the criteria for sending the "active" beacon by enforcing it to retry initializing mgr modules, we need to account for extreme cases in which the modules are stuck loading for a very long time, or even indefinitely. In these extreme cases where the modules might never initialize, we don't want to delay sending the "active" beacon for too long. This can result in blocking other important mgr functionality, such as reporting PG availability in the health status. We want to avoid sending warnings about PGs being unknown in the health status when that's not ultimately the problem.
To account for an exceptionally long module loading time, I added a new configurable
mgr_module_load_max_expiration. After a certain expiration (by default, 20 seconds), if the mgr modules still haven't finished initializing, the standby will then proceed to mark itself "available" and send the "active" beacon to the mon, which unblocks other critical mgr functionality.For added clarity, a health error will be issued at this time, indicating which mgr modules got stuck initializing (See src/mgr/PyModuleRegistry.cc). The idea is to unblock the rest of the mgr's critical functionality while making it clear to Ceph operators that some modules are unusable.
This is what the health warning might look like:
Links to rendered documentation:
Note: All of the new configurations added in this PR are marked as "dev" level, and are to be used only in testing scenarios or in out-of-the-norm troubleshooting scenarios. I have chosen not to document them for this reason.
Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores lflores@ibm.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition