mgr: add retry logic for module loading by ljflores · Pull Request #61325 · ceph/ceph

ljflores · 2025-01-10T18:12:34Z

This PR accomplishes several things:

Reorder "active and enabled" checks & add retry logic for module loading

The current "active and enabled" check order can result in an ENOTSUP being
returned when a module is enabled, but needs more time to load. In this case,
the command should be supported, but the module timed out loading. We added
internal retry logic to the mgr, so now the mgr retries loading the module several
times before giving up. With that implemented, ETIMEDOUT is more appropriate.

The default max time that a module can take to load is now 5 seconds. To
adjust this time, the following command can be run:
ceph config set mgr mgr_module_load_timeout <uint>

The default cadence at which the mgr retries is every 1 second. To
adjust this time, the following command can be run:
ceph config set mgr mgr_module_load_interval <uint>
Add test coverage for slow loading module

To test the above modifications, this PR adds a dev-only config
that can inject a longer loading time into the mgr module loading
sequence so we can simulate this scenario in a test.

The config is 0 secs by default since we do not add any delay
outside of testing scenarios. The config can be adjusted
with the following command:
ceph config set mgr mgr_module_load_delay <uint>

A second dev-only config also allows you to specify which
module you want to be delayed in loading time. You may change
this with the following command:
ceph config set mgr mgr_module_load_delay_name <module name>

The workunit added here tests a simulated slow loading module
scenario, as well as checks for any problems in the actual
module loading time.

Latest teuthology results for the new workunit are green:
https://pulpito.ceph.com/lflores-2025-01-10_16:53:51-rados:mgr-wip-_handle_command-returns-ENOTSUP-prematurely-distro-default-smithi/

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

The current "active and enabled" check order can result in an ENOTSUP being returned when a module is enabled, but needs more time to load. In this case, the command should be supported, but the module timed out loading. With the internal retry logic that is added (the mgr retries loading the module several times before giving up), ETIMEDOUT is more appropriate. The default max time that a module can take to load is now 5 seconds. To adjust this time, the following command can be run: `ceph config set mgr mgr_module_load_timeout <uint>` The default cadence at which the mgr retries is every 1 second. To adjust this time, the following command can be run: `ceph config set mgr mgr_module_load_interval <uint>` Fixes: https://tracker.ceph.com/issues/69012 Signed-off-by: Laura Flores <lflores@ibm.com> Co-authored-by: Brad Hubbard <bhubbard@redhat.com>

This commit adds a dev-only config that can inject a longer loading time into the mgr module loading sequence so we can simulate this scenario in a test. The config is 0 secs by default since we do not add any delay outside of testing scenarios. The config can be adjusted with the following command: `ceph config set mgr mgr_module_load_delay <uint>` A second dev-only config also allows you to specify which module you want to be delayed in loading time. You may change this with the following command: `ceph config set mgr mgr_module_load_delay_name <module name>` The workunit added here tests a simulated slow loading module scenario, as well as checks for any problems in the actual module loading time. Signed-off-by: Laura Flores <lflores@ibm.com>

ljflores · 2025-01-10T21:19:28Z

check-black: FAIL ✖ in 7.51 seconds

Make check errors seem unrelated.

ljflores · 2025-01-13T15:51:39Z

jenkins test make check

rzarzynski · 2025-01-13T21:03:03Z

src/mgr/DaemonServer.cc

+            << prefix << "'): retrying in " << interval << " secs (" << retries
+            << " retries left)." << dendl;
+    // Sleep for the retry interval
+    std::this_thread::sleep_for(std::chrono::seconds(interval));


OK, so mgr starts taking care about delaying requests till the module gets initialized. Seems sane.

Yeah, we basically give the mgr a little more time to retry loading the module. It's not a guarantee that the module loads in time, but if it doesn't load in the 5 seconds we give it, there is something bigger going on, and we will issue an ETIMEDOUT. But this will help account for slower hardware and any kind of latency in the cluster.

This blocks the messenger thread. I don't think we can structure it this way. Note also

https://github.com/ceph/ceph/pull/61325/files#diff-2de410c19e44c5a7ca3ca756e0464f8101e0218002090e36c55b3b68c40040e6R2601-R2619

which also has checks for whether the module is loaded properly. That should be consolidated.

Only the mod_finisher should block waiting for the module to load and there should be a condition variable that wakes the finisher thread rather than unconditionally wait interval seconds.

rzarzynski · 2025-01-13T21:03:34Z

@batrick: ?

badone

LGTM but I wrote some of it of course so an additional approval would be advisable I believe :)

rzarzynski · 2025-01-22T21:01:13Z

qa/workunits/mgr/test_mgr_module_loading_time.sh

+echo "Testing with module load delay of 6 seconds..."
+ceph config set mgr mgr_module_load_delay 6
+
+output=$(ceph mgr fail; ceph orch status 2>&1)


rzarzynski · 2025-01-22T21:04:58Z

Would be good to get a review from @batrick as welll. However, let's not wait with this too long. Would several days by fine, @ljflores?

ljflores · 2025-01-22T23:10:36Z

@rzarzynski yeah, I will check with Patrick though.

ljflores · 2025-01-23T17:53:45Z

Going to add this in a QA batch since we have several approvals. However, if @batrick has any feedback, still feel free to add it.

batrick · 2025-01-24T19:44:16Z

src/common/options/mgr.yaml.in

  - cluster_create
+# retry every N seconds
+- name: mgr_module_load_interval
+  type: uint


Suggested change

type: uint

type: millisecs

Would be better. See for example mds_log_trim_upkeep_interval.

batrick · 2025-01-24T19:44:26Z

src/common/options/mgr.yaml.in

+  with_legacy: true
+# fail if the module fails to load in time
+- name: mgr_module_load_timeout
+  type: uint


batrick · 2025-01-24T19:44:35Z

src/common/options/mgr.yaml.in

+  - mgr
+  with_legacy: true
+- name: mgr_module_load_delay
+  type: uint


batrick · 2025-01-24T19:45:10Z

src/common/options/mgr.yaml.in

+  desc: Choose which mgr module to inject a load delay. For testing purposes only.
+  flags:
+  - runtime
+  with_legacy: true


with_legacy is deprecated.

Do you mean just for this config option, or for each new config option?

any new config should not have with_legacy: true.

batrick · 2025-01-24T19:49:29Z

src/mgr/DaemonServer.cc

  auto& mod_name = py_command.module_name;
+  uint64_t interval = cct->_conf->mgr_module_load_interval;
+  uint64_t timeout = cct->_conf->mgr_module_load_timeout;
+  uint64_t retries = floor(timeout / interval);


floor is unnecessary for integers.

batrick · 2025-01-24T19:56:49Z

src/mgr/DaemonServer.cc

+          "command '" << prefix << "').";
    dout(4) << ss.str() << dendl;
-    cmdctx->reply(-EOPNOTSUPP, ss);
+    cmdctx->reply(-ETIMEDOUT, ss);


Suggested change

cmdctx->reply(-ETIMEDOUT, ss);

cmdctx->reply(-EAGAIN, ss);

For applications using the ceph-mgr provided CLI, we want retries prompted by returning EAGAIN. See for example:

ceph/src/mgr/PyModuleRegistry.cc

Lines 363 to 365 in 55982ea

// We do not expect to be called before active modules is up, but

// it's straightfoward to handle this case so let's do it.

return -EAGAIN;

There is also a PR relating to that: #60194

Since this code has a built-in expectation that a module may eventually load, we should tell the caller to try again later.

Suggest also issuing a cluster log error that a module is failing to load to satisfy commands.

batrick · 2025-01-24T20:01:05Z

src/mgr/DaemonServer.cc

+          << py_handler_name << "` to enable it";
+    dout(4) << ss.str() << dendl;
+    cmdctx->reply(-EOPNOTSUPP, ss);
+    return true;


It may also be good to issue a cluster log warning that API calls are being made to the ceph-mgr which require a module to be loaded.

batrick · 2025-01-24T20:04:08Z

src/mgr/DaemonServer.cc

+            << prefix << "'): retrying in " << interval << " secs (" << retries
+            << " retries left)." << dendl;
+    // Sleep for the retry interval
+    std::this_thread::sleep_for(std::chrono::seconds(interval));


This blocks the messenger thread. I don't think we can structure it this way. Note also

https://github.com/ceph/ceph/pull/61325/files#diff-2de410c19e44c5a7ca3ca756e0464f8101e0218002090e36c55b3b68c40040e6R2601-R2619

which also has checks for whether the module is loaded properly. That should be consolidated.

Only the mod_finisher should block waiting for the module to load and there should be a condition variable that wakes the finisher thread rather than unconditionally wait interval seconds.

ronen-fr · 2025-01-26T13:23:27Z

@ljflores - the QA run that contained this PR should be recreated and retried. My question:
should this PR still be part of it (i.e. - was @batrick comment addressed?) ?

ljflores · 2025-01-27T22:35:38Z

@ronen-fr no, dropping it from the batch

mchangir · 2025-01-30T05:15:57Z

Here are my comments as per Brad's request:
Currently, the MGR starts loading modules and simultaneously declares availability. The Mgr::background_init() method only queues a lambda which does Mgr::init() (which queues the module loading lambdas to the finisher) and follows up with sending a beacon to the MON to declare availability. Since module loading takes time, the MGR may not be available to serve all different types of commands by the time the availability is declared. This is the reason for most test/command failures that wait for MGR availability and then immediately issue MGR commands.

My PR #59089 just defers the declaration of MGR availability only after an attempt to load all active modules.

FYI - My PR neither reattempts to load modules that have failed loading once nor does it add any timeout logic to declare eventual failure.

github-actions · 2025-03-31T06:02:18Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2025-04-30T07:01:53Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

badone and others added 2 commits January 10, 2025 18:06

ljflores requested a review from a team as a code owner January 10, 2025 18:12

github-actions bot added common core mgr tests labels Jan 10, 2025

ljflores changed the title ~~Wip handle command returns enotsup prematurely~~ mgr: add retry logic for module loading Jan 10, 2025

ljflores mentioned this pull request Jan 10, 2025

mgr: Reorder enabled and active checks #60669

Closed

14 tasks

ljflores requested review from athanatos, badone, batrick, kotreshhr, rzarzynski and yaarith January 10, 2025 18:17

rzarzynski reviewed Jan 13, 2025

View reviewed changes

badone approved these changes Jan 14, 2025

View reviewed changes

ljflores requested a review from rzarzynski January 22, 2025 20:16

rzarzynski approved these changes Jan 22, 2025

View reviewed changes

athanatos approved these changes Jan 23, 2025

View reviewed changes

ljflores added the needs-qa label Jan 23, 2025

ljflores added the wip-yuri-testing label Jan 23, 2025

batrick requested changes Jan 24, 2025

View reviewed changes

ljflores removed wip-yuri-testing needs-qa labels Jan 27, 2025

badone mentioned this pull request Jan 29, 2025

mgr: declare availability only after module loading #59089

Closed

14 tasks

github-actions bot added the stale label Mar 31, 2025

github-actions bot closed this Apr 30, 2025

ljflores mentioned this pull request Jun 12, 2025

mgr: ensure that all modules have started before advertising active mgr #63859

Merged

14 tasks

	// We do not expect to be called before active modules is up, but
	// it's straightfoward to handle this case so let's do it.
	return -EAGAIN;

Conversation

ljflores commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

ljflores commented Jan 10, 2025

Uh oh!

ljflores commented Jan 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljflores Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzarzynski commented Jan 13, 2025

Uh oh!

badone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzarzynski commented Jan 22, 2025

Uh oh!

ljflores commented Jan 22, 2025

Uh oh!

ljflores commented Jan 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ronen-fr commented Jan 26, 2025

Uh oh!

ljflores commented Jan 27, 2025

Uh oh!

mchangir commented Jan 30, 2025

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ljflores commented Jan 10, 2025 •

edited

Loading

ljflores Jan 15, 2025 •

edited

Loading