Skip to content

mgr: add retry logic for module loading#61325

Closed
ljflores wants to merge 2 commits intoceph:mainfrom
ljflores:wip-_handle_command-returns-ENOTSUP-prematurely
Closed

mgr: add retry logic for module loading#61325
ljflores wants to merge 2 commits intoceph:mainfrom
ljflores:wip-_handle_command-returns-ENOTSUP-prematurely

Conversation

@ljflores
Copy link
Member

@ljflores ljflores commented Jan 10, 2025

This PR accomplishes several things:

  1. Reorder "active and enabled" checks & add retry logic for module loading

    The current "active and enabled" check order can result in an ENOTSUP being
    returned when a module is enabled, but needs more time to load. In this case,
    the command should be supported, but the module timed out loading. We added
    internal retry logic to the mgr, so now the mgr retries loading the module several
    times before giving up. With that implemented, ETIMEDOUT is more appropriate.

    The default max time that a module can take to load is now 5 seconds. To
    adjust this time, the following command can be run:
    ceph config set mgr mgr_module_load_timeout <uint>

    The default cadence at which the mgr retries is every 1 second. To
    adjust this time, the following command can be run:
    ceph config set mgr mgr_module_load_interval <uint>

  2. Add test coverage for slow loading module

    To test the above modifications, this PR adds a dev-only config
    that can inject a longer loading time into the mgr module loading
    sequence so we can simulate this scenario in a test.

    The config is 0 secs by default since we do not add any delay
    outside of testing scenarios. The config can be adjusted
    with the following command:
    ceph config set mgr mgr_module_load_delay <uint>

    A second dev-only config also allows you to specify which
    module you want to be delayed in loading time. You may change
    this with the following command:
    ceph config set mgr mgr_module_load_delay_name <module name>

    The workunit added here tests a simulated slow loading module
    scenario, as well as checks for any problems in the actual
    module loading time.

Latest teuthology results for the new workunit are green:
https://pulpito.ceph.com/lflores-2025-01-10_16:53:51-rados:mgr-wip-_handle_command-returns-ENOTSUP-prematurely-distro-default-smithi/

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

badone and others added 2 commits January 10, 2025 18:06
The current "active and enabled" check order can result in an ENOTSUP being
returned when a module is enabled, but needs more time to load. In this case,
the command should be supported, but the module timed out loading. With the internal
retry logic that is added (the mgr retries loading the module several
times before giving up), ETIMEDOUT is more appropriate.

The default max time that a module can take to load is now 5 seconds. To
adjust this time, the following command can be run:
  `ceph config set mgr mgr_module_load_timeout <uint>`

The default cadence at which the mgr retries is every 1 second. To
adjust this time, the following command can be run:
  `ceph config set mgr mgr_module_load_interval <uint>`

Fixes: https://tracker.ceph.com/issues/69012
Signed-off-by: Laura Flores <lflores@ibm.com>
Co-authored-by: Brad Hubbard <bhubbard@redhat.com>
This commit adds a dev-only config that can inject a longer
loading time into the mgr module loading sequence so we can
simulate this scenario in a test.

The config is 0 secs by default since we do not add any delay
outside of testing scenarios. The config can be adjusted
with the following command:
  `ceph config set mgr mgr_module_load_delay <uint>`

A second dev-only config also allows you to specify which
module you want to be delayed in loading time. You may change
this with the following command:
  `ceph config set mgr mgr_module_load_delay_name <module name>`

The workunit added here tests a simulated slow loading module
scenario, as well as checks for any problems in the actual
module loading time.

Signed-off-by: Laura Flores <lflores@ibm.com>
@ljflores ljflores requested a review from a team as a code owner January 10, 2025 18:12
@ljflores ljflores changed the title Wip handle command returns enotsup prematurely mgr: add retry logic for module loading Jan 10, 2025
@ljflores
Copy link
Member Author

check-black: FAIL ✖ in 7.51 seconds

Make check errors seem unrelated.

@ljflores
Copy link
Member Author

jenkins test make check

<< prefix << "'): retrying in " << interval << " secs (" << retries
<< " retries left)." << dendl;
// Sleep for the retry interval
std::this_thread::sleep_for(std::chrono::seconds(interval));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so mgr starts taking care about delaying requests till the module gets initialized. Seems sane.

Copy link
Member Author

@ljflores ljflores Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we basically give the mgr a little more time to retry loading the module. It's not a guarantee that the module loads in time, but if it doesn't load in the 5 seconds we give it, there is something bigger going on, and we will issue an ETIMEDOUT. But this will help account for slower hardware and any kind of latency in the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blocks the messenger thread. I don't think we can structure it this way. Note also

https://github.com/ceph/ceph/pull/61325/files#diff-2de410c19e44c5a7ca3ca756e0464f8101e0218002090e36c55b3b68c40040e6R2601-R2619

which also has checks for whether the module is loaded properly. That should be consolidated.

Only the mod_finisher should block waiting for the module to load and there should be a condition variable that wakes the finisher thread rather than unconditionally wait interval seconds.

@rzarzynski
Copy link
Contributor

@batrick: ?

Copy link
Contributor

@badone badone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I wrote some of it of course so an additional approval would be advisable I believe :)

@ljflores ljflores requested a review from rzarzynski January 22, 2025 20:16
echo "Testing with module load delay of 6 seconds..."
ceph config set mgr mgr_module_load_delay 6

output=$(ceph mgr fail; ceph orch status 2>&1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

@rzarzynski
Copy link
Contributor

Would be good to get a review from @batrick as welll. However, let's not wait with this too long. Would several days by fine, @ljflores?

@ljflores
Copy link
Member Author

@rzarzynski yeah, I will check with Patrick though.

@ljflores
Copy link
Member Author

Going to add this in a QA batch since we have several approvals. However, if @batrick has any feedback, still feel free to add it.

- cluster_create
# retry every N seconds
- name: mgr_module_load_interval
type: uint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type: uint
type: millisecs

Would be better. See for example mds_log_trim_upkeep_interval.

with_legacy: true
# fail if the module fails to load in time
- name: mgr_module_load_timeout
type: uint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

- mgr
with_legacy: true
- name: mgr_module_load_delay
type: uint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

desc: Choose which mgr module to inject a load delay. For testing purposes only.
flags:
- runtime
with_legacy: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_legacy is deprecated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean just for this config option, or for each new config option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any new config should not have with_legacy: true.

auto& mod_name = py_command.module_name;
uint64_t interval = cct->_conf->mgr_module_load_interval;
uint64_t timeout = cct->_conf->mgr_module_load_timeout;
uint64_t retries = floor(timeout / interval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floor is unnecessary for integers.

"command '" << prefix << "').";
dout(4) << ss.str() << dendl;
cmdctx->reply(-EOPNOTSUPP, ss);
cmdctx->reply(-ETIMEDOUT, ss);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cmdctx->reply(-ETIMEDOUT, ss);
cmdctx->reply(-EAGAIN, ss);

For applications using the ceph-mgr provided CLI, we want retries prompted by returning EAGAIN. See for example:

// We do not expect to be called before active modules is up, but
// it's straightfoward to handle this case so let's do it.
return -EAGAIN;

There is also a PR relating to that: #60194

Since this code has a built-in expectation that a module may eventually load, we should tell the caller to try again later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest also issuing a cluster log error that a module is failing to load to satisfy commands.

<< py_handler_name << "` to enable it";
dout(4) << ss.str() << dendl;
cmdctx->reply(-EOPNOTSUPP, ss);
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be good to issue a cluster log warning that API calls are being made to the ceph-mgr which require a module to be loaded.

<< prefix << "'): retrying in " << interval << " secs (" << retries
<< " retries left)." << dendl;
// Sleep for the retry interval
std::this_thread::sleep_for(std::chrono::seconds(interval));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blocks the messenger thread. I don't think we can structure it this way. Note also

https://github.com/ceph/ceph/pull/61325/files#diff-2de410c19e44c5a7ca3ca756e0464f8101e0218002090e36c55b3b68c40040e6R2601-R2619

which also has checks for whether the module is loaded properly. That should be consolidated.

Only the mod_finisher should block waiting for the module to load and there should be a condition variable that wakes the finisher thread rather than unconditionally wait interval seconds.

@ronen-fr
Copy link
Contributor

@ljflores - the QA run that contained this PR should be recreated and retried. My question:
should this PR still be part of it (i.e. - was @batrick comment addressed?) ?

@ljflores
Copy link
Member Author

@ronen-fr no, dropping it from the batch

@mchangir
Copy link
Contributor

Here are my comments as per Brad's request:
Currently, the MGR starts loading modules and simultaneously declares availability. The Mgr::background_init() method only queues a lambda which does Mgr::init() (which queues the module loading lambdas to the finisher) and follows up with sending a beacon to the MON to declare availability. Since module loading takes time, the MGR may not be available to serve all different types of commands by the time the availability is declared. This is the reason for most test/command failures that wait for MGR availability and then immediately issue MGR commands.

My PR #59089 just defers the declaration of MGR availability only after an attempt to load all active modules.

FYI - My PR neither reattempts to load modules that have failed loading once nor does it add any timeout logic to declare eventual failure.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Mar 31, 2025
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants