mgr: declare availability only after module loading by mchangir · Pull Request #59089 · ceph/ceph

mchangir · 2024-08-08T03:15:29Z

Fixes: https://tracker.ceph.com/issues/67230
Signed-off-by: Milind Changire mchangir@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

github-actions · 2024-11-15T09:01:49Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-12-15T14:01:27Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

mchangir · 2024-12-16T03:54:31Z

reopening for merge

vshankar · 2025-01-27T09:55:47Z

Bumping this PR up to seek reviews.

rzarzynski · 2025-01-28T16:08:45Z

CC: @ljflores, @badone.

badone · 2025-01-29T22:09:03Z

@mchangir does this overlap with #61325 ? Could you offer any thoughts you have in that PR so we can determine whether one can be closed or whether they should be merged in some way? Many thanks for your time.

github-actions · 2025-03-30T23:01:29Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

mchangir · 2025-04-07T10:11:16Z

@vshankar fyi - this PR helps with the fs volume ls stall issue

mchangir · 2025-04-11T10:32:06Z

@mchangir does this overlap with #61325 ? Could you offer any thoughts you have in that PR so we can determine whether one can be closed or whether they should be merged in some way? Many thanks for your time.

I've replied here: #61325 (comment)

github-actions · 2025-06-10T11:01:31Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

mchangir · 2025-06-10T12:07:23Z

trivial rebase to take PR out of stale status

ljflores

@mchangir I have been evaluating this solution against the slow-module-loading issue I've been working on here (https://tracker.ceph.com/issues/71631), and I think some pieces are missing from this solution.

PTAL at #63859, which has some similar components to your solution, but I've added some extra logic to ensure that all modules have started up before we declare availability. Let me know if you have any thoughts/suggestions, etc. I'm curious on your opinion on how this solution works with yours, i.e. if these solutions are solving the same problem and one can be closed, or if we somehow need both.

When I tested your solution, I used the following reproducer, which injects a time-complex bottleneck into the balancer module to cause its startup time to take longer:

diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py
index 476304275c1..d2dfa01e99a 100644
--- a/src/pybind/mgr/balancer/module.py
+++ b/src/pybind/mgr/balancer/module.py
@@ -771,6 +771,11 @@ class Module(MgrModule):
                 if pg_upmap_activity:
                     self.update_pg_upmap_activity(plan)  # update pg activity in `balancer status detail`
                 self.optimizing = False
+            # causing a bottleneck
+            for i in range(0, 1000):
+                for j in range (0, 1000):
+                    x = i + j
+                    self.log.debug("hitting the bottleneck in the balancer module")
             self.log.debug('Sleeping for %d', sleep_interval)
             self.event.wait(sleep_interval)
             self.event.clear()

After injecting the bottleneck, I ran this command on a vstart cluster, which shows that the mgr still declared availability too early with this patch:

$ ./bin/ceph mgr fail; ./bin/ceph telemetry show --debug_mgrc=20
2025-06-11T18:44:49.771+0000 7f2ed6ffd640 20 mgrc handle_mgr_map mgrmap(e 6)
2025-06-11T18:44:49.772+0000 7f2ed6ffd640  4 mgrc handle_mgr_map Got map version 6
2025-06-11T18:44:49.772+0000 7f2ed6ffd640  4 mgrc handle_mgr_map Active mgr is now 
2025-06-11T18:44:49.772+0000 7f2ed6ffd640  4 mgrc reconnect No active mgr available yet
2025-06-11T18:44:50.117+0000 7f2ef4805640 20 mgrc start_command cmd: [{"prefix": "telemetry show", "target": ["mon-mgr", ""]}]
2025-06-11T18:44:50.117+0000 7f2ef4805640  5 mgrc start_command no mgr session (no running mgr daemon?), waiting
2025-06-11T18:44:54.116+0000 7f2ed6ffd640 20 mgrc handle_mgr_map mgrmap(e 7)
2025-06-11T18:44:54.116+0000 7f2ed6ffd640  4 mgrc handle_mgr_map Got map version 7
2025-06-11T18:44:54.116+0000 7f2ed6ffd640  4 mgrc handle_mgr_map Active mgr is now 
2025-06-11T18:44:54.116+0000 7f2ed6ffd640  4 mgrc reconnect No active mgr available yet
2025-06-11T18:44:56.132+0000 7f2ed6ffd640 20 mgrc handle_mgr_map mgrmap(e 8)
2025-06-11T18:44:56.132+0000 7f2ed6ffd640  4 mgrc handle_mgr_map Got map version 8
2025-06-11T18:44:56.132+0000 7f2ed6ffd640  4 mgrc handle_mgr_map Active mgr is now [v2:127.0.0.1:6800/3720421852,v1:127.0.0.1:6801/3720421852]
2025-06-11T18:44:56.132+0000 7f2ed6ffd640  4 mgrc reconnect Starting new session with [v2:127.0.0.1:6800/3720421852,v1:127.0.0.1:6801/3720421852]
2025-06-11T18:44:56.132+0000 7f2ed6ffd640 10 mgrc reconnect resending 0 (cli)
2025-06-11T18:44:56.135+0000 7f2ed6ffd640 20 mgrc handle_command_reply tid 0 r -95
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it

In the mgr log, we can see that mgr send_beacon active occurred before the telemetry module had finished loading, which resulted in the ENOTSUP error:

$ cat out/mgr.x.log | grep "send_beacon active\|I am now activating\|telemetry show"
2025-06-11T18:46:11.817+0000 7f8f5fc20200 20 mgr[py] loaded command telemetry show name=channels,req=false,n=N,type=CephString
2025-06-11T18:46:11.817+0000 7f8f5fc20200 20 mgr[py] loaded command telemetry show-device 
2025-06-11T18:46:11.817+0000 7f8f5fc20200 20 mgr[py] loaded command telemetry show-all 
2025-06-11T18:46:15.905+0000 7f8f5adb8640  1 mgr handle_mgr_map I am now activating
2025-06-11T18:46:17.845+0000 7f8f57db2640 20 mgr send_beacon active
2025-06-11T18:46:17.921+0000 7f8cbf60d640  1 -- [v2:127.0.0.1:6800/976509187,v1:127.0.0.1:6801/976509187] <== client.4402 127.0.0.1:0/273425155 1 ==== mgr_command(tid 0: {"prefix": "telemetry show", "target": ["mon-mgr", ""]}) ==== 79+0+0 (secure 0 0 0) 0x55cd1423a820 con 0x55cd14711400
2025-06-11T18:46:17.921+0000 7f8cbf60d640 10 mgr.server _handle_command decoded-size=2 prefix=telemetry show
2025-06-11T18:46:17.922+0000 7f8cbf60d640 20 is_capable service=py module=telemetry command=telemetry show read addr - on cap allow *
2025-06-11T18:46:17.922+0000 7f8cbf60d640  0 log_channel(audit) log [DBG] : from='client.4402 -' entity='client.admin' cmd=[{"prefix": "telemetry show", "target": ["mon-mgr", ""]}]: dispatch
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it) -- 0x55cd14311380 con 0x55cd14711400
2025-06-11T18:46:18.503+0000 7f8f58db4640 10 log_client  will send 2025-06-11T18:46:17.923865+0000 mgr.x (mgr.4396) 1 : audit [DBG] from='client.4402 -' entity='client.admin' cmd=[{"prefix": "telemetry show", "target": ["mon-mgr", ""]}]: dispatch
2025-06-11T18:46:19.847+0000 7f8f57db2640 20 mgr send_beacon active
2025-06-11T18:46:21.847+0000 7f8f57db2640 20 mgr send_beacon active

In my patch, I adjusted the solution so the mgr declares availability only after pending_modules is empty. Here, the mgr waits to declare availability properly, and the command succeeds:

$ ./bin/ceph mgr fail; ./bin/ceph telemetry show --debug_mgrc=20
2025-06-11T18:48:32.969+0000 7f67277fe640 20 mgrc handle_mgr_map mgrmap(e 19)
2025-06-11T18:48:32.969+0000 7f67277fe640  4 mgrc handle_mgr_map Got map version 19
2025-06-11T18:48:32.969+0000 7f67277fe640  4 mgrc handle_mgr_map Active mgr is now 
2025-06-11T18:48:32.969+0000 7f67277fe640  4 mgrc reconnect No active mgr available yet
2025-06-11T18:48:33.317+0000 7f674d6b9640 20 mgrc start_command cmd: [{"prefix": "telemetry show", "target": ["mon-mgr", ""]}]
2025-06-11T18:48:33.317+0000 7f674d6b9640  5 mgrc start_command no mgr session (no running mgr daemon?), waiting
2025-06-11T18:48:37.368+0000 7f67277fe640 20 mgrc handle_mgr_map mgrmap(e 20)
2025-06-11T18:48:37.368+0000 7f67277fe640  4 mgrc handle_mgr_map Got map version 20
2025-06-11T18:48:37.368+0000 7f67277fe640  4 mgrc handle_mgr_map Active mgr is now 
2025-06-11T18:48:37.368+0000 7f67277fe640  4 mgrc reconnect No active mgr available yet
2025-06-11T18:48:39.378+0000 7f67277fe640 20 mgrc handle_mgr_map mgrmap(e 21)
2025-06-11T18:48:39.378+0000 7f67277fe640  4 mgrc handle_mgr_map Got map version 21
2025-06-11T18:48:39.378+0000 7f67277fe640  4 mgrc handle_mgr_map Active mgr is now 
2025-06-11T18:48:39.378+0000 7f67277fe640  4 mgrc reconnect No active mgr available yet
2025-06-11T18:48:50.818+0000 7f67277fe640 20 mgrc handle_mgr_map mgrmap(e 22)
2025-06-11T18:48:50.818+0000 7f67277fe640  4 mgrc handle_mgr_map Got map version 22
2025-06-11T18:48:50.818+0000 7f67277fe640  4 mgrc handle_mgr_map Active mgr is now [v2:127.0.0.1:6800/3826094627,v1:127.0.0.1:6801/3826094627]
2025-06-11T18:48:50.818+0000 7f67277fe640  4 mgrc reconnect Starting new session with [v2:127.0.0.1:6800/3826094627,v1:127.0.0.1:6801/3826094627]
2025-06-11T18:48:50.818+0000 7f67277fe640 10 mgrc reconnect resending 0 (cli)
2025-06-11T18:48:50.821+0000 7f67277fe640 20 mgrc handle_command_reply tid 0 r 0
Telemetry is off. Please consider opting-in with `ceph telemetry on`.
Preview sample reports with `ceph telemetry preview`.

In the mgr log, we can see that the mgr is sending a "starting" beacon to indicate that it is still starting, and it only sends the "active" beacon after pending_modules is empty. So, the command is ready to be supported:

$ cat out/mgr.x.log | grep "send_beacon active\|I am now activating\|telemetry show"
2025-06-11T18:48:33.278+0000 7f8f6f753200 20 mgr[py] loaded command telemetry show name=channels,req=false,n=N,type=CephString
2025-06-11T18:48:33.278+0000 7f8f6f753200 20 mgr[py] loaded command telemetry show-device 
2025-06-11T18:48:33.278+0000 7f8f6f753200 20 mgr[py] loaded command telemetry show-all 
2025-06-11T18:48:37.368+0000 7f8f6aed3640  1 mgr handle_mgr_map I am now activating
2025-06-11T18:48:39.312+0000 7f8f67ecd640 20 mgr send_beacon active (starting)
2025-06-11T18:48:41.313+0000 7f8f67ecd640 20 mgr send_beacon active (starting)
2025-06-11T18:48:43.314+0000 7f8f67ecd640 20 mgr send_beacon active (starting)
2025-06-11T18:48:45.315+0000 7f8f67ecd640 20 mgr send_beacon active (starting)
2025-06-11T18:48:47.316+0000 7f8f67ecd640 20 mgr send_beacon active (starting)
2025-06-11T18:48:49.316+0000 7f8f67ecd640 20 mgr send_beacon active (starting)
2025-06-11T18:48:50.756+0000 7f8ccff3c640 20 mgr send_beacon active
2025-06-11T18:48:50.819+0000 7f8ccf73b640  1 -- [v2:127.0.0.1:6800/3826094627,v1:127.0.0.1:6801/3826094627] <== client.4469 127.0.0.1:0/3847426830 1 ==== mgr_command(tid 0: {"prefix": "telemetry show", "target": ["mon-mgr", ""]}) ==== 79+0+0 (secure 0 0 0) 0x5645b30ecb60 con 0x5645b3936400
2025-06-11T18:48:50.819+0000 7f8ccf73b640 10 mgr.server _handle_command decoded-size=2 prefix=telemetry show
2025-06-11T18:48:50.821+0000 7f8ccf73b640 20 is_capable service=py module=telemetry command=telemetry show read addr - on cap allow *
2025-06-11T18:48:50.821+0000 7f8ccf73b640  0 log_channel(audit) log [DBG] : from='client.4469 -' entity='client.admin' cmd=[{"prefix": "telemetry show", "target": ["mon-mgr", ""]}]: dispatch
2025-06-11T18:48:50.821+0000 7f8ccf73b640 10 mgr.server _handle_command passing through command 'telemetry show' size 2
2025-06-11T18:48:50.821+0000 7f8cba611640 10 mgr.server operator() dispatching command 'telemetry show' size 2
2025-06-11T18:48:50.964+0000 7f8f68ecf640 10 log_client  will send 2025-06-11T18:48:50.822638+0000 mgr.x (mgr.4461) 1 : audit [DBG] from='client.4469 -' entity='client.admin' cmd=[{"prefix": "telemetry show", "target": ["mon-mgr", ""]}]: dispatch
2025-06-11T18:48:51.317+0000 7f8f67ecd640 20 mgr send_beacon active
2025-06-11T18:48:51.821+0000 7f8f6aed3640 10 log_client  logged 2025-06-11T18:48:50.822638+0000 mgr.x (mgr.4461) 1 : audit [DBG] from='client.4469 -' entity='client.admin' cmd=[{"prefix": "telemetry show", "target": ["mon-mgr", ""]}]: dispatch
2025-06-11T18:48:53.318+0000 7f8f67ecd640 20 mgr send_beacon active

mchangir · 2025-06-12T04:03:48Z

@ljflores Thanks for validating my PR.
I hadn't thought of the implications of pending_modules not being empty by the time the beacon is sent.

From what I can remember, I came across a situation that a module may fail to load for some reason and that we have no guarantee over pending_modules to be empty. So my PR only ensures that the module loading is attempted at least once and only then does it send the beacon to the mon. But I guess my PR continues to have a race with module loading and beacon sending.

A module may also be deadlocked, but I guess we haven't seen something like this so far.

So, we could drop my PR in favor of yours.

mchangir · 2025-06-12T09:00:11Z

@ljflores I'm also not aware of the sync/async nature of module loading. I'd presumed that the module loading is synchronous in nature when the list of active modules is being iterated over during module loading, i.e. the modules are loaded one after the other only when the result of module loading is known to the loader.

Fixes: https://tracker.ceph.com/issues/67230 Signed-off-by: Milind Changire <mchangir@redhat.com>

mchangir · 2025-07-29T02:16:51Z

Moved initializing and initialized flag update to after module loading contexts are executed by the finisher.

mchangir · 2025-07-30T09:04:07Z

@ljflores I have now moved setting initialized = true; to a LambdaContext that gets executed after all modules have been loaded.

The only other major difference in my PR is that the PR doesn't confirm if pending_modules is empty before setting initialized = true;

If you are up to it, you could give it a try.

github-actions · 2025-09-28T11:01:43Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2025-10-28T12:02:37Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

mchangir requested a review from a team as a code owner August 8, 2024 03:15

github-actions bot added core mgr labels Aug 8, 2024

mchangir mentioned this pull request Aug 8, 2024

mgr: block registration of a module's client #51169

Closed

14 tasks

mchangir added the needs-review label Sep 16, 2024

github-actions bot added the stale label Nov 15, 2024

github-actions bot closed this Dec 15, 2024

mchangir reopened this Dec 16, 2024

github-actions bot removed the stale label Dec 16, 2024

vshankar requested review from ljflores and rzarzynski January 27, 2025 09:55

mchangir mentioned this pull request Jan 30, 2025

mgr: add retry logic for module loading #61325

Closed

14 tasks

github-actions bot added the stale label Mar 30, 2025

mchangir removed the stale label Mar 31, 2025

github-actions bot added the stale label Jun 10, 2025

mchangir force-pushed the mgr-declare-availability-only-after-module-loading branch from 69bdc83 to 8c461ff Compare June 10, 2025 12:06

github-actions bot removed the stale label Jun 10, 2025

ljflores reviewed Jun 11, 2025

View reviewed changes

mgr: declare availability only after module loading

a97cd6f

Fixes: https://tracker.ceph.com/issues/67230 Signed-off-by: Milind Changire <mchangir@redhat.com>

mchangir force-pushed the mgr-declare-availability-only-after-module-loading branch from 8c461ff to a97cd6f Compare July 29, 2025 02:14

github-actions bot added the stale label Sep 28, 2025

github-actions bot closed this Oct 28, 2025

Conversation

mchangir commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

github-actions bot commented Nov 15, 2024

Uh oh!

github-actions bot commented Dec 15, 2024

Uh oh!

mchangir commented Dec 16, 2024

Uh oh!

vshankar commented Jan 27, 2025

Uh oh!

rzarzynski commented Jan 28, 2025

Uh oh!

badone commented Jan 29, 2025

Uh oh!

github-actions bot commented Mar 30, 2025

Uh oh!

mchangir commented Apr 7, 2025

Uh oh!

mchangir commented Apr 11, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

mchangir commented Jun 10, 2025

Uh oh!

ljflores left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mchangir commented Jun 12, 2025

Uh oh!

mchangir commented Jun 12, 2025

Uh oh!

mchangir commented Jul 29, 2025

Uh oh!

mchangir commented Jul 30, 2025

Uh oh!

github-actions bot commented Sep 28, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mchangir commented Aug 8, 2024 •

edited

Loading

ljflores left a comment •

edited

Loading