fix: Add waitready command to verify cluster ready#683
fix: Add waitready command to verify cluster ready#683johnramsden merged 8 commits intocanonical:mainfrom
Conversation
When an operator attempts to do something before the cluster is up they can receive unexpected failures because bootstrap is not finished or microcluster is not yet available. This can be particularly problematic in CI or scripting. Add an additional subcommand (similar to lxd waitready) https://manpages.debian.org/unstable/lxd/lxd.waitready.1 To confirm the cluster is up we check for the microcluster daemon to be ready, and for ceph to be ready (ceph -s) On failure we get a message like the following if we haven't bootstrapped for example: microceph waitready --timeout 30 Error: ceph not ready: timed out waiting for Ceph to become ready: context deadline exceeded Running the following you should expect it to wait before running status, and it should succeed sudo microceph cluster bootstrap & sudo microceph waitready sudo microceph status [1] 35966 MicroCeph deployment summary: - microceph (10.56.203.112) Services: mds, mgr, mon Disks: 0 Signed-off-by: John Ramsden <john.ramsden@canonical.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
One note I have is I'm not sure if |
sabaini
left a comment
There was a problem hiding this comment.
Hey @johnramsden thank you, lgtm in general, two comments inline
Ensures that we do not wait on ceph -s indefinitely Signed-off-by: John Ramsden <john.ramsden@canonical.com>
The value should be non-negative so using an unsigned value is more correct and gives us the expected error Set custom parsing error: Error: invalid argument "-1" for "--timeout" flag: strconv.ParseUint: parsing "-1": invalid syntax: timeout must be a positive number of seconds Rather than defuilt: Error: invalid argument "-1" for "--timeout" flag: strconv.ParseUint: parsing "-1": invalid syntax Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden I can see an operator wanting to wait for 2 things. 1. Ceph cluster bootstrap (ceph -s works) and 2. Storage ready (OSDs enrolled, atleast 1 for rf==1 or 3 otherwise) |
UtkarshBhatthere
left a comment
There was a problem hiding this comment.
overall lgtm, that being said, imo a --storage flag would make this even better. Since microceph has opinions on what should be a minimum storage setup it should wait untill enough OSDs to spawn a storage pool are available (i.e. if rf==1 - > one OSD and if rf==3 -> 3 OSD). IDK what should the criteria be for EC pools.
sabaini
left a comment
There was a problem hiding this comment.
Hey @johnramsden thanks for the update, some nits below
Leave better comment mentioning 'or zero' Connect all the relevant interfaces Leave a comment regarding bootstrap has not happened Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
When --storage is passed, after daemon and monitor readiness, poll until enough OSDs are up to satisfy pool replication requirements. The required count is max(pool.Size) across all pools, falling back to osd_pool_default_size if no pools exist. Update GetOSDPools to accept a context allowing us to reuse functionality Signed-off-by: John Ramsden <john.ramsden@canonical.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Thanks this is a great suggestion. I added it, I would appreciate it if you could review. I did touch a different part of the codebase that I thought I could reuse - just adding a context. I do not believe that there should be any implications other than a timeout potentially happening if a client disconnects |
sabaini
left a comment
There was a problem hiding this comment.
I'm +1, interested to hear @UtkarshBhatthere thoughts
Move getRequiredOSDCount() inside the polling loop so it retries Signed-off-by: John Ramsden <john.ramsden@canonical.com>
UtkarshBhatthere
left a comment
There was a problem hiding this comment.
I am happy with the overall feature, left few on testing and code organization.
Move WaitForCephReady to dedicated waitready.go (not monitor-specific), move OSD wait functions (WaitForOSDsReady, getRequiredOSDCount, getUpOSDCount, getDefaultPoolSize) to osd.go where they belong. Split install_microceph into install/bootstrap phases and combine pre/post bootstrap waitready tests into a single test_waitready function that exercises both scenarios without a purge/reinstall cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Move WaitForCephReady to dedicated waitready.go (not monitor-specific), move OSD wait functions (WaitForOSDsReady, getRequiredOSDCount, getUpOSDCount, getDefaultPoolSize) to osd.go where they belong. Split install_microceph into install/bootstrap phases and combine pre/post bootstrap waitready tests into a single test_waitready function that exercises both scenarios without a purge/reinstall cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Move WaitForCephReady to dedicated waitready.go (not monitor-specific), move OSD wait functions (WaitForOSDsReady, getRequiredOSDCount, getUpOSDCount, getDefaultPoolSize) to osd.go where they belong. Split install_microceph into install/bootstrap phases and combine pre/post bootstrap waitready tests into a single test_waitready function that exercises both scenarios without a purge/reinstall cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
|
Thanks for the review @UtkarshBhatthere , going forward I will be more careful about organization. Let me know how you feel about the changes.
|
UtkarshBhatthere
left a comment
There was a problem hiding this comment.
LGTM, the PR is gtg. Thanks @johnramsden
# Description When an operator attempts to do something before the cluster is up they can receive unexpected failures because bootstrap is not finished or microcluster is not yet available. This can be particularly problematic in CI or scripting. Add an additional subcommand (similar to lxd waitready) https://manpages.debian.org/unstable/lxd/lxd.waitready.1 To confirm the cluster is up we check for the microcluster daemon to be ready, and for ceph to be ready (ceph -s) On failure we get a message like the following if we haven't bootstrapped for example: ``` microceph waitready --timeout 30 Error: ceph not ready: timed out waiting for Ceph to become ready: context deadline exceeded ``` Running the following you should expect it to wait before running status, and it should succeed ``` sudo microceph cluster bootstrap & sudo microceph waitready sudo microceph status [1] 35966 MicroCeph deployment summary: - microceph (10.56.203.112) Services: mds, mgr, mon Disks: 0 ``` Also add --storage flag: When --storage is passed, after daemon and monitor readiness, poll until enough OSDs are up to satisfy pool replication requirements. The required count is max(pool.Size) across all pools, falling back to osd_pool_default_size if no pools exist. Update GetOSDPools to accept a context allowing us to reuse functionality Fixes canonical#653 Fixes: canonical#683 ## Type of change - Bug fix (non-breaking change which fixes an issue) ## How has this been tested? Added tests demonstrating waiting and timeout prior to bootstrap, and waiting succeeding post bootstrap. ## Contributor checklist Please check that you have: - [x] self-reviewed the code in this PR - [x] added code comments, particularly in less straightforward areas - [x] checked and added or updated relevant documentation - [x] checked and added or updated relevant release notes - [x] added tests to verify effectiveness of this change --------- Signed-off-by: John Ramsden <john.ramsden@canonical.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Description
When an operator attempts to do something before the cluster is up they can receive unexpected failures because bootstrap is not finished or microcluster is not yet available. This can be particularly problematic in CI or scripting.
Add an additional subcommand (similar to lxd waitready) https://manpages.debian.org/unstable/lxd/lxd.waitready.1
To confirm the cluster is up we check for the microcluster daemon to be ready, and for ceph to be ready (ceph -s)
On failure we get a message like the following if we haven't bootstrapped for example:
Running the following you should expect it to wait before running status, and it should succeed
Also add --storage flag:
When --storage is passed, after daemon and monitor readiness, poll until enough OSDs are up to satisfy pool replication requirements.
The required count is max(pool.Size) across all pools, falling back to osd_pool_default_size if no pools exist.
Update GetOSDPools to accept a context allowing us to reuse functionality
Fixes #653
Fixes: #683
Type of change
How has this been tested?
Added tests demonstrating waiting and timeout prior to bootstrap, and waiting succeeding post bootstrap.
Contributor checklist
Please check that you have: