Skip to content

fix: Add waitready command to verify cluster ready#683

Merged
johnramsden merged 8 commits intocanonical:mainfrom
johnramsden:john/CEPH-1590-wait-ready
Mar 18, 2026
Merged

fix: Add waitready command to verify cluster ready#683
johnramsden merged 8 commits intocanonical:mainfrom
johnramsden:john/CEPH-1590-wait-ready

Conversation

@johnramsden
Copy link
Member

@johnramsden johnramsden commented Feb 27, 2026

Description

When an operator attempts to do something before the cluster is up they can receive unexpected failures because bootstrap is not finished or microcluster is not yet available. This can be particularly problematic in CI or scripting.

Add an additional subcommand (similar to lxd waitready) https://manpages.debian.org/unstable/lxd/lxd.waitready.1

To confirm the cluster is up we check for the microcluster daemon to be ready, and for ceph to be ready (ceph -s)

On failure we get a message like the following if we haven't bootstrapped for example:

microceph waitready --timeout 30
Error: ceph not ready: timed out waiting for Ceph to become ready: context deadline exceeded

Running the following you should expect it to wait before running status, and it should succeed

sudo microceph cluster bootstrap &
sudo microceph waitready
sudo microceph status
[1] 35966
MicroCeph deployment summary:
- microceph (10.56.203.112) Services: mds, mgr, mon Disks: 0

Also add --storage flag:

When --storage is passed, after daemon and monitor readiness, poll until enough OSDs are up to satisfy pool replication requirements.

The required count is max(pool.Size) across all pools, falling back to osd_pool_default_size if no pools exist.

Update GetOSDPools to accept a context allowing us to reuse functionality

Fixes #653
Fixes: #683

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How has this been tested?

Added tests demonstrating waiting and timeout prior to bootstrap, and waiting succeeding post bootstrap.

Contributor checklist

Please check that you have:

  • self-reviewed the code in this PR
  • added code comments, particularly in less straightforward areas
  • checked and added or updated relevant documentation
  • checked and added or updated relevant release notes
  • added tests to verify effectiveness of this change

When an operator attempts to do something before the cluster is up they can receive unexpected failures because bootstrap is not finished or microcluster is not yet available. This can be particularly problematic in CI or scripting.

Add an additional subcommand (similar to lxd waitready) https://manpages.debian.org/unstable/lxd/lxd.waitready.1

To confirm the cluster is up we check for the microcluster daemon to be ready, and for ceph to be ready (ceph -s)

On failure we get a message like the following if we haven't bootstrapped for example:

microceph waitready --timeout 30
Error: ceph not ready: timed out waiting for Ceph to become ready: context deadline exceeded

Running the following you should expect it to wait before running status, and it should succeed

sudo microceph cluster bootstrap &
sudo microceph waitready
sudo microceph status
[1] 35966
MicroCeph deployment summary:
- microceph (10.56.203.112)
  Services: mds, mgr, mon
  Disks: 0

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@johnramsden
Copy link
Member Author

One note I have is I'm not sure if ceph -s is completely sufficient and if there's anything else we want to wait on

Copy link
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @johnramsden thank you, lgtm in general, two comments inline

Ensures that we do not wait on ceph -s indefinitely

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
The value should be non-negative so using an unsigned value is more correct and gives us the expected error

Set custom parsing error:

Error: invalid argument "-1" for "--timeout" flag: strconv.ParseUint: parsing "-1": invalid syntax: timeout must be a positive number of seconds

Rather than defuilt:

Error: invalid argument "-1" for "--timeout" flag: strconv.ParseUint: parsing "-1": invalid syntax

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden johnramsden requested a review from sabaini March 2, 2026 21:44
@UtkarshBhatthere
Copy link
Contributor

One note I have is I'm not sure if ceph -s is completely sufficient and if there's anything else we want to wait on

@johnramsden I can see an operator wanting to wait for 2 things. 1. Ceph cluster bootstrap (ceph -s works) and 2. Storage ready (OSDs enrolled, atleast 1 for rf==1 or 3 otherwise)

Copy link
Contributor

@UtkarshBhatthere UtkarshBhatthere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, that being said, imo a --storage flag would make this even better. Since microceph has opinions on what should be a minimum storage setup it should wait untill enough OSDs to spawn a storage pool are available (i.e. if rf==1 - > one OSD and if rf==3 -> 3 OSD). IDK what should the criteria be for EC pools.

Copy link
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @johnramsden thanks for the update, some nits below

johnramsden and others added 2 commits March 3, 2026 10:32
Leave better comment mentioning 'or zero'
Connect all the relevant interfaces
Leave a comment regarding bootstrap has not happened

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
When --storage is passed, after daemon and monitor readiness, poll until enough OSDs are up to satisfy pool replication requirements.

The required count is max(pool.Size) across all pools, falling back to osd_pool_default_size if no pools exist.

Update GetOSDPools to accept a context allowing us to reuse functionality

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@johnramsden
Copy link
Member Author

overall lgtm, that being said, imo a --storage flag would make this even better. Since microceph has opinions on what should be a minimum storage setup it should wait untill enough OSDs to spawn a storage pool are available (i.e. if rf==1 - > one OSD and if rf==3 -> 3 OSD). IDK what should the criteria be for EC pools.

Thanks this is a great suggestion. I added it, I would appreciate it if you could review.

I did touch a different part of the codebase that I thought I could reuse - just adding a context. I do not believe that there should be any implications other than a timeout potentially happening if a client disconnects

Copy link
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1, interested to hear @UtkarshBhatthere thoughts

Move getRequiredOSDCount() inside the polling loop so it retries

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Copy link
Contributor

@UtkarshBhatthere UtkarshBhatthere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the overall feature, left few on testing and code organization.

@johnramsden johnramsden linked an issue Mar 17, 2026 that may be closed by this pull request
johnramsden added a commit to johnramsden/microceph that referenced this pull request Mar 17, 2026
Move WaitForCephReady to dedicated waitready.go (not monitor-specific),
move OSD wait functions (WaitForOSDsReady, getRequiredOSDCount,
getUpOSDCount, getDefaultPoolSize) to osd.go where they belong.

Split install_microceph into install/bootstrap phases and combine
pre/post bootstrap waitready tests into a single test_waitready function
that exercises both scenarios without a purge/reinstall cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
johnramsden added a commit to johnramsden/microceph that referenced this pull request Mar 17, 2026
Move WaitForCephReady to dedicated waitready.go (not monitor-specific),
move OSD wait functions (WaitForOSDsReady, getRequiredOSDCount,
getUpOSDCount, getDefaultPoolSize) to osd.go where they belong.

Split install_microceph into install/bootstrap phases and combine
pre/post bootstrap waitready tests into a single test_waitready function
that exercises both scenarios without a purge/reinstall cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Move WaitForCephReady to dedicated waitready.go (not monitor-specific),
move OSD wait functions (WaitForOSDsReady, getRequiredOSDCount,
getUpOSDCount, getDefaultPoolSize) to osd.go where they belong.

Split install_microceph into install/bootstrap phases and combine
pre/post bootstrap waitready tests into a single test_waitready function
that exercises both scenarios without a purge/reinstall cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden
Copy link
Member Author

Thanks for the review @UtkarshBhatthere , going forward I will be more careful about organization.

Let me know how you feel about the changes.

  • microceph/ceph/waitready.go (new) — WaitForCephReady moved here from monitor.go
  • microceph/ceph/waitready_test.go (new) — WaitForCephReady tests moved here
  • microceph/ceph/osd.go — added WaitForOSDsReady, getRequiredOSDCount, getUpOSDCount,
    getDefaultPoolSize
  • microceph/ceph/osd_test.go — added OSD wait tests
  • microceph/ceph/monitor.go — removed waitready/OSD functions
  • microceph/ceph/monitor_test.go — deleted
  • tests/scripts/actionutils.sh — split install_microceph into install_microceph +
    bootstrap_microceph + install_and_bootstrap_microceph; replaced separate pre/post bootstrap
    tests with combined test_waitready
  • .github/workflows/tests.yml — consolidated waitready test steps, removed purge/reinstall
    block, renamed install_microceph -> install_and_bootstrap_microceph for all other callers

Copy link
Contributor

@UtkarshBhatthere UtkarshBhatthere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the PR is gtg. Thanks @johnramsden

@johnramsden johnramsden merged commit 941640f into canonical:main Mar 18, 2026
44 of 45 checks passed
@johnramsden johnramsden deleted the john/CEPH-1590-wait-ready branch March 18, 2026 17:08
johnramsden added a commit to johnramsden/microceph that referenced this pull request Mar 21, 2026
# Description

When an operator attempts to do something before the cluster is up they
can receive unexpected failures because bootstrap is not finished or
microcluster is not yet available. This can be particularly problematic
in CI or scripting.

Add an additional subcommand (similar to lxd waitready)
https://manpages.debian.org/unstable/lxd/lxd.waitready.1

To confirm the cluster is up we check for the microcluster daemon to be
ready, and for ceph to be ready (ceph -s)

On failure we get a message like the following if we haven't
bootstrapped for example:

```
microceph waitready --timeout 30
Error: ceph not ready: timed out waiting for Ceph to become ready: context deadline exceeded
```

Running the following you should expect it to wait before running
status, and it should succeed

```
sudo microceph cluster bootstrap &
sudo microceph waitready
sudo microceph status
[1] 35966
MicroCeph deployment summary:
- microceph (10.56.203.112) Services: mds, mgr, mon Disks: 0
```

Also add --storage flag:

When --storage is passed, after daemon and monitor readiness, poll until
enough OSDs are up to satisfy pool replication requirements.

The required count is max(pool.Size) across all pools, falling back to
osd_pool_default_size if no pools exist.

Update GetOSDPools to accept a context allowing us to reuse
functionality

Fixes canonical#653 
Fixes: canonical#683

## Type of change

- Bug fix (non-breaking change which fixes an issue)

## How has this been tested?

Added tests demonstrating waiting and timeout prior to bootstrap, and
waiting succeeding post bootstrap.

## Contributor checklist

Please check that you have:

- [x] self-reviewed the code in this PR
- [x] added code comments, particularly in less straightforward areas
- [x] checked and added or updated relevant documentation
- [x] checked and added or updated relevant release notes
- [x] added tests to verify effectiveness of this change

---------

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MicroCeph Quality Improvements microceph wait-ready a way to wait for the microceph cluster untill it's up

3 participants