Skip to content

Service: Wait for LXD members to be ready after join#1246

Merged
roosterfish merged 3 commits intocanonical:mainfrom
roosterfish:prevent_sql_race
Mar 10, 2026
Merged

Service: Wait for LXD members to be ready after join#1246
roosterfish merged 3 commits intocanonical:mainfrom
roosterfish:prevent_sql_race

Conversation

@roosterfish
Copy link
Contributor

@roosterfish roosterfish commented Mar 2, 2026

We sometimes observe this error in the pipeline and it seems to be a race.
This is to prevent the following error from happening in case resources are getting deployed right after creating the MicroCloud:

Error: Failed instance creation: Fetch project database object: Failed to fetch from projects table: Failed to fetch from projects table: Failed to fetch from projects table: sql: transaction has already been committed or rolled back

An equivalent fix was once added to lxd-ci, see https://github.com/canonical/lxd-ci/pull/577/files.

In addition the force start of LXD in the pipeline is moved from reset_system into restore_system to ensure it always runs.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces post-cluster-join race conditions by ensuring LXD is responsive before returning from the join workflow, and adjusts the test harness so LXD is force-started after snapshot restores.

Changes:

  • Add a post-Join readiness wait (via internal/ready) before returning from LXDService.Join.
  • Move the CI “force LXD to start” step from reset_system to restore_system so it runs after snapshot restore.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
test/includes/microcloud.sh Moves the “force LXD startup” step to the snapshot restore path to avoid missing it when SNAPSHOT_RESTORE=1.
service/lxd.go Adds a bounded wait for LXD readiness after joining a cluster to avoid early follow-up operations hitting an unready member.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This allows reusing the same const inside the service package too.

Signed-off-by: Julian Pelizäus <julian.pelizaeus@canonical.com>
This is to prevent the following error from happening in case resources are getting deployed right after creating the MicroCloud:
Error: Failed instance creation: Fetch project database object: Failed to fetch from projects table: Failed to fetch from projects table: Failed to fetch from projects table: sql: transaction has already been committed or rolled back

Signed-off-by: Julian Pelizäus <julian.pelizaeus@canonical.com>
When SNAPSHOT_RESTORE=1, the reset_systems func is returning early and doesn't run reset_system which does
not trigger the force start of LXD.
Instead perform this action in restore_system so it always runs regardless whether or not SNAPSHOT_RESTORE is set.

Signed-off-by: Julian Pelizäus <julian.pelizaeus@canonical.com>
@roosterfish roosterfish marked this pull request as ready for review March 10, 2026 08:34
Copy link
Member

@simondeziel simondeziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@roosterfish roosterfish merged commit b2058bd into canonical:main Mar 10, 2026
56 of 57 checks passed
@roosterfish roosterfish deleted the prevent_sql_race branch March 10, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants