Skip to content

Cluster Toolkit - new module for creating Artifact Registries#3639

Merged
tpdownes merged 15 commits into
GoogleCloudPlatform:developfrom
nagconsulting:develop
May 1, 2025
Merged

Cluster Toolkit - new module for creating Artifact Registries#3639
tpdownes merged 15 commits into
GoogleCloudPlatform:developfrom
nagconsulting:develop

Conversation

@scott-nag

@scott-nag scott-nag commented Feb 5, 2025

Copy link
Copy Markdown
Collaborator

Cluster Toolkit Updates

Introduction of a module to handle Artifact Registry operations: README

OFE Updates

  • Re-enabled the latest Terraform (1.4 was being used temporarily due to cluster deletion bug)
  • Included an option for Private Service Access to the VPC creation page
  • Added views for displaying Artifact Registry repositories that are created alongside clusters, as well as placeholder views that will soon be used to handle container operations
  • Updated OFE cluster creation page to include a section for creating Artifact Registries per the TF module
  • Fixed an issue with extraction of the Slurm controller location as it had change in the TFState file
  • Updated the cluster page to remove the deprecated enable_smt field and replace it with advanced_machine_features.threads_per_core
  • Python update 3.8 -> 3.12

I do have one question about pre-commit/Terraform. The variables file here contains validation which I have commented out as pre-commit wasn't happy. I was seeing errors like this: The condition for variable "repo_password" can only refer to the variable itself, using var.repo_password despite it working when I was testing it locally - is there a different/preferred way that I can re-add these rules again?

Any questions please feel free to give me a shout. Many thanks

@scott-nag scott-nag changed the title Develop Cluster Toolkit - new module for creating Artifact Registries Feb 5, 2025
@scott-nag scott-nag added the release-new-modules Added to release notes under the "New Modules" heading. label Feb 17, 2025
@scott-nag scott-nag requested review from a team and samskillman as code owners March 27, 2025 18:28
@scott-nag

Copy link
Copy Markdown
Collaborator Author

Hello - bumping this PR for visibility as I have just added a number of updates:

  • Update: Added a new ‘Containers’ section to OFE. This is primarily used to build containers in ‘standard’ repositories using Cloud Build. Builds are handled by an update to the existing c2 / PubSub functionality
  • Update: Added new Ansible role to configure clusters to support these containerised Enroot / Pyxis jobs
  • Update: Added option to ‘Enable Private Google Access’ when creating a VPC (required for Artifact Registry / container features)
  • Update: Added functionality to create a container-based ‘Application’, and a new section to facilitate these containerised Slurm ‘Jobs’ too
  • Update: Simplified the ‘Applications’ form by using Django ‘crispy forms’ package. This means that a lot of the manual Bootstrap HTML/CSS styling can be removed from the front end templates
  • Fixed: Ansible role used to configure clusters would sometimes try to retrieve a package unsuccessfully, causing cluster deployment to fail. Included ‘retries’ in the task which resolved the issue
  • Fixed: A delay in a freshly created Service Account propagating would occasionally cause a failure when applying roles. Included a mechanism to retry on failure here too
  • Fixed: Blueprint generation issue where the ‘compute_startup_scripts’ field in the v6-controller module was deprecated in favour of a similar ‘startup-script’ field in the ‘v6-nodeset’ module.

Comment thread community/modules/container/artifact-registry/variables.tf Outdated
Comment thread community/modules/compute/schedmd-slurm-gcp-v6-nodeset/variables.tf
mr0re1
mr0re1 previously approved these changes Apr 3, 2025
Comment thread community/modules/compute/schedmd-slurm-gcp-v6-nodeset/variables.tf Outdated
Comment thread community/modules/scheduler/schedmd-slurm-gcp-v6-controller/slurm_files.tf Outdated
Comment thread community/modules/container/artifact-registry/variables.tf
@tpdownes tpdownes self-requested a review May 1, 2025 22:57
@tpdownes tpdownes self-assigned this May 1, 2025

@tpdownes tpdownes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet have a passing test, but the existing implementation has known problems and manual testing by NAG suggests that the problem may be in our test rather than in the solution itself. We believe customer supportability is best served by merging the solution and doing further evaluation of the test.

@tpdownes tpdownes enabled auto-merge May 1, 2025 22:59
@RachaelSTamakloe RachaelSTamakloe self-requested a review May 1, 2025 23:01
@tpdownes tpdownes merged commit a1de85f into GoogleCloudPlatform:develop May 1, 2025
11 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-new-modules Added to release notes under the "New Modules" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants