feat: add ML Diagnostics module and integration for GKE TPU blueprints#5350
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Cluster Toolkit by introducing a new blueprint that streamlines the deployment of GKE clusters tailored for high-performance AI/ML tasks using TPU v6e accelerators. It simplifies the complex setup of networking, IAM, and Kubernetes scheduling components, while also pre-integrating Google Cloud ML Diagnostics to ensure workloads are immediately observable and diagnosable. This change aims to reduce operational overhead and accelerate the development and deployment of machine learning applications on Google Cloud. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new blueprint for ML Diagnostics on GKE with TPUs. It adds a new example, a new mldiagnostics Terraform module, and supporting changes to the kubectl-apply and gke-cluster modules. The changes are well-structured and the new wait output in kubectl-apply is a good pattern for explicit dependencies. However, I've identified a few issues, primarily in the new mldiagnostics module related to incorrect namespace handling and dependency definitions which could cause deployment failures. I've also found some inconsistencies and a typo in the new example's documentation and sample job. My review includes detailed comments and code suggestions to address these points.
|
Please add a PR description |
|
/gcbrun |
LAVEEN
left a comment
There was a problem hiding this comment.
LGTM.
Thanks for addressing review comments. Please make sure to run test-gke-ml-diagostics.yml before merging if not ran before.
|
Test Failures:
All the above test failures are not related to the code changes in this PR. |
0dc011a
into
GoogleCloudPlatform:develop
Summary
This PR introduces a new
mldiagnosticsmodule to automate the installation and configuration of Google Cloud ML Diagnostics (Diagon++) on GKE clusters. It also integrates this capability into thegke-tpu-v6eandgke-tpu-7xblueprints and adds a new Ansible playbook forgke-tpu-v6eandgke-tpu-7xintegration tests.Problem Statement
Setting up Google Cloud ML Diagnostics (Diagon++) on GKE typically involves multiple manual steps, including provisioning IAM permissions, installing Helm charts, and configuring workload namespaces. Automating this setup within the Cluster Toolkit ensures a repeatable, best-practice deployment for profiling, logging, and monitoring AI/ML workloads.
Changes Made
modules/management/mldiagnosticsto install ML Diagnostics components (Injection Webhook and Connection Operator) and automatically label the workload namespace for profiling.gke-tpu-v6e,gke-tpu-7xand their advanced blueprint examples to support optional ML Diagnostics enablement.gke-cluster: Added thenamespaceinput to support creating Workload Identity resources in a dedicated user workload namespace.kubectl-apply: Addedcert_managerinstallation support and templatized the namespace in the Kueue config to create KueueLocalQueuein the user namespace.modules/management/mldiagnostics/sample-workload-test.test-gke-ml-diagnostics.ymlto the daily integration test suite for both v6e and 7x blueprints.gke-tpu-v6eintegration test to use thedefaultnamespace for user_namespace, while thegke-tpu-7xtest uses a custom namespace (ai-workloads) to validate both use cases.Documentation
Usage Example
To enable ML Diagnostics on a GKE deployment, include the new module in your blueprint and route the
user_namespacefrom gke-cluster module: