feat: Add job submission capability by introducing "gcluster job" command#5431
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new experimental job submission and management framework for the HPC Toolkit. It enables users to deploy containerized workloads to GKE clusters, manage job lifecycles, and introspect cluster resources. The changes include a robust command-line interface for job submission, automated prerequisite verification, and integration with Kubernetes JobSet and Kueue for efficient resource scheduling. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new job submission and management feature for GKE clusters, including commands for submitting, listing, canceling, and fetching logs for workloads. It also adds infrastructure management for JobSet and Kueue, along with utility functions for image building and telemetry. I have provided feedback on improving the maintainability of template loading, optimizing file system traversal, caching telemetry checks, and enhancing the robustness of HTTP requests and command execution.
1acc796 to
0468803
Compare
6169800 to
4c0c2b8
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces experimental cluster and job command groups to the gcluster CLI, providing tools to list, describe, and manage GKE environments and JobSet-based workloads. Key features include a GKE orchestrator, a Crane-based image builder, and automated prerequisite checks. Review feedback identifies critical bugs in GKE location handling where gcloud commands incorrectly use --location or hardcode --zone, which will fail for regional clusters. Additionally, the reviewer recommends removing the mandatory requirement for the --accelerator flag to support auto-discovery, renaming flags for consistency, fixing script errors in the documentation, and eliminating code duplication in the random string generator.
e680ed4 to
cc5d02c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant set of experimental features for managing GKE clusters and submitting containerized workloads via gcluster cluster and gcluster job. It includes automated prerequisite handling, on-the-fly image building using Crane, and integration with JobSet and Kueue for orchestration. Feedback focuses on several critical improvements: removing hardcoded 'default' namespaces to support multi-tenancy, tightening file and directory permissions for sensitive state data, transitioning from legacy gcr.io to Artifact Registry, and adopting cryptographically secure random string generation. Additionally, the destructive re-installation of the JobSet controller should be replaced with idempotent updates to avoid disrupting active workloads.
e8c705f to
da7e98a
Compare
6ec0ed4 to
942b120
Compare
…ch all namespaces, remove destructive delete in infra_manager
rchestrator/telemetry_test.go pkg/orchestrator/jobtelemetry_test.go wget -r -np -nH --cut-dirs=6 "$FTP_URL" -P "$LOCAL_DIR" Co-authored-by: Neelabh94 <neelgoyal@google.com> Co-authored-by: kvenkatachala333 <kvenkatachala@google.com> Co-authored-by: SikaGrr <isikiric@google.com>
…sion scope and migrate to artifact-registry from gcr.io
…g the prereq checks
bff22d0 to
d26b1bc
Compare
d26b1bc to
b75035c
Compare
6c0efcd
into
GoogleCloudPlatform:develop
PR adds two new gcluster commands that makes it easier for the customers to interact with the environment as well as jobs (workloads) on a GKE cluster.
New gcluster job CLI Domain: Added commands for job submission, listing, cancellation, and log retrieval (submit, list, cancel, logs).
New gcluster cluster CLI Domain: Added commands for cluster inspection and management, including list, info, describe, and volume.
Scope and Compatibility: All changes currently support GKE clusters only and do not impact or alter any existing cluster creation or deployment steps.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.