Skip to content

feat: Add job submission capability by introducing "gcluster job" command#5431

Merged
Neelabh94 merged 27 commits into
GoogleCloudPlatform:developfrom
Neelabh94:workload_submission
Apr 29, 2026
Merged

feat: Add job submission capability by introducing "gcluster job" command#5431
Neelabh94 merged 27 commits into
GoogleCloudPlatform:developfrom
Neelabh94:workload_submission

Conversation

@Neelabh94

@Neelabh94 Neelabh94 commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

PR adds two new gcluster commands that makes it easier for the customers to interact with the environment as well as jobs (workloads) on a GKE cluster.

  • New gcluster job CLI Domain: Added commands for job submission, listing, cancellation, and log retrieval (submit, list, cancel, logs).

    • GKE Orchestrator Implementation: Built a robust backend using Kubernetes JobSet and Kueue for managing multi-slice workloads.
    • Automated Image Building: Integrated Crane for on-the-fly container image building and registry pushing without requiring local Docker.
    • Prerequisite Automation: Implemented comprehensive pre-flight checks and auto-installation of necessary components (JobSet CRD, Kueue).
    • Documentation Added a detailed user guide (docs/gcluster_job_guide.md)
  • New gcluster cluster CLI Domain: Added commands for cluster inspection and management, including list, info, describe, and volume.

  • Scope and Compatibility: All changes currently support GKE clusters only and do not impact or alter any existing cluster creation or deployment steps.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new experimental job submission and management framework for the HPC Toolkit. It enables users to deploy containerized workloads to GKE clusters, manage job lifecycles, and introspect cluster resources. The changes include a robust command-line interface for job submission, automated prerequisite verification, and integration with Kubernetes JobSet and Kueue for efficient resource scheduling.

Highlights

  • New Job Submission Command: Added an experimental 'gcluster job submit' command to allow users to submit containerized workloads to GKE clusters using JobSet and Kueue.
  • Cluster Management Commands: Introduced new cluster-related commands including 'list', 'info', 'describe', and 'volume' to improve cluster introspection.
  • Job Lifecycle Management: Added commands to manage job lifecycles, specifically 'list', 'cancel', and 'logs' for workloads running on GKE.
  • Automated Prerequisite Checks: Implemented an automated prerequisite check system to ensure necessary tools (gcloud, kubectl) and configurations are in place before job submission.
  • Container Image Building: Integrated Crane for on-the-fly container image building, allowing users to build and push images directly from a build context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new job submission and management feature for GKE clusters, including commands for submitting, listing, canceling, and fetching logs for workloads. It also adds infrastructure management for JobSet and Kueue, along with utility functions for image building and telemetry. I have provided feedback on improving the maintainability of template loading, optimizing file system traversal, caching telemetry checks, and enhancing the robustness of HTTP requests and command execution.

Comment thread pkg/orchestrator/gke/infra_manager.go Outdated
Comment thread pkg/shell/common.go Outdated
Comment thread pkg/imagebuilder/crane_builder.go Outdated
Comment thread pkg/orchestrator/jobtelemetry.go
Comment thread pkg/orchestrator/gke/manifest_generator.go Outdated
Comment thread pkg/orchestrator/gke/infra_manager.go Outdated
@Neelabh94 Neelabh94 force-pushed the workload_submission branch from 1acc796 to 0468803 Compare March 31, 2026 06:52
@Neelabh94 Neelabh94 added enhancement New feature or request release-key-new-features Added to release notes under the "Key New Features" heading. labels Mar 31, 2026
@Neelabh94 Neelabh94 force-pushed the workload_submission branch 4 times, most recently from 6169800 to 4c0c2b8 Compare March 31, 2026 13:46
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@Neelabh94 Neelabh94 changed the title gcluster job feat: Add job submission capability by introducing "gcluster job" command Mar 31, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces experimental cluster and job command groups to the gcluster CLI, providing tools to list, describe, and manage GKE environments and JobSet-based workloads. Key features include a GKE orchestrator, a Crane-based image builder, and automated prerequisite checks. Review feedback identifies critical bugs in GKE location handling where gcloud commands incorrectly use --location or hardcode --zone, which will fail for regional clusters. Additionally, the reviewer recommends removing the mandatory requirement for the --accelerator flag to support auto-discovery, renaming flags for consistency, fixing script errors in the documentation, and eliminating code duplication in the random string generator.

Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Comment thread cmd/job/submit.go
Comment thread pkg/orchestrator/gke/gke_cluster_orchestrator.go Outdated
Comment thread pkg/orchestrator/gke/gke_cluster_orchestrator.go Outdated
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Comment thread cmd/cluster/describe_test.go
Comment thread cmd/job/job.go Outdated
Comment thread docs/gcluster_job_guide.md
Comment thread docs/gcluster_job_guide.md Outdated
Comment thread pkg/imagebuilder/crane_builder.go Outdated
@Neelabh94 Neelabh94 force-pushed the workload_submission branch 2 times, most recently from e680ed4 to cc5d02c Compare April 1, 2026 01:53
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant set of experimental features for managing GKE clusters and submitting containerized workloads via gcluster cluster and gcluster job. It includes automated prerequisite handling, on-the-fly image building using Crane, and integration with JobSet and Kueue for orchestration. Feedback focuses on several critical improvements: removing hardcoded 'default' namespaces to support multi-tenancy, tightening file and directory permissions for sensitive state data, transitioning from legacy gcr.io to Artifact Registry, and adopting cryptographically secure random string generation. Additionally, the destructive re-installation of the JobSet controller should be replaced with idempotent updates to avoid disrupting active workloads.

Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Comment thread pkg/orchestrator/gke/infra_manager.go Outdated
Comment thread cmd/job/submit.go Outdated
Comment thread cmd/job/prereq.go Outdated
Comment thread cmd/job/prereq.go
Comment thread pkg/imagebuilder/crane_builder.go Outdated
Comment thread pkg/orchestrator/gke/infra_manager.go Outdated
Comment thread pkg/shell/common.go Outdated
@Neelabh94 Neelabh94 force-pushed the workload_submission branch from e8c705f to da7e98a Compare April 1, 2026 07:20
@Neelabh94 Neelabh94 marked this pull request as ready for review April 1, 2026 11:23
@Neelabh94 Neelabh94 requested review from a team and samskillman as code owners April 1, 2026 11:23
@Neelabh94 Neelabh94 force-pushed the workload_submission branch 9 times, most recently from 6ec0ed4 to 942b120 Compare April 7, 2026 09:25
Comment thread cmd/cluster/cluster.go Outdated
Comment thread cmd/cluster/cluster.go Outdated
Comment thread cmd/cluster/cluster.go
Comment thread cmd/cluster/describe.go Outdated
Comment thread cmd/cluster/describe.go Outdated
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Neelabh94 and others added 22 commits April 29, 2026 08:55
…ch all namespaces, remove destructive delete in infra_manager
rchestrator/telemetry_test.go pkg/orchestrator/jobtelemetry_test.go

wget -r -np -nH --cut-dirs=6 "$FTP_URL" -P "$LOCAL_DIR"

Co-authored-by: Neelabh94 <neelgoyal@google.com>
Co-authored-by: kvenkatachala333 <kvenkatachala@google.com>
Co-authored-by: SikaGrr <isikiric@google.com>
…sion scope and migrate to artifact-registry from gcr.io
@Neelabh94 Neelabh94 force-pushed the workload_submission branch from bff22d0 to d26b1bc Compare April 29, 2026 08:57
@Neelabh94 Neelabh94 force-pushed the workload_submission branch from d26b1bc to b75035c Compare April 29, 2026 09:04
@Neelabh94 Neelabh94 dismissed jamOne-’s stale review April 29, 2026 11:38

All cocnerns addressed.

@Neelabh94 Neelabh94 removed the request for review from jamOne- April 29, 2026 11:46
@Neelabh94 Neelabh94 merged commit 6c0efcd into GoogleCloudPlatform:develop Apr 29, 2026
17 of 76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants