GPUs require special drivers and software which are not pre-installed on Dataproc clusters by default. This initialization action installs GPU driver for NVIDIA GPUs on master and worker nodes in a Dataproc cluster.
A default version will be selected from NVIDIA's guidance, similar to the NVIDIA Deep Learning Frameworks Support Matrix, for CUDA, the NVIDIA kernel driver, cuDNN, and NCCL.
Specifying a supported value for the cuda-version metadata variable
will select compatible values for Driver, cuDNN, and NCCL from the script's
internal matrix. Default CUDA versions are typically:
- Dataproc 2.0:
12.1.1 - Dataproc 2.1:
12.4.1 - Dataproc 2.2 & 2.3:
12.6.3
(Note: The script supports a wider range of specific versions.
Refer to internal arrays in install_gpu_driver.sh for the full matrix.)
Example Tested Configurations (Illustrative):
| CUDA | Full Version | Driver | cuDNN | NCCL | Tested Dataproc Image Versions |
|---|---|---|---|---|---|
| 11.8 | 11.8.0 | 525.147.05 | 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Ubuntu 22.04) |
| 12.0 | 12.0.1 | 525.147.05 | 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Rocky 9, Ubuntu 22.04) |
| 12.4 | 12.4.1 | 590.48.01 | 9.1.0.70 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+ |
| 12.6 | 12.6.3 | 590.48.01 | 9.6.0.74 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+ |
Supported Operating Systems:
- Debian 10, 11, 12
- Ubuntu 18.04, 20.04, 22.04 LTS
- Rocky Linux 8, 9
This initialization action will install NVIDIA GPU drivers and the CUDA toolkit. Optional components like cuDNN, NCCL, and PyTorch can be included via metadata.
-
Use the
gcloudcommand to create a new cluster with this initialization action. The following command will create a new cluster named<CLUSTER_NAME>and install default GPU drivers (GPU agent is enabled by default).REGION=<region> CLUSTER_NAME=<cluster_name> DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-debian12 gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --image-version ${DATAPROC_IMAGE_VERSION} \ --master-accelerator type=nvidia-tesla-t4,count=1 \ --worker-accelerator type=nvidia-tesla-t4,count=2 \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \ --scopes https://www.googleapis.com/auth/monitoring.write # For GPU agent
-
Use the
gcloudcommand to create a new cluster specifying a custom CUDA version and providing direct HTTP/HTTPS URLs for the driver and CUDA.runfiles. This example also disables the GPU agent.REGION=<region> CLUSTER_NAME=<cluster_name> DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-ubuntu22 MY_DRIVER_URL="https://us.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run" MY_CUDA_URL="https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run" gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --image-version ${DATAPROC_IMAGE_VERSION} \ --master-accelerator type=nvidia-tesla-t4,count=1 \ --worker-accelerator type=nvidia-tesla-t4,count=2 \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \ --metadata gpu-driver-url=${MY_DRIVER_URL},cuda-url=${MY_CUDA_URL},install-gpu-agent=false
-
To create a cluster with Multi-Instance GPU (MIG) enabled (e.g., for NVIDIA A100 GPUs), you must use this
install_gpu_driver.shscript for the base driver installation, and additionally specifygpu/mig.shas a startup script.REGION=<region> CLUSTER_NAME=<cluster_name> DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-rocky9 gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --image-version ${DATAPROC_IMAGE_VERSION} \ --worker-machine-type a2-highgpu-1g \ --worker-accelerator type=nvidia-tesla-a100,count=1 \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \ --properties "dataproc:startup.script.uri=gs://goog-dataproc-initialization-actions-${REGION}/gpu/mig.sh" \ --metadata MIG_CGI='1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb' # Example MIG profiles
When this install_gpu_driver.sh script is used as a customization-script
for building custom Dataproc images (e.g., with tools from the
GoogleCloudDataproc/custom-images repository like generate_custom_image.py),
some configurations need to be deferred.
- The image building tool should pass the metadata
--metadata invocation-type=custom-imagesto the temporary instance used during image creation. - This instructs
install_gpu_driver.shto install drivers and tools but defer Hadoop/Spark-specific configurations to the first boot of an instance created from this custom image. This is handled via a systemd service (dataproc-gpu-config.service). - End-users creating clusters from such a custom image do not set
the
invocation-typemetadata.
Example command for generate_custom_image.py (simplified):
python generate_custom_image.py \
# ... other generate_custom_image.py arguments ...
--customization-script gs://<your-bucket>/gpu/install_gpu_driver.sh \
--metadata invocation-type=custom-images,cuda-version=12.6 # Plus other desired metadataThis script configures YARN, Dataproc's default Resource Manager, for GPU awareness.
- It sets
yarn.io/gpuas a resource type. - It configures the
LinuxContainerExecutorand cgroups for GPU isolation. - It installs a GPU discovery script (
getGpusResources.sh) for Spark, which caches results to minimizenvidia-smicalls. - Spark default configurations in
/etc/spark/conf/spark-defaults.confare updated with GPU-related properties (e.g.,spark.executor.resource.gpu.amount) and the RAPIDS Spark plugin (com.nvidia.spark.SQLPlugin) is commonly configured.
This script can install NVIDIA cuDNN, a GPU-accelerated library for deep neural networks.
- If
include-pytorch=yesis specified orcudnn-versionis provided, a compatible version of cuDNN will be selected and installed based on the determined CUDA version. - To install a specific version of cuDNN, use the
cudnn-versionmetadata parameter (e.g.,--metadata cudnn-version=8.9.7.29). Please consult the cuDNN Archive and your deep learning framework's documentation for CUDA compatibility. The script may uselibcudnnpackages or tarball installations.
Example cuDNN Version Mapping (Illustrative):
| cuDNN Major.Minor | Example Full Version | Compatible CUDA Versions (General) |
|---|---|---|
| 8.6 | 8.6.0.163 | 10.2, 11.x |
| 8.9 | 8.9.7.29 | 11.x, 12.x |
| 9.x | e.g., 9.6.0.74 | 12.x |
This script accepts the following metadata parameters:
install-gpu-agent:true|false. Default:true. Installs GPU monitoring agent. Requires thehttps://www.googleapis.com/auth/monitoring.writescope.cuda-version: (Optional) Specify desired CUDA version (e.g.,11.8,12.4.1). Overrides default CUDA selection.cuda-url: (Optional) HTTP/HTTPS URL to a specific CUDA toolkit.runfile (e.g.,https://developer.download.nvidia.com/.../cuda_12.4.1_..._linux.run). Fetched usingcurl. Overridescuda-versionand default selection.gpu-driver-version: (Optional) Specify NVIDIA driver version (e.g.,550.90.07). Overrides default compatible driver selection.gpu-driver-url: (Optional) HTTP/HTTPS URL to a specific NVIDIA driver.runfile (e.g.,https://us.download.nvidia.com/.../NVIDIA-Linux-x86_64-...run). Fetched usingcurl. Overridesgpu-driver-version.gpu-driver-provider: (Optional)OS|NVIDIA. Default:NVIDIA. Determines preference for OS-provided vs. NVIDIA-direct drivers. The script often prioritizes.runfiles or source builds for reliability.cudnn-version: (Optional) Specify cuDNN version (e.g.,8.9.7.29).cudnn-install-source: (Optional)tarball|package. Default:package(except for2.0-rocky8and2.1-rocky8where it defaults totarballto bypass CDN flakes). Determines whether cuDNN is installed via the OS package manager or extracted from the standalone NVIDIA tarball cached in GCS.nccl-version: (Optional) Specify NCCL version.include-pytorch: (Optional)yes|no. Default:no. Ifyes, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda environment.gpu-conda-env: (Optional) Name for the PyTorch Conda environment. Default:dpgce.container-runtime: (Optional) E.g.,docker,containerd,crio. For NVIDIA Container Toolkit configuration. Auto-detected if not specified.http-proxy: (Optional) URL of an HTTP proxy for downloads.http-proxy-pem-uri: (Optional) Ags://path to the PEM-encoded certificate file used by the proxy specified inhttp-proxy. This is needed if the proxy uses TLS and its certificate is not already trusted by the cluster's default trust store (e.g., if it's a self-signed certificate or signed by an internal CA). The script will install this certificate into the system and Java trust stores.invocation-type: (For Custom Images) Set tocustom-imagesby image building tools. Not typically set by end-users creating clusters.- Secure Boot Signing Parameters: Used if Secure Boot is enabled and
you need to sign kernel modules built from source.
private_secret_name=<your-private-key-secret-name> public_secret_name=<your-public-cert-secret-name> secret_project=<your-gcp-project-id> secret_version=<your-secret-version> modulus_md5sum=<md5sum-of-your-mok-key-modulus>
When the script needs to build NVIDIA kernel modules from source (e.g., using NVIDIA's open-gpu-kernel-modules repository, or if pre-built OS packages are not suitable), special considerations apply if Secure Boot is enabled.
- Secure Boot Active: Locally compiled modules must be signed with a key
trusted by the system's UEFI firmware.
- MOK Key Signing: Provide the Secure Boot signing metadata parameters
(listed above) to use keys stored in GCP Secret Manager. The public MOK
certificate must be enrolled in your base image's UEFI keystore. See
GoogleCloudDataproc/custom-images/examples/secure-boot/create-key-pair.shfor guidance on key creation and management. - Disabling Secure Boot (Unsecured Workaround): You can pass the
--no-shielded-secure-bootflag togcloud dataproc clusters create. This allows unsigned modules but disables Secure Boot's protections.
- MOK Key Signing: Provide the Secure Boot signing metadata parameters
(listed above) to use keys stored in GCP Secret Manager. The public MOK
certificate must be enrolled in your base image's UEFI keystore. See
- Error Indication: If a kernel module fails to load due to signature
issues while Secure Boot is active, check
/var/log/nvidia-installer.logordmesgoutput for errors like "Operation not permitted" or messages related to signature verification failure.
-
Once the cluster has been created, you can access the Dataproc cluster and verify NVIDIA drivers are installed successfully.
sudo nvidia-smi
-
If the CUDA toolkit was installed, verify the compiler:
/usr/local/cuda/bin/nvcc --version
-
If you install the GPU collection service (
install-gpu-agent=true, default), verify installation by using the following command:sudo systemctl status gpu-utilization-agent.service
(The service should be
active (running)).
For more information about GPU support, take a look at Dataproc documentation.
The GPU monitoring agent (installed when install-gpu-agent=true) automatically
collects and sends GPU utilization and memory usage metrics to Cloud Monitoring.
The agent is based on code from the
ml-on-gcp/gcp-gpu-utilization-metrics
repository. The create_gpu_metrics.py script mentioned in older
documentation is no longer used by this initialization action, as the agent
handles metric creation and reporting.
- Installation Failures: Examine the initialization action log on the
affected node, typically
/var/log/dataproc-initialization-script-0.log(or a similar name if multiple init actions are used). - GPU Agent Issues: If the agent was installed (
install-gpu-agent=true), check its service logs usingsudo journalctl -u gpu-utilization-agent.service. - Driver Load or Secure Boot Problems: Review
dmesgoutput and/var/log/nvidia-installer.logfor errors related to module loading or signature verification. - "Points written too frequently" (GPU Agent): This was a known issue with
older versions of the
report_gpu_metrics.pyservice. The current script and agent versions aim to mitigate this. If encountered, check agent logs.
For instructions on how to manually test changes to this initialization action, including iterative development on a live cluster, please see the TESTING.md guide.
If you are modifying this initialization action, you can use the provided test infrastructure to validate your changes locally before deploying them to production.
Before pushing any changes to GitHub, you must run the integration tests locally to validate your modifications against the full test matrix (test_gpu.py). These tests use absl.testing.parameterized and the integration_tests.dataproc_test_case framework to spin up ephemeral Dataproc clusters and validate GPU functionality (SINGLE, STANDARD, KERBEROS, MIG, etc.).
We provide a Podman wrapper to execute the Bazel test suite locally, perfectly simulating the remote CI sandbox environment.
- Credentials: Ensure you have your Google Cloud Application Default Credentials (ADC) saved locally, typically at
~/.config/gcloud/application_default_credentials.json, and copy it toinitialization-actions/key.json. - Environment: You must have a configured
env.jsonin thegpu/directory.
To run the full suite in the Podman container (Unfiltered):
⚠️ WARNING: HIGH RESOURCE CONSUMPTION An unfiltered run executes the entire test matrix (currently ~12 shards). Because the script is configured to run up to 10 jobs in parallel, this will concurrently provision up to 10 separate Dataproc clusters. This requires massive GCP quota (e.g., ~900 vCPUs and ~30 GPUs simultaneously if usingn1-standard-32profiles) and will take 60-90 minutes.
cd initialization-actions
# Test a specific Dataproc image version against the full suite
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22"To run a specific test filter to iterate quickly on a failure (Recommended):
cd initialization-actions
# Filter by a specific test function
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_gpu_allocation"
# Filter by another specific test function
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=test_install_gpu_cuda_nvidia_with_spark_job"
# Filter by the entire class
./gpu/run-bazel-tests-with-podman.sh "2.2-ubuntu22" "--test_filter=NvidiaGpuDriverTestCase"If you have already provisioned a Dataproc cluster (e.g., my-cluster) and want to verify its GPU configuration without running the full Bazel test suite, you can use the standalone verification scripts.
# Verify using the local Python script
python3 gpu/verify_external_cluster.py \
--cluster=my-cluster \
--region=us-east4 \
--zone=us-east4-b \
--project=my-project \
--tests smi agent spark torch tf numa
# Or using the bash equivalent
export CLUSTER_NAME=my-cluster PROJECT_ID=my-project REGION=us-east4 ZONE=us-east4-b
./gpu/verify_external_gpu_cluster.shFor comprehensive validation of Spark RAPIDS, PyTorch, and TensorFlow on a running cluster, an external testing script is available in the associated cloud-dataproc/gcloud repository.
# Configure the gcloud test environment
cd ../cloud-dataproc/gcloud
source lib/env.sh # Populates environment variables from env.json
# Execute the comprehensive Spark GPU test suite against the configured cluster
./t/spark-gpu-test.shThis script will remotely execute SSH commands to validate NUMA configurations, run PyTorch/TensorFlow isolated in their Conda environments, verify NVCC/cuDNN, and submit SparkPi and JavaIndexToStringExample Spark jobs configured to use the RAPIDS accelerator plugin.
- This initialization script will install NVIDIA GPU drivers in all nodes in which a GPU is detected. If no GPUs are present on a node, most GPU-specific installation steps are skipped.
- Performance & Caching:
- The script extensively caches downloaded artifacts (drivers, CUDA
.runfiles) and compiled components (kernel modules, NCCL, Conda environments) to a GCS bucket. This bucket is typically specified by thedataproc-temp-bucketcluster property or metadata. - First Run / Cache Warming: Initial runs on new configurations (OS,
kernel, or driver version combinations) that require source compilation
(e.g., for NCCL or kernel modules when no pre-compiled version is
available or suitable) can be time-consuming.
- On small instances (e.g., 2-core nodes), this process can take up to 150 minutes.
- To optimize and avoid long startup times on production clusters, it is highly recommended to "pre-warm" the GCS cache. This can be done by running the script once on a temporary, larger instance (e.g., a single-node, 32-core machine) with your target OS and desired GPU configuration. This will build and cache the necessary components. Subsequent cluster creations using the same cache bucket will be significantly faster (e.g., the init action might take 12-20 minutes on a large instance for the initial build, and then much faster on subsequent nodes using the cache).
- Security Benefit of Caching: When the script successfully finds and
uses cached, pre-built artifacts, it often bypasses the need to
install build tools (e.g.,
gcc,kernel-devel,make) on the cluster nodes. This reduces the attack surface area of the resulting cluster instances.
- The script extensively caches downloaded artifacts (drivers, CUDA
- SSHD configuration is hardened by default by the script.
- The script includes logic to manage APT sources and GPG keys for Debian-based systems, including handling of archived backports repositories to ensure dependencies can be met.
- Tested primarily with Dataproc 2.0+ images. Support for older Dataproc 1.5 images is limited.