Skip to content

[launcher] add gpu driver installation b200#663

Merged
jkl73 merged 1 commit into
google:mainfrom
jkl73:gpudriver
Feb 19, 2026
Merged

[launcher] add gpu driver installation b200#663
jkl73 merged 1 commit into
google:mainfrom
jkl73:gpudriver

Conversation

@jkl73

@jkl73 jkl73 commented Feb 14, 2026

Copy link
Copy Markdown
Contributor

[launcher] add gpu driver installation b200

Add driver installation logic

Co-authored-by: meetrajvala 160713120+meetrajvala@users.noreply.github.com

Original PR: #638

@jkl73 jkl73 force-pushed the gpudriver branch 3 times, most recently from f9732f9 to eee26f5 Compare February 18, 2026 00:43
@jkl73 jkl73 requested a review from alexmwu February 18, 2026 00:46
@jkl73 jkl73 requested review from yawangwang February 18, 2026 01:05
@jkl73 jkl73 force-pushed the gpudriver branch 2 times, most recently from ec979c0 to 600c4c0 Compare February 18, 2026 03:25
Comment thread launcher/image/vgexperiment.json Outdated
Comment thread launcher/internal/experiments/experiments_test.go
Comment thread launcher/internal/gpu/config.go
Comment thread launcher/launcher/main.go Outdated
Comment thread launcher/container_runner.go
Comment thread launcher/internal/gpu/driverinstaller.go Outdated
Comment thread launcher/internal/gpu/driverinstaller.go
}
// Explicitly need to set the GPU state to READY for GPUs with confidential compute mode ON.
if ccEnabled == attest.GPUDeviceCCMode_ON {
setGPUStateCmd := NvidiaSmiOutputFunc("conf-compute", "-srs", "1")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting GPU to ready state signals GPU is ready for running workload. We should defer this step after GPU attestation is measured into RTMR because an early load of malicious workload could alter the GPU attestation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measurement is not there yet, we can handle this once the cgpu attestation is added

@yawangwang yawangwang Feb 18, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i'm wondering if we can remove this step here since this will be removed eventually.
Alternatively, we can keep this step as long as we adding GPU workload tests https://github.com/google/go-tpm-tools/blob/cs_cgpu_h100/launcher/image/test/scripts/gpu/test_gpu_workload.sh

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave the measurements for another PR. That said, this PR should include those GPU workload tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dev gcp project currently doesn't have b200 machines, so can only manually run this in staging

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have to run against b200 machines, can we run against H100 machines to verify the GPU driver installation flow? Our dev gcp project is allowlisted for H100DriverInstallation experiment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have to run against b200 machines, can we run against H100 machines to verify the GPU driver installation flow? Our dev gcp project is allowlisted for H100DriverInstallation experiment.

We could, though that's controlled by the h100 flag which the experiment binary is still being roll out...

Comment thread launcher/spec/launch_spec.go Outdated
}
s.GpuDriverVersion = unmarshaledMap[gpuDriverVersion]
if s.GpuDriverVersion == "" {
s.GpuDriverVersion = "DEFAULT"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkl73 jkl73 Feb 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's forcing user to set the gpu driver explicitly for now, later default driver maybe qualified and doesn't need to be set explicitly

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well i'm thinking the alternative; can we not introduce this gpuDriverVersion launch spec for the initial release? Since there will be only one GPU driver version supported per CS image, introducing extra flags may confuse customers. We can add this launch spec flag later if CS image will support multiple GPU driver versions. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will remove the version flag

Comment thread launcher/image/vgexperiment.json
Add driver installation logic

Co-authored-by: meetrajvala <160713120+meetrajvala@users.noreply.github.com>
@jkl73 jkl73 merged commit 1c301ef into google:main Feb 19, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants