[launcher] add gpu driver installation b200#663
Conversation
f9732f9 to
eee26f5
Compare
ec979c0 to
600c4c0
Compare
| } | ||
| // Explicitly need to set the GPU state to READY for GPUs with confidential compute mode ON. | ||
| if ccEnabled == attest.GPUDeviceCCMode_ON { | ||
| setGPUStateCmd := NvidiaSmiOutputFunc("conf-compute", "-srs", "1") |
There was a problem hiding this comment.
Setting GPU to ready state signals GPU is ready for running workload. We should defer this step after GPU attestation is measured into RTMR because an early load of malicious workload could alter the GPU attestation.
There was a problem hiding this comment.
Measurement is not there yet, we can handle this once the cgpu attestation is added
There was a problem hiding this comment.
yeah i'm wondering if we can remove this step here since this will be removed eventually.
Alternatively, we can keep this step as long as we adding GPU workload tests https://github.com/google/go-tpm-tools/blob/cs_cgpu_h100/launcher/image/test/scripts/gpu/test_gpu_workload.sh
There was a problem hiding this comment.
I think it's fine to leave the measurements for another PR. That said, this PR should include those GPU workload tests
There was a problem hiding this comment.
the dev gcp project currently doesn't have b200 machines, so can only manually run this in staging
There was a problem hiding this comment.
we don't have to run against b200 machines, can we run against H100 machines to verify the GPU driver installation flow? Our dev gcp project is allowlisted for H100DriverInstallation experiment.
There was a problem hiding this comment.
we don't have to run against b200 machines, can we run against H100 machines to verify the GPU driver installation flow? Our dev gcp project is allowlisted for H100DriverInstallation experiment.
We could, though that's controlled by the h100 flag which the experiment binary is still being roll out...
| } | ||
| s.GpuDriverVersion = unmarshaledMap[gpuDriverVersion] | ||
| if s.GpuDriverVersion == "" { | ||
| s.GpuDriverVersion = "DEFAULT" |
There was a problem hiding this comment.
So the "DEFAULT" GPU driver version will fail the version check anyway https://github.com/google/go-tpm-tools/pull/663/changes#diff-510e0533f5e990d7ab16bd8aadd7f435571b8ad71cd5b2ef326da34accb572cfR63-R67?
There was a problem hiding this comment.
Yes, it's forcing user to set the gpu driver explicitly for now, later default driver maybe qualified and doesn't need to be set explicitly
There was a problem hiding this comment.
well i'm thinking the alternative; can we not introduce this gpuDriverVersion launch spec for the initial release? Since there will be only one GPU driver version supported per CS image, introducing extra flags may confuse customers. We can add this launch spec flag later if CS image will support multiple GPU driver versions. WDYT?
There was a problem hiding this comment.
sure, will remove the version flag
Add driver installation logic Co-authored-by: meetrajvala <160713120+meetrajvala@users.noreply.github.com>
[launcher] add gpu driver installation b200
Add driver installation logic
Co-authored-by: meetrajvala 160713120+meetrajvala@users.noreply.github.com
Original PR: #638