Skip to content

Use kubelet-device-plugin API#132

Merged
arnaldo2792 merged 3 commits intobottlerocket-os:developfrom
arnaldo2792:nvidia-settings-api
Sep 11, 2024
Merged

Use kubelet-device-plugin API#132
arnaldo2792 merged 3 commits intobottlerocket-os:developfrom
arnaldo2792:nvidia-settings-api

Conversation

@arnaldo2792
Copy link
Copy Markdown
Contributor

@arnaldo2792 arnaldo2792 commented Sep 6, 2024

Issue number:

Related: bottlerocket-os/bottlerocket-settings-sdk#60

Description of changes:

Per bottlerocket-os/bottlerocket-settings-sdk#57 (comment), we are moving away from kubernetes.device-plugin to kubelet-device-plugin.

Testing done:

As part of bottlerocket-os/bottlerocket-settings-sdk#60 and bottlerocket-os/bottlerocket#4182

  1. Instance joined the cluster:
NAME                                           STATUS   ROLES    AGE    VERSION
ip-XXXX.us-west-2.compute.internal   Ready    <none>   22s    v1.30.1-eks-e564799
  1. Files were generated using the new values
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugin": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "pass-device-specs": true
      }
    }
  }
}

bash-5.1# cat /etc/systemd/system/nvidia-k8s-device-plugin.service.d/exec-start.conf
[Service]
ExecStart=
ExecStart=/usr/bin/nvidia-device-plugin --config-file=/etc/nvidia-k8s-device-plugin/settings.yaml
bash-5.1# cat /etc/nvidia-container-runtime/config.toml
### generated from the template file ###
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"
bash-5.1#
  1. A container created after the settings were changed has access to all GPUs without requesting any:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: safe-defaults
spec:
  selector:
    matchLabels:
      name: safe-defaults
  template:
    metadata:
      labels:
        name: safe-defaults
    spec:
      # No GPUs requested
      containers:
        - name: safe-defaults
          image: nvidia/cuda:12.4.1-cudnn-devel-rockylinux8
          command: ['sh', '-c', 'sleep infinity']
bash-5.1# apiclient set kubelet-device-plugin.nvidia.device-list-strategy=envvar
bash-5.1# apiclient set nvidia-container-runtime.visible-devices-as-volume-mounts=false
bash-5.1# apiclient set nvidia-container-runtime.visible-devices-envvar-when-unprivileged=true
bash-5.1# cat /etc/nvidia-container-runtime/config.toml
### generated from the template file ###
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"
└─> ❯ k exec safe-defaults-cnsqn -- nvidia-smi
Mon Sep  9 15:31:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   25C    P8               8W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

git = "https://github.com/bottlerocket-os/bottlerocket-settings-sdk"
tag = "bottlerocket-settings-models-v0.4.0"
version = "0.4.0"
git = "https://github.com/arnaldo2792/bottlerocket-settings-sdk.git"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll drop this once the new Settings SDK is released

@arnaldo2792 arnaldo2792 force-pushed the nvidia-settings-api branch 7 times, most recently from aade1bc to ab5cd8e Compare September 9, 2024 16:00
@arnaldo2792
Copy link
Copy Markdown
Contributor Author

Forced push includes:

  • Remove unnecessary dependency updates in Cargo.lock
  • Skip CI and wait until the Settings SDK is released

@arnaldo2792
Copy link
Copy Markdown
Contributor Author

(forced push removes hack commit and uses the latest Settings SDK release)

The API shape was changed to kubelet-device-plugin

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
@arnaldo2792
Copy link
Copy Markdown
Contributor Author

(forced push fixes conflicts and re-enabled the CI)

@arnaldo2792 arnaldo2792 marked this pull request as ready for review September 10, 2024 22:19
@arnaldo2792 arnaldo2792 merged commit 557a7e5 into bottlerocket-os:develop Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants