Environment
● Kubernetes: 1.20.11
● OS: Centos7(3.10.0-1160.15.2.el7.x86_64)
● Docker: 19.03.15
● NVIDIA Driver Version: 510.47.03
● DCGM Exporter Docker Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04
Issue description
There is no process is using gpus on my node, output of nvidia-smi:
# nvidia-smi
Tue Apr 26 15:29:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:54:00.0 Off | 0 |
| N/A 33C P0 67W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:5A:00.0 Off | 0 |
| N/A 32C P0 65W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:6B:00.0 Off | 0 |
| N/A 32C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:70:00.0 Off | 0 |
| N/A 34C P0 71W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:BE:00.0 Off | 0 |
| N/A 33C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:C3:00.0 Off | 0 |
| N/A 30C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:DA:00.0 Off | 0 |
| N/A 32C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:E0:00.0 Off | 0 |
| N/A 33C P0 66W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But I found the output of the metric DCGM_FI_DEV_FB_USED is 850MiB:
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-14cdc0a4-f52f-a50c-f758-eb93e013e555",device="nvidia6",gpu="6",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-4eb035e5-2709-daa2-e3a1-5d2d8da60610",device="nvidia1",gpu="1",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-6fdb3b1a-3e7f-abc1-3e6a-7378a3cb2778",device="nvidia3",gpu="3",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-8ded7a54-6143-7898-563a-eb598623d740",device="nvidia0",gpu="0",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-a4f2126d-7321-144c-aa01-cdb6e7c8022a",device="nvidia7",gpu="7",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-bfbd4faf-d197-bd20-e054-2735bdd0c49e",device="nvidia4",gpu="4",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-c4d991e3-a1a7-0966-056b-a26d078c2f67",device="nvidia5",gpu="5",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-f7b1a978-75f0-a263-4206-8d7cd53641e0",device="nvidia2",gpu="2",modelName="NVIDIA A100-SXM4-80GB"} 850
At the same time I used the nvml to get the used gpu memory is 0, why dcgm-exporter outputs the 850MiB?
And I also tested M40,P100,P4,T4,V100,A10 with driver 510.47.03, the value of metric DCGM_FI_DEV_FB_USED is not 0 even if there is no gpu process is using gpus.
Is a bug for dcgm? or is a bug for nvidia drvier 510.47.03?
Environment
● Kubernetes: 1.20.11
● OS: Centos7(3.10.0-1160.15.2.el7.x86_64)
● Docker: 19.03.15
● NVIDIA Driver Version: 510.47.03
● DCGM Exporter Docker Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04
Issue description
There is no process is using gpus on my node, output of nvidia-smi:
But I found the output of the metric DCGM_FI_DEV_FB_USED is 850MiB:
At the same time I used the nvml to get the used gpu memory is 0, why dcgm-exporter outputs the 850MiB?
And I also tested M40,P100,P4,T4,V100,A10 with driver 510.47.03, the value of metric DCGM_FI_DEV_FB_USED is not 0 even if there is no gpu process is using gpus.
Is a bug for dcgm? or is a bug for nvidia drvier 510.47.03?