Skip to content

dcgm-exporter collects metrics incorrectly? #348

@happy2048

Description

@happy2048

Environment

● Kubernetes: 1.20.11
● OS: Centos7(3.10.0-1160.15.2.el7.x86_64)
● Docker: 19.03.15
● NVIDIA Driver Version: 510.47.03
● DCGM Exporter Docker Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04

Issue description

There is no process is using gpus on my node, output of nvidia-smi:

# nvidia-smi

Tue Apr 26 15:29:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:54:00.0 Off |                    0 |
| N/A   33C    P0    67W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:5A:00.0 Off |                    0 |
| N/A   32C    P0    65W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:6B:00.0 Off |                    0 |
| N/A   32C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:70:00.0 Off |                    0 |
| N/A   34C    P0    71W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:BE:00.0 Off |                    0 |
| N/A   33C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:C3:00.0 Off |                    0 |
| N/A   30C    P0    63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:DA:00.0 Off |                    0 |
| N/A   32C    P0    63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:E0:00.0 Off |                    0 |
| N/A   33C    P0    66W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But I found the output of the metric DCGM_FI_DEV_FB_USED is 850MiB:

# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-14cdc0a4-f52f-a50c-f758-eb93e013e555",device="nvidia6",gpu="6",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-4eb035e5-2709-daa2-e3a1-5d2d8da60610",device="nvidia1",gpu="1",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-6fdb3b1a-3e7f-abc1-3e6a-7378a3cb2778",device="nvidia3",gpu="3",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-8ded7a54-6143-7898-563a-eb598623d740",device="nvidia0",gpu="0",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-a4f2126d-7321-144c-aa01-cdb6e7c8022a",device="nvidia7",gpu="7",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-bfbd4faf-d197-bd20-e054-2735bdd0c49e",device="nvidia4",gpu="4",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-c4d991e3-a1a7-0966-056b-a26d078c2f67",device="nvidia5",gpu="5",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-f7b1a978-75f0-a263-4206-8d7cd53641e0",device="nvidia2",gpu="2",modelName="NVIDIA A100-SXM4-80GB"} 850

At the same time I used the nvml to get the used gpu memory is 0, why dcgm-exporter outputs the 850MiB?
And I also tested M40,P100,P4,T4,V100,A10 with driver 510.47.03, the value of metric DCGM_FI_DEV_FB_USED is not 0 even if there is no gpu process is using gpus.
Is a bug for dcgm? or is a bug for nvidia drvier 510.47.03?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions