Skip to content

infiniband: The rate function returned value is inconsistent with the system file query results #16570

@xiaoxlm

Description

@xiaoxlm

What did you do?

I need to monitor the traffic changes of the infiniband ports on the node. However, I noticed that the value returned by the rate function is different from the value calculated from the system files.

I found that the two values(system data、 from promql data) differ by about 5 times.

I think system data is correct

Could you please help me understand the reason for this? Or is it possible that I am using the wrong formula?

What did you expect to see?

The data obtained by promql should be not much different from the system data

What did you see instead? Under which circumstances?

first, Let's see their current value :

# cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_data
5996930595687

get counter value by promql:
Image

5996930595687 * 4 ≈ 23987722386204,They're very close。

But When I use prometheus functions, They are very different.

Data obtained using the rate function in prometheus web UI

Image

Script used to fetch system data:

Running below script in my node 10.10.1.84 :

while true; do
    RX1=$(cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_data)
    TX1=$(cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_xmit_data)
    sleep 1
    RX2=$(cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_data)
    TX2=$(cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_xmit_data)
    RX_RATE=$(( (RX2 - RX1) * 4 ))   # 1 word = 4 bytes
    TX_RATE=$(( (TX2 - TX1) * 4 ))
    echo "RX: $((RX_RATE * 8 / (1024*1024*1024))) Gbps, TX: $((TX_RATE * 8 / (1024*1024*1024))) Gbps"
done

RX: 18 Gbps, TX: 18 Gbps
RX: 18 Gbps, TX: 18 Gbps
RX: 20 Gbps, TX: 20 Gbps
RX: 22 Gbps, TX: 22 Gbps
RX: 20 Gbps, TX: 20 Gbps

I found that the two values(system data、 promql) differ by about 5 times.

System information

PRETTY_NAME="Ubuntu 22.04.4 LTS"

Prometheus version

I use promtheus-operoter, and get verion info from promtheus web UI:

version	3.0.1
revision	1f56e8492c31a558ccea833027db4bd7f8b6d0e9
branch	HEAD
buildUser	root@9c13055ffc3c
buildDate	20241128-17:20:55
goVersion	go1.23.3
platform: linux/amd64

Prometheus configuration file

open-telemetry-collector config file:

...
scrape_configs:
        - job_name: node-exporter
          scrape_interval: 2s
          static_configs:
            - targets: [ 10.10.1.84:9100 ] # 修改处
              labels:
                host_ip: '10.10.1.84'
...

ServiceMonitor:

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: open-telemetry-collector
  namespace: monitoring
  labels:
    app: open-telemetry-collector
    release: prometheus
spec:
  selector:
    matchLabels:            #Service选择器
      app: open-telemetry-collector
  namespaceSelector:        #Namespace选择器
    matchNames:
      - monitoring
  endpoints:
  - port: metrics           #采集节点端口(svc定义)
    interval: 5s           #采集频率根据实际需求配置,prometheus默认15s
    path: /metrics          #默认地址/metrics

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions