Skip to content

ethtool collector produces too many metrics (hundreds per device) #2096

@bdrung

Description

@bdrung

Host operating system: output of uname -a

In-house build of linux 5.4.

node_exporter version: output of node_exporter --version

node_exporter, version 1.2.0+ds (branch: debian/sid, revision: 1.2.0+ds-0ionos2~deb10)
build user: team+pkg-go@tracker.debian.org
build date: 20210721-19:27:22
go version: go1.15.9
platform: linux/amd64

This is an in-house build from the 1.2.0 release plus the patches from following merge requests:

node_exporter command line flags

Relevant flags:

--collector.ethtool --collector.ethtool.ignored-devices=^(lo|dhcp-s|ovs-system|(b|bn|c|d|mb|n|pn|pub|r|t)[0-9a-f]+(n[0-9]+)?)$

Are you running node_exporter in Docker?

No

What did you do that produced an error?

After rolling out the new version and enabling the ethtool collector, the amount of metrics increased by around 25% which too much additional load on our Prometheus servers. One system had 17738 metrics where 10196 were node_ethtool_ ones (57%). I saw Ethernet devices that have 674 entries in the ethtool -S output. Example:

NIC statistics:
     rx_packets: 100095693
     tx_packets: 26774178
     rx_bytes: 7702229176
     tx_bytes: 21227737647
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     collisions: 0
     rx_length_errors: 0
     rx_crc_errors: 0
     rx_unicast: 13594191
     tx_unicast: 26500801
     rx_multicast: 3
     tx_multicast: 273368
     rx_broadcast: 86501474
     tx_broadcast: 7
     rx_unknown_protocol: 0
     tx_linearize: 0
     tx_force_wb: 0
     tx_busy: 0
     rx_alloc_fail: 0
     rx_pg_alloc_fail: 0
     tx-0.packets: 7335276
     tx-0.bytes: 8216031580
     rx-0.packets: 3009930
     rx-0.bytes: 481779221
     tx-1.packets: 37690
     tx-1.bytes: 7069713
     rx-1.packets: 7196
     rx-1.bytes: 5842519
[...]
     tx-126.packets: 0
     tx-126.bytes: 0
     rx-126.packets: 0
     rx-126.bytes: 0
     tx-127.packets: 0
     tx-127.bytes: 0
     rx-127.packets: 0
     rx-127.bytes: 0
     veb.rx_bytes: 0
     veb.tx_bytes: 0
     veb.rx_unicast: 0
     veb.tx_unicast: 0
     veb.rx_multicast: 0
     veb.tx_multicast: 0
     veb.rx_broadcast: 0
     veb.tx_broadcast: 0
     veb.rx_discards: 0
     veb.tx_discards: 0
     veb.tx_errors: 0
     veb.rx_unknown_protocol: 0
     veb.tc_0_tx_packets: 0
     veb.tc_0_tx_bytes: 0
     veb.tc_0_rx_packets: 0
     veb.tc_0_rx_bytes: 0
     veb.tc_1_tx_packets: 0
[...]
     veb.tc_7_rx_bytes: 0
     port.rx_bytes: 23320709815
     port.tx_bytes: 24611734876
     port.rx_unicast: 28825506
     port.tx_unicast: 42010865
     port.rx_multicast: 142587000
     port.tx_multicast: 446671
     port.rx_broadcast: 86501552
     port.tx_broadcast: 8
     port.tx_errors: 0
     port.rx_dropped: 0
     port.tx_dropped_link_down: 0
     port.rx_crc_errors: 0
     port.illegal_bytes: 0
     port.mac_local_faults: 0
     port.mac_remote_faults: 0
     port.tx_timeout: 0
     port.rx_csum_bad: 0
     port.rx_length_errors: 0
     port.link_xon_rx: 0
     port.link_xoff_rx: 0
     port.link_xon_tx: 0
     port.link_xoff_tx: 0
     port.rx_size_64: 90091895
     port.rx_size_127: 162957382
     port.rx_size_255: 1927681
     port.rx_size_511: 1581791
     port.rx_size_1023: 740555
     port.rx_size_1522: 614754
     port.rx_size_big: 0
     port.tx_size_64: 1185001
     port.tx_size_127: 20725706
     port.tx_size_255: 4051046
     port.tx_size_511: 1283653
     port.tx_size_1023: 581012
     port.tx_size_1522: 14631126
     port.tx_size_big: 0
     port.rx_undersize: 0
     port.rx_fragments: 0
     port.rx_oversize: 0
     port.rx_jabber: 0
     port.VF_admin_queue_requests: 0
     port.arq_overflows: 0
     port.tx_hwtstamp_timeouts: 0
     port.rx_hwtstamp_cleared: 0
     port.tx_hwtstamp_skipped: 0
     port.fdir_flush_cnt: 0
     port.fdir_atr_match: 13146280
     port.fdir_atr_tunnel_match: 0
     port.fdir_atr_status: 1
     port.fdir_sb_match: 0
     port.fdir_sb_status: 1
     port.tx_lpi_status: 0
     port.rx_lpi_status: 0
     port.tx_lpi_count: 0
     port.rx_lpi_count: 0
     port.tx_priority_0_xon_tx: 0
     port.tx_priority_0_xoff_tx: 0
     port.rx_priority_0_xon_rx: 0
     port.rx_priority_0_xoff_rx: 0
     port.rx_priority_0_xon_2_xoff: 0
     port.tx_priority_1_xon_tx: 0
[...]
     port.rx_priority_7_xoff_rx: 0
     port.rx_priority_7_xon_2_xoff: 0

What did you expect to see?

I expect that the ethtool collector only produces a manageable amount of metrics if you only monitor four network devices. If you advise how to best solve this issue, I can prepare and test a merge request for it. Ideas:

  • Exclude all metrics with numbers in them.
  • Add an option to set a exclusion regex for the metrics name
  • Use a whitelist for ethtool metrics to export

What did you see instead?

There are per-CPU tx / rx queues which adds up for systems with 104 cores or more.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions