-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
Host operating system: output of uname -a
In-house build of linux 5.4.
node_exporter version: output of node_exporter --version
node_exporter, version 1.2.0+ds (branch: debian/sid, revision: 1.2.0+ds-0ionos2~deb10)
build user: team+pkg-go@tracker.debian.org
build date: 20210721-19:27:22
go version: go1.15.9
platform: linux/amd64
This is an in-house build from the 1.2.0 release plus the patches from following merge requests:
- Add --collector.ethtool.ignored-devices #2085
- ethtool: Expose node_ethtool_info metric #2080
- ethtool: Sanitize metric names #2093
node_exporter command line flags
Relevant flags:
--collector.ethtool --collector.ethtool.ignored-devices=^(lo|dhcp-s|ovs-system|(b|bn|c|d|mb|n|pn|pub|r|t)[0-9a-f]+(n[0-9]+)?)$
Are you running node_exporter in Docker?
No
What did you do that produced an error?
After rolling out the new version and enabling the ethtool collector, the amount of metrics increased by around 25% which too much additional load on our Prometheus servers. One system had 17738 metrics where 10196 were node_ethtool_ ones (57%). I saw Ethernet devices that have 674 entries in the ethtool -S output. Example:
NIC statistics:
rx_packets: 100095693
tx_packets: 26774178
rx_bytes: 7702229176
tx_bytes: 21227737647
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
collisions: 0
rx_length_errors: 0
rx_crc_errors: 0
rx_unicast: 13594191
tx_unicast: 26500801
rx_multicast: 3
tx_multicast: 273368
rx_broadcast: 86501474
tx_broadcast: 7
rx_unknown_protocol: 0
tx_linearize: 0
tx_force_wb: 0
tx_busy: 0
rx_alloc_fail: 0
rx_pg_alloc_fail: 0
tx-0.packets: 7335276
tx-0.bytes: 8216031580
rx-0.packets: 3009930
rx-0.bytes: 481779221
tx-1.packets: 37690
tx-1.bytes: 7069713
rx-1.packets: 7196
rx-1.bytes: 5842519
[...]
tx-126.packets: 0
tx-126.bytes: 0
rx-126.packets: 0
rx-126.bytes: 0
tx-127.packets: 0
tx-127.bytes: 0
rx-127.packets: 0
rx-127.bytes: 0
veb.rx_bytes: 0
veb.tx_bytes: 0
veb.rx_unicast: 0
veb.tx_unicast: 0
veb.rx_multicast: 0
veb.tx_multicast: 0
veb.rx_broadcast: 0
veb.tx_broadcast: 0
veb.rx_discards: 0
veb.tx_discards: 0
veb.tx_errors: 0
veb.rx_unknown_protocol: 0
veb.tc_0_tx_packets: 0
veb.tc_0_tx_bytes: 0
veb.tc_0_rx_packets: 0
veb.tc_0_rx_bytes: 0
veb.tc_1_tx_packets: 0
[...]
veb.tc_7_rx_bytes: 0
port.rx_bytes: 23320709815
port.tx_bytes: 24611734876
port.rx_unicast: 28825506
port.tx_unicast: 42010865
port.rx_multicast: 142587000
port.tx_multicast: 446671
port.rx_broadcast: 86501552
port.tx_broadcast: 8
port.tx_errors: 0
port.rx_dropped: 0
port.tx_dropped_link_down: 0
port.rx_crc_errors: 0
port.illegal_bytes: 0
port.mac_local_faults: 0
port.mac_remote_faults: 0
port.tx_timeout: 0
port.rx_csum_bad: 0
port.rx_length_errors: 0
port.link_xon_rx: 0
port.link_xoff_rx: 0
port.link_xon_tx: 0
port.link_xoff_tx: 0
port.rx_size_64: 90091895
port.rx_size_127: 162957382
port.rx_size_255: 1927681
port.rx_size_511: 1581791
port.rx_size_1023: 740555
port.rx_size_1522: 614754
port.rx_size_big: 0
port.tx_size_64: 1185001
port.tx_size_127: 20725706
port.tx_size_255: 4051046
port.tx_size_511: 1283653
port.tx_size_1023: 581012
port.tx_size_1522: 14631126
port.tx_size_big: 0
port.rx_undersize: 0
port.rx_fragments: 0
port.rx_oversize: 0
port.rx_jabber: 0
port.VF_admin_queue_requests: 0
port.arq_overflows: 0
port.tx_hwtstamp_timeouts: 0
port.rx_hwtstamp_cleared: 0
port.tx_hwtstamp_skipped: 0
port.fdir_flush_cnt: 0
port.fdir_atr_match: 13146280
port.fdir_atr_tunnel_match: 0
port.fdir_atr_status: 1
port.fdir_sb_match: 0
port.fdir_sb_status: 1
port.tx_lpi_status: 0
port.rx_lpi_status: 0
port.tx_lpi_count: 0
port.rx_lpi_count: 0
port.tx_priority_0_xon_tx: 0
port.tx_priority_0_xoff_tx: 0
port.rx_priority_0_xon_rx: 0
port.rx_priority_0_xoff_rx: 0
port.rx_priority_0_xon_2_xoff: 0
port.tx_priority_1_xon_tx: 0
[...]
port.rx_priority_7_xoff_rx: 0
port.rx_priority_7_xon_2_xoff: 0
What did you expect to see?
I expect that the ethtool collector only produces a manageable amount of metrics if you only monitor four network devices. If you advise how to best solve this issue, I can prepare and test a merge request for it. Ideas:
- Exclude all metrics with numbers in them.
- Add an option to set a exclusion regex for the metrics name
- Use a whitelist for ethtool metrics to export
What did you see instead?
There are per-CPU tx / rx queues which adds up for systems with 104 cores or more.