Skip to content

pressure collector in version 1.1.0 prints errors on kernels that don't support /proc/pressure #1961

@siebenmann

Description

@siebenmann

The new pressure collector is enabled by default but logs errors (at the 'error' log level) on every scrape on kernels that do not support pressure stall information ('PSI'). Since PSI is a relatively new kernel feature, the default state of 1.1.0 logs copious errors on the standard kernels on many still supported Linux distributions, such as CentOS 7 or Ubuntu 18.04 (with normal server kernels). Since these are logged at level 'error', you cannot suppress them without basically discarding all logs from the node exporter.

My view is that the pressure collector should not log error messages in the default configuration on kernels without PSI; at most it should report a collector failure in the node exporter's metrics for this. As a sysadmin, it would be even better if the collector did not report a collector failure at all on systems without PSI, because I would prefer to have collector failures only reported in situations where the collector should be working in the first place; if the collector is enabled on a system that doesn't support what it's doing at all, I would like it to show up as a different metric.

(The current situation makes it impossible to distinguish in general between 'the system does not seem to have this feature' and 'the system theoretically has this feature but we encountered some error collecting stats'. The latter is a much more serious issue than the former, since many systems do not have many features that are checked by default-enabled collectors (many of which currently report collector failures).)

It's possible that these error messages are also printed in some ordinary configurations even on PSI-enabled kernels, but I can't investigate that; version 1.1.0 is silent (so far) on the systems I have run it on that do have /proc/pressure.

Host operating system: output of uname -a

Linux compsbk1 4.15.0-117-generic #118-Ubuntu SMP Fri Sep 4 20:02:41 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.1.0 (branch: HEAD, revision: 0e74fbcd5fe3b98246292829a8e81e3133e17033)
  build user:       root@c81c7415c0ee
  build date:       20210205-22:54:09
  go version:       go1.15.8
  platform:         linux/amd64

node_exporter command line flags

node_exporter --no-collector.wifi --collector.netdev.device-exclude='^lo$' --collector.systemd --collector.systemd.unit-include='^.+\.service$' --collector.systemd.unit-exclude='^(user|ifup)@.*$' --no-collector.zfs --collector.processes --collector.textfile.directory=/var/local/prometheus/node-exporter --collector.filesystem.ignored-mount-points='^/(sys|proc|dev)($|/)' --collector.filesystem.ignored-fs-types='^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs|rootfs|selinuxfs|nfs)$'

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Ran version 1.1.0 on a stock Ubuntu 18.04 LTS server, with the stock Ubuntu 18.04 kernel (well, a slightly older version).

What did you expect to see?

Quiet logs in a standard default configuration.

What did you see instead?

A constant report (one per scrape) of:
level=error ts=2021-02-08T19:42:48.048Z caller=collector.go:161 msg="collector failed" name=pressure duration_seconds=0.073142059 err="failed to retrieve pressure stats: psi_stats: unavailable for cpu"

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions