Skip to content

thermal_zone with ENODATA causes thermal collection to drop all data #1703

@candlerb

Description

@candlerb

Host operating system

Linux brian-kit 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Hardware is an Intel Skull Canyon NUC

node_exporter version

node_exporter, version 1.0.0-rc.1 (branch: HEAD, revision: 3cedd344fd4ea8c1e6e1fb0854e824d5d8b2f24a)
  build user:       root@1e01740d5299
  build date:       20200514-15:02:31
  go version:       go1.14.2

node_exporter command line flags

--collector.textfile.directory=/var/lib/node_exporter --collector.systemd

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Scrape node_exporter looking for thermal metrics

What did you expect to see?

Thermal metrics such as:

node_thermal_zone_temp{type="XXX",zone="YYY"} 43
node_cooling_device_cur_state{name="0",type="Processor"} 0

What did you see instead?

No thermal metrics, and collector_success 0.

# curl -s localhost:9100/metrics | egrep 'therm|cool'
node_scrape_collector_duration_seconds{collector="thermal_zone"} 0.001691836
node_scrape_collector_success{collector="thermal_zone"} 0
node_systemd_unit_state{name="thermald.service",state="activating",type="dbus"} 0
node_systemd_unit_state{name="thermald.service",state="active",type="dbus"} 1
node_systemd_unit_state{name="thermald.service",state="deactivating",type="dbus"} 0
node_systemd_unit_state{name="thermald.service",state="failed",type="dbus"} 0
node_systemd_unit_state{name="thermald.service",state="inactive",type="dbus"} 0

Debug information

The scrape triggers node_exporter to generate a single debug log line:

May 14 17:32:13 brian-kit node_exporter[6434]: level=error ts=2020-05-14T16:32:13.754Z caller=collector.go:161 msg="collector failed" name=thermal_zone duration_seconds=0.001691836 err="read /sys/class/thermal/thermal_zone4/temp: no data available"

Now, this device does have thermal info, but there is an error when reading the final zone (zone4):

# ls /sys/class/thermal/
cooling_device0   cooling_device11  cooling_device14  cooling_device4  cooling_device7  thermal_zone0  thermal_zone3
cooling_device1   cooling_device12  cooling_device2   cooling_device5  cooling_device8  thermal_zone1  thermal_zone4
cooling_device10  cooling_device13  cooling_device3   cooling_device6  cooling_device9  thermal_zone2

# cat /sys/class/thermal/thermal_zone*/temp
27800
29800
41000
53000
cat: /sys/class/thermal/thermal_zone4/temp: No data available

# cat /sys/class/thermal/thermal_zone4/type
iwlwifi_1
# cat /sys/class/thermal/thermal_zone4/policy
step_wise
# cat /sys/class/thermal/thermal_zone4/temp
cat: /sys/class/thermal/thermal_zone4/temp: No data available

strace of node_exporter shows:

[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone4/temp", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] <... openat resumed> )      = 18
[pid  6469] epoll_ctl(4, EPOLL_CTL_ADD, 18, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1146761160, u64=139935476232136}} <unfinished ...>
[pid  6469] <... epoll_ctl resumed> )   = 0
[pid  6469] fcntl(18, F_GETFL <unfinished ...>
[pid  6469] <... fcntl resumed> )       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
[pid  6469] fcntl(18, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE <unfinished ...>
[pid  6469] <... fcntl resumed> )       = 0
[pid  6469] fstat(18,  <unfinished ...>
[pid  6469] <... fstat resumed> {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
[pid  6469] read(18,  <unfinished ...>
[pid  6469] <... read resumed> 0xc000420600, 4608) = -1 ENODATA (No data available)
[pid  6469] epoll_ctl(4, EPOLL_CTL_DEL, 18, 0xc00027a75c <unfinished ...>
[pid  6469] <... epoll_ctl resumed> )   = 0
[pid  6469] close(18 <unfinished ...>
[pid  6469] <... close resumed> )       = 0
[pid  6469] write(2, "level=error ts=2020-05-14T16:46:08.184Z caller=collector.go:161 msg=\"collector failed\" name=thermal_zone duration_seconds=0.0080"..., 202 <unfinished ...>

You can see that it is able to open the thermal_zone4/temp entry and gets a valid fd, but a read gives ENODATA.

It's therefore my working hypothesis that:

  1. This is a wifi card with a thermal zone but no temperature sensor
  2. This error is causing the entire collector to give up and not return the results it already collected before this point.

Aside: there's cooling info available too, and no errors:

# cat /sys/class/thermal/cooling_device*/cur_state
0
0
0
0
-1
0
0
0
0
0
0
0
0
0
0

However, strace shows no attempt to access those. grepping the strace output for /sys/class/thermal shows only:

[pid  6469] newfstatat(AT_FDCWD, "/sys/class/thermal",  <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone0/type", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone0/policy", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone0/temp", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone0/mode", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone0/passive", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone1/type", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone1/policy", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone1/temp", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone1/mode", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone1/passive", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone2/type", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone2/policy", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone2/temp", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone2/mode", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone2/passive", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone3/type", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone3/policy", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone3/temp", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone3/mode", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone3/passive", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone4/type", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone4/policy", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  6469] openat(AT_FDCWD, "/sys/class/thermal/thermal_zone4/temp", O_RDONLY|O_CLOEXEC <unfinished ...>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions