Skip to content

fix: decode SMC float sensors as little-endian on Apple Silicon#150

Merged
inureyes merged 1 commit into
mainfrom
fix/macos-temperature-readings
Apr 8, 2026
Merged

fix: decode SMC float sensors as little-endian on Apple Silicon#150
inureyes merged 1 commit into
mainfrom
fix/macos-temperature-readings

Conversation

@inureyes

@inureyes inureyes commented Apr 8, 2026

Copy link
Copy Markdown
Member

Summary

Fixes broken temperature readings on macOS local mode. Apple Silicon's SMC stores flt (IEEE 754 single-precision) sensor values in little-endian byte order, but convert_value() was decoding them as big-endian. Every Tg*/Tp*/Te* read returned a randomly varying garbage float, the (10..=120) sanity filter rejected almost everything, and downstream the dashboard showed 0 °C / 6 °C / "Nominal" depending on which fallback path triggered.

Switching SMC_TYPE_FLT to f32::from_le_bytes() restores real die temperatures (~50–60 °C idle on M1 Ultra). Verified by dumping the raw SMC response buffer: bytes always landed at offset 48 with consistent LE-encoded floats matching expected die temps.

Root Cause

src/device/macos_native/smc.rs:373 decoded SMC_TYPE_FLT with f32::from_be_bytes(). Apple Silicon SMC actually stores floats little-endian (mactop, asitop, macmon all do this). The legacy fixed-point types (SP78, FP*) remain big-endian, so the bug only affects the flt variant — but on Apple Silicon every modern temperature/voltage/current sensor uses flt .

Symptoms:

  • all_smi_gpu_temperature_celsius = 0 in API mode
  • all_smi_cpu_temperature_celsius not exported (because cpu.temperature was always None)
  • View mode "GPU Temp." gauge oscillating between 0 and ~7 °C
  • Dashboard "Avg. Temp" cell showing thermal pressure text only because the numeric path produced nonsense

Additional Fixes

While investigating the temperature display, several related correctness issues were uncovered and fixed in the same PR:

  1. architecture detail key case mismatch. apple_silicon_native and the prometheus exporter wrote the GPU detail key as lowercase "architecture", but ui/dashboard.rs and view/data_collection/aggregator.rs looked it up as "Architecture". The mismatch silently disabled the entire is_apple_silicon special-case path. Standardised on the lowercase form everywhere, including the NVIDIA and Jetson readers.

  2. cpu_macos::get_cpu_temperature was a stub. Always returned None, so the live "CPU Temp." gauge was permanently 0 °C even though SMC was already collecting the value. Wired it through the cached NativeMetricsData already fetched in get_apple_silicon_cpu_info.

  3. GPU temperature fallback. apple_silicon_native now falls back to the SMC CPU die temperature when the GPU sensor is unavailable. CPU and GPU share the same SoC die so the readings track each other closely; far more meaningful than reporting 0 °C.

  4. Unified dashboard temperature display. "Avg. Temp" cell now shows numeric °C on every platform. On Apple Silicon the second-row "Temp. Stdev" cell becomes a "Thermal" cell carrying the qualitative thermal pressure level (single-die std dev is meaningless), so both numeric and qualitative information remain visible.

  5. Per-GPU list view. Used to display thermal pressure text on Apple Silicon for the same reason; now shows the real numeric die temp, consistent with every other platform.

Verification (M1 Ultra)

Before:

all_smi_gpu_temperature_celsius{...} 0
# (no all_smi_cpu_temperature_celsius metric exported)

After:

all_smi_gpu_temperature_celsius{...} 51
all_smi_cpu_temperature_celsius{...} 60
gpu_info{cpu_temperature="60.0", gpu_temperature="50.7", thermal_pressure="Nominal", ...}

Added test_flt_little_endian_decoding regression unit test so the endianness can't silently flip back.

Test plan

  • cargo build succeeds
  • cargo clippy clean
  • cargo test passes (5 + 17 + 19 tests, including new endianness test)
  • M1 Ultra: API mode exports real die temperatures
  • M1 Ultra: View mode dashboard shows consistent numeric °C across top row, live stats, and per-GPU list

Apple Silicon's SMC stores `flt ` (IEEE 754 single-precision) sensor
values in little-endian byte order, unlike the legacy SP78/FP* fixed
point types which remain big-endian. The convert_value() helper was
using f32::from_be_bytes() for FLT, so every Tg*/Tp*/Te* temperature
read returned a randomly varying garbage float (e.g. 1e-32, 3e36) and
the (10..=120) sanity filter rejected almost everything. Symptoms in
local view mode: GPU temperature showed 0 °C or sporadic 6/7 °C, CPU
temperature showed 0 °C, dashboard "Avg. Temp" showed thermal pressure
text only because the numeric path produced nonsense.

Switching the FLT branch to from_le_bytes() restores real die
temperatures (~50–60 °C idle on M1 Ultra). Verified by dumping the raw
SMC response buffer: bytes always landed at offset 48 with consistent
LE-encoded floats matching expected die temps.

While here, several related correctness issues were also addressed:

* Dashboard and aggregator looked up the GPU detail key as
  "Architecture" but apple_silicon_native (and the prometheus exporter)
  write it lowercase as "architecture". The mismatch silently disabled
  the entire is_apple_silicon special-case path. Standardise on the
  lowercase form everywhere, including the NVIDIA and Jetson readers.
* cpu_macos::get_cpu_temperature was a stub that always returned None,
  so the live "CPU Temp." gauge was permanently 0 °C even though SMC
  was already collecting the value. Wire it through the cached
  NativeMetricsData fetched in get_apple_silicon_cpu_info.
* apple_silicon_native now falls back to the SMC CPU die temperature
  when the GPU sensor is unavailable. CPU and GPU share the same SoC
  die so the readings track each other closely; this is far more
  meaningful than reporting 0 °C.
* Dashboard "Avg. Temp" cell now shows numeric °C on every platform.
  On Apple Silicon the second-row "Temp. Stdev" cell becomes a
  "Thermal" cell carrying the qualitative thermal pressure level
  (single-die std dev is meaningless), so both pieces of information
  remain visible.
* The per-GPU list view used to display thermal pressure text on Apple
  Silicon for the same reason; it now shows the real numeric die temp,
  consistent with every other platform.

Added a regression unit test (test_flt_little_endian_decoding) so that
the endianness can't silently flip back.
@inureyes inureyes added type:bug Something isn't working device:apple-silicon Apple Silicon related mode:view View mode related mode:api API mode related priority:high High priority issue labels Apr 8, 2026
@inureyes inureyes merged commit 3cd6b76 into main Apr 8, 2026
1 check passed
@inureyes inureyes deleted the fix/macos-temperature-readings branch April 8, 2026 13:10
@inureyes inureyes self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:apple-silicon Apple Silicon related mode:api API mode related mode:view View mode related priority:high High priority issue type:bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant