fix: decode SMC float sensors as little-endian on Apple Silicon#150
Merged
Conversation
Apple Silicon's SMC stores `flt ` (IEEE 754 single-precision) sensor values in little-endian byte order, unlike the legacy SP78/FP* fixed point types which remain big-endian. The convert_value() helper was using f32::from_be_bytes() for FLT, so every Tg*/Tp*/Te* temperature read returned a randomly varying garbage float (e.g. 1e-32, 3e36) and the (10..=120) sanity filter rejected almost everything. Symptoms in local view mode: GPU temperature showed 0 °C or sporadic 6/7 °C, CPU temperature showed 0 °C, dashboard "Avg. Temp" showed thermal pressure text only because the numeric path produced nonsense. Switching the FLT branch to from_le_bytes() restores real die temperatures (~50–60 °C idle on M1 Ultra). Verified by dumping the raw SMC response buffer: bytes always landed at offset 48 with consistent LE-encoded floats matching expected die temps. While here, several related correctness issues were also addressed: * Dashboard and aggregator looked up the GPU detail key as "Architecture" but apple_silicon_native (and the prometheus exporter) write it lowercase as "architecture". The mismatch silently disabled the entire is_apple_silicon special-case path. Standardise on the lowercase form everywhere, including the NVIDIA and Jetson readers. * cpu_macos::get_cpu_temperature was a stub that always returned None, so the live "CPU Temp." gauge was permanently 0 °C even though SMC was already collecting the value. Wire it through the cached NativeMetricsData fetched in get_apple_silicon_cpu_info. * apple_silicon_native now falls back to the SMC CPU die temperature when the GPU sensor is unavailable. CPU and GPU share the same SoC die so the readings track each other closely; this is far more meaningful than reporting 0 °C. * Dashboard "Avg. Temp" cell now shows numeric °C on every platform. On Apple Silicon the second-row "Temp. Stdev" cell becomes a "Thermal" cell carrying the qualitative thermal pressure level (single-die std dev is meaningless), so both pieces of information remain visible. * The per-GPU list view used to display thermal pressure text on Apple Silicon for the same reason; it now shows the real numeric die temp, consistent with every other platform. Added a regression unit test (test_flt_little_endian_decoding) so that the endianness can't silently flip back.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes broken temperature readings on macOS local mode. Apple Silicon's SMC stores
flt(IEEE 754 single-precision) sensor values in little-endian byte order, butconvert_value()was decoding them as big-endian. Every Tg*/Tp*/Te* read returned a randomly varying garbage float, the (10..=120) sanity filter rejected almost everything, and downstream the dashboard showed 0 °C / 6 °C / "Nominal" depending on which fallback path triggered.Switching
SMC_TYPE_FLTtof32::from_le_bytes()restores real die temperatures (~50–60 °C idle on M1 Ultra). Verified by dumping the raw SMC response buffer: bytes always landed at offset 48 with consistent LE-encoded floats matching expected die temps.Root Cause
src/device/macos_native/smc.rs:373decodedSMC_TYPE_FLTwithf32::from_be_bytes(). Apple Silicon SMC actually stores floats little-endian (mactop, asitop, macmon all do this). The legacy fixed-point types (SP78, FP*) remain big-endian, so the bug only affects thefltvariant — but on Apple Silicon every modern temperature/voltage/current sensor usesflt.Symptoms:
all_smi_gpu_temperature_celsius= 0 in API modeall_smi_cpu_temperature_celsiusnot exported (becausecpu.temperaturewas always None)Additional Fixes
While investigating the temperature display, several related correctness issues were uncovered and fixed in the same PR:
architecturedetail key case mismatch.apple_silicon_nativeand the prometheus exporter wrote the GPU detail key as lowercase"architecture", butui/dashboard.rsandview/data_collection/aggregator.rslooked it up as"Architecture". The mismatch silently disabled the entireis_apple_siliconspecial-case path. Standardised on the lowercase form everywhere, including the NVIDIA and Jetson readers.cpu_macos::get_cpu_temperaturewas a stub. Always returnedNone, so the live "CPU Temp." gauge was permanently 0 °C even though SMC was already collecting the value. Wired it through the cachedNativeMetricsDataalready fetched inget_apple_silicon_cpu_info.GPU temperature fallback.
apple_silicon_nativenow falls back to the SMC CPU die temperature when the GPU sensor is unavailable. CPU and GPU share the same SoC die so the readings track each other closely; far more meaningful than reporting 0 °C.Unified dashboard temperature display. "Avg. Temp" cell now shows numeric °C on every platform. On Apple Silicon the second-row "Temp. Stdev" cell becomes a "Thermal" cell carrying the qualitative thermal pressure level (single-die std dev is meaningless), so both numeric and qualitative information remain visible.
Per-GPU list view. Used to display thermal pressure text on Apple Silicon for the same reason; now shows the real numeric die temp, consistent with every other platform.
Verification (M1 Ultra)
Before:
After:
Added
test_flt_little_endian_decodingregression unit test so the endianness can't silently flip back.Test plan
cargo buildsucceedscargo clippycleancargo testpasses (5 + 17 + 19 tests, including new endianness test)