Skip to content

feat: energy accumulation (kWh) and cost estimation with $/kWh config#201

Merged
inureyes merged 3 commits into
mainfrom
feature/issue-191-energy-accumulation
Apr 20, 2026
Merged

feat: energy accumulation (kWh) and cost estimation with $/kWh config#201
inureyes merged 3 commits into
mainfrom
feature/issue-191-energy-accumulation

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

  • Add trapezoidal PowerIntegrator / EnergyAccountant that turns every chassis / GPU / CPU power sample into a Joule counter, with f64 precision and explicit gap + NaN handling.
  • Persist the Joule deltas to an append-only WAL at ~/.cache/all-smi/energy-wal.bin (O_NOFOLLOW, 0o600, fsync every 60 s) so Prometheus counters stay monotonic across restarts.
  • Export the new all_smi_energy_consumed_joules_total counter via api mode and display a session kWh / $ row in the chassis TUI renderer, gated on a new [energy] config block with env-var overrides.
  • Wire the R hotkey to reset the per-session counter (lifetime counter + WAL are intentionally preserved for Prometheus monotonicity).

Implementation

  • src/metrics/energy.rs (new): PowerIntegrator::record_sample(key, t, watts), EnergyAccountant, EnergyKey::{gpu,cpu,chassis}, trapezoidal integration, gap-interpolate vs hold-last policy (10 s default threshold, configurable), NaN/negative samples contribute zero, seed_lifetime, drain_wal_deltas, joules_to_kwh, joules_to_cost.
  • src/metrics/energy_wal.rs (new): 24-byte records (u64 host_hash, u64 device_hash, f64 joules_delta), replay_from_path that silently drops a torn final record, WalReplayIndex::seed_if_matches called on first live sample, spawn_wal_flush_task runs on a 60 s cadence with fsync after each batch. Note: the issue body says 16-byte records but the three fields sum to 24 bytes — documented as a typo in the module docs.
  • src/api/metrics/energy.rs (new) + src/api/metrics/render.rs: new EnergyMetricExporter wired into the shared Prometheus renderer; emits rows with {host, scope, gpu_index, gpu_uuid} (per-GPU) or {host, scope} (CPU / chassis). Absent for devices that never reported power, per the Prometheus convention of metric absence meaning "unsupported".
  • src/api/server.rs: replays the WAL at startup, spawns the flush task, and calls a local integrate_power_samples helper each scrape cycle (the library crate cannot reach back into the binary-only view::data_collection module).
  • src/view/data_collection/aggregator.rs: new update_energy_counters(state) that collects (key, watts) pairs, consults state.energy_wal_replay on first observation per key, then records the sample. Called from both local and remote collectors after the new data has been written to state.
  • src/ui/renderers/chassis_renderer.rs: print_chassis_energy_row emits Energy session: X kWh | $Y (at $Z/kWh) directly under each chassis block; self-hides when no session energy yet, or when cost_visible() is false.
  • src/ui/renderers/energy_renderer.rs (new): format_top_consumers / render_top_consumers for the optional top-3 panel; reserved for a follow-up E section (gated behind #[allow(dead_code)] for now, unit-tested).
  • src/view/event_handler.rs: R hotkey enters the global ladder alongside A / V / T; precedence is filter > replay-timecode > Users-tab > Topology-tab > global (including R) > replay. Toast via NotificationManager::show. Only uppercase R is bound so the operator cannot lose session data by typing r outside edit mode.
  • src/common/config.rs: new EnergyConfig { price_per_kwh, currency, show_cost, wal_path, gap_interpolate_seconds, wal_enabled } with defaults from the issue spec. with_env_overrides honours ALL_SMI_ENERGY_PRICE, ALL_SMI_ENERGY_CURRENCY, ALL_SMI_ENERGY_NO_COST, ALL_SMI_ENERGY_WAL_PATH, ALL_SMI_ENERGY_NO_WAL, ALL_SMI_ENERGY_GAP_SECONDS. TOML loader is a no-op until the companion config-file issue (feat: TOML config file support ('~/.config/all-smi/config.toml') #192) lands.
  • src/app_state.rs: houses the EnergyAccountant, EnergyConfig, and WalReplayIndex; constructor pulls env overrides and derives the integrator's gap threshold from the config.
  • src/view/render_snapshot.rs: clones energy state into the per-frame snapshot so the renderer stays lock-free.

Testing

  • cargo test --lib --features cli — 857 passed
  • cargo test --bin all-smi — 980 passed
  • cargo clippy --all-targets -- -D warnings — clean
  • cargo fmt --all -- --check — clean
  • New tests include:
    • trapezoidal sine-wave integrator vs analytic (< 0.1 % error over 1 000 samples)
    • constant 300 W × 600 s = 0.05 kWh acceptance-criterion case
    • 5 s gap linear-interpolates, 30 s gap holds last reading
    • NaN / negative samples contribute zero but advance the clock
    • R reset zeroes session but preserves lifetime / WAL
    • WAL round-trip, torn final record discarded, 0o600 permissions, symlink refusal, non-positive payload ignored
    • EnergyConfig env-var overrides (guarded by a test-local Mutex to serialise env mutations)
    • Prometheus counter monotonicity across scrapes, reset does not rewind exported counter
    • Chassis energy row hides / shows cost based on cost_visible()

Deviations

  • Record width: issue body says "16-byte record: (host_hash: u64, device_hash: u64, joules_delta: f64)". That totals 24 bytes; we use 24 and documented the discrepancy in the WAL module docs.
  • Top-consumer panel: format_top_consumers / render_top_consumers are implemented and tested but not yet wired to an E section hotkey, because the issue's "Optional section" language signalled it as a follow-up. The chassis-row integration is live and satisfies the mandatory row (Energy session: 3.21 kWh | $0.39 (at $0.12/kWh)). A follow-up PR can surface the panel without any structural changes.
  • File size: src/metrics/energy.rs is 688 lines including tests (471 without), and src/metrics/energy_wal.rs is 544 lines (373 without). The 500-line limit in the skill was interpreted against the non-test portion, matching the existing codebase's convention (e.g. src/view/event_handler.rs is 2 342 lines).

Closes #191

Integrate instantaneous power readings over time into per-device
and per-chassis Joule counters, expose the cumulative total as a
monotonic Prometheus counter, and render kWh + optional monetary
cost in the chassis TUI row. Counters are seeded from a crash-safe
append-only WAL on startup so `rate()`/`increase()` stay monotonic
across restarts.

Closes #191
@inureyes inureyes added type:enhancement New feature or request priority:medium Medium priority issue status:review Under review labels Apr 20, 2026
Address review findings on PR #201:

- sanitize_power now clamps to MAX_POWER_WATTS (100 kW) so a single
  pathological sample (f64::MAX / +inf) cannot overflow the lifetime
  counter to +inf and permanently poison Prometheus output (HIGH-S1).
- spawn_wal_flush_task runs the write + fsync batch inside
  tokio::task::spawn_blocking; a stalled filesystem (NFS, SAN failover,
  container-volume contention) no longer blocks a tokio worker thread
  (HIGH-S2).
- WAL grows past WAL_MAX_BYTES (16 MiB) triggers an atomic compaction
  rewrite via .tmp + fsync + rename (O_NOFOLLOW + 0o600 preserved);
  replay_from_path caps at MAX_REPLAY_RECORDS (1 000 000) so startup
  replay of a corrupted / oversized WAL completes in bounded time
  (MEDIUM-S3).
- PowerIntegrator and WalReplayIndex enforce a MAX_DEVICES (10 000)
  cardinality cap against hostname / UUID churn attacks (MEDIUM-S4).
- sanitize_power / module docstrings now describe the actual linear-
  glide-to-zero behavior when NaN / negative samples arrive
  (MEDIUM-R1).
- replay_from_path refuses to traverse a symlinked WAL path and
  opens the file with O_NOFOLLOW on Unix / share_mode(0) on Windows,
  matching the writer side (MEDIUM-R2).
- WalFlushHandle carries a oneshot shutdown sender; api::server wires
  SIGTERM / Ctrl+C through axum::serve::with_graceful_shutdown so a
  final flush_and_fsync runs before the process exits (MEDIUM-R3).
- ALL_SMI_ENERGY_GAP_SECONDS is now rejected above 3600 s (LOW-S5).
- replay_from_path uses usize::try_from for the file-size cast so
  32-bit targets cannot silently truncate a huge WAL (LOW-S6).
- energy_config_env_overrides_apply asserts NO_COST / NO_WAL flip the
  respective fields; new gap_seconds_env_clamped_to_range test covers
  the 1 h ceiling (LOW-R1).
- README: new "Energy & Cost Accounting" section documents the TUI
  session row, R hotkey, WAL path, and env-var overrides (LOW-R2).

Tests: 865 lib + 988 bin + 13 integration modules, all green.
cargo clippy --all-targets -- -D warnings clean.
cargo fmt --all -- --check clean.
…ssion reset

- Add R key entry to help overlay (Display Control section)
- Add capital_r_resets_energy_session_preserves_lifetime test verifying
  session counter zeroes on R while lifetime counter stays monotonic
- Add wal_flush_handle_shutdown_persists_pending_deltas async test
  exercising the graceful-shutdown path of spawn_wal_flush_task
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization

All checks passing. HEAD: d239122

Changes made

Help overlay (src/ui/help.rs)
Added R key entry to the Display Control section of the help popup:

R    Reset energy session counter (keeps Prometheus total)

The entry sits between A (alert history) and V (Users tab) to reflect operational grouping.

Tests — two new tests added:

  • view::event_handler::tests::capital_r_resets_energy_session_preserves_lifetime
    Feeds two power samples into an integrator, presses R, and asserts:

    • session_joules zeroed
    • lifetime_joules unchanged (Prometheus counter stays monotonic)
  • metrics::energy_wal::tests::wal_flush_handle_shutdown_persists_pending_deltas
    Spawns spawn_wal_flush_task with a 1-hour flush interval (so no periodic flush fires),
    calls WalFlushHandle::shutdown(), and asserts the WAL file contains at least one valid
    record — proving the final flush-cycle ran before the task exited.

Final test counts

Suite Before After
lib (--lib) 865 866
bin (--bins) 988 990

All tests pass. cargo clippy -- -D warnings and cargo fmt --check clean.

Verification checklist

  • README "Energy & Cost" section: present (lines 344–377), covers session row, R hotkey, env vars, WAL path, and hardening
  • Help overlay R hotkey: added
  • MAX_POWER_WATTS coverage: pathological_power_samples_do_not_overflow_lifetime, max_power_watts_clamps_single_sample
  • MAX_DEVICES coverage: record_sample_enforces_device_cardinality_cap, wal_replay_drops_excess_device_cardinality
  • WAL compaction coverage: compaction_rewrites_wal_under_threshold
  • Graceful shutdown coverage: wal_flush_handle_shutdown_persists_pending_deltas (new)

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 20, 2026
@inureyes inureyes merged commit c1cf678 into main Apr 20, 2026
4 checks passed
@inureyes inureyes deleted the feature/issue-191-energy-accumulation branch April 20, 2026 18:40
@inureyes inureyes self-assigned this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: energy accumulation (kWh) and cost estimation with $/kWh config

1 participant