Skip to content

feat: energy accumulation (kWh) and cost estimation with $/kWh config #191

Description

@inureyes

Summary

Integrate instantaneous power readings over time into persistent energy counters (Joules / kWh) and, when a $/kWh price is configured, display running cost in the TUI. Expose the counter as a Prometheus metric (all_smi_energy_consumed_joules_total) that works with rate() and increase(). Add an energy-focused summary section surfacing top consumers.

Motivation

The project already collects real-time chassis/GPU power every few seconds. Integrating those samples into an energy counter unlocks two operator-grade use cases:

  1. Session reporting — "this cluster consumed 27.4 kWh over the last 6 hours, equivalent to $3.29 at $0.12/kWh". Currently there is no in-TUI way to see this; users must integrate externally in PromQL.
  2. Carbon / sustainability reporting — total Joules is the raw input most sustainability dashboards want.

The data is a trivial sum += P_avg * dt running integration, so the cost/benefit is strongly positive.

Current state

  • Chassis / GPU / CPU power readings are collected every collection tick (AppConfig::MIN_RENDER_INTERVAL_MS drives rendering cadence; readers poll at the mode-configured interval).
  • AppConfig::HISTORY_MAX_ENTRIES = 100 provides short-term ring history but no integral.
  • No Prometheus counter for energy exists; only gauges.
  • No config path for $/kWh.

Proposed design

Integrator

Trapezoidal integration per device and per chassis, maintained as f64 Joules. For a sample stream (t_0, p_0), (t_1, p_1), ... the increment is ((p_{i-1} + p_i) / 2) * (t_i - t_{i-1}).

  • Per-GPU counter: Joules accumulator keyed by (host, gpu_uuid).
  • Per-chassis counter: keyed by host.
  • Per-CPU counter where a CPU power reading exists (Apple Silicon, some Intel/AMD chipsets).

Missing samples:

  • Gap ≤ 10s: linear interpolate the power across the gap.
  • Gap > 10s: hold last reading (explicit rationale — a dropped sample is likelier than an instant doubling).
  • NaN / negative: treat as zero for the integration window.

Persistence

  • In-memory during a live session. Reset on R hotkey (with confirmation toast).
  • Optional disk-backed WAL at ~/.cache/all-smi/energy-wal.bin so Prometheus counters survive restart:
    • Append a 16-byte record (host_hash: u64, device_hash: u64, joules_delta: f64) every minute.
    • On startup, replay.
    • Crash-safe because entries are independent; a torn final record is just discarded.

Prometheus metric

New counter metric:

# HELP all_smi_energy_consumed_joules_total Cumulative energy consumption in Joules.
# TYPE all_smi_energy_consumed_joules_total counter
all_smi_energy_consumed_joules_total{host="dgx-01", gpu_index="0", gpu_uuid="..."} 8.43e6
all_smi_energy_consumed_joules_total{host="dgx-01", scope="chassis"} 6.13e7

This is a monotonic counter suitable for rate() and increase() queries.

TUI display

  • Chassis renderer gains a row: Energy session: 3.21 kWh | $0.39 (at $0.12/kWh).
  • New local/remote section E (or an expandable row inside the existing Chassis panel — pick whichever fits the redesigned local TUI layout):
    • Top-3 consumers by device.
    • Cumulative chassis energy, elapsed time, average power.
    • Per-tab "session-kWh" chip next to the existing utilization chip.
  • R hotkey resets in-memory session counters (WAL is not rewound — it continues to accumulate across sessions; Prometheus users rely on monotonicity).

Config

Under [energy] section of the config file (see companion config issue):

[energy]
price_per_kwh = 0.12
currency = "USD"
show_cost = true
wal_path = "~/.cache/all-smi/energy-wal.bin"
gap_interpolate_seconds = 10

Env var overrides: ALL_SMI_ENERGY_PRICE, ALL_SMI_ENERGY_CURRENCY, ALL_SMI_ENERGY_NO_COST (unsets show_cost).

Implementation plan

Files to add / modify:

  • New src/metrics/energy.rs:
    • PowerIntegrator with record_sample(device_key, t, watts).
    • EnergyAccountant with per-device and per-chassis views.
    • Trapezoidal integration, gap handling, reset, Joules → kWh and cost helpers.
  • src/metrics/aggregator.rs — wire incoming power samples into the integrator each collection cycle.
  • New src/metrics/energy_wal.rs — append-only WAL with fsync cadence (60s).
  • src/api/metrics/ — export the new all_smi_energy_consumed_joules_total counter.
  • src/ui/renderers/chassis_renderer.rs — render the new row.
  • New src/ui/renderers/energy_renderer.rs — top-consumer summary section.
  • src/view/event_handler.rsR hotkey.
  • src/app_state.rs — house the EnergyAccountant and reset timestamp.
  • src/common/config.rs — read [energy]; reasonable defaults.
  • src/mock/generator.rs — mock emits power consistently so mock mode can exercise energy counters meaningfully.

Acceptance criteria

  • After holding 300 W draw for 10 minutes, accumulator reads ~0.05 kWh ( = 300 W × 600 s / 3.6e6 ).
  • Cost row shows the configured price; hides when show_cost = false or price is 0.
  • Prometheus counter is monotonic across scrapes.
  • R resets the session; Prometheus counter is not affected (continues monotonic).
  • Restarting api mode with WAL replays the counter (monotonicity preserved).
  • Remote aggregation: in view, the cluster-level energy panel sums across hosts.
  • Gaps: a 5-second sample gap linearly interpolates; a 30-second gap holds last value (documented).
  • Unit tests cover: integrator correctness on synthetic sine-wave input (trapezoidal vs analytic, < 0.1 % error over 1000 samples), gap handling, NaN handling, reset semantics, WAL round-trip.
  • README gains an "Energy & Cost" section.

Edge cases & non-goals

  • Power readings available on some platforms only (Apple Silicon — ANE + package; NVIDIA — per-GPU; AMD — per-GPU on most SKUs). If no power is available for a device, energy for that device is not emitted (not zero).
  • Cost is explicitly an approximation — document that idle/base load power and PSU efficiency are ignored.
  • Currency is display-only; no FX conversion.
  • Non-goal: historical data store. That's Prometheus' job; we only keep a WAL big enough for counter continuity.
  • Non-goal: carbon intensity mapping (gCO2/kWh). Out of scope for v1; the Joule counter is enough for external tooling.

Soft dependency

  • Config file issue — without it, ship with env-var only (ALL_SMI_ENERGY_PRICE=0.12) and add config-file support when that lands.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions