Skip to content

feat: add ALL_SMI_MOCK_HARDWARE_DETAILS env-var gate for extended NVIDIA mock metrics#182

Merged
inureyes merged 2 commits into
mainfrom
feature/issue-178-mock-hardware-details-gate
Apr 17, 2026
Merged

feat: add ALL_SMI_MOCK_HARDWARE_DETAILS env-var gate for extended NVIDIA mock metrics#182
inureyes merged 2 commits into
mainfrom
feature/issue-178-mock-hardware-details-gate

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Closes #178.

Add ALL_SMI_MOCK_HARDWARE_DETAILS environment variable to gate extended NVIDIA hardware detail metrics in the mock server, consistent with the existing ALL_SMI_MOCK_VGPU and ALL_SMI_MOCK_MIG patterns.

  • When ALL_SMI_MOCK_HARDWARE_DETAILS is set to any non-empty value, the mock emits the full set of extended metrics: NUMA node ID, GSP firmware mode and version, NvLink remote endpoint classification, GPM SM occupancy and memory bandwidth utilization, thermal thresholds (slowdown, shutdown, max-operating, acoustic), legacy all_smi_gpu_pstate, and canonical all_smi_gpu_performance_state.
  • When unset (the default), only basic GPU metrics are emitted, simulating an older NVIDIA driver that does not expose these extended NVML APIs.

Thermal thresholds and P-state gating decision

Thermal thresholds and P-state are gated under the same ALL_SMI_MOCK_HARDWARE_DETAILS flag rather than a separate one. Rationale: these metrics all originate from the same family of extended NVML calls. In practice a host either has access to all of them (modern driver) or none of them (older driver without GSP/NvLink/thermal-threshold support). A separate per-feature flag would add API surface and flag-management complexity without a meaningful use case that requires mixed presence.

Changes

  • src/mock/templates/nvidia.rs: add HARDWARE_DETAILS_ENV_VAR constant and is_hardware_details_enabled() function; gate add_pstate_metrics, add_thermal_threshold_metrics, and add_hardware_detail_metrics behind the env var check in build_nvidia_template; add with_hardware_details_enabled() test helper; wrap existing hardware-detail tests inside the helper; add mock_template_omits_hardware_details_by_default and is_hardware_details_enabled_reflects_env_var tests.
  • README.md: document ALL_SMI_MOCK_HARDWARE_DETAILS alongside ALL_SMI_MOCK_VGPU and ALL_SMI_MOCK_MIG in the mock server section.

Test plan

  • All existing tests pass with ALL_SMI_MOCK_HARDWARE_DETAILS unset (default CI environment).
  • mock_template_omits_hardware_details_by_default verifies that none of the 12 extended metric families appear in the default output.
  • is_hardware_details_enabled_reflects_env_var verifies the truthy/falsy/absent logic.
  • All prior hardware-detail tests continue to pass when run with the env var set via with_hardware_details_enabled().

…DIA mock metrics

Gate thermal thresholds, P-state, NUMA node ID, GSP firmware mode/version,
NvLink remote endpoint types, and GPM SM occupancy/memory bandwidth utilization
behind ALL_SMI_MOCK_HARDWARE_DETAILS. When unset (the default) the mock emits
only basic GPU metrics, simulating an older NVIDIA driver that does not expose
the extended NVML APIs.

Thermal thresholds and P-state are gated under the same flag as NUMA/GSP/
NvLink/GPM because they originate from the same family of extended NVML calls;
a host either has access to all of them or none. Splitting into per-feature
flags would add complexity without a clear use case.

Pattern mirrors ALL_SMI_MOCK_VGPU and ALL_SMI_MOCK_MIG: any non-empty value
enables the feature; absent or empty disables it.

Updated tests to call with_hardware_details_enabled() for tests that verify
extended metric presence. Added new test (mock_template_omits_hardware_details_by_default)
and is_hardware_details_enabled_reflects_env_var to cover the default-off path.

Closes #178.
@inureyes inureyes added type:enhancement New feature or request priority:low Low priority issue device:nvidia-gpu NVIDIA GPU related status:review Under review labels Apr 17, 2026
cargo test runs tests in parallel threads within a single process, and
std::env::set_var/remove_var mutate shared process state. With 9 tests
in src/mock/templates/nvidia.rs concurrently mutating
ALL_SMI_MOCK_HARDWARE_DETAILS, sibling tests raced and failed
intermittently (observed: mock_template_includes_threshold_metrics,
mock_template_numa_alternates_between_zero_and_one, and
mock_template_omits_hardware_details_by_default each failed on
different runs).

Introduce a module-wide Mutex and an RAII EnvVarGuard that acquires the
lock, snapshots the prior env value, and restores it on drop. Drop
runs on unwind, so the guard is also panic-safe: a failing assertion
no longer leaks the env var into later tests. Poisoned locks are
recovered from so a previously panicking test does not block
subsequent tests.

Also mirror the README mock-server documentation into API.md, where
ALL_SMI_MOCK_VGPU and ALL_SMI_MOCK_MIG are already documented, so the
new flag appears in the same reference surfaces as the existing ones.
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Gate the extended NVIDIA hardware detail mock metrics (NUMA, GSP firmware, NvLink, GPM, thermal thresholds, P-state) behind ALL_SMI_MOCK_HARDWARE_DETAILS, matching the ALL_SMI_MOCK_VGPU / ALL_SMI_MOCK_MIG pattern.

Findings Addressed (fixed in 2bc6f81)

  • CRITICAL: Test race condition on ALL_SMI_MOCK_HARDWARE_DETAILS. 9 tests in src/mock/templates/nvidia.rs concurrently mutated the process-global env var. cargo test default parallelism caused different tests to fail on different runs (reproduced: mock_template_includes_threshold_metrics, mock_template_numa_alternates_between_zero_and_one, mock_template_omits_hardware_details_by_default). Fixed by introducing a module-wide std::sync::Mutex and an RAII EnvVarGuard that acquires the lock, snapshots the prior env value, and restores it on drop — also recovering from poisoned locks.
  • HIGH: Helper not panic-safe. Previously, a panicking assertion inside with_hardware_details_enabled would skip the env restore block and leak the set value into sibling tests. The new EnvVarGuard restores via Drop, which runs on unwind.
  • HIGH: SAFETY comment misrepresented thread safety. Original comment claimed "called from single-threaded test functions that do not spawn additional threads", but sibling tests themselves run concurrently. SAFETY comments now correctly document that _lock serialises all mutations in the module.
  • LOW: API.md parity missing. ALL_SMI_MOCK_VGPU and ALL_SMI_MOCK_MIG are documented in both README.md and API.md (Notes and numbered operations-notes list). The new flag is now mirrored in the same two locations in API.md.

Verification

  • All stated requirements implemented (env var gate, default-off behaviour, README doc)
  • No placeholder/mock code remaining
  • Integrated into project code flow — gate is applied inside build_nvidia_template, reachable from generate() and generate_template()
  • Project conventions followed — HARDWARE_DETAILS_ENV_VAR / is_hardware_details_enabled() signatures mirror VGPU_ENV_VAR / is_vgpu_enabled() and MIG_ENV_VAR / is_mig_enabled()
  • Existing modules reused where applicable — maybe_add_vgpu_template and maybe_add_mig_template unchanged; the new gate uses the same truthy/empty/absent semantics as the sibling helpers
  • No unintended structural changes — gated families (NUMA, GSP, NvLink, GPM, thermal thresholds, P-state) correctly wrap add_pstate_metrics, add_thermal_threshold_metrics, and add_hardware_detail_metrics; basic / process / driver / system / chassis / vGPU / MIG metrics remain ungated as before
  • Tests pass — all 1159 mock-feature tests pass; the previously flaky 9 nvidia tests pass 5/5 consecutive runs under default parallelism
  • cargo clippy --features mock --all-targets -- -D warnings clean
  • cargo fmt --check clean

Notes for reviewers

  • The decision to gate thermal thresholds, P-state, and extended hardware details together under one flag (rather than separate flags per family) is correct and matches the PR description's rationale — the corresponding tests in src/mock/templates/nvidia.rs assert all 12 gated metric families are absent when the env var is unset.
  • No Korean README exists in the project, so bilingual docs parity was not applicable.
  • Pre-existing NVML extended metric families in the real reader (api/metrics/*.rs) are unchanged — the gate is purely a mock-side simulation of an older driver.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 17, 2026
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

All checks passing. Ready for merge.

@inureyes inureyes merged commit 4c989da into main Apr 17, 2026
4 checks passed
@inureyes inureyes deleted the feature/issue-178-mock-hardware-details-gate branch April 17, 2026 07:02
@inureyes inureyes self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:nvidia-gpu NVIDIA GPU related priority:low Low priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enhancement: Add env-var gate for hardware detail mock data (ALL_SMI_MOCK_HARDWARE_DETAILS)

1 participant