feat: add ALL_SMI_MOCK_HARDWARE_DETAILS env-var gate for extended NVIDIA mock metrics#182
Merged
Merged
Conversation
…DIA mock metrics Gate thermal thresholds, P-state, NUMA node ID, GSP firmware mode/version, NvLink remote endpoint types, and GPM SM occupancy/memory bandwidth utilization behind ALL_SMI_MOCK_HARDWARE_DETAILS. When unset (the default) the mock emits only basic GPU metrics, simulating an older NVIDIA driver that does not expose the extended NVML APIs. Thermal thresholds and P-state are gated under the same flag as NUMA/GSP/ NvLink/GPM because they originate from the same family of extended NVML calls; a host either has access to all of them or none. Splitting into per-feature flags would add complexity without a clear use case. Pattern mirrors ALL_SMI_MOCK_VGPU and ALL_SMI_MOCK_MIG: any non-empty value enables the feature; absent or empty disables it. Updated tests to call with_hardware_details_enabled() for tests that verify extended metric presence. Added new test (mock_template_omits_hardware_details_by_default) and is_hardware_details_enabled_reflects_env_var to cover the default-off path. Closes #178.
cargo test runs tests in parallel threads within a single process, and std::env::set_var/remove_var mutate shared process state. With 9 tests in src/mock/templates/nvidia.rs concurrently mutating ALL_SMI_MOCK_HARDWARE_DETAILS, sibling tests raced and failed intermittently (observed: mock_template_includes_threshold_metrics, mock_template_numa_alternates_between_zero_and_one, and mock_template_omits_hardware_details_by_default each failed on different runs). Introduce a module-wide Mutex and an RAII EnvVarGuard that acquires the lock, snapshots the prior env value, and restores it on drop. Drop runs on unwind, so the guard is also panic-safe: a failing assertion no longer leaks the env var into later tests. Poisoned locks are recovered from so a previously panicking test does not block subsequent tests. Also mirror the README mock-server documentation into API.md, where ALL_SMI_MOCK_VGPU and ALL_SMI_MOCK_MIG are already documented, so the new flag appears in the same reference surfaces as the existing ones.
Member
Author
Implementation Review SummaryIntent
Findings Addressed (fixed in 2bc6f81)
Verification
Notes for reviewers
|
Member
Author
PR Finalization CompleteSummary
All checks passing. Ready for merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #178.
Add
ALL_SMI_MOCK_HARDWARE_DETAILSenvironment variable to gate extended NVIDIA hardware detail metrics in the mock server, consistent with the existingALL_SMI_MOCK_VGPUandALL_SMI_MOCK_MIGpatterns.ALL_SMI_MOCK_HARDWARE_DETAILSis set to any non-empty value, the mock emits the full set of extended metrics: NUMA node ID, GSP firmware mode and version, NvLink remote endpoint classification, GPM SM occupancy and memory bandwidth utilization, thermal thresholds (slowdown, shutdown, max-operating, acoustic), legacyall_smi_gpu_pstate, and canonicalall_smi_gpu_performance_state.Thermal thresholds and P-state gating decision
Thermal thresholds and P-state are gated under the same
ALL_SMI_MOCK_HARDWARE_DETAILSflag rather than a separate one. Rationale: these metrics all originate from the same family of extended NVML calls. In practice a host either has access to all of them (modern driver) or none of them (older driver without GSP/NvLink/thermal-threshold support). A separate per-feature flag would add API surface and flag-management complexity without a meaningful use case that requires mixed presence.Changes
src/mock/templates/nvidia.rs: addHARDWARE_DETAILS_ENV_VARconstant andis_hardware_details_enabled()function; gateadd_pstate_metrics,add_thermal_threshold_metrics, andadd_hardware_detail_metricsbehind the env var check inbuild_nvidia_template; addwith_hardware_details_enabled()test helper; wrap existing hardware-detail tests inside the helper; addmock_template_omits_hardware_details_by_defaultandis_hardware_details_enabled_reflects_env_vartests.README.md: documentALL_SMI_MOCK_HARDWARE_DETAILSalongsideALL_SMI_MOCK_VGPUandALL_SMI_MOCK_MIGin the mock server section.Test plan
ALL_SMI_MOCK_HARDWARE_DETAILSunset (default CI environment).mock_template_omits_hardware_details_by_defaultverifies that none of the 12 extended metric families appear in the default output.is_hardware_details_enabled_reflects_env_varverifies the truthy/falsy/absent logic.with_hardware_details_enabled().