Skip to content

Convert GPU Driver installation to Tool, Add amd-smi#4080

Merged
LiliDeng merged 27 commits intomainfrom
aditya/cleanup_gpu_installation
Nov 11, 2025
Merged

Convert GPU Driver installation to Tool, Add amd-smi#4080
LiliDeng merged 27 commits intomainfrom
aditya/cleanup_gpu_installation

Conversation

@adityagesh
Copy link
Collaborator

  1. Convert GPU Driver installation to a Tool. GPU Feature in Azure is overloaded and has multiple responsibilities.
  2. Add support for AMD GPU Driver installation
  3. Add basic amd-smi tool

@adityagesh adityagesh force-pushed the aditya/cleanup_gpu_installation branch from 7ef7268 to 09513d7 Compare October 28, 2025 13:01
GPU feature was responsible for both capability
and driver installation. This change is refactoring
without any functional changes.

The driver installation responsibility is now
converted to a tool
_is_nvidia is initialized, but never used
@adityagesh adityagesh force-pushed the aditya/cleanup_gpu_installation branch 7 times, most recently from 0081041 to 4636902 Compare October 29, 2025 08:35
@adityagesh adityagesh force-pushed the aditya/cleanup_gpu_installation branch from 4636902 to 9b0fa7a Compare October 29, 2025 10:05
@adityagesh adityagesh force-pushed the aditya/cleanup_gpu_installation branch 3 times, most recently from 760b044 to ccc7398 Compare November 1, 2025 12:28
@adityagesh adityagesh force-pushed the aditya/cleanup_gpu_installation branch from ccc7398 to 86355c1 Compare November 1, 2025 12:31
@adityagesh adityagesh force-pushed the aditya/cleanup_gpu_installation branch 3 times, most recently from 203c42c to 73e3f98 Compare November 5, 2025 09:18
@adityagesh adityagesh force-pushed the aditya/cleanup_gpu_installation branch 2 times, most recently from 20d47d0 to 7a950e6 Compare November 5, 2025 15:37
@adityagesh adityagesh force-pushed the aditya/cleanup_gpu_installation branch from 7a950e6 to 315488f Compare November 5, 2025 17:54
@adityagesh adityagesh force-pushed the aditya/cleanup_gpu_installation branch from 175d62f to 347778f Compare November 6, 2025 06:23
@adityagesh
Copy link
Collaborator Author

@squirrelsc @LiliDeng any more comments?

@squirrelsc
Copy link
Member

@LiliDeng LGTM

@LiliDeng LiliDeng merged commit a9d3470 into main Nov 11, 2025
58 checks passed
@LiliDeng LiliDeng deleted the aditya/cleanup_gpu_installation branch November 11, 2025 04:17
LiliDeng added a commit that referenced this pull request Nov 18, 2025
…#4058)

* Initial plan

* Add unified message support for iperf3 TCP and UDP performance metrics

Co-authored-by: squirrelsc <27178119+squirrelsc@users.noreply.github.com>

* Refactor unified message methods to use parsed fields and add connections_num as parameter

Co-authored-by: squirrelsc <27178119+squirrelsc@users.noreply.github.com>

* Merge main and use Parameter relativity for connections_num and buffer_size metrics

Co-authored-by: squirrelsc <27178119+squirrelsc@users.noreply.github.com>

* Remove conn_suffix from metric names

Co-authored-by: squirrelsc <27178119+squirrelsc@users.noreply.github.com>

* Revert "Move examples and microsoft directories into the Python package (#4023)" (#4063)

This reverts commit 89e7b53.

* Reapply "Move examples and microsoft directories into the Python package (#4023)" (#4063)

This reverts commit efe1cd3.

* runbook: fix path for legacy layout

* Add UnifiedMessage support for NetworkLatencyPerformanceMessage

* kdump: Replace CvmDisabled with before_case SecurityProfile check (#4032)

* kdump: Replace CvmDisabled with before_case SecurityProfile check

* kdump: Fix SecurityProfile check to skip only CVM and Stateless VMs

- Remove empty simple_requirement() calls (unnecessary)

- Optimize f-string usage (only use f-prefix where needed)

- Remove unused simple_requirement import

* Add detailed panic categorization and error code extraction

* enrich SerialConsole.check_panic() to return detailed panic

* Added tests for network related components (#4009)

* notifier: remove pytest-html dependency

Replace pytest-html dependency with custom HTML
report generator using string.Template. This
change provides better control over report
formatting and reduces external dependencies.

* runbook: fix microsoft package name for new paths.

The new path is still able to be written like
"microsoft/testsuites", so that it needs to use
"microsoft" instead of "testsuites" as the package
name.

* Remove watchdog pattern from serial console panic detection (#4075)

* fix verify_cpu_count and improve PowerShell

- Implement calculate_vcpu_count() method in
  WindowsLscpu class to fix verify_cpu_count test
  failure on Windows
- Add null check for stderr in
  PowerShell.wait_result() to prevent errors when
  PowerShell is used to run cmd commands with no
  stderr output

* iDRAC: Handle  HTTP 500 internal errors with service reset

* Fix Hyper-V Stop-VM to use TurnOff on timeout/failure

* Remove overly broad stall regex pattern causing false positive panic detections (#4082)

* Initial plan

* Remove overly broad stall regex pattern to prevent false alarms

Co-authored-by: lesscodingmorehappiness <81588170+lesscodingmorehappiness@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lesscodingmorehappiness <81588170+lesscodingmorehappiness@users.noreply.github.com>

* Revert "skip test if hv_netvsc driver is not used"

This reverts commit f6fdcf7.

* change kselftest required /tmp/ size to 1GB for Overlake SoC limited space

* Add enabled switch for environments and nodes

This change introduces an `enabled` boolean field
at both the environment and node levels, allowing
selective loading of configurations through
runbook variables.

Example:
  environment:
    - name: my_env
      enabled: $(use_first_env)  # Variable-controlled
      nodes:
        - type: local
          name: node1
          enabled: true
        - type: local
          name: node2
          enabled: false  # Skip this node

* Process: Raise exception on timeout. (#4077)

* Skip tests on L1VH Nodes (#4078)

* mshv: skip checking logfile size on l1vh

L1VH parents by default don't have any entries in mshvlog file. Skip
checking logfile size on these nodes.

Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>

* mshv: skip mshvtrace test on l1vh Nodes

L1VH nodes cannot collect performance traces. Skip the related test
on the L1VH nodes.

Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>

---------

Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>

* Set minimum TLS setting 1.2 for storage accounts

Support for TLS 1.0 and 1.1 will be discontinued for all Azure Storage
accounts. The guidance is to migrate to minumum TLS version 1.2.

https://learn.microsoft.com/en-us/azure/storage/common/transport-layer-security-configure-migrate-to-tls2#why-use-tls-12

* Fix IPTable Test (#4088)

* Add virtualization feature

* doc: fix doc path after test code moved.

* doc: fix some build warnings.

* doc: allow duplicate test case names in different test suites.

* Fix VHD schema documentation to show nested hyperv_generation field (#4100)

* changes to install xxhash tool before building kernel

* Modrpobe command update for verbose is false

* Document resource_group_tags parameter for Azure runbook (#4101)

* Add Host version tracking for baremetal and HyperV platforms

* Convert GPU Driver installation to Tool, Add amd-smi (#4080)

* ch perf: Implement comprehensive performance stabilization framework

* Classify /bin/true redirections in kernel modules as not loaded

Previously, `is_module_loaded` returned True (loaded) when `modprobe -nv`
produced a blacklist directive like 'install /bin/true', causing test
cases like verify_floppy_module_is_blacklisted although module was not
actually loaded.

Added a minimal check for the install /bin/true pattern and now treat it
as not loaded, returning False.

* Kdump: Enhnace error log for incomplete dump file

* Update Nested Feature Supported list in Azure

* Create dm-cache test (#4093)

* Fix nvme device path fetch logic

* DPDK: add netvsc rescind tests (#4076)

* Remove squirrelsc from CODEOWNERS file

Co-authored-by: squirrelsc <27178119+squirrelsc@users.noreply.github.com>

* UnifiedPerfMessage: add metric_str_value to store string value (#4107)

* UnifiedPerfMessage: add str_value to store string value

* Rename str_value to metric_str_value in UnifiedPerfMessage (#4108)

* Initial plan

* Rename str_value to metric_str_value for consistency

Co-authored-by: squirrelsc <27178119+squirrelsc@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: squirrelsc <27178119+squirrelsc@users.noreply.github.com>

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: squirrelsc <27178119+squirrelsc@users.noreply.github.com>

* Pass through MIGRATABLE_VERSION from pipeline environment

* Add UnifiedMessage support for NetworkPPSPerformanceMessage (#4057)

* Initial plan

* Rebase on latest main branch

* Initial plan

* Initial plan

* Rebase on latest main branch

* Sync latest code from main branch

* Clean commit history - single commit for PR changes

* Add connections_num and buffer_size to metric names as suffix

- Remove separate connections_num and buffer_size_bytes metrics
- Add suffix format: _conn_{connections_num}_buffer_{buffer_size}
- Apply suffix to all TCP metrics: rx/tx_throughput_in_gbps, congestion_windowsize_kb, retransmitted_segments
- Apply suffix to all UDP metrics: rx/tx_throughput_in_gbps, data_loss
- This allows distinguishing results by connection count and buffer size

Co-authored-by: LiliDeng <10083705+LiliDeng@users.noreply.github.com>

* Fix flake8 errors: remove trailing whitespace from blank lines

- Remove trailing whitespace from line 492 in send_iperf3_tcp_unified_perf_messages
- Remove trailing whitespace from line 534 in send_iperf3_udp_unified_perf_messages
- Fixes W293 flake8 warnings and BLK100 black formatting issue

Co-authored-by: LiliDeng <10083705+LiliDeng@users.noreply.github.com>

---------

Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: squirrelsc <27178119+squirrelsc@users.noreply.github.com>
Co-authored-by: LiliDeng <lildeng@microsoft.com>
Co-authored-by: Chi Song (from Dev Box) <chisong@microsoft.com>
Co-authored-by: Vivek Yadav <vyadav@microsoft.com>
Co-authored-by: Balashivaram Ganesan <71939272+Balashivaram@users.noreply.github.com>
Co-authored-by: lesscodingmorehappiness <81588170+lesscodingmorehappiness@users.noreply.github.com>
Co-authored-by: Panfeng Xue <paxue@microsoft.com>
Co-authored-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-authored-by: Sebastian Heid <8442432+s4heid@users.noreply.github.com>
Co-authored-by: Umang Francis <umfranci@microsoft.com>
Co-authored-by: rabdulfaizy <rabdulfaizy@microsoft.com>
Co-authored-by: Aditya Nagesh <adityanagesh@microsoft.com>
Co-authored-by: Rachel Menge <rachelmenge@microsoft.com>
Co-authored-by: Kanchan Sen Laskar <kasenlaskar@microsoft.com>
Co-authored-by: mcgov <mamcgove@microsoft.com>
Co-authored-by: LiliDeng <10083705+LiliDeng@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants