[ci]: add gpu monitoring#2718
Conversation
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the CI system by introducing robust GPU monitoring capabilities. The primary goal is to prevent CI job failures caused by stale GPU processes by proactively identifying and reporting them, as well as performing an early health check to ensure GPU resources are available before a job fully commences. This ensures more reliable and efficient CI execution, reducing wasted compute time on jobs destined to fail due to resource contention. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust GPU monitoring system for the CI environment, aiming to prevent job failures due to stale processes hogging GPU resources. The implementation includes a host-level cron job and a pre-flight check in CI jobs. However, a critical security vulnerability has been identified in .buildkite/k3_harness/gpu-monitor.sh due to the use of a predictable temporary file in /tmp, which could lead to a symbolic link attack and sensitive file overwrites. It is recommended to use mktemp for secure temporary file handling. Additionally, while the scripts are well-structured, there are areas for improvement concerning script robustness, efficiency, and safety, particularly around temporary file handling (beyond the security concern), command-line utilities, and documented commands.
| fi | ||
| done | ||
| shopt -u nullglob | ||
| } | sort -u > /tmp/.gpu_monitor_k8s_pids 2>/dev/null || true |
There was a problem hiding this comment.
The script uses a hardcoded, predictable temporary file path /tmp/.gpu_monitor_k8s_pids. This is vulnerable to a symbolic link attack. An attacker with local access could create a symbolic link at this location pointing to a sensitive system file. When the script runs (especially if run with root privileges as suggested in the installation instructions), it will overwrite the target file with the list of PIDs, potentially leading to a denial of service or system instability.
To remediate this, use mktemp to create a secure, unpredictable temporary file and ensure it is cleaned up on exit.
| } | sort -u > /tmp/.gpu_monitor_k8s_pids 2>/dev/null || true | |
| K8S_PIDS_FILE=$(mktemp) | |
| trap 'rm -f "$K8S_PIDS_FILE"' EXIT | |
| # ... later in the script ... | |
| } | sort -u > "$K8S_PIDS_FILE" 2>/dev/null || true | |
| # ... and replace all other occurrences of /tmp/.gpu_monitor_k8s_pids with "$K8S_PIDS_FILE" |
| tail -f /var/log/gpu-monitor.log | ||
|
|
||
| # Remove the cron job | ||
| crontab -l | grep -v gpu-monitor.sh | crontab - |
There was a problem hiding this comment.
This command to remove the cron job can be risky. If crontab -l fails (e.g., if no crontab exists for the user), it can result in an empty pipe to crontab -, which would wipe the user's crontab. A safer pattern is to redirect stderr from crontab -l to /dev/null to prevent errors from breaking the pipe and causing unintended side effects.
| crontab -l | grep -v gpu-monitor.sh | crontab - | |
| crontab -l 2>/dev/null | grep -v gpu-monitor.sh | crontab - |
|
|
||
| set -euo pipefail | ||
|
|
||
| LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]" |
There was a problem hiding this comment.
To avoid potential race conditions and permission issues with temporary files, it's a best practice to use mktemp to create a uniquely named temporary file for storing K8s PIDs. Additionally, using a trap ensures that this temporary file is cleaned up automatically when the script exits, even in case of an error.
You can then replace all instances of /tmp/.gpu_monitor_k8s_pids with $K8S_PIDS_FILE and remove the manual rm command at the end of the script.
| LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]" | |
| LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]" | |
| K8S_PIDS_FILE=$(mktemp) | |
| trap 'rm -f "$K8S_PIDS_FILE"' EXIT |
| gpu_mem=$(nvidia-smi --query-compute-apps=pid,gpu_name,used_memory --format=csv,noheader -i 2>/dev/null \ | ||
| | grep "^${pid}," || echo "$pid, unknown, unknown") |
|
|
||
| # Parse GPU info into a temp file to avoid subshell variable scoping issues | ||
| local gpu_info | ||
| gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null) |
There was a problem hiding this comment.
Instead of calling xargs multiple times within the loop to trim whitespace, it's more efficient to process the output of nvidia-smi once. You can pipe the output to sed 's/ //g' to remove all spaces, which simplifies the loop body and improves performance.
| gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null) | |
| gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null | sed 's/ //g') |
| idx=$(echo "$idx" | xargs) | ||
| total=$(echo "$total" | xargs) | ||
| used=$(echo "$used" | xargs) | ||
| free=$(echo "$free" | xargs) |
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
* add gpu monitoring Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix the nightly vllm installation Signed-off-by: Samuel Shen <slshen@uchciago.edu> * address gemini comments Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix lint Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
* add gpu monitoring Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix the nightly vllm installation Signed-off-by: Samuel Shen <slshen@uchciago.edu> * address gemini comments Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix lint Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* add gpu monitoring Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix the nightly vllm installation Signed-off-by: Samuel Shen <slshen@uchciago.edu> * address gemini comments Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix lint Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
avoid stale processes disrupting CI jobs. clean non-CI jobs every 10 minutes
other bugs: