Skip to content

[ci]: add gpu monitoring#2718

Merged
sammshen merged 6 commits intoLMCache:devfrom
sammshen:ci-health-check
Mar 11, 2026
Merged

[ci]: add gpu monitoring#2718
sammshen merged 6 commits intoLMCache:devfrom
sammshen:ci-health-check

Conversation

@sammshen
Copy link
Copy Markdown
Contributor

@sammshen sammshen commented Mar 9, 2026

avoid stale processes disrupting CI jobs. clean non-CI jobs every 10 minutes

other bugs:

  • fix the nightly wheel link from vllm

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the CI system by introducing robust GPU monitoring capabilities. The primary goal is to prevent CI job failures caused by stale GPU processes by proactively identifying and reporting them, as well as performing an early health check to ensure GPU resources are available before a job fully commences. This ensures more reliable and efficient CI execution, reducing wasted compute time on jobs destined to fail due to resource contention.

Highlights

  • Host-level GPU Monitoring: Introduced a new script, gpu-monitor.sh, which identifies and logs non-Kubernetes processes that are consuming GPU memory. This helps in detecting stale processes that might disrupt CI jobs.
  • Automated GPU Monitor Installation: Added setup-gpu-monitor.sh to automate the installation of gpu-monitor.sh as a cron job, scheduled to run every 10 minutes. This script also configures log rotation for the monitor's output.
  • Early GPU Health Check in CI: Integrated a check_gpu_health function into setup-env.sh, which runs at the start of every CI job. This pre-check ensures that GPUs have at least 80% free memory, failing early if stale processes are detected and preventing cryptic CUDA OOM errors.
  • Documentation Updates: Updated the README.md to include comprehensive documentation for the new GPU monitoring features, explaining their purpose, installation, and how to view/remove the cron job.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .buildkite/k3_harness/README.md
    • Updated the file structure list to include gpu-monitor.sh and setup-gpu-monitor.sh.
    • Added a new section 'GPU monitoring' detailing the purpose and usage of the new scripts, including installation and log viewing instructions.
    • Modified the description of setup-env.sh to reflect the inclusion of a GPU health check.
  • .buildkite/k3_harness/gpu-monitor.sh
    • Added a new script to detect and log non-Kubernetes processes that are using GPU memory.
    • Implemented logic to distinguish K8s pod processes from stale host processes using cgroups and crictl.
    • Provided detailed logging of stale processes, including PID, command, GPU bus ID, memory usage, and process age.
    • Included a summary of GPU memory usage if stale processes are found and exits with an error code.
  • .buildkite/k3_harness/setup-env.sh
    • Integrated a GPU health pre-check by sourcing helpers.sh and calling check_gpu_health 80 at the beginning of the script.
    • Added comments to clearly delineate the GPU health pre-check section.
  • .buildkite/k3_harness/setup-gpu-monitor.sh
    • Added a new script to install gpu-monitor.sh as a cron job, running every 10 minutes.
    • Configured log rotation for /var/log/gpu-monitor.log to manage log file size.
    • Ensured idempotency for cron job installation, preventing duplicate entries.
    • Provided clear output confirming the installation details and instructions for viewing logs or removing the cron job.
  • .buildkite/k3_tests/common_scripts/helpers.sh
    • Added a new function check_gpu_health to verify that all visible GPUs have a minimum percentage of free memory.
    • The function reports GPU memory status (total, used, free, percentage free) for each GPU.
    • If a GPU has insufficient free memory, it lists the processes consuming GPU resources and causes the script to exit with an error.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust GPU monitoring system for the CI environment, aiming to prevent job failures due to stale processes hogging GPU resources. The implementation includes a host-level cron job and a pre-flight check in CI jobs. However, a critical security vulnerability has been identified in .buildkite/k3_harness/gpu-monitor.sh due to the use of a predictable temporary file in /tmp, which could lead to a symbolic link attack and sensitive file overwrites. It is recommended to use mktemp for secure temporary file handling. Additionally, while the scripts are well-structured, there are areas for improvement concerning script robustness, efficiency, and safety, particularly around temporary file handling (beyond the security concern), command-line utilities, and documented commands.

Comment thread .buildkite/k3_harness/gpu-monitor.sh Outdated
fi
done
shopt -u nullglob
} | sort -u > /tmp/.gpu_monitor_k8s_pids 2>/dev/null || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The script uses a hardcoded, predictable temporary file path /tmp/.gpu_monitor_k8s_pids. This is vulnerable to a symbolic link attack. An attacker with local access could create a symbolic link at this location pointing to a sensitive system file. When the script runs (especially if run with root privileges as suggested in the installation instructions), it will overwrite the target file with the list of PIDs, potentially leading to a denial of service or system instability.

To remediate this, use mktemp to create a secure, unpredictable temporary file and ensure it is cleaned up on exit.

Suggested change
} | sort -u > /tmp/.gpu_monitor_k8s_pids 2>/dev/null || true
K8S_PIDS_FILE=$(mktemp)
trap 'rm -f "$K8S_PIDS_FILE"' EXIT
# ... later in the script ...
} | sort -u > "$K8S_PIDS_FILE" 2>/dev/null || true
# ... and replace all other occurrences of /tmp/.gpu_monitor_k8s_pids with "$K8S_PIDS_FILE"

Comment thread .buildkite/k3_harness/README.md Outdated
tail -f /var/log/gpu-monitor.log

# Remove the cron job
crontab -l | grep -v gpu-monitor.sh | crontab -
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This command to remove the cron job can be risky. If crontab -l fails (e.g., if no crontab exists for the user), it can result in an empty pipe to crontab -, which would wipe the user's crontab. A safer pattern is to redirect stderr from crontab -l to /dev/null to prevent errors from breaking the pipe and causing unintended side effects.

Suggested change
crontab -l | grep -v gpu-monitor.sh | crontab -
crontab -l 2>/dev/null | grep -v gpu-monitor.sh | crontab -


set -euo pipefail

LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To avoid potential race conditions and permission issues with temporary files, it's a best practice to use mktemp to create a uniquely named temporary file for storing K8s PIDs. Additionally, using a trap ensures that this temporary file is cleaned up automatically when the script exits, even in case of an error.

You can then replace all instances of /tmp/.gpu_monitor_k8s_pids with $K8S_PIDS_FILE and remove the manual rm command at the end of the script.

Suggested change
LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]"
LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]"
K8S_PIDS_FILE=$(mktemp)
trap 'rm -f "$K8S_PIDS_FILE"' EXIT

Comment thread .buildkite/k3_harness/gpu-monitor.sh Outdated
Comment on lines +73 to +74
gpu_mem=$(nvidia-smi --query-compute-apps=pid,gpu_name,used_memory --format=csv,noheader -i 2>/dev/null \
| grep "^${pid}," || echo "$pid, unknown, unknown")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The gpu_mem variable is assigned a value but is never used later in the script. It can be removed to improve code clarity.


# Parse GPU info into a temp file to avoid subshell variable scoping issues
local gpu_info
gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of calling xargs multiple times within the loop to trim whitespace, it's more efficient to process the output of nvidia-smi once. You can pipe the output to sed 's/ //g' to remove all spaces, which simplifies the loop body and improves performance.

Suggested change
gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null)
gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null | sed 's/ //g')

Comment on lines +33 to +36
idx=$(echo "$idx" | xargs)
total=$(echo "$total" | xargs)
used=$(echo "$used" | xargs)
free=$(echo "$free" | xargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Following the suggestion to use sed to process the nvidia-smi output, these lines for trimming whitespace are no longer necessary and can be removed.

Samuel Shen and others added 5 commits March 9, 2026 20:53
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
@sammshen sammshen enabled auto-merge (squash) March 10, 2026 20:51
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Mar 10, 2026
@sammshen sammshen merged commit d9b6580 into LMCache:dev Mar 11, 2026
28 of 29 checks passed
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
* add gpu monitoring

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* fix the nightly vllm installation

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* address gemini comments

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* fix lint

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* add gpu monitoring

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* fix the nightly vllm installation

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* address gemini comments

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* fix lint

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* add gpu monitoring

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* fix the nightly vllm installation

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* address gemini comments

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* fix lint

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants