[ci]: add gpu monitoring by sammshen · Pull Request #2718 · LMCache/LMCache

sammshen · 2026-03-09T21:36:49Z

avoid stale processes disrupting CI jobs. clean non-CI jobs every 10 minutes

other bugs:

fix the nightly wheel link from vllm

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

gemini-code-assist · 2026-03-09T21:37:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the CI system by introducing robust GPU monitoring capabilities. The primary goal is to prevent CI job failures caused by stale GPU processes by proactively identifying and reporting them, as well as performing an early health check to ensure GPU resources are available before a job fully commences. This ensures more reliable and efficient CI execution, reducing wasted compute time on jobs destined to fail due to resource contention.

Highlights

Host-level GPU Monitoring: Introduced a new script, gpu-monitor.sh, which identifies and logs non-Kubernetes processes that are consuming GPU memory. This helps in detecting stale processes that might disrupt CI jobs.
Automated GPU Monitor Installation: Added setup-gpu-monitor.sh to automate the installation of gpu-monitor.sh as a cron job, scheduled to run every 10 minutes. This script also configures log rotation for the monitor's output.
Early GPU Health Check in CI: Integrated a check_gpu_health function into setup-env.sh, which runs at the start of every CI job. This pre-check ensures that GPUs have at least 80% free memory, failing early if stale processes are detected and preventing cryptic CUDA OOM errors.
Documentation Updates: Updated the README.md to include comprehensive documentation for the new GPU monitoring features, explaining their purpose, installation, and how to view/remove the cron job.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

.buildkite/k3_harness/README.md
- Updated the file structure list to include gpu-monitor.sh and setup-gpu-monitor.sh.
- Added a new section 'GPU monitoring' detailing the purpose and usage of the new scripts, including installation and log viewing instructions.
- Modified the description of setup-env.sh to reflect the inclusion of a GPU health check.
.buildkite/k3_harness/gpu-monitor.sh
- Added a new script to detect and log non-Kubernetes processes that are using GPU memory.
- Implemented logic to distinguish K8s pod processes from stale host processes using cgroups and crictl.
- Provided detailed logging of stale processes, including PID, command, GPU bus ID, memory usage, and process age.
- Included a summary of GPU memory usage if stale processes are found and exits with an error code.
.buildkite/k3_harness/setup-env.sh
- Integrated a GPU health pre-check by sourcing helpers.sh and calling check_gpu_health 80 at the beginning of the script.
- Added comments to clearly delineate the GPU health pre-check section.
.buildkite/k3_harness/setup-gpu-monitor.sh
- Added a new script to install gpu-monitor.sh as a cron job, running every 10 minutes.
- Configured log rotation for /var/log/gpu-monitor.log to manage log file size.
- Ensured idempotency for cron job installation, preventing duplicate entries.
- Provided clear output confirming the installation details and instructions for viewing logs or removing the cron job.
.buildkite/k3_tests/common_scripts/helpers.sh
- Added a new function check_gpu_health to verify that all visible GPUs have a minimum percentage of free memory.
- The function reports GPU memory status (total, used, free, percentage free) for each GPU.
- If a GPU has insufficient free memory, it lists the processes consuming GPU resources and causes the script to exit with an error.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a robust GPU monitoring system for the CI environment, aiming to prevent job failures due to stale processes hogging GPU resources. The implementation includes a host-level cron job and a pre-flight check in CI jobs. However, a critical security vulnerability has been identified in .buildkite/k3_harness/gpu-monitor.sh due to the use of a predictable temporary file in /tmp, which could lead to a symbolic link attack and sensitive file overwrites. It is recommended to use mktemp for secure temporary file handling. Additionally, while the scripts are well-structured, there are areas for improvement concerning script robustness, efficiency, and safety, particularly around temporary file handling (beyond the security concern), command-line utilities, and documented commands.

gemini-code-assist · 2026-03-09T21:39:01Z

+        fi
+    done
+    shopt -u nullglob
+} | sort -u > /tmp/.gpu_monitor_k8s_pids 2>/dev/null || true


The script uses a hardcoded, predictable temporary file path /tmp/.gpu_monitor_k8s_pids. This is vulnerable to a symbolic link attack. An attacker with local access could create a symbolic link at this location pointing to a sensitive system file. When the script runs (especially if run with root privileges as suggested in the installation instructions), it will overwrite the target file with the list of PIDs, potentially leading to a denial of service or system instability.

To remediate this, use mktemp to create a secure, unpredictable temporary file and ensure it is cleaned up on exit.

Suggested change

} | sort -u > /tmp/.gpu_monitor_k8s_pids 2>/dev/null || true

K8S_PIDS_FILE=$(mktemp)

trap 'rm -f "$K8S_PIDS_FILE"' EXIT

# ... later in the script ...

} | sort -u > "$K8S_PIDS_FILE" 2>/dev/null || true

# ... and replace all other occurrences of /tmp/.gpu_monitor_k8s_pids with "$K8S_PIDS_FILE"

gemini-code-assist · 2026-03-09T21:39:01Z

+tail -f /var/log/gpu-monitor.log
+
+# Remove the cron job
+crontab -l | grep -v gpu-monitor.sh | crontab -


This command to remove the cron job can be risky. If crontab -l fails (e.g., if no crontab exists for the user), it can result in an empty pipe to crontab -, which would wipe the user's crontab. A safer pattern is to redirect stderr from crontab -l to /dev/null to prevent errors from breaking the pipe and causing unintended side effects.

Suggested change

crontab -l | grep -v gpu-monitor.sh | crontab -

crontab -l 2>/dev/null | grep -v gpu-monitor.sh | crontab -

gemini-code-assist · 2026-03-09T21:39:01Z

+
+set -euo pipefail
+
+LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]"


To avoid potential race conditions and permission issues with temporary files, it's a best practice to use mktemp to create a uniquely named temporary file for storing K8s PIDs. Additionally, using a trap ensures that this temporary file is cleaned up automatically when the script exits, even in case of an error.

You can then replace all instances of /tmp/.gpu_monitor_k8s_pids with $K8S_PIDS_FILE and remove the manual rm command at the end of the script.

Suggested change

LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]"

LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]"

K8S_PIDS_FILE=$(mktemp)

trap 'rm -f "$K8S_PIDS_FILE"' EXIT

gemini-code-assist · 2026-03-09T21:39:01Z

+    gpu_mem=$(nvidia-smi --query-compute-apps=pid,gpu_name,used_memory --format=csv,noheader -i 2>/dev/null \
+        | grep "^${pid}," || echo "$pid, unknown, unknown")


The gpu_mem variable is assigned a value but is never used later in the script. It can be removed to improve code clarity.

gemini-code-assist · 2026-03-09T21:39:01Z

+
+    # Parse GPU info into a temp file to avoid subshell variable scoping issues
+    local gpu_info
+    gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null)


Instead of calling xargs multiple times within the loop to trim whitespace, it's more efficient to process the output of nvidia-smi once. You can pipe the output to sed 's/ //g' to remove all spaces, which simplifies the loop body and improves performance.

Suggested change

gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null)

gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null | sed 's/ //g')

gemini-code-assist · 2026-03-09T21:39:01Z

+        idx=$(echo "$idx" | xargs)
+        total=$(echo "$total" | xargs)
+        used=$(echo "$used" | xargs)
+        free=$(echo "$free" | xargs)


Following the suggestion to use sed to process the nvidia-smi output, these lines for trimming whitespace are no longer necessary and can be removed.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

… ci-health-check

* add gpu monitoring Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix the nightly vllm installation Signed-off-by: Samuel Shen <slshen@uchciago.edu> * address gemini comments Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix lint Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>

* add gpu monitoring Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix the nightly vllm installation Signed-off-by: Samuel Shen <slshen@uchciago.edu> * address gemini comments Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix lint Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>

add gpu monitoring

9a9d179

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

gemini-code-assist Bot reviewed Mar 9, 2026

View reviewed changes

Samuel Shen and others added 5 commits March 9, 2026 20:53

fix the nightly vllm installation

4d04a31

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

address gemini comments

1e55926

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

Merge branch 'dev' into ci-health-check

b7a00a9

fix lint

6a717f4

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

Merge branch 'ci-health-check' of github-samuel:sammshen/LMCache into…

2170429

… ci-health-check

sammshen enabled auto-merge (squash) March 10, 2026 20:51

github-actions Bot added the full Run comprehensive tests on this PR label Mar 10, 2026

sammshen requested review from Shaoting-Feng and kobe0938 March 11, 2026 17:16

kobe0938 approved these changes Mar 11, 2026

View reviewed changes

Shaoting-Feng approved these changes Mar 11, 2026

View reviewed changes

sammshen merged commit d9b6580 into LMCache:dev Mar 11, 2026
28 of 29 checks passed

ApostaC mentioned this pull request Mar 11, 2026

[CI][Hotfix][Chore] remove the repetitive definition of report_status #2745

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci]: add gpu monitoring#2718

[ci]: add gpu monitoring#2718
sammshen merged 6 commits intoLMCache:devfrom
sammshen:ci-health-check

sammshen commented Mar 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-} | sort -u > /tmp/.gpu_monitor_k8s_pids 2>/dev/null || true
+    K8S_PIDS_FILE=$(mktemp)
+    trap 'rm -f "$K8S_PIDS_FILE"' EXIT
+    # ... later in the script ...
+    } | sort -u > "$K8S_PIDS_FILE" 2>/dev/null || true
+    # ... and replace all other occurrences of /tmp/.gpu_monitor_k8s_pids with "$K8S_PIDS_FILE"

	crontab -l \| grep -v gpu-monitor.sh \| crontab -
	crontab -l 2>/dev/null \| grep -v gpu-monitor.sh \| crontab -


		set -euo pipefail

		LOG_PREFIX="[gpu-monitor $(date '+%Y-%m-%d %H:%M:%S')]"

		gpu_mem=$(nvidia-smi --query-compute-apps=pid,gpu_name,used_memory --format=csv,noheader -i 2>/dev/null \
		\| grep "^${pid}," \|\| echo "$pid, unknown, unknown")

	gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null)
	gpu_info=$(nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv,noheader,nounits 2>/dev/null \| sed 's/ //g')

Conversation

sammshen commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Mar 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sammshen commented Mar 9, 2026 •

edited

Loading