K3 CI Refactor#2663
Conversation
K8s-based CI infrastructure using K3s + NVIDIA GPU Operator + agent-stack-k8s: - k3_harness/: cluster setup, env setup, base image, teardown scripts - k3_tests/: comprehensive, correctness, integration, multiprocess pipelines
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
… socket - Rolling baselines: nightly writes date-stamped <feature>-YYYYMMDD.json, PR builds compare against worst-case (max) across 5-day window - upload-baselines.sh finalize step collects artifacts, prunes old files, single commit to benchmarks-main - Switch from SSH key to GITHUB_TOKEN (HTTPS) for repo checkout and push - Priority 1 for 2-GPU steps (pd, p2p, multiprocess) so they schedule first - Fix memory leak check: override LMCACHE_INTERNAL_API_SERVER_SOCKET_PATH_PREFIX to include port (replicates old Docker volume mount path mapping) - Fix correctness: replace col -b with sed for man page formatting
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the continuous integration system by migrating existing tests to a new K3s-based infrastructure. This change aims to enhance the efficiency and reliability of CI pipelines through improved resource management, standardized environment provisioning, and robust cleanup mechanisms. The new setup provides a more scalable and maintainable foundation for running various test types, from correctness checks to performance benchmarks, by leveraging Kubernetes capabilities for isolated and ephemeral test environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the CI infrastructure to a K3s-based system, aiming for improved resource scheduling, a unified environment, and more reliable cleanup. However, critical security vulnerabilities were identified in the newly added shell scripts, including shell command injection via unsanitized variables in yq and bash -c commands, and potential secret exposure by passing tokens through command-line arguments and URLs. Addressing these security concerns is paramount, especially since these scripts process configuration files that could be manipulated in a pull request. Additionally, general issues were found, such as missing system dependencies (yq, jq) in the base Docker image and an incorrect Helm release name in the teardown script.
| source .buildkite/k3_harness/setup-env.sh | ||
|
|
||
| # Install test utilities (yq for YAML parsing, jq for JSON, openai/pandas/matplotlib for benchmarks) | ||
| uv pip install yq jq openai pandas matplotlib 2>/dev/null || true |
There was a problem hiding this comment.
This command attempts to install yq and jq using uv pip, but they are not Python packages and cannot be installed this way. This will fail (though the error is suppressed by || true), and the subsequent test script will fail because yq and jq are not found. These dependencies should be installed in the base Docker image via apt-get.
| uv pip install yq jq openai pandas matplotlib 2>/dev/null || true | |
| uv pip install openai pandas matplotlib 2>/dev/null || true |
| alloc=$(yq -er '.["docker-decoder"]["alloc-port"]' "$cfg_file" 2>/dev/null || echo "7400") | ||
|
|
||
| # Inject PD-specific env vars into docker sections | ||
| prefiller_docker=$(echo "$prefiller_docker" | yq -y ". + {\"env\": (.env + [\"LMCACHE_PD_PROXY_PORT=$proxy\"])}") |
There was a problem hiding this comment.
The proxy variable, which is extracted from a configuration file, is interpolated directly into a shell command string for yq. This allows for shell command injection if the configuration file contains malicious values. An attacker could exploit this by submitting a pull request with a modified configuration file.
| prefiller_docker=$(echo "$prefiller_docker" | yq -y ". + {\"env\": (.env + [\"LMCACHE_PD_PROXY_PORT=$proxy\"])}") | |
| prefiller_docker=$(echo "$prefiller_docker" | yq -y --arg proxy "$proxy" '. + {"env": (.env + ["LMCACHE_PD_PROXY_PORT=" + $proxy])}') |
|
|
||
| # Inject PD-specific env vars into docker sections | ||
| prefiller_docker=$(echo "$prefiller_docker" | yq -y ". + {\"env\": (.env + [\"LMCACHE_PD_PROXY_PORT=$proxy\"])}") | ||
| decoder_docker=$(echo "$decoder_docker" | yq -y ". + {\"env\": (.env + [\"LMCACHE_PD_PEER_INIT_PORT=$init\", \"LMCACHE_PD_PEER_ALLOC_PORT=$alloc\"])}") |
There was a problem hiding this comment.
The init and alloc variables are interpolated directly into a shell command string for yq, leading to a potential shell command injection vulnerability similar to the one found on line 131.
| decoder_docker=$(echo "$decoder_docker" | yq -y ". + {\"env\": (.env + [\"LMCACHE_PD_PEER_INIT_PORT=$init\", \"LMCACHE_PD_PEER_ALLOC_PORT=$alloc\"])}") | |
| decoder_docker=$(echo "$decoder_docker" | yq -y --arg init "$init" --arg alloc "$alloc" '. + {"env": (.env + ["LMCACHE_PD_PEER_INIT_PORT=" + $init, "LMCACHE_PD_PEER_ALLOC_PORT=" + $alloc])}') |
| reply=$(yq -er '.docker1["reply-port"]' "$cfg_file" 2>/dev/null || echo "8400") | ||
|
|
||
| # Inject controller URLs | ||
| docker1=$(echo "$docker1" | yq -y ". + {\"env\": (.env + [\"LMCACHE_CONTROLLER_PULL_URL=localhost:$pull\", \"LMCACHE_CONTROLLER_REPLY_URL=localhost:$reply\", \"UCX_TLS=tcp\"])}") |
There was a problem hiding this comment.
The pull and reply variables are interpolated into a yq command string, creating a shell command injection vulnerability.
| docker1=$(echo "$docker1" | yq -y ". + {\"env\": (.env + [\"LMCACHE_CONTROLLER_PULL_URL=localhost:$pull\", \"LMCACHE_CONTROLLER_REPLY_URL=localhost:$reply\", \"UCX_TLS=tcp\"])}") | |
| docker1=$(echo "$docker1" | yq -y --arg pull "$pull" --arg reply "$reply" '. + {"env": (.env + ["LMCACHE_CONTROLLER_PULL_URL=localhost:" + $pull, "LMCACHE_CONTROLLER_REPLY_URL=localhost:" + $reply, "UCX_TLS=tcp"])}') |
| # Build workload JSON (merge workload section with model, strip non-CLI fields) | ||
| # Fields like expected-latency-gain are used by the checking logic, not long_doc_qa.py. | ||
| # "completion" -> "completions" rename to match the argparse flag. | ||
| workload_yaml="$(yq "(.workload * {\"model\": \"$model\"}) | del(.type) | del(.[\"expected-latency-gain\"]) | if .completion then .completions = .completion | del(.completion) else . end" "$cfg_file")" |
There was a problem hiding this comment.
The model variable is interpolated into a yq command string, which can lead to command injection if the model name in the configuration file contains shell metacharacters or yq filter delimiters.
| workload_yaml="$(yq "(.workload * {\"model\": \"$model\"}) | del(.type) | del(.[\"expected-latency-gain\"]) | if .completion then .completions = .completion | del(.completion) else . end" "$cfg_file")" | |
| workload_yaml="$(yq --arg model "$model" '(.workload * {"model": $model}) | del(.type) | del(.["expected-latency-gain"]) | if .completion then .completions = .completion | del(.completion) else . end' "$cfg_file")" |
| && apt-get install -y --no-install-recommends \ | ||
| ccache software-properties-common git curl sudo \ | ||
| python3 python3-dev python3-venv python3-pip tzdata libxcb1-dev \ |
There was a problem hiding this comment.
The comprehensive tests rely on yq and jq for parsing YAML and JSON files. These tools are not installed in this base image, which will cause test failures. They should be added to the apt-get install command.
&& apt-get install -y --no-install-recommends \
ccache software-properties-common git curl sudo yq jq \
python3 python3-dev python3-venv python3-pip tzdata libxcb1-dev \
| if helm status buildkite-agent -n buildkite &>/dev/null; then | ||
| echo "→ Removing agent-stack-k8s..." | ||
| helm uninstall buildkite-agent -n buildkite --wait | ||
| fi |
There was a problem hiding this comment.
The Helm release for the Buildkite agent is named agent-stack-k8s in install-agent-stack.sh, but this script uses buildkite-agent to check the status and uninstall it. This will cause the teardown for the agent to fail. The release name should be consistent.
| if helm status buildkite-agent -n buildkite &>/dev/null; then | |
| echo "→ Removing agent-stack-k8s..." | |
| helm uninstall buildkite-agent -n buildkite --wait | |
| fi | |
| if helm status agent-stack-k8s -n buildkite &>/dev/null; then | |
| echo "→ Removing agent-stack-k8s..." | |
| helm uninstall agent-stack-k8s -n buildkite --wait | |
| fi |
| # - pod-spec-patch: injects GITHUB_TOKEN into job containers for push operations | ||
| helm upgrade --install agent-stack-k8s oci://ghcr.io/buildkite/helm/agent-stack-k8s \ | ||
| --namespace buildkite --create-namespace \ | ||
| --set agentToken="${TOKEN}" \ |
There was a problem hiding this comment.
The Buildkite agent token is passed to helm upgrade using the --set flag, which can expose the token in the process list (e.g., via ps aux) to other users on the system. It is more secure to pass sensitive values using environment variables, secret files, or by referencing an existing Kubernetes secret. Additionally, for better readability and maintainability, consider moving the JSON configuration passed to --set-json into a temporary YAML file and using helm upgrade --values <file>.
| local port="${1:-8000}" | ||
| while [ "$port" -lt 65536 ]; do | ||
| if ! lsof -iTCP:"$port" -sTCP:LISTEN >/dev/null 2>&1 && | ||
| ! timeout 1 bash -c "</dev/tcp/127.0.0.1/${port}" 2>/dev/null; then |
There was a problem hiding this comment.
The find_free_port function is vulnerable to command injection because the port variable is interpolated directly into a bash -c command string. Although currently called with hardcoded values, this utility function is inherently unsafe if used with any external input.
| ! timeout 1 bash -c "</dev/tcp/127.0.0.1/${port}" 2>/dev/null; then | |
| if ! lsof -iTCP:"$port" -sTCP:LISTEN >/dev/null 2>&1 && | |
| ! timeout 1 bash -c "</dev/tcp/127.0.0.1/$((port))" 2>/dev/null; then |
| if [[ -n "${GITHUB_TOKEN:-}" ]]; then | ||
| # Extract owner/repo from any URL format (SSH or HTTPS) | ||
| REPO_PATH="$(echo "$ORIGIN_URL" | sed -E 's|.*github\.com[:/]||' | sed 's/\.git$//')" | ||
| PUSH_URL="https://x-access-token:${GITHUB_TOKEN}@github.com/${REPO_PATH}.git" |
There was a problem hiding this comment.
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
| GPU_MEMORY_GB=$((GPU_MEMORY_MB / 1024)) | ||
| echo "Detected GPU memory: ${GPU_MEMORY_GB}GB (${GPU_MEMORY_MB}MB)" | ||
|
|
||
| if [ "$GPU_MEMORY_GB" -gt 100 ]; then |
ApostaC
left a comment
There was a problem hiding this comment.
LGTM! Let's put it online for a few days and see if what will happen
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
| echo "Detected GPU memory: ${GPU_MEMORY_GB}GB (${GPU_MEMORY_MB}MB)" | ||
|
|
||
| if [ "$GPU_MEMORY_GB" -gt 100 ]; then | ||
| if [ "$GPU_MEMORY_GB" -gt 90 ]; then |
There was a problem hiding this comment.
to stay consistent with the new RTX 6000 96 GB in the k3_tests
|
I understand the high-level design (migrating to k3s) but I am not sure about the concrete details. Do we have any designs for CI that I can refer to? |
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
|
@KuntaiDu thanks for the suggestion! Just added a pretty concise ARCHITECTURE.md |
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
* Add smoke test for new yotta-lab queues * Add K3s CI harness and test pipelines K8s-based CI infrastructure using K3s + NVIDIA GPU Operator + agent-stack-k8s: - k3_harness/: cluster setup, env setup, base image, teardown scripts - k3_tests/: comprehensive, correctness, integration, multiprocess pipelines * forgot to add target queue Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix README Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix container name Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change image pull policy Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix relative paths Signed-off-by: Samuel Shen <slshen@uchciago.edu> * non-container integration run script Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rewrite scripts Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rolling 5-day baseline, HTTPS auth, priority scheduling, fix mem leak socket - Rolling baselines: nightly writes date-stamped <feature>-YYYYMMDD.json, PR builds compare against worst-case (max) across 5-day window - upload-baselines.sh finalize step collects artifacts, prunes old files, single commit to benchmarks-main - Switch from SSH key to GITHUB_TOKEN (HTTPS) for repo checkout and push - Priority 1 for 2-GPU steps (pd, p2p, multiprocess) so they schedule first - Fix memory leak check: override LMCACHE_INTERNAL_API_SERVER_SOCKET_PATH_PREFIX to include port (replicates old Docker volume mount path mapping) - Fix correctness: replace col -b with sed for man page formatting * fix mp ci Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change priority Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change installation back to editable Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix p2p mem check Signed-off-by: Samuel Shen <slshen@uchciago.edu> * speed up mp Signed-off-by: Samuel Shen <slshen@uchciago.edu> * relative thresholds for mp long_doc_qa, parallel vllm startup * set both thresholds to 10% * skip mem leak check for p2p * revert to sequential vllm startup, --master-port doesnt help * fix parallel vllm startup by unsetting VLLM_PORT env var * add integration nightly docs, fix teardown helm name, use yq --arg * Remove health checks Signed-off-by: Samuel Shen <slshen@uchciago.edu> * parallellize integration tests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * loosen MP test Signed-off-by: Samuel Shen <slshen@uchciago.edu> * reduce gpu util Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness in MP Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add ARCHITECTURE.md Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add local_cpu_mla.yaml test back Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: Ofer Kiselov Nahman <ofer.kiselovnahman@weka.io>
* Add smoke test for new yotta-lab queues * Add K3s CI harness and test pipelines K8s-based CI infrastructure using K3s + NVIDIA GPU Operator + agent-stack-k8s: - k3_harness/: cluster setup, env setup, base image, teardown scripts - k3_tests/: comprehensive, correctness, integration, multiprocess pipelines * forgot to add target queue Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix README Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix container name Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change image pull policy Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix relative paths Signed-off-by: Samuel Shen <slshen@uchciago.edu> * non-container integration run script Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rewrite scripts Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rolling 5-day baseline, HTTPS auth, priority scheduling, fix mem leak socket - Rolling baselines: nightly writes date-stamped <feature>-YYYYMMDD.json, PR builds compare against worst-case (max) across 5-day window - upload-baselines.sh finalize step collects artifacts, prunes old files, single commit to benchmarks-main - Switch from SSH key to GITHUB_TOKEN (HTTPS) for repo checkout and push - Priority 1 for 2-GPU steps (pd, p2p, multiprocess) so they schedule first - Fix memory leak check: override LMCACHE_INTERNAL_API_SERVER_SOCKET_PATH_PREFIX to include port (replicates old Docker volume mount path mapping) - Fix correctness: replace col -b with sed for man page formatting * fix mp ci Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change priority Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change installation back to editable Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix p2p mem check Signed-off-by: Samuel Shen <slshen@uchciago.edu> * speed up mp Signed-off-by: Samuel Shen <slshen@uchciago.edu> * relative thresholds for mp long_doc_qa, parallel vllm startup * set both thresholds to 10% * skip mem leak check for p2p * revert to sequential vllm startup, --master-port doesnt help * fix parallel vllm startup by unsetting VLLM_PORT env var * add integration nightly docs, fix teardown helm name, use yq --arg * Remove health checks Signed-off-by: Samuel Shen <slshen@uchciago.edu> * parallellize integration tests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * loosen MP test Signed-off-by: Samuel Shen <slshen@uchciago.edu> * reduce gpu util Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness in MP Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add ARCHITECTURE.md Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add local_cpu_mla.yaml test back Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* Add smoke test for new yotta-lab queues * Add K3s CI harness and test pipelines K8s-based CI infrastructure using K3s + NVIDIA GPU Operator + agent-stack-k8s: - k3_harness/: cluster setup, env setup, base image, teardown scripts - k3_tests/: comprehensive, correctness, integration, multiprocess pipelines * forgot to add target queue Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix README Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix container name Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change image pull policy Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix relative paths Signed-off-by: Samuel Shen <slshen@uchciago.edu> * non-container integration run script Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rewrite scripts Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rolling 5-day baseline, HTTPS auth, priority scheduling, fix mem leak socket - Rolling baselines: nightly writes date-stamped <feature>-YYYYMMDD.json, PR builds compare against worst-case (max) across 5-day window - upload-baselines.sh finalize step collects artifacts, prunes old files, single commit to benchmarks-main - Switch from SSH key to GITHUB_TOKEN (HTTPS) for repo checkout and push - Priority 1 for 2-GPU steps (pd, p2p, multiprocess) so they schedule first - Fix memory leak check: override LMCACHE_INTERNAL_API_SERVER_SOCKET_PATH_PREFIX to include port (replicates old Docker volume mount path mapping) - Fix correctness: replace col -b with sed for man page formatting * fix mp ci Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change priority Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change installation back to editable Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix p2p mem check Signed-off-by: Samuel Shen <slshen@uchciago.edu> * speed up mp Signed-off-by: Samuel Shen <slshen@uchciago.edu> * relative thresholds for mp long_doc_qa, parallel vllm startup * set both thresholds to 10% * skip mem leak check for p2p * revert to sequential vllm startup, --master-port doesnt help * fix parallel vllm startup by unsetting VLLM_PORT env var * add integration nightly docs, fix teardown helm name, use yq --arg * Remove health checks Signed-off-by: Samuel Shen <slshen@uchciago.edu> * parallellize integration tests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * loosen MP test Signed-off-by: Samuel Shen <slshen@uchciago.edu> * reduce gpu util Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness in MP Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add ARCHITECTURE.md Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add local_cpu_mla.yaml test back Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* Add smoke test for new yotta-lab queues * Add K3s CI harness and test pipelines K8s-based CI infrastructure using K3s + NVIDIA GPU Operator + agent-stack-k8s: - k3_harness/: cluster setup, env setup, base image, teardown scripts - k3_tests/: comprehensive, correctness, integration, multiprocess pipelines * forgot to add target queue Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix README Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix container name Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change image pull policy Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix relative paths Signed-off-by: Samuel Shen <slshen@uchciago.edu> * non-container integration run script Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rewrite scripts Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rolling 5-day baseline, HTTPS auth, priority scheduling, fix mem leak socket - Rolling baselines: nightly writes date-stamped <feature>-YYYYMMDD.json, PR builds compare against worst-case (max) across 5-day window - upload-baselines.sh finalize step collects artifacts, prunes old files, single commit to benchmarks-main - Switch from SSH key to GITHUB_TOKEN (HTTPS) for repo checkout and push - Priority 1 for 2-GPU steps (pd, p2p, multiprocess) so they schedule first - Fix memory leak check: override LMCACHE_INTERNAL_API_SERVER_SOCKET_PATH_PREFIX to include port (replicates old Docker volume mount path mapping) - Fix correctness: replace col -b with sed for man page formatting * fix mp ci Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change priority Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change installation back to editable Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix p2p mem check Signed-off-by: Samuel Shen <slshen@uchciago.edu> * speed up mp Signed-off-by: Samuel Shen <slshen@uchciago.edu> * relative thresholds for mp long_doc_qa, parallel vllm startup * set both thresholds to 10% * skip mem leak check for p2p * revert to sequential vllm startup, --master-port doesnt help * fix parallel vllm startup by unsetting VLLM_PORT env var * add integration nightly docs, fix teardown helm name, use yq --arg * Remove health checks Signed-off-by: Samuel Shen <slshen@uchciago.edu> * parallellize integration tests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * loosen MP test Signed-off-by: Samuel Shen <slshen@uchciago.edu> * reduce gpu util Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness in MP Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add ARCHITECTURE.md Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add local_cpu_mla.yaml test back Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* Add smoke test for new yotta-lab queues * Add K3s CI harness and test pipelines K8s-based CI infrastructure using K3s + NVIDIA GPU Operator + agent-stack-k8s: - k3_harness/: cluster setup, env setup, base image, teardown scripts - k3_tests/: comprehensive, correctness, integration, multiprocess pipelines * forgot to add target queue Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix README Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix container name Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change image pull policy Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix relative paths Signed-off-by: Samuel Shen <slshen@uchciago.edu> * non-container integration run script Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rewrite scripts Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rolling 5-day baseline, HTTPS auth, priority scheduling, fix mem leak socket - Rolling baselines: nightly writes date-stamped <feature>-YYYYMMDD.json, PR builds compare against worst-case (max) across 5-day window - upload-baselines.sh finalize step collects artifacts, prunes old files, single commit to benchmarks-main - Switch from SSH key to GITHUB_TOKEN (HTTPS) for repo checkout and push - Priority 1 for 2-GPU steps (pd, p2p, multiprocess) so they schedule first - Fix memory leak check: override LMCACHE_INTERNAL_API_SERVER_SOCKET_PATH_PREFIX to include port (replicates old Docker volume mount path mapping) - Fix correctness: replace col -b with sed for man page formatting * fix mp ci Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change priority Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change installation back to editable Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix p2p mem check Signed-off-by: Samuel Shen <slshen@uchciago.edu> * speed up mp Signed-off-by: Samuel Shen <slshen@uchciago.edu> * relative thresholds for mp long_doc_qa, parallel vllm startup * set both thresholds to 10% * skip mem leak check for p2p * revert to sequential vllm startup, --master-port doesnt help * fix parallel vllm startup by unsetting VLLM_PORT env var * add integration nightly docs, fix teardown helm name, use yq --arg * Remove health checks Signed-off-by: Samuel Shen <slshen@uchciago.edu> * parallellize integration tests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * loosen MP test Signed-off-by: Samuel Shen <slshen@uchciago.edu> * reduce gpu util Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness in MP Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add ARCHITECTURE.md Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add local_cpu_mla.yaml test back Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: shaoxiawjc <wjc2800@163.com>
* Add smoke test for new yotta-lab queues * Add K3s CI harness and test pipelines K8s-based CI infrastructure using K3s + NVIDIA GPU Operator + agent-stack-k8s: - k3_harness/: cluster setup, env setup, base image, teardown scripts - k3_tests/: comprehensive, correctness, integration, multiprocess pipelines * forgot to add target queue Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix README Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix container name Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change image pull policy Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix relative paths Signed-off-by: Samuel Shen <slshen@uchciago.edu> * non-container integration run script Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rewrite scripts Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rolling 5-day baseline, HTTPS auth, priority scheduling, fix mem leak socket - Rolling baselines: nightly writes date-stamped <feature>-YYYYMMDD.json, PR builds compare against worst-case (max) across 5-day window - upload-baselines.sh finalize step collects artifacts, prunes old files, single commit to benchmarks-main - Switch from SSH key to GITHUB_TOKEN (HTTPS) for repo checkout and push - Priority 1 for 2-GPU steps (pd, p2p, multiprocess) so they schedule first - Fix memory leak check: override LMCACHE_INTERNAL_API_SERVER_SOCKET_PATH_PREFIX to include port (replicates old Docker volume mount path mapping) - Fix correctness: replace col -b with sed for man page formatting * fix mp ci Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change priority Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change installation back to editable Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix p2p mem check Signed-off-by: Samuel Shen <slshen@uchciago.edu> * speed up mp Signed-off-by: Samuel Shen <slshen@uchciago.edu> * relative thresholds for mp long_doc_qa, parallel vllm startup * set both thresholds to 10% * skip mem leak check for p2p * revert to sequential vllm startup, --master-port doesnt help * fix parallel vllm startup by unsetting VLLM_PORT env var * add integration nightly docs, fix teardown helm name, use yq --arg * Remove health checks Signed-off-by: Samuel Shen <slshen@uchciago.edu> * parallellize integration tests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * loosen MP test Signed-off-by: Samuel Shen <slshen@uchciago.edu> * reduce gpu util Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness in MP Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add ARCHITECTURE.md Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add local_cpu_mla.yaml test back Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
* Add smoke test for new yotta-lab queues * Add K3s CI harness and test pipelines K8s-based CI infrastructure using K3s + NVIDIA GPU Operator + agent-stack-k8s: - k3_harness/: cluster setup, env setup, base image, teardown scripts - k3_tests/: comprehensive, correctness, integration, multiprocess pipelines * forgot to add target queue Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix README Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix container name Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change image pull policy Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix relative paths Signed-off-by: Samuel Shen <slshen@uchciago.edu> * non-container integration run script Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rewrite scripts Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rolling 5-day baseline, HTTPS auth, priority scheduling, fix mem leak socket - Rolling baselines: nightly writes date-stamped <feature>-YYYYMMDD.json, PR builds compare against worst-case (max) across 5-day window - upload-baselines.sh finalize step collects artifacts, prunes old files, single commit to benchmarks-main - Switch from SSH key to GITHUB_TOKEN (HTTPS) for repo checkout and push - Priority 1 for 2-GPU steps (pd, p2p, multiprocess) so they schedule first - Fix memory leak check: override LMCACHE_INTERNAL_API_SERVER_SOCKET_PATH_PREFIX to include port (replicates old Docker volume mount path mapping) - Fix correctness: replace col -b with sed for man page formatting * fix mp ci Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change priority Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change installation back to editable Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix p2p mem check Signed-off-by: Samuel Shen <slshen@uchciago.edu> * speed up mp Signed-off-by: Samuel Shen <slshen@uchciago.edu> * relative thresholds for mp long_doc_qa, parallel vllm startup * set both thresholds to 10% * skip mem leak check for p2p * revert to sequential vllm startup, --master-port doesnt help * fix parallel vllm startup by unsetting VLLM_PORT env var * add integration nightly docs, fix teardown helm name, use yq --arg * Remove health checks Signed-off-by: Samuel Shen <slshen@uchciago.edu> * parallellize integration tests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * loosen MP test Signed-off-by: Samuel Shen <slshen@uchciago.edu> * reduce gpu util Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness in MP Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add ARCHITECTURE.md Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add local_cpu_mla.yaml test back Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* Add smoke test for new yotta-lab queues * Add K3s CI harness and test pipelines K8s-based CI infrastructure using K3s + NVIDIA GPU Operator + agent-stack-k8s: - k3_harness/: cluster setup, env setup, base image, teardown scripts - k3_tests/: comprehensive, correctness, integration, multiprocess pipelines * forgot to add target queue Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix README Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix container name Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change image pull policy Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix relative paths Signed-off-by: Samuel Shen <slshen@uchciago.edu> * non-container integration run script Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rewrite scripts Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness Signed-off-by: Samuel Shen <slshen@uchciago.edu> * rolling 5-day baseline, HTTPS auth, priority scheduling, fix mem leak socket - Rolling baselines: nightly writes date-stamped <feature>-YYYYMMDD.json, PR builds compare against worst-case (max) across 5-day window - upload-baselines.sh finalize step collects artifacts, prunes old files, single commit to benchmarks-main - Switch from SSH key to GITHUB_TOKEN (HTTPS) for repo checkout and push - Priority 1 for 2-GPU steps (pd, p2p, multiprocess) so they schedule first - Fix memory leak check: override LMCACHE_INTERNAL_API_SERVER_SOCKET_PATH_PREFIX to include port (replicates old Docker volume mount path mapping) - Fix correctness: replace col -b with sed for man page formatting * fix mp ci Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change priority Signed-off-by: Samuel Shen <slshen@uchciago.edu> * change installation back to editable Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix p2p mem check Signed-off-by: Samuel Shen <slshen@uchciago.edu> * speed up mp Signed-off-by: Samuel Shen <slshen@uchciago.edu> * relative thresholds for mp long_doc_qa, parallel vllm startup * set both thresholds to 10% * skip mem leak check for p2p * revert to sequential vllm startup, --master-port doesnt help * fix parallel vllm startup by unsetting VLLM_PORT env var * add integration nightly docs, fix teardown helm name, use yq --arg * Remove health checks Signed-off-by: Samuel Shen <slshen@uchciago.edu> * parallellize integration tests Signed-off-by: Samuel Shen <slshen@uchciago.edu> * loosen MP test Signed-off-by: Samuel Shen <slshen@uchciago.edu> * reduce gpu util Signed-off-by: Samuel Shen <slshen@uchciago.edu> * fix correctness in MP Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add ARCHITECTURE.md Signed-off-by: Samuel Shen <slshen@uchciago.edu> * add local_cpu_mla.yaml test back Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Rewrites four tests (
.buildkite/k3_tests/with a new k3 based infra.buildkite/k3_harness/).Benefits: