Add model cache for AMD tests#10966
Conversation
Summary of ChangesHello @Kangyan-Zhou, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust model caching system for AMD CI tests to address the recurring issue of hitting Hugging Face API rate limits. By establishing a persistent cache directory on the host and integrating it with the CI Docker containers, models downloaded during tests will be reused, significantly reducing redundant network requests. The changes also include a comprehensive management script to facilitate cache initialization, status monitoring, cleanup of stale files, and preloading of frequently used models, thereby enhancing the efficiency and reliability of the CI pipeline. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a caching mechanism for Hugging Face models in the AMD CI environment to prevent hitting API rate limits. This is achieved by creating a persistent cache directory on the CI host, mounting it into the test container, and adding a new script amd_manage_cache.sh to manage this cache (initialize, preload, clean, etc.). The changes are logical and address the motivation. My review includes suggestions to improve the new cache management script for better maintainability and readability.
| if [[ "$EUID" -eq 0 ]]; then | ||
| # Running as root | ||
| mkdir -p "${CACHE_DIR}/huggingface" | ||
| chmod -R 777 "${CACHE_DIR}" | ||
| else | ||
| # Running as non-root, use sudo | ||
| sudo mkdir -p "${CACHE_DIR}/huggingface" | ||
| sudo chmod -R 777 "${CACHE_DIR}" | ||
| fi |
There was a problem hiding this comment.
This if/else block to handle root and non-root permissions is also used in the clean-old command. To reduce code duplication, you can define a local variable for the sudo command. For an even cleaner solution, you could define this variable at the top of the script and reuse it in both init and clean-old commands.
| if [[ "$EUID" -eq 0 ]]; then | |
| # Running as root | |
| mkdir -p "${CACHE_DIR}/huggingface" | |
| chmod -R 777 "${CACHE_DIR}" | |
| else | |
| # Running as non-root, use sudo | |
| sudo mkdir -p "${CACHE_DIR}/huggingface" | |
| sudo chmod -R 777 "${CACHE_DIR}" | |
| fi | |
| local SUDO="" | |
| if [[ "$EUID" -ne 0 ]]; then | |
| SUDO="sudo" | |
| fi | |
| ${SUDO} mkdir -p "${CACHE_DIR}/huggingface" | |
| ${SUDO} chmod -R 777 "${CACHE_DIR}" |
| if [[ "$EUID" -eq 0 ]]; then | ||
| find "${CACHE_DIR}" -type f -atime +30 -delete 2>/dev/null || true | ||
| else | ||
| sudo find "${CACHE_DIR}" -type f -atime +30 -delete 2>/dev/null || true | ||
| fi |
There was a problem hiding this comment.
Similar to the init command, this if/else block for root/non-root execution can be simplified by using a variable for the sudo command. This improves readability and maintainability by avoiding repeated logic.
| if [[ "$EUID" -eq 0 ]]; then | |
| find "${CACHE_DIR}" -type f -atime +30 -delete 2>/dev/null || true | |
| else | |
| sudo find "${CACHE_DIR}" -type f -atime +30 -delete 2>/dev/null || true | |
| fi | |
| local SUDO="" | |
| if [[ "$EUID" -ne 0 ]]; then | |
| SUDO="sudo" | |
| fi | |
| ${SUDO} find "${CACHE_DIR}" -type f -atime +30 -delete 2>/dev/null || true |
| python:3.10-slim bash -c " | ||
| pip install huggingface_hub transformers torch --quiet | ||
| python3 -c \" | ||
| from huggingface_hub import snapshot_download | ||
| import os | ||
|
|
||
| models = [ | ||
| 'meta-llama/Llama-3.1-8B-Instruct', | ||
| 'meta-llama/Llama-3.2-1B-Instruct', | ||
| 'deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', | ||
| 'mistralai/Mixtral-8x7B-Instruct-v0.1', | ||
| ] | ||
|
|
||
| for model in models: | ||
| print(f'Downloading {model}...') | ||
| try: | ||
| snapshot_download(model, cache_dir=os.environ['HF_HOME']) | ||
| print(f'✓ {model} cached successfully') | ||
| except Exception as e: | ||
| print(f'✗ Failed to cache {model}: {e}') | ||
| continue | ||
| \" | ||
| " |
There was a problem hiding this comment.
The embedded Python script inside the bash -c command has complex quoting (\"). This can be hard to read and maintain. Using a heredoc to pass the Python script to python3's standard input would make the code much cleaner and avoid quoting issues.
python:3.10-slim bash -c "
pip install huggingface_hub transformers torch --quiet
python3 - <<'EOF'
from huggingface_hub import snapshot_download
import os
models = [
'meta-llama/Llama-3.1-8B-Instruct',
'meta-llama/Llama-3.2-1B-Instruct',
'deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct',
'mistralai/Mixtral-8x7B-Instruct-v0.1',
]
for model in models:
print(f'Downloading {model}...')
try:
snapshot_download(model, cache_dir=os.environ['HF_HOME'])
print(f'✓ {model} cached successfully')
except Exception as e:
print(f'✗ Failed to cache {model}: {e}')
continue
EOF
"|
This PR looks so cool! As it seems not AMD only issue, could we also extend to all CI? |
Yes the plan is to extend to all CI |
Motivation
AMD CI tests seem to hit huggingface API limit very often.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist