[MP] Fault Tolerance CI#2764
Conversation
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
…branch) Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the robustness of the LMCache system by introducing a dedicated fault tolerance test within the CI pipeline. It also refines the vLLM integration to ensure that request completion is accurately managed, even in the event of LMCache server disruptions, thereby enhancing the overall reliability and stability of the caching mechanism. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a fault tolerance test to the CI pipeline, which simulates an LMCache server failure and verifies that vLLM requests can still complete. The changes include a new test script, modifications to the main test runner to incorporate this new test, and fixes in the Python adapter to correctly handle this fault tolerance scenario. My review focuses on improving the robustness and maintainability of the new test script and ensuring consistent error handling. I've identified a potential bug in how JSON payloads are constructed in the shell script and suggested improvements for simplification and consistency. The Python changes for fault tolerance appear correct and well-reasoned.
| if ! curl -sf --max-time 120 \ | ||
| "http://localhost:${VLLM_PORT}/v1/completions" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d "{ | ||
| \"model\": \"$MODEL\", | ||
| \"prompt\": \"Question: What is $i + $i?\\nAnswer:\", | ||
| \"max_tokens\": 32, | ||
| \"temperature\": 0 | ||
| }" > /dev/null 2>&1; then | ||
| echo "Request $i failed - vLLM became unresponsive" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Constructing JSON with string concatenation is fragile and can lead to invalid JSON if variables like $MODEL contain special characters (e.g., quotes). Using a heredoc to create the JSON payload is a more robust and readable approach.
| if ! curl -sf --max-time 120 \ | |
| "http://localhost:${VLLM_PORT}/v1/completions" \ | |
| -H "Content-Type: application/json" \ | |
| -d "{ | |
| \"model\": \"$MODEL\", | |
| \"prompt\": \"Question: What is $i + $i?\\nAnswer:\", | |
| \"max_tokens\": 32, | |
| \"temperature\": 0 | |
| }" > /dev/null 2>&1; then | |
| echo "Request $i failed - vLLM became unresponsive" | |
| exit 1 | |
| fi | |
| json_payload=$(cat <<EOF | |
| { | |
| "model": "$MODEL", | |
| "prompt": "Question: What is $i + $i?\\nAnswer:", | |
| "max_tokens": 32, | |
| "temperature": 0 | |
| } | |
| EOF | |
| ) | |
| if ! curl -sf --max-time 120 \ | |
| "http://localhost:${VLLM_PORT}/v1/completions" \ | |
| -H "Content-Type: application/json" \ | |
| -d "$json_payload" > /dev/null 2>&1; then | |
| echo "Request $i failed - vLLM became unresponsive" | |
| exit 1 | |
| fi |
| docker kill "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true | ||
| docker rm -f "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true |
There was a problem hiding this comment.
The docker rm -f command forcefully removes a running container, which includes stopping it. Therefore, the preceding docker kill command is redundant. You can simplify this to a single docker rm -f command.
| docker kill "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true | |
| docker rm -f "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true | |
| docker rm -f "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true |
| if ! "$SCRIPT_DIR/run-fault-tolerance.sh"; then | ||
| echo "❌ fault tolerance test failed" | ||
| TEST_RESULT=1 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
The script exits immediately with exit 1 upon failure of the fault tolerance test. This is inconsistent with how other test failures are handled in this script (e.g., the run-long-doc-qa.sh test), which only set TEST_RESULT=1 and allow the script to continue. While this is the last test, exiting immediately could prevent any future cleanup steps from running. For consistency and to ensure any final cleanup logic is executed, consider removing exit 1.
| if ! "$SCRIPT_DIR/run-fault-tolerance.sh"; then | |
| echo "❌ fault tolerance test failed" | |
| TEST_RESULT=1 | |
| exit 1 | |
| fi | |
| if ! "$SCRIPT_DIR/run-fault-tolerance.sh"; then | |
| echo "❌ fault tolerance test failed" | |
| TEST_RESULT=1 | |
| fi |
ApostaC
left a comment
There was a problem hiding this comment.
Need to update k3s as well
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
@sammshen Please also take a look at this PR! |
* health check Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * lint Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * dev Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * timeout Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: yuweia <ayw.sirius19@gmail.com> * add comment Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * add test Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * Remove fault tolerance CI step (will be added in separate fault-t-ci branch) Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * Rename HEALTH_CHECK to PING, add timeout params, extract helper Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix timeout Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * add k3s test Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> --------- Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> Signed-off-by: yuweia <ayw.sirius19@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
* health check Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * lint Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * dev Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * timeout Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: yuweia <ayw.sirius19@gmail.com> * add comment Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * add test Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * Remove fault tolerance CI step (will be added in separate fault-t-ci branch) Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * Rename HEALTH_CHECK to PING, add timeout params, extract helper Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix timeout Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * add k3s test Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> --------- Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> Signed-off-by: yuweia <ayw.sirius19@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
* health check Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * lint Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * dev Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * timeout Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: yuweia <ayw.sirius19@gmail.com> * add comment Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * add test Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * Remove fault tolerance CI step (will be added in separate fault-t-ci branch) Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * Rename HEALTH_CHECK to PING, add timeout params, extract helper Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix timeout Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * add k3s test Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> --------- Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> Signed-off-by: yuweia <ayw.sirius19@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
* health check Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * lint Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * dev Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * timeout Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: yuweia <ayw.sirius19@gmail.com> * add comment Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * add test Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * Remove fault tolerance CI step (will be added in separate fault-t-ci branch) Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * Rename HEALTH_CHECK to PING, add timeout params, extract helper Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ut Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix timeout Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * add k3s test Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> --------- Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> Signed-off-by: yuweia <ayw.sirius19@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
What this PR does / why we need it:
Special notes for your reviewers:
If applicable: