Skip to content

[MP] Fault Tolerance CI#2764

Merged
sammshen merged 29 commits intoLMCache:devfrom
Oasis-Git:fault-t
Mar 24, 2026
Merged

[MP] Fault Tolerance CI#2764
sammshen merged 29 commits intoLMCache:devfrom
Oasis-Git:fault-t

Conversation

@Oasis-Git
Copy link
Copy Markdown
Member

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: yuweia <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
…branch)

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness of the LMCache system by introducing a dedicated fault tolerance test within the CI pipeline. It also refines the vLLM integration to ensure that request completion is accurately managed, even in the event of LMCache server disruptions, thereby enhancing the overall reliability and stability of the caching mechanism.

Highlights

  • New Fault Tolerance CI Test: A new Buildkite CI script has been introduced to test the fault tolerance of LMCache. This test simulates LMCache server failure during a vLLM benchmark and verifies that vLLM requests still complete successfully.
  • vLLM Adapter Robustness Improvements: The vLLM multi-process adapter has been enhanced to prevent re-reporting of finished request IDs and to correctly handle scenarios where LMCache becomes unavailable mid-request, ensuring accurate tracking of request completion.
  • CI Script Integration: The main multiprocessing test script now incorporates the new fault tolerance test as a critical step, replacing a previous LMCache status check, and removes an unused environment variable.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .buildkite/scripts/multiprocessing-test/run-fault-tolerance.sh
    • Added a new shell script to implement a fault tolerance test for LMCache, which involves killing the LMCache container mid-benchmark.
  • .buildkite/scripts/multiprocessing-test/run-mp-test.sh
    • Removed the LMCACHE_HTTP_PORT environment variable export as it is no longer used.
    • Replaced the LMCache server status check with the execution of the new fault tolerance test script.
  • lmcache/integration/vllm/vllm_multi_process_adapter.py
    • Added a new _returned_finished set to track request IDs already reported as finished, preventing duplicate reporting.
    • Modified _process_finished_stores to utilize the _returned_finished set to avoid re-processing already reported finished requests.
    • Updated get_finished to ensure requests with pending retrieves are not also reported as finished sending if they completed without LMCache after the server died.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Oasis-Git Oasis-Git added the full Run comprehensive tests on this PR label Mar 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fault tolerance test to the CI pipeline, which simulates an LMCache server failure and verifies that vLLM requests can still complete. The changes include a new test script, modifications to the main test runner to incorporate this new test, and fixes in the Python adapter to correctly handle this fault tolerance scenario. My review focuses on improving the robustness and maintainability of the new test script and ensuring consistent error handling. I've identified a potential bug in how JSON payloads are constructed in the shell script and suggested improvements for simplification and consistency. The Python changes for fault tolerance appear correct and well-reasoned.

Comment on lines +123 to +134
if ! curl -sf --max-time 120 \
"http://localhost:${VLLM_PORT}/v1/completions" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"prompt\": \"Question: What is $i + $i?\\nAnswer:\",
\"max_tokens\": 32,
\"temperature\": 0
}" > /dev/null 2>&1; then
echo "Request $i failed - vLLM became unresponsive"
exit 1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Constructing JSON with string concatenation is fragile and can lead to invalid JSON if variables like $MODEL contain special characters (e.g., quotes). Using a heredoc to create the JSON payload is a more robust and readable approach.

Suggested change
if ! curl -sf --max-time 120 \
"http://localhost:${VLLM_PORT}/v1/completions" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"prompt\": \"Question: What is $i + $i?\\nAnswer:\",
\"max_tokens\": 32,
\"temperature\": 0
}" > /dev/null 2>&1; then
echo "Request $i failed - vLLM became unresponsive"
exit 1
fi
json_payload=$(cat <<EOF
{
"model": "$MODEL",
"prompt": "Question: What is $i + $i?\\nAnswer:",
"max_tokens": 32,
"temperature": 0
}
EOF
)
if ! curl -sf --max-time 120 \
"http://localhost:${VLLM_PORT}/v1/completions" \
-H "Content-Type: application/json" \
-d "$json_payload" > /dev/null 2>&1; then
echo "Request $i failed - vLLM became unresponsive"
exit 1
fi

Comment on lines +105 to +106
docker kill "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true
docker rm -f "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docker rm -f command forcefully removes a running container, which includes stopping it. Therefore, the preceding docker kill command is redundant. You can simplify this to a single docker rm -f command.

Suggested change
docker kill "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true
docker rm -f "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true
docker rm -f "$LMCACHE_CONTAINER_NAME" 2>/dev/null || true

Comment on lines +116 to +120
if ! "$SCRIPT_DIR/run-fault-tolerance.sh"; then
echo "❌ fault tolerance test failed"
TEST_RESULT=1
exit 1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script exits immediately with exit 1 upon failure of the fault tolerance test. This is inconsistent with how other test failures are handled in this script (e.g., the run-long-doc-qa.sh test), which only set TEST_RESULT=1 and allow the script to continue. While this is the last test, exiting immediately could prevent any future cleanup steps from running. For consistency and to ensure any final cleanup logic is executed, consider removing exit 1.

Suggested change
if ! "$SCRIPT_DIR/run-fault-tolerance.sh"; then
echo "❌ fault tolerance test failed"
TEST_RESULT=1
exit 1
fi
if ! "$SCRIPT_DIR/run-fault-tolerance.sh"; then
echo "❌ fault tolerance test failed"
TEST_RESULT=1
fi

@Oasis-Git Oasis-Git changed the title Fault Tolerance CI [MP] Fault Tolerance CI Mar 13, 2026
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update k3s as well

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Mar 14, 2026

@sammshen Please also take a look at this PR!

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sammshen sammshen enabled auto-merge (squash) March 16, 2026 07:05
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

ApostaC and others added 3 commits March 23, 2026 15:31
@sammshen sammshen merged commit cbc9dd8 into LMCache:dev Mar 24, 2026
24 of 25 checks passed
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 26, 2026
* health check

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* lint

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* dev

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* timeout

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: yuweia <ayw.sirius19@gmail.com>

* add comment

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* add test

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* Remove fault tolerance CI step (will be added in separate fault-t-ci branch)

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* Rename HEALTH_CHECK to PING, add timeout params, extract helper

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix timeout

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* add k3s test

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

---------

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: yuweia <ayw.sirius19@gmail.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 27, 2026
* health check

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* lint

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* dev

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* timeout

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: yuweia <ayw.sirius19@gmail.com>

* add comment

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* add test

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* Remove fault tolerance CI step (will be added in separate fault-t-ci branch)

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* Rename HEALTH_CHECK to PING, add timeout params, extract helper

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix timeout

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* add k3s test

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

---------

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: yuweia <ayw.sirius19@gmail.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* health check

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* lint

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* dev

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* timeout

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: yuweia <ayw.sirius19@gmail.com>

* add comment

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* add test

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* Remove fault tolerance CI step (will be added in separate fault-t-ci branch)

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* Rename HEALTH_CHECK to PING, add timeout params, extract helper

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix timeout

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* add k3s test

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

---------

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: yuweia <ayw.sirius19@gmail.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* health check

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* lint

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* dev

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* timeout

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: yuweia <ayw.sirius19@gmail.com>

* add comment

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* add test

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* Remove fault tolerance CI step (will be added in separate fault-t-ci branch)

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* Rename HEALTH_CHECK to PING, add timeout params, extract helper

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ut

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix timeout

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* add k3s test

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

---------

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: yuweia <ayw.sirius19@gmail.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants