[pd-router] Add Configurable Retry Logic for reduce backend pressure#8744
[pd-router] Add Configurable Retry Logic for reduce backend pressure#8744
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the robustness of the PD router by introducing a configurable retry mechanism to gracefully handle transient backend failures under heavy load. Alongside this critical reliability improvement, I've also integrated a powerful new benchmarking tool to facilitate ongoing performance validation and ensure the router maintains its high-performance characteristics. The changes also include general code refactoring to improve maintainability and error handling consistency.
Highlights
- Configurable Retry Logic: I've implemented configurable retry logic with exponential backoff and jitter for the PD router. This solution specifically targets transient backend
KVTransferErrorfailures under high load, ensuring robust error handling without compromising Rust's performance advantages. It retries server errors (5xx, 502, 504) and is configurable viaRetryConfig. - New Benchmark Tool: A new Rust-based benchmark tool (
router-benchmark) has been added. This tool, developed during debugging, is capable of simulating various load scenarios (basic, realistic, progressive, sustained) and generating detailed performance reports, proving invaluable for testing and validation. - Code Refactoring and Error Handling: The
pd_router.rsfile has been refactored to improve code organization, extract common helper methods for error handling (e.g., server selection, bootstrap injection, serialization), and enhance consistency in error responses. This makes the codebase cleaner and more maintainable. - Configuration Updates: The
RouterConfignow includes aretryfield to hold the newRetryConfigparameters. Additionally, the defaultrequest_timeout_secsinRouterConfighas been increased from 600 seconds to 3600 seconds (1 hour) to better align with the Python mini load balancer's behavior.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
The pull request introduces configurable retry logic to the PD router, refactors pd_router.rs for improved clarity, and adds a new benchmark tool. I've suggested improvements to the retry logic, benchmark tool accuracy, and consistency across router implementations.
0ab01ac to
5be904b
Compare
5d56337 to
0ea246b
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Summary
This PR adds configurable retry logic to the PD router to handle backend
KVTransferErrorfailures under load. The solution ensures error handling without sacrificing the performance advantages of the Rust implementation. Additionally, a performance benchmark tool was developed during debugging and is included as it provides valuable testing capabilities.Problem Investigation
Under load (1000+ concurrent requests), both the Rust PD router and Python mini load balancer experienced
KVTransferErrorfailures from backend servers:Initial Hypothesis: Connection Pool Mismatch
We initially suspected a connection pooling issue:
Configuration Comparison:
pool_idle_timeout: 50 secondspool_max_idle_per_host: 100 connectionstimeout_keep_alive: 5 secondsFailed Attempts to Match Python Behavior
We tried multiple approaches to make Rust behave like Python:
None of these approaches worked because Rust's inherent performance characteristics still made it faster than Python, even with artificial constraints.
Root Cause Discovery
The real issue wasn't connection pooling. The fundamental problem: both implementations hit capacity limits under high concurrent load
Solution: Configurable Retry Logic
Instead of trying to make Rust slower or ignore errors like mini-lb, we implemented a proper solution: configurable retry logic with exponential backoff.
Implementation Details
Key features:
Code Improvements
While implementing retry logic, we also refactored
pd_router.rs:Benchmark Tool
During debugging, we developed a Rust benchmark tool that proved invaluable for testing. We're including it in this PR as it provides ongoing value:
Usage:
Results
With retry logic:
Files Changed
src/routers/pd_router.rs: Retry logic and refactoringsrc/routers/router.rs: Retry configuration supportsrc/config/types.rs: RetryConfig structsrc/routers/factory.rs: Pass retry configurationbenches/router_benchmark.rs: New benchmark toolbenches/README.md: Benchmark documentationKey Insights
Breaking Changes
None. Default retry configuration maintains existing behavior.
Accuracy Test
Benchmark & Profiling
Checklist