Fix: Timeout Precision in Remote Shard Queries#5475
Fix: Timeout Precision in Remote Shard Queries#5475varshith257 wants to merge 0 commit intoqdrant:devfrom
Conversation
|
Hi @varshith257, thank you for your contribution 👋
I would propose a slightly different approach, the changes should happen only the remote shard operations and not across the code base.
I do not believe those new abstractions are useful TBH. Instead I propose to apply two changes for the 6 ops in
First, start by checking if the duration is zero, in that case we do not need to call the remote shards and can shortcut the requests with a timeout directly to save on network traffic. Secondly, about handling the flooring issue of I would simply replace In order to ensure the validation is not triggered on the server side. WDYT? Also please test your changes locally before submitting a PR :) |
|
I believe these changes partially address the issue:
What about handling timeouts in milliseconds (ms) between nodes? |
I'd prefer seconds as floating point because it's an SI base unit. Either way it'd require breaking API changes in some places. So it's not so trivial to implement. |
|
Sorry, I wasn’t clear! I’m totally fine with using seconds as a floating-point value. My point was more “we cannot floor or ceil to the nearest second; we need greater precision than an int.” |
|
Thanks for reviewing! I am just bit confused so raised Pr with what I understood 😀 Will push changes as requested |
|
I'm actually seeing this issue with 5 second timeouts as well |
|
The issue can happen for other timeout values where the pre-processing steps prior to reaching the remote shard decrease the value to less than 1 second. I believe the course of action I described is good enough to also tackle the precision concern given that we cannot introduce breaking changes. We do set the original timeout on the gRPC connection for each remote shard call with if let Some(timeout) = timeout {
request.set_timeout(timeout);
}In that case the connection will be closed which in turn will drop the future object being processed. |
|
@varshith257 Hi! Would you mind to bump your PR to apply the other suggestion. 😄 Basically just update the calculation mentioned here. And update it to something like I think that'll be good enough. |
fbdeb93 to
520c879
Compare
520c879 to
373659a
Compare
|
Just it went messy when rebasing. Sorry for it! I have opened new PR |
|
FYI I have implemented my suggestions in #5495 |
All Submissions:
devbranch. Did you create your branch fromdev?New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --all --all-featurescommand?Changes to Core Features:
This PR addresses #5463: Timeout errors when querying distributed Qdrant deployments with a 1-second timeout. The issue stems from timeout calculations during query distribution where time subtraction and conversion to integers result in 0-second timeouts for remote shards, causing validation errors.
Root Cause
Duration::as_secs()or equivalent, truncating sub-second timeouts to 0 seconds after subtracting elapsed time.Deadline Exceedederrors.Changes
calculate_timeoutfunction to handle precise floating-point timeout values, ensuring timeout clamping to a minimum of 1 second.std::time::Instantandtokio::time::Instantfor flexibility across codebases.Closes #5463.
/claim #5463