Skip to content

Fix: Timeout Precision in Remote Shard Queries#5475

Closed
varshith257 wants to merge 0 commit intoqdrant:devfrom
varshith257:fix/distributed-query-timeout
Closed

Fix: Timeout Precision in Remote Shard Queries#5475
varshith257 wants to merge 0 commit intoqdrant:devfrom
varshith257:fix/distributed-query-timeout

Conversation

@varshith257
Copy link
Copy Markdown
Contributor

@varshith257 varshith257 commented Nov 19, 2024

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

This PR addresses #5463: Timeout errors when querying distributed Qdrant deployments with a 1-second timeout. The issue stems from timeout calculations during query distribution where time subtraction and conversion to integers result in 0-second timeouts for remote shards, causing validation errors.

Root Cause

  • Timeout calculations used Duration::as_secs() or equivalent, truncating sub-second timeouts to 0 seconds after subtracting elapsed time.
  • Remote shard queries received timeouts below the minimum required threshold (1 second), leading to Deadline Exceeded errors.

Changes

  • Introduced the calculate_timeout function to handle precise floating-point timeout values, ensuring timeout clamping to a minimum of 1 second.
  • The function now supports both std::time::Instant and tokio::time::Instant for flexibility across codebases.
  • Using clamping, enforced a minimum timeout of 1 second for distributed queries.
  • Added validation checks to avoid passing invalid timeouts to remote shards.

Closes #5463.
/claim #5463

@agourlay
Copy link
Copy Markdown
Member

agourlay commented Nov 19, 2024

Hi @varshith257, thank you for your contribution 👋

Introduced the calculate_timeout function to handle precise floating-point timeout values, ensuring timeout clamping to a minimum of 1 second.

I would propose a slightly different approach, the changes should happen only the remote shard operations and not across the code base.

The function now supports both std::time::Instant and tokio::time::Instant for flexibility across codebases.

I do not believe those new abstractions are useful TBH.

Instead I propose to apply two changes for the 6 ops in remote_shards:

  • scroll_by
  • core_search
  • count
  • retrieve
  • query_batch
  • facet

First, start by checking if the duration is zero, in that case we do not need to call the remote shards and can shortcut the requests with a timeout directly to save on network traffic.

Secondly, about handling the flooring issue of timeout_duration.as_secs which turns anything under 1 second to zero.

I would simply replace
timeout.map(|t| t.as_secs())
by
timeout.map(|t| t.as_secs().max(1)

In order to ensure the validation is not triggered on the server side.

WDYT?

Also please test your changes locally before submitting a PR :)

@timvisee
Copy link
Copy Markdown
Member

@agourlay I agree. I'd rather keep using the std Duration API everywhere possible. It already has sufficient arithmetic functions.

timeout.map(|t| t.as_secs().max(1)

You suggest lower-capping at at least 1.

How about using t.as_sec_f32().ceil() as usize instead?

@mautini
Copy link
Copy Markdown

mautini commented Nov 19, 2024

I believe these changes partially address the issue:

  • When the user specifies a 1-second timeout, it is always rounded up to 1 second, introducing errors in the estimation.
  • If the user provides a timeout greater than 1 second, the flooring issue remains. For example, if the user specifies 2 seconds, a “deadline exceeded” error could still be returned after only 1 second.

What about handling timeouts in milliseconds (ms) between nodes?

@timvisee
Copy link
Copy Markdown
Member

timvisee commented Nov 19, 2024

What about handling timeouts in milliseconds (ms) between nodes?

I'd prefer seconds as floating point because it's an SI base unit.

Either way it'd require breaking API changes in some places. So it's not so trivial to implement.

@mautini
Copy link
Copy Markdown

mautini commented Nov 19, 2024

Sorry, I wasn’t clear! I’m totally fine with using seconds as a floating-point value. My point was more “we cannot floor or ceil to the nearest second; we need greater precision than an int.”

@varshith257
Copy link
Copy Markdown
Contributor Author

Thanks for reviewing! I am just bit confused so raised Pr with what I understood 😀

Will push changes as requested

@zeyaddeeb
Copy link
Copy Markdown

I'm actually seeing this issue with 5 second timeouts as well

@agourlay
Copy link
Copy Markdown
Member

The issue can happen for other timeout values where the pre-processing steps prior to reaching the remote shard decrease the value to less than 1 second.

I believe the course of action I described is good enough to also tackle the precision concern given that we cannot introduce breaking changes.

We do set the original timeout on the gRPC connection for each remote shard call with

if let Some(timeout) = timeout {
  request.set_timeout(timeout);
}

In that case the connection will be closed which in turn will drop the future object being processed.

@timvisee
Copy link
Copy Markdown
Member

timvisee commented Nov 20, 2024

@varshith257 Hi! Would you mind to bump your PR to apply the other suggestion. 😄

Basically just update the calculation mentioned here. And update it to something like t.as_sec_f32().ceil() as usize.

I think that'll be good enough.

@varshith257 varshith257 changed the title Fix Timeout Handling for 1-Second Queries in Distributed Deployments Fix: Timeout Precision in Remote Shard Queries Nov 20, 2024
@timvisee timvisee requested a review from agourlay November 20, 2024 15:04
Copy link
Copy Markdown
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do one more round of reformatting, and then it should be good 😄

cargo +nightly fmt --all -- --check

Edit: whoops it appears I missed a bracket in my suggestion above.

@varshith257 varshith257 force-pushed the fix/distributed-query-timeout branch from fbdeb93 to 520c879 Compare November 20, 2024 17:07
@varshith257 varshith257 force-pushed the fix/distributed-query-timeout branch from 520c879 to 373659a Compare November 20, 2024 17:10
@varshith257
Copy link
Copy Markdown
Contributor Author

varshith257 commented Nov 20, 2024

Just it went messy when rebasing. Sorry for it! I have opened new PR

@agourlay
Copy link
Copy Markdown
Member

FYI I have implemented my suggestions in #5495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants