Fix: Timeout Precision in Remote Shard Queries by varshith257 · Pull Request #5475 · qdrant/qdrant

varshith257 · 2024-11-19T06:24:02Z

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

This PR addresses #5463: Timeout errors when querying distributed Qdrant deployments with a 1-second timeout. The issue stems from timeout calculations during query distribution where time subtraction and conversion to integers result in 0-second timeouts for remote shards, causing validation errors.

Root Cause

Timeout calculations used Duration::as_secs() or equivalent, truncating sub-second timeouts to 0 seconds after subtracting elapsed time.
Remote shard queries received timeouts below the minimum required threshold (1 second), leading to Deadline Exceeded errors.

Changes

Introduced the calculate_timeout function to handle precise floating-point timeout values, ensuring timeout clamping to a minimum of 1 second.
The function now supports both std::time::Instant and tokio::time::Instant for flexibility across codebases.
Using clamping, enforced a minimum timeout of 1 second for distributed queries.
Added validation checks to avoid passing invalid timeouts to remote shards.

Closes #5463.
/claim #5463

agourlay · 2024-11-19T09:02:35Z

Hi @varshith257, thank you for your contribution 👋

Introduced the calculate_timeout function to handle precise floating-point timeout values, ensuring timeout clamping to a minimum of 1 second.

I would propose a slightly different approach, the changes should happen only the remote shard operations and not across the code base.

The function now supports both std::time::Instant and tokio::time::Instant for flexibility across codebases.

I do not believe those new abstractions are useful TBH.

Instead I propose to apply two changes for the 6 ops in remote_shards:

scroll_by
core_search
count
retrieve
query_batch
facet

First, start by checking if the duration is zero, in that case we do not need to call the remote shards and can shortcut the requests with a timeout directly to save on network traffic.

Secondly, about handling the flooring issue of timeout_duration.as_secs which turns anything under 1 second to zero.

I would simply replace
timeout.map(|t| t.as_secs())
by
timeout.map(|t| t.as_secs().max(1)

In order to ensure the validation is not triggered on the server side.

WDYT?

Also please test your changes locally before submitting a PR :)

timvisee · 2024-11-19T09:07:22Z

@agourlay I agree. I'd rather keep using the std Duration API everywhere possible. It already has sufficient arithmetic functions.

timeout.map(|t| t.as_secs().max(1)

You suggest lower-capping at at least 1.

How about using t.as_sec_f32().ceil() as usize instead?

mautini · 2024-11-19T09:13:53Z

I believe these changes partially address the issue:

When the user specifies a 1-second timeout, it is always rounded up to 1 second, introducing errors in the estimation.
If the user provides a timeout greater than 1 second, the flooring issue remains. For example, if the user specifies 2 seconds, a “deadline exceeded” error could still be returned after only 1 second.

What about handling timeouts in milliseconds (ms) between nodes?

timvisee · 2024-11-19T09:17:06Z

What about handling timeouts in milliseconds (ms) between nodes?

I'd prefer seconds as floating point because it's an SI base unit.

Either way it'd require breaking API changes in some places. So it's not so trivial to implement.

mautini · 2024-11-19T09:26:49Z

Sorry, I wasn’t clear! I’m totally fine with using seconds as a floating-point value. My point was more “we cannot floor or ceil to the nearest second; we need greater precision than an int.”

varshith257 · 2024-11-19T16:03:31Z

Thanks for reviewing! I am just bit confused so raised Pr with what I understood 😀

Will push changes as requested

zeyaddeeb · 2024-11-19T22:12:04Z

I'm actually seeing this issue with 5 second timeouts as well

agourlay · 2024-11-20T08:01:01Z

The issue can happen for other timeout values where the pre-processing steps prior to reaching the remote shard decrease the value to less than 1 second.

I believe the course of action I described is good enough to also tackle the precision concern given that we cannot introduce breaking changes.

We do set the original timeout on the gRPC connection for each remote shard call with

if let Some(timeout) = timeout {
  request.set_timeout(timeout);
}

In that case the connection will be closed which in turn will drop the future object being processed.

timvisee · 2024-11-20T09:32:58Z

@varshith257 Hi! Would you mind to bump your PR to apply the other suggestion. 😄

Basically just update the calculation mentioned here. And update it to something like t.as_sec_f32().ceil() as usize.

I think that'll be good enough.

lib/collection/src/collection/query.rs

timvisee

Please do one more round of reformatting, and then it should be good 😄

cargo +nightly fmt --all -- --check

Edit: whoops it appears I missed a bracket in my suggestion above.

lib/collection/src/collection/query.rs

varshith257 · 2024-11-20T17:21:00Z

Just it went messy when rebasing. Sorry for it! I have opened new PR

agourlay · 2024-11-21T16:58:24Z

FYI I have implemented my suggestions in #5495

algora-pbc bot mentioned this pull request Nov 19, 2024

Timeout error on query if timeout=1 in a distributed deployment #5463

Closed

algora-pbc bot added the 🙋 Bounty claim label Nov 19, 2024

varshith257 changed the title ~~Fix Timeout Handling for 1-Second Queries in Distributed Deployments~~ Fix: Timeout Precision in Remote Shard Queries Nov 20, 2024

timvisee reviewed Nov 20, 2024

View reviewed changes

lib/collection/src/collection/query.rs Outdated Show resolved Hide resolved

timvisee requested a review from agourlay November 20, 2024 15:04

timvisee requested changes Nov 20, 2024

View reviewed changes

varshith257 commented Nov 20, 2024

View reviewed changes

lib/collection/src/collection/query.rs Outdated Show resolved Hide resolved

varshith257 force-pushed the fix/distributed-query-timeout branch from fbdeb93 to 520c879 Compare November 20, 2024 17:07

varshith257 closed this Nov 20, 2024

varshith257 force-pushed the fix/distributed-query-timeout branch from 520c879 to 373659a Compare November 20, 2024 17:10

varshith257 mentioned this pull request Nov 20, 2024

Fix: Timeout Precision in Remote Shard Queries #5496

Closed

9 tasks

Conversation

varshith257 commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All Submissions:

New Feature Submissions:

Changes to Core Features:

Root Cause

Changes

Uh oh!

agourlay commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvisee commented Nov 19, 2024

Uh oh!

mautini commented Nov 19, 2024

Uh oh!

timvisee commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mautini commented Nov 19, 2024

Uh oh!

varshith257 commented Nov 19, 2024

Uh oh!

zeyaddeeb commented Nov 19, 2024

Uh oh!

agourlay commented Nov 20, 2024

Uh oh!

timvisee commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

timvisee left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

varshith257 commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agourlay commented Nov 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

varshith257 commented Nov 19, 2024 •

edited

Loading

agourlay commented Nov 19, 2024 •

edited

Loading

timvisee commented Nov 19, 2024 •

edited

Loading

timvisee commented Nov 20, 2024 •

edited

Loading

timvisee left a comment •

edited

Loading

varshith257 commented Nov 20, 2024 •

edited

Loading