-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Timeout error on query if timeout=1 in a distributed deployment #5463
Copy link
Copy link
Closed
Labels
Description
Current Behavior
When querying the Qdrant Recommendation API, if the collection is distributed (across multiple shards) and the timeout is set to 1 second, Qdrant may return a timeout error during request distribution, even if the request execution time is less than 1 second. Without a timeout specified, or with timeout=2, the request succeeds.
Steps to Reproduce
- Create a Qdrant cluster with multiple nodes using the official Helm Chart.
- Create a collection with multiple shards.
- Execute a simple recommend query like the one below (with timeout=1):
POST /collections/<collection_name>/points/query?timeout=1
{
"query": {
"recommend": {
"positive": [
"c026bde1-8399-488d-954a-c777a42ab74a"
]
}
},
"with_payload": false,
"with_vector": false
}
- Qdrant output the following error
{
"error": "Service internal error: 2 of 2 read operations failed:\n Timeout error: Deadline Exceeded: status: DeadlineExceeded, message: \"Timeout: Timeout error: Operation 'Search' timed out after 0 seconds\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Mon, 18 Nov 2024 17:28:13 GMT\", \"content-length\": \"0\"} }\n Timeout error: Deadline Exceeded: status: DeadlineExceeded, message: \"Timeout: Timeout error: Operation 'Search' timed out after 0 seconds\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Mon, 18 Nov 2024 17:28:13 GMT\", \"content-length\": \"0\"} }"
}
Expected Behavior
The request should return a result without any errors. (Running the same request without a timeout set returns a response in less than 1 second.)
Possible Solution
The issue seems to stem from the timeout settings between nodes (shards), as indicated by the message timed out after 0 seconds. This might be due to subtracting 1 (an integer) instead of a float (less than 1). The issue could be related to the following commits:
- Non blocking retrieve with timeout and cancellation support #4844
- Non blocking exact count with timeout and cancellation support #4849
Context (Environment)
- Deployment: Qdrant deployed in a private Kubernetes cluster using the official Helm Chart.
- Affected Versions: At least Qdrant 1.11.5 and 1.12.3. It worked fine with 1.10.1
- Reproducibility: The issue is not related to any specific client, as it can be reproduced using Python or Curl.
- Scope: The issue occurs only with distributed clusters (multiple shards).
Reactions are currently unavailable