Timeout on point retrieve when acquiring segment holder#7588
Conversation
| assert_eq!(records.len(), 3); | ||
| } | ||
|
|
||
| #[test] |
06880cb to
5b66afb
Compare
| #[error("Operation cancelled: {description}")] | ||
| Cancelled { description: String }, | ||
| #[error("Timeout error: {description}")] | ||
| Timeout { description: String }, |
There was a problem hiding this comment.
a new variant was required to propagate the error
| use crate::operations::types::VectorsConfig; | ||
| use crate::operations::vector_params_builder::VectorParamsBuilder; | ||
|
|
||
| pub const TEST_TIMEOUT: Duration = Duration::from_hours(1); |
There was a problem hiding this comment.
Wow!
Since this is in tests, shall we do a much lower timeout (like 10 seconds?) so we catch any operations that are stuck for some time.
lib/edge/src/query.rs
Outdated
| &point_ids, | ||
| &WithPayload::from(with_payload), | ||
| &with_vector, | ||
| Duration::from_hours(1), |
There was a problem hiding this comment.
Move this to a constant (as well as the other usages in edge)?
5b66afb to
e318cce
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/edge/src/lib.rs (1)
11-11: Centralizing the edge timeout is good; reconsider visibility and default choiceDefining a single
DEFAULT_EDGE_TIMEOUTconstant is a nice improvement over scattering literals, and usingDurationhere is straightforward.Two small design thoughts:
Visibility (
pubvs internal)
If this timeout is only meant to be used inside theedgecrate, you might not want to expose it as part of the public API:
- Using
const DEFAULT_EDGE_TIMEOUT: Duration = …;(orpub(crate)) keeps it internal and avoids committing to this exact value as an external contract.- That gives you more freedom to tweak the default later without worrying about downstream users relying on it.
Documenting the 1‑hour default
Since this is now the de facto timeout for edge retrieval/scroll paths, consider adding a short comment explaining why 1 hour was chosen and under what assumptions it’s safe (e.g., “matches external request deadline”, “intended to be conservative until we plumb per-request timeouts”, etc.). That will help future changes align with your intent.Also applies to: 185-186
lib/edge/src/retrieve.rs (1)
9-9: Retrieve now correctly participates in the shared timeout policyImporting
DEFAULT_EDGE_TIMEOUTand passing it intoretrieve_blockingmakes the plainShard::retrievepath consistent with the other edge read paths and ensures it no longer waits indefinitely on the segment holder.If you later introduce per-request or configurable timeouts (as hinted in the PR description), this method should be a straightforward place to thread such a parameter through instead of always using the default.
Also applies to: 22-30
lib/edge/src/scroll.rs (1)
108-116: Timeout propagation is consistent, but consider configurability and initial lock acquisitionPassing
DEFAULT_EDGE_TIMEOUTintoretrieve_blockingin all three scroll helpers keeps scroll behavior aligned with other edge retrievals and ensures these calls now respect the same bounded wait policy.Two follow-ups you may want to consider (can be done in a later PR):
Configurable/request-scoped timeout
Right now the scroll timeout is controlled only via a crate-level default. Given this is on the query/scroll path, you might eventually want to derive the timeout from:
- a higher-level request deadline (HTTP/gRPC or client-supplied), or
- a collection/cluster configuration value,
instead of a fixed default baked into the edge crate.Unbounded
self.segments.read()beforeretrieve_blocking
Each scroll helper still does an initialself.segments.read().split_segments()without a timeout. If a writer holds the segment-holder write lock for a long time, these reads can still block indefinitely before you ever reachretrieve_blocking, so the new timeout won't fire in that scenario. If your goal is to avoid any potentially unbounded waits on the segment holder in the read path, it may be worth introducing a helper that acquires this first read lock with the same timeout semantics asretrieve_blocking.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
lib/collection/src/collection_manager/fixtures.rs(2 hunks)lib/edge/src/lib.rs(2 hunks)lib/edge/src/query.rs(2 hunks)lib/edge/src/retrieve.rs(2 hunks)lib/edge/src/scroll.rs(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/collection/src/collection_manager/fixtures.rs
- lib/edge/src/query.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/review-rules.md)
**/*.rs: Prefer explicitSomeType::from(x)over implicitx.into()in Rust
In new code, don't usetransmute_from_u8,transmute_to_u8,transmute_from_u8_to_slice,transmute_from_u8_to_mut_slice,transmute_to_u8_slice- usebytemuckorzerocopycrates instead
Prefer explicit exhaustive matches over catch-all_arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using: _over using..in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)
Files:
lib/edge/src/scroll.rslib/edge/src/retrieve.rslib/edge/src/lib.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.
📚 Learning: 2025-11-25T11:56:33.176Z
Learnt from: CR
Repo: qdrant/qdrant PR: 0
File: .github/review-rules.md:0-0
Timestamp: 2025-11-25T11:56:33.176Z
Learning: For Rust code review, checking whether code compiles is out of scope, including: missing imports for `size_of` or `align_of` (included in Rust 1.80+ prelude), methods not implemented for types, different method names across `rand` crate versions (e.g., `random_range` vs `gen_range`), and other compiler errors or warnings
Applied to files:
lib/edge/src/scroll.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: lint
- GitHub Check: e2e-tests
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: integration-tests-consensus
- GitHub Check: test-consistency
- GitHub Check: integration-tests
- GitHub Check: Build Qdrant Edge Python bindings
🔇 Additional comments (1)
lib/edge/src/scroll.rs (1)
16-18: Use of shared DEFAULT_EDGE_TIMEOUT in scroll paths looks goodImporting
DEFAULT_EDGE_TIMEOUThere cleanly shares the same default timeout policy between scroll and other edge retrieval paths, avoiding per-call hard-coding.
e6dab72 to
5f2c37d
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
lib/edge/src/scroll.rs (1)
266-274: LGTM!scroll_randomlynow uses centralized timeout constant.Note: The previous review comment about
Duration::from_hours(1)MSRV compatibility is addressed by using theDEFAULT_EDGE_TIMEOUTconstant - the fix should be applied inlib/edge/src/lib.rswhere the constant is defined.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
lib/collection/src/collection_manager/collection_updater.rs(9 hunks)lib/collection/src/collection_manager/fixtures.rs(2 hunks)lib/collection/src/collection_manager/segments_searcher.rs(4 hunks)lib/collection/src/collection_manager/tests/mod.rs(5 hunks)lib/collection/src/operations/types.rs(1 hunks)lib/collection/src/shards/local_shard/query.rs(1 hunks)lib/collection/src/shards/local_shard/scroll.rs(3 hunks)lib/collection/src/shards/local_shard/shard_ops.rs(1 hunks)lib/edge/src/lib.rs(2 hunks)lib/edge/src/query.rs(2 hunks)lib/edge/src/retrieve.rs(2 hunks)lib/edge/src/scroll.rs(4 hunks)lib/segment/src/common/operation_error.rs(2 hunks)lib/shard/src/retrieve/retrieve_blocking.rs(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- lib/collection/src/collection_manager/fixtures.rs
- lib/collection/src/collection_manager/collection_updater.rs
- lib/collection/src/shards/local_shard/scroll.rs
- lib/edge/src/query.rs
- lib/collection/src/collection_manager/segments_searcher.rs
- lib/collection/src/operations/types.rs
- lib/collection/src/shards/local_shard/shard_ops.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/review-rules.md)
**/*.rs: Prefer explicitSomeType::from(x)over implicitx.into()in Rust
In new code, don't usetransmute_from_u8,transmute_to_u8,transmute_from_u8_to_slice,transmute_from_u8_to_mut_slice,transmute_to_u8_slice- usebytemuckorzerocopycrates instead
Prefer explicit exhaustive matches over catch-all_arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using: _over using..in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)
Files:
lib/edge/src/scroll.rslib/edge/src/retrieve.rslib/shard/src/retrieve/retrieve_blocking.rslib/edge/src/lib.rslib/collection/src/collection_manager/tests/mod.rslib/collection/src/shards/local_shard/query.rslib/segment/src/common/operation_error.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.
📚 Learning: 2025-11-25T11:56:33.176Z
Learnt from: CR
Repo: qdrant/qdrant PR: 0
File: .github/review-rules.md:0-0
Timestamp: 2025-11-25T11:56:33.176Z
Learning: For Rust code review, checking whether code compiles is out of scope, including: missing imports for `size_of` or `align_of` (included in Rust 1.80+ prelude), methods not implemented for types, different method names across `rand` crate versions (e.g., `random_range` vs `gen_range`), and other compiler errors or warnings
Applied to files:
lib/edge/src/scroll.rs
🧬 Code graph analysis (3)
lib/shard/src/retrieve/retrieve_blocking.rs (2)
lib/collection/src/operations/types.rs (2)
timeout(935-942)service_error(944-949)lib/segment/src/common/operation_error.rs (2)
timeout(119-126)service_error(86-91)
lib/collection/src/collection_manager/tests/mod.rs (2)
lib/collection/src/collection_manager/fixtures.rs (3)
build_segment_1(121-171)build_segment_2(173-211)empty_segment(29-31)lib/segment/tests/integration/fixtures/segment.rs (3)
build_segment_1(26-76)build_segment_2(78-128)empty_segment(19-21)
lib/collection/src/shards/local_shard/query.rs (3)
lib/collection/src/operations/types.rs (1)
timeout(935-942)lib/segment/src/common/operation_error.rs (1)
timeout(119-126)src/actix/api/service_api.rs (2)
timeout(38-40)timeout(80-82)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: e2e-tests
- GitHub Check: test-consensus-compose
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: Build Qdrant Edge Python bindings
- GitHub Check: lint
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (13)
lib/collection/src/shards/local_shard/query.rs (1)
128-141: LGTM! Timeout correctly propagated to both wrapper and retrieve call.The
timeoutis appropriately passed to both the outertokio::time::timeoutwrapper and the innerSegmentsSearcher::retrievecall, ensuring consistent timeout behavior at both the async and segment-holder lock acquisition levels.lib/collection/src/collection_manager/tests/mod.rs (5)
20-22: LGTM! Import updated correctly.The import now includes
TEST_TIMEOUTfrom fixtures, aligning with the new timeout parameter requirement.
290-299: LGTM! Test updated with timeout parameter.The
retrieve_blockingcall intest_delete_all_point_versionscorrectly includesTEST_TIMEOUTas the new timeout argument.
336-345: LGTM! Second retrieve_blocking call updated consistently.
453-462: LGTM! Testtest_proxy_shared_updatesupdated with timeout.
589-598: LGTM! Testtest_proxy_shared_updates_same_versionupdated with timeout.lib/segment/src/common/operation_error.rs (3)
51-52: LGTM! NewTimeoutvariant appropriately added.The
Timeoutvariant follows the established pattern of other error variants with a description field and consistent error message formatting.
119-126: LGTM! Thetimeoutconstructor is well-structured.The method signature and implementation align with the similar
CollectionError::timeoutpattern shown in the relevant code snippets.
86-117: LGTM! Refactored to returnSelffor consistency.The constructors now use
Selfreturn type, which is more idiomatic and consistent with the newtimeoutmethod.lib/edge/src/scroll.rs (3)
17-17: LGTM! Import added for centralized timeout constant.Using a shared constant is better than hardcoding timeout values at each call site.
108-116: LGTM!scroll_by_idnow uses centralized timeout constant.
162-170: LGTM!scroll_by_fieldnow uses centralized timeout constant.lib/shard/src/retrieve/retrieve_blocking.rs (1)
36-87: LGTM!The retrieval logic correctly preserves the existing behavior while operating under the timeout-guarded lock. The optimization to keep the latest point version and the conditional payload/vector assembly are maintained appropriately.
| point_ids, | ||
| &with_payload, | ||
| &with_vector, | ||
| DEFAULT_EDGE_TIMEOUT, |
There was a problem hiding this comment.
Reconsider the 1-hour default timeout for user-facing operations.
A 1-hour timeout for point retrieval operations is excessively long for user-facing read paths. This could:
- Mask performance issues or deadlocks that should fail faster
- Lead to poor user experience if operations hang for extended periods
- Tie up resources unnecessarily
Consider making the timeout configurable (e.g., via constructor parameter or config) with a more conservative default (e.g., 30-60 seconds for edge operations).
I see you noted dissatisfaction with this approach in the PR description. Would you like help exploring alternatives, such as:
- Adding a configurable timeout field to the
Shardstruct - Accepting timeout as a parameter in the public
retrievemethod - Using a more conservative default (e.g., 30-60 seconds) with override capability
| let Some(segments_guard) = segments.try_read_for(timeout) else { | ||
| return Err(OperationError::timeout( | ||
| timeout.as_secs() as usize, | ||
| "retrieve points", | ||
| )); | ||
| }; |
There was a problem hiding this comment.
Sub-second timeout precision is lost in error message.
Converting the timeout to seconds via timeout.as_secs() truncates sub-second durations. If a caller passes a timeout less than 1 second (e.g., Duration::from_millis(500)), the error message will read "timed out after 0 seconds", which is misleading.
Consider using milliseconds for better precision:
- return Err(OperationError::timeout(
- timeout.as_secs() as usize,
- "retrieve points",
- ));
+ return Err(OperationError::timeout(
+ timeout.as_millis() as usize,
+ "retrieve points (ms)",
+ ));Alternatively, if the timeout() helper must accept seconds, you could round up:
return Err(OperationError::timeout(
- timeout.as_secs() as usize,
+ timeout.as_secs().max(1) as usize,
"retrieve points",
));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let Some(segments_guard) = segments.try_read_for(timeout) else { | |
| return Err(OperationError::timeout( | |
| timeout.as_secs() as usize, | |
| "retrieve points", | |
| )); | |
| }; | |
| let Some(segments_guard) = segments.try_read_for(timeout) else { | |
| return Err(OperationError::timeout( | |
| timeout.as_millis() as usize, | |
| "retrieve points (ms)", | |
| )); | |
| }; |
There was a problem hiding this comment.
good point I will take care of all instances of that problem in a different PR
There was a problem hiding this comment.
@agourlay, sounds good! Would you like me to create a GitHub issue to track this timeout precision improvement across all instances?
🧠 Learnings used
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.
There was a problem hiding this comment.
Yeah sure go ahead and do that.
Please list precisely in the issue all the instances of this in the code base to make my life easier.
Also assign the issue to myself so I don't forget.
Do not disappoint me.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all instances of OperationError::timeout calls with timeout.as_secs()
echo "=== Searching for OperationError::timeout calls with timeout.as_secs() ==="
rg -nP 'OperationError::timeout\s*\(' -A 3 -B 1
echo -e "\n=== Searching for timeout.as_secs() usage ==="
rg -nP 'timeout\.as_secs\(\)' -C 2
echo -e "\n=== Finding all files that contain timeout precision conversion ==="
rg -l 'timeout\.as_secs\(\)'
Length of output: 13938
🏁 Script executed:
#!/bin/bash
# Create GitHub issue with comprehensive list of all timeout precision instances
gh issue create \
--title "Improve timeout precision in error messages (sub-second durations truncated)" \
--assignee agourlay \
--body "## Problem
When creating timeout errors, we use \`timeout.as_secs()\` which truncates sub-second durations. If a timeout is less than 1 second (e.g., \`Duration::from_millis(500)\`), the error message will read \"timed out after 0 seconds\", which is misleading.
## Related PR and Discussion
- PR: #7588
- Comment: https://github.com/qdrant/qdrant/pull/7588#discussion_r2589088088
## Solution
Use \`timeout.as_millis()\` or \`timeout.as_secs().max(1)\` to preserve precision or ensure non-zero values.
## All Instances Requiring Fix
### Critical: Error Creation (23 instances)
These directly affect error messages and should be fixed:
1. **lib/shard/src/retrieve/retrieve_blocking.rs:31**
\`\`\`rust
OperationError::timeout(timeout.as_secs() as usize, \"retrieve points\")
\`\`\`
2. **src/common/telemetry.rs:117**
\`\`\`rust
StorageError::timeout(timeout.as_secs() as usize, \"collections telemetry\")
\`\`\`
3. **lib/collection/src/shards/local_shard/query.rs:141**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
\`\`\`
4. **lib/collection/src/shards/local_shard/scroll.rs:59**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"Query scroll\")
\`\`\`
5. **lib/collection/src/shards/local_shard/scroll.rs:186**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"scroll_by_id\")
\`\`\`
6. **lib/collection/src/shards/local_shard/scroll.rs:212**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
\`\`\`
7. **lib/collection/src/shards/local_shard/scroll.rs:275**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"scroll_by_field\")
\`\`\`
8. **lib/collection/src/shards/local_shard/scroll.rs:308**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
\`\`\`
9. **lib/collection/src/shards/local_shard/scroll.rs:378**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"scroll_randomly\")
\`\`\`
10. **lib/collection/src/shards/local_shard/scroll.rs:448**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
\`\`\`
11. **lib/collection/src/shards/local_shard/shard_ops.rs:263**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"count\")
\`\`\`
12. **lib/collection/src/shards/local_shard/shard_ops.rs:304**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
\`\`\`
13. **lib/collection/src/shards/local_shard/telemetry.rs:38**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"shard telemetry\")
\`\`\`
14. **lib/collection/src/shards/local_shard/telemetry.rs:51**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"shard telemetry\")
\`\`\`
15. **lib/collection/src/shards/local_shard/telemetry.rs:130**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"optimization status\")
\`\`\`
16. **lib/collection/src/shards/local_shard/formula_rescore.rs:44**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"rescore_with_formula\")
\`\`\`
17. **lib/collection/src/shards/local_shard/facet.rs:63**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"facet\")
\`\`\`
18. **lib/collection/src/shards/local_shard/facet.rs:142**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"facet\")
\`\`\`
19. **lib/collection/src/shards/local_shard/facet.rs:191**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"facet\")
\`\`\`
20. **lib/collection/src/shards/local_shard/search.rs:143**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"Search\")
\`\`\`
21. **lib/collection/src/shards/replica_set/mod.rs:592**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"wait for replica set state\")
\`\`\`
22. **lib/collection/src/grouping/builder.rs:77**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"GroupBy\")
\`\`\`
23. **lib/collection/src/collection/mmr.rs:42**
\`\`\`rust
CollectionError::timeout(timeout.as_secs() as usize, \"mmr\")
\`\`\`
### Lower Priority: Logging and Formatting (7 instances)
These are in log messages or error string formatting and have less impact:
24. **lib/collection/src/shards/local_shard/scroll.rs:57** (log message)
25. **lib/collection/src/shards/local_shard/search.rs:141** (log message)
26. **lib/collection/src/grouping/builder.rs:76** (log message)
27. **lib/storage/src/content_manager/consensus_manager.rs:771** (error message formatting)
28. **lib/collection/src/shards/replica_set/update.rs:493** (error message formatting)
29. **lib/collection/src/shards/proxy_shard.rs:124** (retry log message)
30. **lib/collection/src/shards/channel_service.rs:119** (gRPC field - may need different handling)
## Recommended Approach
1. Fix all 23 critical error creation instances first
2. Consider fixing logging instances for consistency
3. Evaluate gRPC field separately (may require protocol consideration)"
Length of output: 44
@agourlay, done! I've created issue #7688 with a comprehensive list of all timeout precision instances in the codebase:
🔗 #7688
The issue includes:
- 23 critical instances in error creation that directly affect error messages
- 7 lower-priority instances in logging and formatting
- Precise file paths and line numbers for each instance
- Code snippets showing the problematic patterns
- Recommended fix approaches
The issue has been assigned to you. All instances are categorized by priority to help you tackle the critical ones first, then decide on the logging instances.
/)/)
( •.•) Mission accomplished!
/>📋 30 instances documented
🧬 Code Graph Analysis Results
-
File: lib/collection/src/collection/point_ops.rs
- Snippet (lines 398-472)
pub async fn retrieve( &self, request: PointRequestInternal, read_consistency: Option<ReadConsistency>, shard_selection: &ShardSelectorInternal, timeout: Option<Duration>, hw_measurement_acc: HwMeasurementAcc, ) -> CollectionResult<Vec<RecordInternal>> { if request.ids.is_empty() { return Ok(Vec::new()); } let with_payload_interface = request .with_payload .as_ref() .unwrap_or(&WithPayloadInterface::Bool(false)); let with_payload = WithPayload::from(with_payload_interface); let ids_len = request.ids.len(); let request = Arc::new(request); let shard_holder = self.shards_holder.read().await; let target_shards = shard_holder.select_shards(shard_selection)?; let mut all_shard_collection_requests = target_shards .into_iter() .map(|(shard, shard_key)| { // Explicitly borrow `request` and `with_payload`, so we can use them in `async move` // block below without unnecessarily cloning anything let request = &request; let with_payload = &with_payload; let hw_acc = hw_measurement_acc.clone(); async move { let mut records = shard .retrieve( request.clone(), with_payload, &request.with_vector, read_consistency, timeout, shard_selection.is_shard_id(), hw_acc, ) .await?; if shard_key.is_none() { return Ok(records); } for point in &mut records { point.shard_key.clone_from(&shard_key.cloned()); } CollectionResult::Ok(records) } }) .collect::<FuturesUnordered<_>>(); // pre-allocate hashmap with capped capacity to protect from malevolent input let mut covered_point_ids = HashMap::with_capacity(ids_len.min(1024)); while let Some(response) = all_shard_collection_requests.try_next().await? { for point in response { // Add each point only once, deduplicate point IDs covered_point_ids.insert(point.id, point); } } // Collect points in the same order as they were requested let points = request .ids .iter() .filter_map(|id| covered_point_ids.remove(id)) .collect(); Ok(points) }
- Snippet (lines 398-472)
-
File: lib/collection/src/shards/local_shard/shard_ops.rs
- Snippet (lines 277-316)
async fn retrieve( &self, request: Arc<PointRequestInternal>, with_payload: &WithPayload, with_vector: &WithVector, search_runtime_handle: &Handle, timeout: Option<Duration>, hw_measurement_acc: HwMeasurementAcc, ) -> CollectionResult<Vec<RecordInternal>> { // Check read rate limiter before proceeding self.check_read_rate_limiter(&hw_measurement_acc, "retrieve", || request.ids.len())?; let timeout = timeout.unwrap_or(self.shared_storage_config.search_timeout); let start_time = Instant::now(); let records_map = tokio::time::timeout( timeout, SegmentsSearcher::retrieve( self.segments.clone(), &request.ids, with_payload, with_vector, search_runtime_handle, timeout, hw_measurement_acc, ), ) .await .map_err(|_: Elapsed| CollectionError::timeout(timeout.as_secs() as usize, "retrieve"))??; let ordered_records = request .ids .iter() .filter_map(|point| records_map.get(point).cloned()) .collect(); let elapsed = start_time.elapsed(); log_request_to_collector(&self.collection_name, elapsed, || request); Ok(ordered_records) }
- Snippet (lines 277-316)
-
File: lib/collection/src/collection_manager/segments_searcher.rs
- Snippet (lines 360-390)
pub async fn retrieve( segments: LockedSegmentHolder, points: &[PointIdType], with_payload: &WithPayload, with_vector: &WithVector, runtime_handle: &Handle, timeout: Duration, hw_measurement_acc: HwMeasurementAcc, ) -> CollectionResult<AHashMap<PointIdType, RecordInternal>> { let stopping_guard = StoppingGuard::new(); let points = runtime_handle.spawn_blocking({ let segments = segments.clone(); let points = points.to_vec(); let with_payload = with_payload.clone(); let with_vector = with_vector.clone(); let is_stopped = stopping_guard.get_is_stopped(); // TODO create one Task per segment level retrieve move || { retrieve_blocking( segments, &points, &with_payload, &with_vector, timeout, &is_stopped, hw_measurement_acc, ) } }); Ok(AbortOnDropHandle::new(points).await??) }
- Snippet (lines 360-390)
-
File: lib/segment/src/common/operation_error.rs
- Snippet (lines 119-126)
pub fn timeout(timeout_sec: usize, operation: impl Into<String>) -> Self { Self::Timeout { description: format!( "Operation '{}' timed out after {timeout_sec} seconds", operation.into() ), } }
- Snippet (lines 119-126)
-
File: lib/edge/src/retrieve.rs
- Snippet (lines 12-38)
pub fn retrieve( &self, point_ids: &[ExtendedPointId], with_payload: Option<WithPayloadInterface>, with_vector: Option<WithVector>, ) -> OperationResult<Vec<RecordInternal>> { let with_payload = WithPayload::from(with_payload.unwrap_or(WithPayloadInterface::Bool(true))); let with_vector = with_vector.unwrap_or(WithVector::Bool(false)); let mut points = retrieve_blocking( self.segments.clone(), point_ids, &with_payload, &with_vector, DEFAULT_EDGE_TIMEOUT, &AtomicBool::new(false), HwMeasurementAcc::disposable(), )?; let points: Vec<_> = point_ids .iter() .filter_map(|id| points.remove(id)) .collect(); Ok(points) }
- Snippet (lines 12-38)
🧠 Learnings used
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
There was a problem hiding this comment.
@agourlay, thank you! Happy to help! 🐰✨
Feel free to ping me if you need anything else while working through those instances.
* Timeout on point retrieve when acquiring segment holder * review: less radical and cleaner * highly embarassing I hope no one sees this! * coderabbit
This PR adds a timeout on
retrieve_blockingto make sure it does not wait indefinitely while trying to acquire the segment holder lock.A unit test has been added showing that the timeout kicks if a writer holds onto the lock for a long period of time.
The function
retrieve_blockingis used mostly on the read patch when searching/querying or scrolling points.I decided to hard-code a few timeouts for convenience:
TBH I do not like the approach for the edge crate, happy to hear better suggestions.