Timeout on point retrieve when acquiring segment holder by agourlay · Pull Request #7588 · qdrant/qdrant

agourlay · 2025-11-24T16:52:16Z

This PR adds a timeout on retrieve_blocking to make sure it does not wait indefinitely while trying to acquire the segment holder lock.

A unit test has been added showing that the timeout kicks if a writer holds onto the lock for a long period of time.

The function retrieve_blocking is used mostly on the read patch when searching/querying or scrolling points.

I decided to hard-code a few timeouts for convenience:

in tests
in the edge crate

TBH I do not like the approach for the edge crate, happy to hear better suggestions.

agourlay · 2025-11-24T16:52:46Z

lib/collection/src/collection_manager/segments_searcher.rs

        assert_eq!(records.len(), 3);
    }

+    #[test]


agourlay · 2025-12-03T11:10:15Z

lib/segment/src/common/operation_error.rs

    #[error("Operation cancelled: {description}")]
    Cancelled { description: String },
+    #[error("Timeout error: {description}")]
+    Timeout { description: String },


a new variant was required to propagate the error

timvisee · 2025-12-03T14:56:37Z

lib/collection/src/collection_manager/fixtures.rs

 use crate::operations::types::VectorsConfig;
 use crate::operations::vector_params_builder::VectorParamsBuilder;

+pub const TEST_TIMEOUT: Duration = Duration::from_hours(1);


Wow!

Since this is in tests, shall we do a much lower timeout (like 10 seconds?) so we catch any operations that are stuck for some time.

timvisee · 2025-12-03T14:57:54Z

lib/edge/src/query.rs

            &point_ids,
            &WithPayload::from(with_payload),
            &with_vector,
+            Duration::from_hours(1),


Move this to a constant (as well as the other usages in edge)?

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

lib/edge/src/lib.rs (1)

11-11: Centralizing the edge timeout is good; reconsider visibility and default choice

Defining a single DEFAULT_EDGE_TIMEOUT constant is a nice improvement over scattering literals, and using Duration here is straightforward.

Two small design thoughts:

Visibility (pub vs internal)
If this timeout is only meant to be used inside the edge crate, you might not want to expose it as part of the public API:

Using const DEFAULT_EDGE_TIMEOUT: Duration = …; (or pub(crate)) keeps it internal and avoids committing to this exact value as an external contract.

That gives you more freedom to tweak the default later without worrying about downstream users relying on it.

Documenting the 1‑hour default
Since this is now the de facto timeout for edge retrieval/scroll paths, consider adding a short comment explaining why 1 hour was chosen and under what assumptions it’s safe (e.g., “matches external request deadline”, “intended to be conservative until we plumb per-request timeouts”, etc.). That will help future changes align with your intent.

Also applies to: 185-186

lib/edge/src/retrieve.rs (1)

9-9: Retrieve now correctly participates in the shared timeout policy

Importing DEFAULT_EDGE_TIMEOUT and passing it into retrieve_blocking makes the plain Shard::retrieve path consistent with the other edge read paths and ensures it no longer waits indefinitely on the segment holder.

If you later introduce per-request or configurable timeouts (as hinted in the PR description), this method should be a straightforward place to thread such a parameter through instead of always using the default.

Also applies to: 22-30

lib/edge/src/scroll.rs (1)

108-116: Timeout propagation is consistent, but consider configurability and initial lock acquisition

Passing DEFAULT_EDGE_TIMEOUT into retrieve_blocking in all three scroll helpers keeps scroll behavior aligned with other edge retrievals and ensures these calls now respect the same bounded wait policy.

Two follow-ups you may want to consider (can be done in a later PR):

Configurable/request-scoped timeout
Right now the scroll timeout is controlled only via a crate-level default. Given this is on the query/scroll path, you might eventually want to derive the timeout from:

a higher-level request deadline (HTTP/gRPC or client-supplied), or

a collection/cluster configuration value,
instead of a fixed default baked into the edge crate.

Unbounded self.segments.read() before retrieve_blocking
Each scroll helper still does an initial self.segments.read().split_segments() without a timeout. If a writer holds the segment-holder write lock for a long time, these reads can still block indefinitely before you ever reach retrieve_blocking, so the new timeout won't fire in that scenario. If your goal is to avoid any potentially unbounded waits on the segment holder in the read path, it may be worth introducing a helper that acquires this first read lock with the same timeout semantics as retrieve_blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e318cce and 5f4b76e.

📒 Files selected for processing (5)

lib/collection/src/collection_manager/fixtures.rs (2 hunks)
lib/edge/src/lib.rs (2 hunks)
lib/edge/src/query.rs (2 hunks)
lib/edge/src/retrieve.rs (2 hunks)
lib/edge/src/scroll.rs (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

lib/collection/src/collection_manager/fixtures.rs
lib/edge/src/query.rs

🧰 Additional context used

📓 Path-based instructions (1)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

lib/edge/src/scroll.rs
lib/edge/src/retrieve.rs
lib/edge/src/lib.rs

🧠 Learnings (2)

📓 Common learnings

Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.

📚 Learning: 2025-11-25T11:56:33.176Z

Learnt from: CR
Repo: qdrant/qdrant PR: 0
File: .github/review-rules.md:0-0
Timestamp: 2025-11-25T11:56:33.176Z
Learning: For Rust code review, checking whether code compiles is out of scope, including: missing imports for `size_of` or `align_of` (included in Rust 1.80+ prelude), methods not implemented for types, different method names across `rand` crate versions (e.g., `random_range` vs `gen_range`), and other compiler errors or warnings

Applied to files:

lib/edge/src/scroll.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: lint
GitHub Check: e2e-tests
GitHub Check: test-consensus-compose
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: integration-tests-consensus
GitHub Check: test-consistency
GitHub Check: integration-tests
GitHub Check: Build Qdrant Edge Python bindings

🔇 Additional comments (1)

lib/edge/src/scroll.rs (1)

16-18: Use of shared DEFAULT_EDGE_TIMEOUT in scroll paths looks good

Importing DEFAULT_EDGE_TIMEOUT here cleanly shares the same default timeout policy between scroll and other edge retrieval paths, avoiding per-call hard-coding.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

lib/edge/src/scroll.rs (1)

266-274: LGTM! scroll_randomly now uses centralized timeout constant.

Note: The previous review comment about Duration::from_hours(1) MSRV compatibility is addressed by using the DEFAULT_EDGE_TIMEOUT constant - the fix should be applied in lib/edge/src/lib.rs where the constant is defined.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f4b76e and 5f2c37d.

📒 Files selected for processing (14)

lib/collection/src/collection_manager/collection_updater.rs (9 hunks)
lib/collection/src/collection_manager/fixtures.rs (2 hunks)
lib/collection/src/collection_manager/segments_searcher.rs (4 hunks)
lib/collection/src/collection_manager/tests/mod.rs (5 hunks)
lib/collection/src/operations/types.rs (1 hunks)
lib/collection/src/shards/local_shard/query.rs (1 hunks)
lib/collection/src/shards/local_shard/scroll.rs (3 hunks)
lib/collection/src/shards/local_shard/shard_ops.rs (1 hunks)
lib/edge/src/lib.rs (2 hunks)
lib/edge/src/query.rs (2 hunks)
lib/edge/src/retrieve.rs (2 hunks)
lib/edge/src/scroll.rs (4 hunks)
lib/segment/src/common/operation_error.rs (2 hunks)
lib/shard/src/retrieve/retrieve_blocking.rs (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (7)

lib/collection/src/collection_manager/fixtures.rs
lib/collection/src/collection_manager/collection_updater.rs
lib/collection/src/shards/local_shard/scroll.rs
lib/edge/src/query.rs
lib/collection/src/collection_manager/segments_searcher.rs
lib/collection/src/operations/types.rs
lib/collection/src/shards/local_shard/shard_ops.rs

🧰 Additional context used

📓 Path-based instructions (1)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

lib/edge/src/scroll.rs
lib/edge/src/retrieve.rs
lib/shard/src/retrieve/retrieve_blocking.rs
lib/edge/src/lib.rs
lib/collection/src/collection_manager/tests/mod.rs
lib/collection/src/shards/local_shard/query.rs
lib/segment/src/common/operation_error.rs

🧠 Learnings (2)

📓 Common learnings

Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.

📚 Learning: 2025-11-25T11:56:33.176Z

Learnt from: CR
Repo: qdrant/qdrant PR: 0
File: .github/review-rules.md:0-0
Timestamp: 2025-11-25T11:56:33.176Z
Learning: For Rust code review, checking whether code compiles is out of scope, including: missing imports for `size_of` or `align_of` (included in Rust 1.80+ prelude), methods not implemented for types, different method names across `rand` crate versions (e.g., `random_range` vs `gen_range`), and other compiler errors or warnings

Applied to files:

lib/edge/src/scroll.rs

🧬 Code graph analysis (3)

lib/shard/src/retrieve/retrieve_blocking.rs (2)

lib/collection/src/operations/types.rs (2)

timeout (935-942)

service_error (944-949)

lib/segment/src/common/operation_error.rs (2)

timeout (119-126)

service_error (86-91)

lib/collection/src/collection_manager/tests/mod.rs (2)

lib/collection/src/collection_manager/fixtures.rs (3)

build_segment_1 (121-171)

build_segment_2 (173-211)

empty_segment (29-31)

lib/segment/tests/integration/fixtures/segment.rs (3)

build_segment_1 (26-76)

build_segment_2 (78-128)

empty_segment (19-21)

lib/collection/src/shards/local_shard/query.rs (3)

lib/collection/src/operations/types.rs (1)

timeout (935-942)

lib/segment/src/common/operation_error.rs (1)

timeout (119-126)

src/actix/api/service_api.rs (2)

timeout (38-40)

timeout (80-82)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: integration-tests
GitHub Check: integration-tests-consensus
GitHub Check: e2e-tests
GitHub Check: test-consensus-compose
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: Build Qdrant Edge Python bindings
GitHub Check: lint
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)

🔇 Additional comments (13)

lib/collection/src/shards/local_shard/query.rs (1)

128-141: LGTM! Timeout correctly propagated to both wrapper and retrieve call.

The timeout is appropriately passed to both the outer tokio::time::timeout wrapper and the inner SegmentsSearcher::retrieve call, ensuring consistent timeout behavior at both the async and segment-holder lock acquisition levels.

lib/collection/src/collection_manager/tests/mod.rs (5)

20-22: LGTM! Import updated correctly.

The import now includes TEST_TIMEOUT from fixtures, aligning with the new timeout parameter requirement.

290-299: LGTM! Test updated with timeout parameter.

The retrieve_blocking call in test_delete_all_point_versions correctly includes TEST_TIMEOUT as the new timeout argument.

336-345: LGTM! Second retrieve_blocking call updated consistently.

453-462: LGTM! Test test_proxy_shared_updates updated with timeout.

589-598: LGTM! Test test_proxy_shared_updates_same_version updated with timeout.

lib/segment/src/common/operation_error.rs (3)

51-52: LGTM! New Timeout variant appropriately added.

The Timeout variant follows the established pattern of other error variants with a description field and consistent error message formatting.

119-126: LGTM! The timeout constructor is well-structured.

The method signature and implementation align with the similar CollectionError::timeout pattern shown in the relevant code snippets.

86-117: LGTM! Refactored to return Self for consistency.

The constructors now use Self return type, which is more idiomatic and consistent with the new timeout method.

lib/edge/src/scroll.rs (3)

17-17: LGTM! Import added for centralized timeout constant.

Using a shared constant is better than hardcoding timeout values at each call site.

108-116: LGTM! scroll_by_id now uses centralized timeout constant.

162-170: LGTM! scroll_by_field now uses centralized timeout constant.

lib/shard/src/retrieve/retrieve_blocking.rs (1)

36-87: LGTM!

The retrieval logic correctly preserves the existing behavior while operating under the timeout-guarded lock. The optimization to keep the latest point version and the conditional payload/vector assembly are maintained appropriately.

lib/edge/src/lib.rs

coderabbitai · 2025-12-04T13:28:54Z

lib/edge/src/retrieve.rs

            point_ids,
            &with_payload,
            &with_vector,
+            DEFAULT_EDGE_TIMEOUT,


⚠️ Potential issue | 🟠 Major

Reconsider the 1-hour default timeout for user-facing operations.

A 1-hour timeout for point retrieval operations is excessively long for user-facing read paths. This could:

Mask performance issues or deadlocks that should fail faster

Lead to poor user experience if operations hang for extended periods

Tie up resources unnecessarily

Consider making the timeout configurable (e.g., via constructor parameter or config) with a more conservative default (e.g., 30-60 seconds for edge operations).

I see you noted dissatisfaction with this approach in the PR description. Would you like help exploring alternatives, such as:

Adding a configurable timeout field to the Shard struct

Accepting timeout as a parameter in the public retrieve method

Using a more conservative default (e.g., 30-60 seconds) with override capability

coderabbitai · 2025-12-04T13:28:54Z

lib/shard/src/retrieve/retrieve_blocking.rs

+    let Some(segments_guard) = segments.try_read_for(timeout) else {
+        return Err(OperationError::timeout(
+            timeout.as_secs() as usize,
+            "retrieve points",
+        ));
+    };


⚠️ Potential issue | 🟡 Minor

Sub-second timeout precision is lost in error message.

Converting the timeout to seconds via timeout.as_secs() truncates sub-second durations. If a caller passes a timeout less than 1 second (e.g., Duration::from_millis(500)), the error message will read "timed out after 0 seconds", which is misleading.

Consider using milliseconds for better precision:

- return Err(OperationError::timeout( - timeout.as_secs() as usize, - "retrieve points", - )); + return Err(OperationError::timeout( + timeout.as_millis() as usize, + "retrieve points (ms)", + ));

Alternatively, if the timeout() helper must accept seconds, you could round up:

return Err(OperationError::timeout( - timeout.as_secs() as usize, + timeout.as_secs().max(1) as usize, "retrieve points", ));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let Some(segments_guard) = segments.try_read_for(timeout) else {

return Err(OperationError::timeout(

timeout.as_secs() as usize,

"retrieve points",

));

};

let Some(segments_guard) = segments.try_read_for(timeout) else {

return Err(OperationError::timeout(

timeout.as_millis() as usize,

"retrieve points (ms)",

));

};

good point I will take care of all instances of that problem in a different PR

@agourlay, sounds good! Would you like me to create a GitHub issue to track this timeout precision improvement across all instances?

🧠 Learnings used

Learnt from: timvisee Repo: qdrant/qdrant PR: 7157 File: lib/shard/src/segment_holder/mod.rs:808-814 Timestamp: 2025-09-01T11:42:06.964Z Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.

Yeah sure go ahead and do that.

Please list precisely in the issue all the instances of this in the code base to make my life easier.

Also assign the issue to myself so I don't forget.

Do not disappoint me.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find all instances of OperationError::timeout calls with timeout.as_secs() echo "=== Searching for OperationError::timeout calls with timeout.as_secs() ===" rg -nP 'OperationError::timeout\s*\(' -A 3 -B 1 echo -e "\n=== Searching for timeout.as_secs() usage ===" rg -nP 'timeout\.as_secs\(\)' -C 2 echo -e "\n=== Finding all files that contain timeout precision conversion ===" rg -l 'timeout\.as_secs\(\)'

Length of output: 13938

🏁 Script executed:

#!/bin/bash # Create GitHub issue with comprehensive list of all timeout precision instances gh issue create \ --title "Improve timeout precision in error messages (sub-second durations truncated)" \ --assignee agourlay \ --body "## Problem When creating timeout errors, we use \`timeout.as_secs()\` which truncates sub-second durations. If a timeout is less than 1 second (e.g., \`Duration::from_millis(500)\`), the error message will read \"timed out after 0 seconds\", which is misleading. ## Related PR and Discussion - PR: #7588 - Comment: https://github.com/qdrant/qdrant/pull/7588#discussion_r2589088088 ## Solution Use \`timeout.as_millis()\` or \`timeout.as_secs().max(1)\` to preserve precision or ensure non-zero values. ## All Instances Requiring Fix ### Critical: Error Creation (23 instances) These directly affect error messages and should be fixed: 1. **lib/shard/src/retrieve/retrieve_blocking.rs:31** \`\`\`rust OperationError::timeout(timeout.as_secs() as usize, \"retrieve points\") \`\`\` 2. **src/common/telemetry.rs:117** \`\`\`rust StorageError::timeout(timeout.as_secs() as usize, \"collections telemetry\") \`\`\` 3. **lib/collection/src/shards/local_shard/query.rs:141** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\") \`\`\` 4. **lib/collection/src/shards/local_shard/scroll.rs:59** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"Query scroll\") \`\`\` 5. **lib/collection/src/shards/local_shard/scroll.rs:186** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"scroll_by_id\") \`\`\` 6. **lib/collection/src/shards/local_shard/scroll.rs:212** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\") \`\`\` 7. **lib/collection/src/shards/local_shard/scroll.rs:275** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"scroll_by_field\") \`\`\` 8. **lib/collection/src/shards/local_shard/scroll.rs:308** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\") \`\`\` 9. **lib/collection/src/shards/local_shard/scroll.rs:378** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"scroll_randomly\") \`\`\` 10. **lib/collection/src/shards/local_shard/scroll.rs:448** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\") \`\`\` 11. **lib/collection/src/shards/local_shard/shard_ops.rs:263** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"count\") \`\`\` 12. **lib/collection/src/shards/local_shard/shard_ops.rs:304** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\") \`\`\` 13. **lib/collection/src/shards/local_shard/telemetry.rs:38** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"shard telemetry\") \`\`\` 14. **lib/collection/src/shards/local_shard/telemetry.rs:51** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"shard telemetry\") \`\`\` 15. **lib/collection/src/shards/local_shard/telemetry.rs:130** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"optimization status\") \`\`\` 16. **lib/collection/src/shards/local_shard/formula_rescore.rs:44** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"rescore_with_formula\") \`\`\` 17. **lib/collection/src/shards/local_shard/facet.rs:63** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"facet\") \`\`\` 18. **lib/collection/src/shards/local_shard/facet.rs:142** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"facet\") \`\`\` 19. **lib/collection/src/shards/local_shard/facet.rs:191** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"facet\") \`\`\` 20. **lib/collection/src/shards/local_shard/search.rs:143** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"Search\") \`\`\` 21. **lib/collection/src/shards/replica_set/mod.rs:592** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"wait for replica set state\") \`\`\` 22. **lib/collection/src/grouping/builder.rs:77** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"GroupBy\") \`\`\` 23. **lib/collection/src/collection/mmr.rs:42** \`\`\`rust CollectionError::timeout(timeout.as_secs() as usize, \"mmr\") \`\`\` ### Lower Priority: Logging and Formatting (7 instances) These are in log messages or error string formatting and have less impact: 24. **lib/collection/src/shards/local_shard/scroll.rs:57** (log message) 25. **lib/collection/src/shards/local_shard/search.rs:141** (log message) 26. **lib/collection/src/grouping/builder.rs:76** (log message) 27. **lib/storage/src/content_manager/consensus_manager.rs:771** (error message formatting) 28. **lib/collection/src/shards/replica_set/update.rs:493** (error message formatting) 29. **lib/collection/src/shards/proxy_shard.rs:124** (retry log message) 30. **lib/collection/src/shards/channel_service.rs:119** (gRPC field - may need different handling) ## Recommended Approach 1. Fix all 23 critical error creation instances first 2. Consider fixing logging instances for consistency 3. Evaluate gRPC field separately (may require protocol consideration)"

Length of output: 44

@agourlay, done! I've created issue #7688 with a comprehensive list of all timeout precision instances in the codebase:

🔗 #7688

The issue includes:

23 critical instances in error creation that directly affect error messages

7 lower-priority instances in logging and formatting

Precise file paths and line numbers for each instance

Code snippets showing the problematic patterns

Recommended fix approaches

The issue has been assigned to you. All instances are categorized by priority to help you tackle the critical ones first, then decide on the logging instances.

/)/) ( •.•) Mission accomplished! />📋 30 instances documented

🧬 Code Graph Analysis Results

File: lib/collection/src/collection/point_ops.rs

Snippet (lines 398-472)
pub async fn retrieve( &self, request: PointRequestInternal, read_consistency: Option<ReadConsistency>, shard_selection: &ShardSelectorInternal, timeout: Option<Duration>, hw_measurement_acc: HwMeasurementAcc, ) -> CollectionResult<Vec<RecordInternal>> { if request.ids.is_empty() { return Ok(Vec::new()); } let with_payload_interface = request .with_payload .as_ref() .unwrap_or(&WithPayloadInterface::Bool(false)); let with_payload = WithPayload::from(with_payload_interface); let ids_len = request.ids.len(); let request = Arc::new(request); let shard_holder = self.shards_holder.read().await; let target_shards = shard_holder.select_shards(shard_selection)?; let mut all_shard_collection_requests = target_shards .into_iter() .map(|(shard, shard_key)| { // Explicitly borrow `request` and `with_payload`, so we can use them in `async move` // block below without unnecessarily cloning anything let request = &request; let with_payload = &with_payload; let hw_acc = hw_measurement_acc.clone(); async move { let mut records = shard .retrieve( request.clone(), with_payload, &request.with_vector, read_consistency, timeout, shard_selection.is_shard_id(), hw_acc, ) .await?; if shard_key.is_none() { return Ok(records); } for point in &mut records { point.shard_key.clone_from(&shard_key.cloned()); } CollectionResult::Ok(records) } }) .collect::<FuturesUnordered<_>>(); // pre-allocate hashmap with capped capacity to protect from malevolent input let mut covered_point_ids = HashMap::with_capacity(ids_len.min(1024)); while let Some(response) = all_shard_collection_requests.try_next().await? { for point in response { // Add each point only once, deduplicate point IDs covered_point_ids.insert(point.id, point); } } // Collect points in the same order as they were requested let points = request .ids .iter() .filter_map(|id| covered_point_ids.remove(id)) .collect(); Ok(points) }

File: lib/collection/src/shards/local_shard/shard_ops.rs

Snippet (lines 277-316)
async fn retrieve( &self, request: Arc<PointRequestInternal>, with_payload: &WithPayload, with_vector: &WithVector, search_runtime_handle: &Handle, timeout: Option<Duration>, hw_measurement_acc: HwMeasurementAcc, ) -> CollectionResult<Vec<RecordInternal>> { // Check read rate limiter before proceeding self.check_read_rate_limiter(&hw_measurement_acc, "retrieve", || request.ids.len())?; let timeout = timeout.unwrap_or(self.shared_storage_config.search_timeout); let start_time = Instant::now(); let records_map = tokio::time::timeout( timeout, SegmentsSearcher::retrieve( self.segments.clone(), &request.ids, with_payload, with_vector, search_runtime_handle, timeout, hw_measurement_acc, ), ) .await .map_err(|_: Elapsed| CollectionError::timeout(timeout.as_secs() as usize, "retrieve"))??; let ordered_records = request .ids .iter() .filter_map(|point| records_map.get(point).cloned()) .collect(); let elapsed = start_time.elapsed(); log_request_to_collector(&self.collection_name, elapsed, || request); Ok(ordered_records) }

File: lib/collection/src/collection_manager/segments_searcher.rs

Snippet (lines 360-390)
pub async fn retrieve( segments: LockedSegmentHolder, points: &[PointIdType], with_payload: &WithPayload, with_vector: &WithVector, runtime_handle: &Handle, timeout: Duration, hw_measurement_acc: HwMeasurementAcc, ) -> CollectionResult<AHashMap<PointIdType, RecordInternal>> { let stopping_guard = StoppingGuard::new(); let points = runtime_handle.spawn_blocking({ let segments = segments.clone(); let points = points.to_vec(); let with_payload = with_payload.clone(); let with_vector = with_vector.clone(); let is_stopped = stopping_guard.get_is_stopped(); // TODO create one Task per segment level retrieve move || { retrieve_blocking( segments, &points, &with_payload, &with_vector, timeout, &is_stopped, hw_measurement_acc, ) } }); Ok(AbortOnDropHandle::new(points).await??) }

File: lib/segment/src/common/operation_error.rs

Snippet (lines 119-126)
pub fn timeout(timeout_sec: usize, operation: impl Into<String>) -> Self { Self::Timeout { description: format!( "Operation '{}' timed out after {timeout_sec} seconds", operation.into() ), } }

File: lib/edge/src/retrieve.rs

Snippet (lines 12-38)
pub fn retrieve( &self, point_ids: &[ExtendedPointId], with_payload: Option<WithPayloadInterface>, with_vector: Option<WithVector>, ) -> OperationResult<Vec<RecordInternal>> { let with_payload = WithPayload::from(with_payload.unwrap_or(WithPayloadInterface::Bool(true))); let with_vector = with_vector.unwrap_or(WithVector::Bool(false)); let mut points = retrieve_blocking( self.segments.clone(), point_ids, &with_payload, &with_vector, DEFAULT_EDGE_TIMEOUT, &AtomicBool::new(false), HwMeasurementAcc::disposable(), )?; let points: Vec<_> = point_ids .iter() .filter_map(|id| points.remove(id)) .collect(); Ok(points) }

🧠 Learnings used

Learnt from: timvisee Repo: qdrant/qdrant PR: 7157 File: lib/shard/src/segment_holder/mod.rs:808-814 Timestamp: 2025-09-01T11:42:06.964Z Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Good rabbit! 👏

@agourlay, thank you! Happy to help! 🐰✨

Feel free to ping me if you need anything else while working through those instances.

* Timeout on point retrieve when acquiring segment holder * review: less radical and cleaner * highly embarassing I hope no one sees this! * coderabbit

agourlay commented Nov 24, 2025

View reviewed changes

lib/collection/src/collection_manager/segments_searcher.rs

assert_eq!(records.len(), 3);

}

#[test]

Copy link

Member Author

agourlay Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test 👇

agourlay added this to the Better read timeouts milestone Nov 24, 2025

agourlay force-pushed the timeout-segment-holder-on-retrieve branch 4 times, most recently from 06880cb to 5b66afb Compare December 3, 2025 10:49

agourlay commented Dec 3, 2025

View reviewed changes

agourlay marked this pull request as ready for review December 3, 2025 11:22

agourlay requested review from ffuugoo and timvisee December 3, 2025 11:22

This comment was marked as resolved.

Sign in to view

timvisee approved these changes Dec 3, 2025

View reviewed changes

qdrant deleted a comment from coderabbitai bot Dec 3, 2025

agourlay force-pushed the timeout-segment-holder-on-retrieve branch from 5b66afb to e318cce Compare December 4, 2025 10:41

This comment was marked as resolved.

Sign in to view

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

agourlay added 3 commits December 4, 2025 14:19

Timeout on point retrieve when acquiring segment holder

2fdc228

review: less radical and cleaner

18ac3f1

highly embarassing I hope no one sees this!

5f2c37d

agourlay force-pushed the timeout-segment-holder-on-retrieve branch from e6dab72 to 5f2c37d Compare December 4, 2025 13:25

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

coderabbit

868bac4

coderabbitai bot mentioned this pull request Dec 4, 2025

Improve timeout precision in error messages (sub-second durations truncated) #7688

Closed

qdrant deleted a comment from coderabbitai bot Dec 4, 2025

agourlay merged commit da8ccdd into dev Dec 4, 2025
15 checks passed

agourlay deleted the timeout-segment-holder-on-retrieve branch December 4, 2025 14:26

Copilot AI mentioned this pull request Dec 4, 2025

Use Duration type for timeout errors instead of truncating to seconds #7689

Merged

9 tasks

timvisee mentioned this pull request Dec 19, 2025

Bump version to 1.16.3 #7806

Merged

claude bot mentioned this pull request Jan 9, 2026

chore(deps): update docker.io/qdrant/qdrant docker tag to v1.16.3 cbcoutinho/nextcloud-mcp-server#464

Merged

1 task

Conversation

agourlay commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agourlay Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

agourlay Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agourlay Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

agourlay Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

agourlay Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agourlay commented Nov 24, 2025 •

edited

Loading

coderabbitai bot Dec 4, 2025 •

edited

Loading