Skip to content

Timeout on point retrieve when acquiring segment holder#7588

Merged
agourlay merged 4 commits intodevfrom
timeout-segment-holder-on-retrieve
Dec 4, 2025
Merged

Timeout on point retrieve when acquiring segment holder#7588
agourlay merged 4 commits intodevfrom
timeout-segment-holder-on-retrieve

Conversation

@agourlay
Copy link
Member

@agourlay agourlay commented Nov 24, 2025

This PR adds a timeout on retrieve_blocking to make sure it does not wait indefinitely while trying to acquire the segment holder lock.

A unit test has been added showing that the timeout kicks if a writer holds onto the lock for a long period of time.

The function retrieve_blocking is used mostly on the read patch when searching/querying or scrolling points.

I decided to hard-code a few timeouts for convenience:

  • in tests
  • in the edge crate

TBH I do not like the approach for the edge crate, happy to hear better suggestions.

assert_eq!(records.len(), 3);
}

#[test]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test 👇

@agourlay agourlay added this to the Better read timeouts milestone Nov 24, 2025
@agourlay agourlay force-pushed the timeout-segment-holder-on-retrieve branch 4 times, most recently from 06880cb to 5b66afb Compare December 3, 2025 10:49
#[error("Operation cancelled: {description}")]
Cancelled { description: String },
#[error("Timeout error: {description}")]
Timeout { description: String },
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a new variant was required to propagate the error

@agourlay agourlay marked this pull request as ready for review December 3, 2025 11:22
@agourlay agourlay requested review from ffuugoo and timvisee December 3, 2025 11:22
coderabbitai[bot]

This comment was marked as resolved.

use crate::operations::types::VectorsConfig;
use crate::operations::vector_params_builder::VectorParamsBuilder;

pub const TEST_TIMEOUT: Duration = Duration::from_hours(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow!

Since this is in tests, shall we do a much lower timeout (like 10 seconds?) so we catch any operations that are stuck for some time.

&point_ids,
&WithPayload::from(with_payload),
&with_vector,
Duration::from_hours(1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to a constant (as well as the other usages in edge)?

@qdrant qdrant deleted a comment from coderabbitai bot Dec 3, 2025
@agourlay agourlay force-pushed the timeout-segment-holder-on-retrieve branch from 5b66afb to e318cce Compare December 4, 2025 10:41
coderabbitai[bot]

This comment was marked as resolved.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
lib/edge/src/lib.rs (1)

11-11: Centralizing the edge timeout is good; reconsider visibility and default choice

Defining a single DEFAULT_EDGE_TIMEOUT constant is a nice improvement over scattering literals, and using Duration here is straightforward.

Two small design thoughts:

  1. Visibility (pub vs internal)
    If this timeout is only meant to be used inside the edge crate, you might not want to expose it as part of the public API:

    • Using const DEFAULT_EDGE_TIMEOUT: Duration = …; (or pub(crate)) keeps it internal and avoids committing to this exact value as an external contract.
    • That gives you more freedom to tweak the default later without worrying about downstream users relying on it.
  2. Documenting the 1‑hour default
    Since this is now the de facto timeout for edge retrieval/scroll paths, consider adding a short comment explaining why 1 hour was chosen and under what assumptions it’s safe (e.g., “matches external request deadline”, “intended to be conservative until we plumb per-request timeouts”, etc.). That will help future changes align with your intent.

Also applies to: 185-186

lib/edge/src/retrieve.rs (1)

9-9: Retrieve now correctly participates in the shared timeout policy

Importing DEFAULT_EDGE_TIMEOUT and passing it into retrieve_blocking makes the plain Shard::retrieve path consistent with the other edge read paths and ensures it no longer waits indefinitely on the segment holder.

If you later introduce per-request or configurable timeouts (as hinted in the PR description), this method should be a straightforward place to thread such a parameter through instead of always using the default.

Also applies to: 22-30

lib/edge/src/scroll.rs (1)

108-116: Timeout propagation is consistent, but consider configurability and initial lock acquisition

Passing DEFAULT_EDGE_TIMEOUT into retrieve_blocking in all three scroll helpers keeps scroll behavior aligned with other edge retrievals and ensures these calls now respect the same bounded wait policy.

Two follow-ups you may want to consider (can be done in a later PR):

  1. Configurable/request-scoped timeout
    Right now the scroll timeout is controlled only via a crate-level default. Given this is on the query/scroll path, you might eventually want to derive the timeout from:

    • a higher-level request deadline (HTTP/gRPC or client-supplied), or
    • a collection/cluster configuration value,
      instead of a fixed default baked into the edge crate.
  2. Unbounded self.segments.read() before retrieve_blocking
    Each scroll helper still does an initial self.segments.read().split_segments() without a timeout. If a writer holds the segment-holder write lock for a long time, these reads can still block indefinitely before you ever reach retrieve_blocking, so the new timeout won't fire in that scenario. If your goal is to avoid any potentially unbounded waits on the segment holder in the read path, it may be worth introducing a helper that acquires this first read lock with the same timeout semantics as retrieve_blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e318cce and 5f4b76e.

📒 Files selected for processing (5)
  • lib/collection/src/collection_manager/fixtures.rs (2 hunks)
  • lib/edge/src/lib.rs (2 hunks)
  • lib/edge/src/query.rs (2 hunks)
  • lib/edge/src/retrieve.rs (2 hunks)
  • lib/edge/src/scroll.rs (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/collection/src/collection_manager/fixtures.rs
  • lib/edge/src/query.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

  • lib/edge/src/scroll.rs
  • lib/edge/src/retrieve.rs
  • lib/edge/src/lib.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.
📚 Learning: 2025-11-25T11:56:33.176Z
Learnt from: CR
Repo: qdrant/qdrant PR: 0
File: .github/review-rules.md:0-0
Timestamp: 2025-11-25T11:56:33.176Z
Learning: For Rust code review, checking whether code compiles is out of scope, including: missing imports for `size_of` or `align_of` (included in Rust 1.80+ prelude), methods not implemented for types, different method names across `rand` crate versions (e.g., `random_range` vs `gen_range`), and other compiler errors or warnings

Applied to files:

  • lib/edge/src/scroll.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: lint
  • GitHub Check: e2e-tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests
  • GitHub Check: Build Qdrant Edge Python bindings
🔇 Additional comments (1)
lib/edge/src/scroll.rs (1)

16-18: Use of shared DEFAULT_EDGE_TIMEOUT in scroll paths looks good

Importing DEFAULT_EDGE_TIMEOUT here cleanly shares the same default timeout policy between scroll and other edge retrieval paths, avoiding per-call hard-coding.

@agourlay agourlay force-pushed the timeout-segment-holder-on-retrieve branch from e6dab72 to 5f2c37d Compare December 4, 2025 13:25
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
lib/edge/src/scroll.rs (1)

266-274: LGTM! scroll_randomly now uses centralized timeout constant.

Note: The previous review comment about Duration::from_hours(1) MSRV compatibility is addressed by using the DEFAULT_EDGE_TIMEOUT constant - the fix should be applied in lib/edge/src/lib.rs where the constant is defined.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f4b76e and 5f2c37d.

📒 Files selected for processing (14)
  • lib/collection/src/collection_manager/collection_updater.rs (9 hunks)
  • lib/collection/src/collection_manager/fixtures.rs (2 hunks)
  • lib/collection/src/collection_manager/segments_searcher.rs (4 hunks)
  • lib/collection/src/collection_manager/tests/mod.rs (5 hunks)
  • lib/collection/src/operations/types.rs (1 hunks)
  • lib/collection/src/shards/local_shard/query.rs (1 hunks)
  • lib/collection/src/shards/local_shard/scroll.rs (3 hunks)
  • lib/collection/src/shards/local_shard/shard_ops.rs (1 hunks)
  • lib/edge/src/lib.rs (2 hunks)
  • lib/edge/src/query.rs (2 hunks)
  • lib/edge/src/retrieve.rs (2 hunks)
  • lib/edge/src/scroll.rs (4 hunks)
  • lib/segment/src/common/operation_error.rs (2 hunks)
  • lib/shard/src/retrieve/retrieve_blocking.rs (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • lib/collection/src/collection_manager/fixtures.rs
  • lib/collection/src/collection_manager/collection_updater.rs
  • lib/collection/src/shards/local_shard/scroll.rs
  • lib/edge/src/query.rs
  • lib/collection/src/collection_manager/segments_searcher.rs
  • lib/collection/src/operations/types.rs
  • lib/collection/src/shards/local_shard/shard_ops.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

  • lib/edge/src/scroll.rs
  • lib/edge/src/retrieve.rs
  • lib/shard/src/retrieve/retrieve_blocking.rs
  • lib/edge/src/lib.rs
  • lib/collection/src/collection_manager/tests/mod.rs
  • lib/collection/src/shards/local_shard/query.rs
  • lib/segment/src/common/operation_error.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.
📚 Learning: 2025-11-25T11:56:33.176Z
Learnt from: CR
Repo: qdrant/qdrant PR: 0
File: .github/review-rules.md:0-0
Timestamp: 2025-11-25T11:56:33.176Z
Learning: For Rust code review, checking whether code compiles is out of scope, including: missing imports for `size_of` or `align_of` (included in Rust 1.80+ prelude), methods not implemented for types, different method names across `rand` crate versions (e.g., `random_range` vs `gen_range`), and other compiler errors or warnings

Applied to files:

  • lib/edge/src/scroll.rs
🧬 Code graph analysis (3)
lib/shard/src/retrieve/retrieve_blocking.rs (2)
lib/collection/src/operations/types.rs (2)
  • timeout (935-942)
  • service_error (944-949)
lib/segment/src/common/operation_error.rs (2)
  • timeout (119-126)
  • service_error (86-91)
lib/collection/src/collection_manager/tests/mod.rs (2)
lib/collection/src/collection_manager/fixtures.rs (3)
  • build_segment_1 (121-171)
  • build_segment_2 (173-211)
  • empty_segment (29-31)
lib/segment/tests/integration/fixtures/segment.rs (3)
  • build_segment_1 (26-76)
  • build_segment_2 (78-128)
  • empty_segment (19-21)
lib/collection/src/shards/local_shard/query.rs (3)
lib/collection/src/operations/types.rs (1)
  • timeout (935-942)
lib/segment/src/common/operation_error.rs (1)
  • timeout (119-126)
src/actix/api/service_api.rs (2)
  • timeout (38-40)
  • timeout (80-82)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: e2e-tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: Build Qdrant Edge Python bindings
  • GitHub Check: lint
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (13)
lib/collection/src/shards/local_shard/query.rs (1)

128-141: LGTM! Timeout correctly propagated to both wrapper and retrieve call.

The timeout is appropriately passed to both the outer tokio::time::timeout wrapper and the inner SegmentsSearcher::retrieve call, ensuring consistent timeout behavior at both the async and segment-holder lock acquisition levels.

lib/collection/src/collection_manager/tests/mod.rs (5)

20-22: LGTM! Import updated correctly.

The import now includes TEST_TIMEOUT from fixtures, aligning with the new timeout parameter requirement.


290-299: LGTM! Test updated with timeout parameter.

The retrieve_blocking call in test_delete_all_point_versions correctly includes TEST_TIMEOUT as the new timeout argument.


336-345: LGTM! Second retrieve_blocking call updated consistently.


453-462: LGTM! Test test_proxy_shared_updates updated with timeout.


589-598: LGTM! Test test_proxy_shared_updates_same_version updated with timeout.

lib/segment/src/common/operation_error.rs (3)

51-52: LGTM! New Timeout variant appropriately added.

The Timeout variant follows the established pattern of other error variants with a description field and consistent error message formatting.


119-126: LGTM! The timeout constructor is well-structured.

The method signature and implementation align with the similar CollectionError::timeout pattern shown in the relevant code snippets.


86-117: LGTM! Refactored to return Self for consistency.

The constructors now use Self return type, which is more idiomatic and consistent with the new timeout method.

lib/edge/src/scroll.rs (3)

17-17: LGTM! Import added for centralized timeout constant.

Using a shared constant is better than hardcoding timeout values at each call site.


108-116: LGTM! scroll_by_id now uses centralized timeout constant.


162-170: LGTM! scroll_by_field now uses centralized timeout constant.

lib/shard/src/retrieve/retrieve_blocking.rs (1)

36-87: LGTM!

The retrieval logic correctly preserves the existing behavior while operating under the timeout-guarded lock. The optimization to keep the latest point version and the conditional payload/vector assembly are maintained appropriately.

point_ids,
&with_payload,
&with_vector,
DEFAULT_EDGE_TIMEOUT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reconsider the 1-hour default timeout for user-facing operations.

A 1-hour timeout for point retrieval operations is excessively long for user-facing read paths. This could:

  • Mask performance issues or deadlocks that should fail faster
  • Lead to poor user experience if operations hang for extended periods
  • Tie up resources unnecessarily

Consider making the timeout configurable (e.g., via constructor parameter or config) with a more conservative default (e.g., 30-60 seconds for edge operations).

I see you noted dissatisfaction with this approach in the PR description. Would you like help exploring alternatives, such as:

  • Adding a configurable timeout field to the Shard struct
  • Accepting timeout as a parameter in the public retrieve method
  • Using a more conservative default (e.g., 30-60 seconds) with override capability

Comment on lines +29 to +34
let Some(segments_guard) = segments.try_read_for(timeout) else {
return Err(OperationError::timeout(
timeout.as_secs() as usize,
"retrieve points",
));
};
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Sub-second timeout precision is lost in error message.

Converting the timeout to seconds via timeout.as_secs() truncates sub-second durations. If a caller passes a timeout less than 1 second (e.g., Duration::from_millis(500)), the error message will read "timed out after 0 seconds", which is misleading.

Consider using milliseconds for better precision:

-        return Err(OperationError::timeout(
-            timeout.as_secs() as usize,
-            "retrieve points",
-        ));
+        return Err(OperationError::timeout(
+            timeout.as_millis() as usize,
+            "retrieve points (ms)",
+        ));

Alternatively, if the timeout() helper must accept seconds, you could round up:

         return Err(OperationError::timeout(
-            timeout.as_secs() as usize,
+            timeout.as_secs().max(1) as usize,
             "retrieve points",
         ));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let Some(segments_guard) = segments.try_read_for(timeout) else {
return Err(OperationError::timeout(
timeout.as_secs() as usize,
"retrieve points",
));
};
let Some(segments_guard) = segments.try_read_for(timeout) else {
return Err(OperationError::timeout(
timeout.as_millis() as usize,
"retrieve points (ms)",
));
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point I will take care of all instances of that problem in a different PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agourlay, sounds good! Would you like me to create a GitHub issue to track this timeout precision improvement across all instances?


🧠 Learnings used
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure go ahead and do that.

Please list precisely in the issue all the instances of this in the code base to make my life easier.

Also assign the issue to myself so I don't forget.

Do not disappoint me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all instances of OperationError::timeout calls with timeout.as_secs()
echo "=== Searching for OperationError::timeout calls with timeout.as_secs() ==="
rg -nP 'OperationError::timeout\s*\(' -A 3 -B 1

echo -e "\n=== Searching for timeout.as_secs() usage ==="
rg -nP 'timeout\.as_secs\(\)' -C 2

echo -e "\n=== Finding all files that contain timeout precision conversion ==="
rg -l 'timeout\.as_secs\(\)'

Length of output: 13938


🏁 Script executed:

#!/bin/bash
# Create GitHub issue with comprehensive list of all timeout precision instances

gh issue create \
  --title "Improve timeout precision in error messages (sub-second durations truncated)" \
  --assignee agourlay \
  --body "## Problem

When creating timeout errors, we use \`timeout.as_secs()\` which truncates sub-second durations. If a timeout is less than 1 second (e.g., \`Duration::from_millis(500)\`), the error message will read \"timed out after 0 seconds\", which is misleading.

## Related PR and Discussion

- PR: #7588
- Comment: https://github.com/qdrant/qdrant/pull/7588#discussion_r2589088088

## Solution

Use \`timeout.as_millis()\` or \`timeout.as_secs().max(1)\` to preserve precision or ensure non-zero values.

## All Instances Requiring Fix

### Critical: Error Creation (23 instances)

These directly affect error messages and should be fixed:

1. **lib/shard/src/retrieve/retrieve_blocking.rs:31**
   \`\`\`rust
   OperationError::timeout(timeout.as_secs() as usize, \"retrieve points\")
   \`\`\`

2. **src/common/telemetry.rs:117**
   \`\`\`rust
   StorageError::timeout(timeout.as_secs() as usize, \"collections telemetry\")
   \`\`\`

3. **lib/collection/src/shards/local_shard/query.rs:141**
   \`\`\`rust
   CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
   \`\`\`

4. **lib/collection/src/shards/local_shard/scroll.rs:59**
   \`\`\`rust
   CollectionError::timeout(timeout.as_secs() as usize, \"Query scroll\")
   \`\`\`

5. **lib/collection/src/shards/local_shard/scroll.rs:186**
   \`\`\`rust
   CollectionError::timeout(timeout.as_secs() as usize, \"scroll_by_id\")
   \`\`\`

6. **lib/collection/src/shards/local_shard/scroll.rs:212**
   \`\`\`rust
   CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
   \`\`\`

7. **lib/collection/src/shards/local_shard/scroll.rs:275**
   \`\`\`rust
   CollectionError::timeout(timeout.as_secs() as usize, \"scroll_by_field\")
   \`\`\`

8. **lib/collection/src/shards/local_shard/scroll.rs:308**
   \`\`\`rust
   CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
   \`\`\`

9. **lib/collection/src/shards/local_shard/scroll.rs:378**
   \`\`\`rust
   CollectionError::timeout(timeout.as_secs() as usize, \"scroll_randomly\")
   \`\`\`

10. **lib/collection/src/shards/local_shard/scroll.rs:448**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
    \`\`\`

11. **lib/collection/src/shards/local_shard/shard_ops.rs:263**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"count\")
    \`\`\`

12. **lib/collection/src/shards/local_shard/shard_ops.rs:304**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"retrieve\")
    \`\`\`

13. **lib/collection/src/shards/local_shard/telemetry.rs:38**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"shard telemetry\")
    \`\`\`

14. **lib/collection/src/shards/local_shard/telemetry.rs:51**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"shard telemetry\")
    \`\`\`

15. **lib/collection/src/shards/local_shard/telemetry.rs:130**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"optimization status\")
    \`\`\`

16. **lib/collection/src/shards/local_shard/formula_rescore.rs:44**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"rescore_with_formula\")
    \`\`\`

17. **lib/collection/src/shards/local_shard/facet.rs:63**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"facet\")
    \`\`\`

18. **lib/collection/src/shards/local_shard/facet.rs:142**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"facet\")
    \`\`\`

19. **lib/collection/src/shards/local_shard/facet.rs:191**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"facet\")
    \`\`\`

20. **lib/collection/src/shards/local_shard/search.rs:143**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"Search\")
    \`\`\`

21. **lib/collection/src/shards/replica_set/mod.rs:592**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"wait for replica set state\")
    \`\`\`

22. **lib/collection/src/grouping/builder.rs:77**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"GroupBy\")
    \`\`\`

23. **lib/collection/src/collection/mmr.rs:42**
    \`\`\`rust
    CollectionError::timeout(timeout.as_secs() as usize, \"mmr\")
    \`\`\`

### Lower Priority: Logging and Formatting (7 instances)

These are in log messages or error string formatting and have less impact:

24. **lib/collection/src/shards/local_shard/scroll.rs:57** (log message)
25. **lib/collection/src/shards/local_shard/search.rs:141** (log message)
26. **lib/collection/src/grouping/builder.rs:76** (log message)
27. **lib/storage/src/content_manager/consensus_manager.rs:771** (error message formatting)
28. **lib/collection/src/shards/replica_set/update.rs:493** (error message formatting)
29. **lib/collection/src/shards/proxy_shard.rs:124** (retry log message)
30. **lib/collection/src/shards/channel_service.rs:119** (gRPC field - may need different handling)

## Recommended Approach

1. Fix all 23 critical error creation instances first
2. Consider fixing logging instances for consistency
3. Evaluate gRPC field separately (may require protocol consideration)"

Length of output: 44


@agourlay, done! I've created issue #7688 with a comprehensive list of all timeout precision instances in the codebase:

🔗 #7688

The issue includes:

  • 23 critical instances in error creation that directly affect error messages
  • 7 lower-priority instances in logging and formatting
  • Precise file paths and line numbers for each instance
  • Code snippets showing the problematic patterns
  • Recommended fix approaches

The issue has been assigned to you. All instances are categorized by priority to help you tackle the critical ones first, then decide on the logging instances.

    /)/)
   ( •.•)  Mission accomplished!
   />📋  30 instances documented

🧬 Code Graph Analysis Results
  • File: lib/collection/src/collection/point_ops.rs

    • Snippet (lines 398-472)
      pub async fn retrieve(
              &self,
              request: PointRequestInternal,
              read_consistency: Option<ReadConsistency>,
              shard_selection: &ShardSelectorInternal,
              timeout: Option<Duration>,
              hw_measurement_acc: HwMeasurementAcc,
          ) -> CollectionResult<Vec<RecordInternal>> {
              if request.ids.is_empty() {
                  return Ok(Vec::new());
              }
              let with_payload_interface = request
                  .with_payload
                  .as_ref()
                  .unwrap_or(&WithPayloadInterface::Bool(false));
              let with_payload = WithPayload::from(with_payload_interface);
              let ids_len = request.ids.len();
              let request = Arc::new(request);
      
              let shard_holder = self.shards_holder.read().await;
              let target_shards = shard_holder.select_shards(shard_selection)?;
              let mut all_shard_collection_requests = target_shards
                  .into_iter()
                  .map(|(shard, shard_key)| {
                      // Explicitly borrow `request` and `with_payload`, so we can use them in `async move`
                      // block below without unnecessarily cloning anything
                      let request = &request;
                      let with_payload = &with_payload;
      
                      let hw_acc = hw_measurement_acc.clone();
      
                      async move {
                          let mut records = shard
                              .retrieve(
                                  request.clone(),
                                  with_payload,
                                  &request.with_vector,
                                  read_consistency,
                                  timeout,
                                  shard_selection.is_shard_id(),
                                  hw_acc,
                              )
                              .await?;
      
                          if shard_key.is_none() {
                              return Ok(records);
                          }
      
                          for point in &mut records {
                              point.shard_key.clone_from(&shard_key.cloned());
                          }
      
                          CollectionResult::Ok(records)
                      }
                  })
                  .collect::<FuturesUnordered<_>>();
      
              // pre-allocate hashmap with capped capacity to protect from malevolent input
              let mut covered_point_ids = HashMap::with_capacity(ids_len.min(1024));
              while let Some(response) = all_shard_collection_requests.try_next().await? {
                  for point in response {
                      // Add each point only once, deduplicate point IDs
                      covered_point_ids.insert(point.id, point);
                  }
              }
      
              // Collect points in the same order as they were requested
              let points = request
                  .ids
                  .iter()
                  .filter_map(|id| covered_point_ids.remove(id))
                  .collect();
      
              Ok(points)
          }
      
  • File: lib/collection/src/shards/local_shard/shard_ops.rs

    • Snippet (lines 277-316)
      async fn retrieve(
              &self,
              request: Arc<PointRequestInternal>,
              with_payload: &WithPayload,
              with_vector: &WithVector,
              search_runtime_handle: &Handle,
              timeout: Option<Duration>,
              hw_measurement_acc: HwMeasurementAcc,
          ) -> CollectionResult<Vec<RecordInternal>> {
              // Check read rate limiter before proceeding
              self.check_read_rate_limiter(&hw_measurement_acc, "retrieve", || request.ids.len())?;
              let timeout = timeout.unwrap_or(self.shared_storage_config.search_timeout);
      
              let start_time = Instant::now();
              let records_map = tokio::time::timeout(
                  timeout,
                  SegmentsSearcher::retrieve(
                      self.segments.clone(),
                      &request.ids,
                      with_payload,
                      with_vector,
                      search_runtime_handle,
                      timeout,
                      hw_measurement_acc,
                  ),
              )
              .await
              .map_err(|_: Elapsed| CollectionError::timeout(timeout.as_secs() as usize, "retrieve"))??;
      
              let ordered_records = request
                  .ids
                  .iter()
                  .filter_map(|point| records_map.get(point).cloned())
                  .collect();
      
              let elapsed = start_time.elapsed();
              log_request_to_collector(&self.collection_name, elapsed, || request);
      
              Ok(ordered_records)
          }
      
  • File: lib/collection/src/collection_manager/segments_searcher.rs

    • Snippet (lines 360-390)
      pub async fn retrieve(
              segments: LockedSegmentHolder,
              points: &[PointIdType],
              with_payload: &WithPayload,
              with_vector: &WithVector,
              runtime_handle: &Handle,
              timeout: Duration,
              hw_measurement_acc: HwMeasurementAcc,
          ) -> CollectionResult<AHashMap<PointIdType, RecordInternal>> {
              let stopping_guard = StoppingGuard::new();
              let points = runtime_handle.spawn_blocking({
                  let segments = segments.clone();
                  let points = points.to_vec();
                  let with_payload = with_payload.clone();
                  let with_vector = with_vector.clone();
                  let is_stopped = stopping_guard.get_is_stopped();
                  // TODO create one Task per segment level retrieve
                  move || {
                      retrieve_blocking(
                          segments,
                          &points,
                          &with_payload,
                          &with_vector,
                          timeout,
                          &is_stopped,
                          hw_measurement_acc,
                      )
                  }
              });
              Ok(AbortOnDropHandle::new(points).await??)
          }
      
  • File: lib/segment/src/common/operation_error.rs

    • Snippet (lines 119-126)
      pub fn timeout(timeout_sec: usize, operation: impl Into<String>) -> Self {
              Self::Timeout {
                  description: format!(
                      "Operation '{}' timed out after {timeout_sec} seconds",
                      operation.into()
                  ),
              }
          }
      
  • File: lib/edge/src/retrieve.rs

    • Snippet (lines 12-38)
      pub fn retrieve(
              &self,
              point_ids: &[ExtendedPointId],
              with_payload: Option<WithPayloadInterface>,
              with_vector: Option<WithVector>,
          ) -> OperationResult<Vec<RecordInternal>> {
              let with_payload =
                  WithPayload::from(with_payload.unwrap_or(WithPayloadInterface::Bool(true)));
              let with_vector = with_vector.unwrap_or(WithVector::Bool(false));
      
              let mut points = retrieve_blocking(
                  self.segments.clone(),
                  point_ids,
                  &with_payload,
                  &with_vector,
                  DEFAULT_EDGE_TIMEOUT,
                  &AtomicBool::new(false),
                  HwMeasurementAcc::disposable(),
              )?;
      
              let points: Vec<_> = point_ids
                  .iter()
                  .filter_map(|id| points.remove(id))
                  .collect();
      
              Ok(points)
          }
      

🧠 Learnings used
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good rabbit! 👏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agourlay, thank you! Happy to help! 🐰✨

Feel free to ping me if you need anything else while working through those instances.

@qdrant qdrant deleted a comment from coderabbitai bot Dec 4, 2025
@agourlay agourlay merged commit da8ccdd into dev Dec 4, 2025
15 checks passed
@agourlay agourlay deleted the timeout-segment-holder-on-retrieve branch December 4, 2025 14:26
timvisee pushed a commit that referenced this pull request Dec 18, 2025
* Timeout on point retrieve when acquiring segment holder

* review: less radical and cleaner

* highly embarassing I hope no one sees this!

* coderabbit
@timvisee timvisee mentioned this pull request Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants