Skip to content

Fix potential deadlock in ProxySegment all_vectors#7172

Merged
agourlay merged 2 commits intodevfrom
fix-potential-deadlock-proxy-vector
Aug 29, 2025
Merged

Fix potential deadlock in ProxySegment all_vectors#7172
agourlay merged 2 commits intodevfrom
fix-potential-deadlock-proxy-vector

Conversation

@agourlay
Copy link
Member

@agourlay agourlay commented Aug 28, 2025

Fetching a vector on a proxy segment reacquires a read lock on the wrapped segment under the hood.

This only deadlock if a different thread tries to acquire a write lock in between.

2025-08-28T19:46:07.204032Z ERROR qdrant::startup: Panic backtrace:
   0: {closure#0}
             at ./src/startup.rs:19:25
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/alloc/src/boxed.rs:1980:9
   2: std::panicking::rust_panic_with_hook
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:841:13
   3: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:699:13
   4: std::sys::backtrace::__rust_end_short_backtrace
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/sys/backtrace.rs:168:18
   5: __rustc::rust_begin_unwind
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:697:5
   6: core::panicking::panic_fmt
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:75:14
   7: on_unpark
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:1220:13
   8: parking_lot_core::parking_lot::deadlock::on_unpark
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:1144:9
   9: parking_lot_core::parking_lot::park::{{closure}}
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:637:17
  10: with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#1}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#2}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>>>
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:207:5
  11: park<parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#1}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#2}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>>
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:600:5
  12: parking_lot::raw_rwlock::RawRwLock::lock_common
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot-0.12.4/src/raw_rwlock.rs:1123:17
  13: parking_lot::raw_rwlock::RawRwLock::lock_shared_slow
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot-0.12.4/src/raw_rwlock.rs:723:14
  14: lock_shared
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot-0.12.4/src/raw_rwlock.rs:109:31
  15: lock_api::rwlock::RwLock<R,T>::read
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/lock_api-0.4.13/src/rwlock.rs:468:18
  16: shard::proxy_segment::segment_entry::<impl segment::entry::entry_point::SegmentEntry for shard::proxy_segment::ProxySegment>::vector
             at ./lib/shard/src/proxy_segment/segment_entry.rs:351:18
  17: shard::proxy_segment::segment_entry::<impl segment::entry::entry_point::SegmentEntry for shard::proxy_segment::ProxySegment>::all_vectors
             at ./lib/shard/src/proxy_segment/segment_entry.rs:370:40
  18: collection::collection_manager::segments_searcher::SegmentsSearcher::retrieve_blocking::{{closure}}
             at ./lib/collection/src/collection_manager/segments_searcher.rs:448:59
  19: read_points<collection::collection_manager::segments_searcher::{impl#0}::retrieve_blocking::{closure_env#0}>
             at ./lib/shard/src/segment_holder/mod.rs:761:29
  20: collection::collection_manager::segments_searcher::SegmentsSearcher::retrieve_blocking
             at ./lib/collection/src/collection_manager/segments_searcher.rs:421:14
  21: collection::collection_manager::segments_searcher::SegmentsSearcher::retrieve::{{closure}}::{{closure}}
             at ./lib/collection/src/collection_manager/segments_searcher.rs:393:21
  22: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/blocking/task.rs:42:21
  23: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/task/core.rs:365:24
  24: with_mut<tokio::runtime::task::core::Stage<tokio::runtime::blocking::task::BlockingTask<collection::collection_manager::segments_searcher::{impl#0}::retrieve::{async_fn#0}::{closure_env#0}>>, core::task::poll::Poll<core::result::Result<ahash::hash_map::AHashMap<segment::types::ExtendedPointId, collection::operations::types::RecordInternal, ahash::random_state::RandomState>, collection::operations::types::CollectionError>>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<collection::collection_manager::segments_searcher::{impl#0}::retrieve::{async_fn#0}::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/loom/std/unsafe_cell.rs:16:9
  25: poll<tokio::runtime::blocking::task::BlockingTask<collection::collection_manager::segments_searcher::{impl#0}::retrieve::{async_fn#0}::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/task/core.rs:354:30
  26: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/task/harness.rs:535:30
  27: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9
  28: do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<collection::collection_manager::segments_searcher::{impl#0}::retrieve::{async_fn#0}::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<core::result::Result<ahash::hash_map::AHashMap<segment::types::ExtendedPointId, collection::operations::types::RecordInternal, ahash::random_state::RandomState>, collection::operations::types::CollectionError>>>
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40
  29: __rust_try
  30: catch_unwind<core::task::poll::Poll<core::result::Result<ahash::hash_map::AHashMap<segment::types::ExtendedPointId, collection::operations::types::RecordInternal, ahash::random_state::RandomState>, collection::operations::types::CollectionError>>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<collection::collection_manager::segments_searcher::{impl#0}::retrieve::{async_fn#0}::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>>
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19
  31: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<collection::collection_manager::segments_searcher::{impl#0}::retrieve::{async_fn#0}::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<core::result::Result<ahash::hash_map::AHashMap<segment::types::ExtendedPointId, collection::operations::types::RecordInternal, ahash::random_state::RandomState>, collection::operations::types::CollectionError>>>
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
  32: tokio::runtime::task::harness::poll_future
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/task/harness.rs:523:18
  33: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/task/harness.rs:210:27
  34: tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/task/harness.rs:155:20
  35: tokio::runtime::task::raw::poll
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/task/raw.rs:325:13
  36: tokio::runtime::task::raw::RawTask::poll
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/task/raw.rs:255:18
  37: tokio::runtime::task::UnownedTask<S>::run
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/task/mod.rs:546:13
  38: tokio::runtime::blocking::pool::Task::run
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/blocking/pool.rs:161:19
  39: run
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/blocking/pool.rs:516:22
  40: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/blocking/pool.rs:474:47
  41: __rust_begin_short_backtrace<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/backtrace.rs:152:18
  42: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:559:17
  43: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9
  44: do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()>
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40
  45: __rust_try
  46: catch_unwind<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>>
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19
  47: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()>
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
  48: {closure#1}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:557:30
  49: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
  50: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/alloc/src/boxed.rs:1966:9
  51: std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/sys/pal/unix/thread.rs:107:17
  52: start_thread
             at ./nptl/pthread_create.c:448:8
  53: __GI___clone3
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78:0

2025-08-28T19:46:07.204114Z ERROR qdrant::startup: Panic occurred in file /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs at line 1220: internal error: entered unreachable code: unparked deadlocked thread!

coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Aug 29, 2025
Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find. I like that this is still sound because we currently guarantee the set of vector names doesn't change.

Here we locked wrapped first and then write. If this deadlocks it means we must lock these in reverse order somewhere else. Do we know where that happens?

@timvisee
Copy link
Member

There are more places where we have this lock pattern. This also locks wrapped first and then write:

let wrapped_segment_guard = wrapped_segment.read();
// Since `deleted_points` are shared between multiple ProxySegments,
// It is possible that some other Proxy moved its point with different version already
// If this is the case, there are multiple scenarios:
// - Local point doesn't exist or already removed locally -> do nothing
// - Already moved version is higher than the current one -> mark local as removed
// - Already moved version is less than what we have in current proxy -> overwrite
// Point doesn't exist in wrapped segment - do nothing
let Some(local_version) = wrapped_segment_guard.point_version(point_id) else {
return Ok(false);
};
// Equal or higher point version is already moved into write segment - delete from
// wrapped segment and do not move it again
if deleted_points_guard
.get(&point_id)
.is_some_and(|&deleted| deleted.local_version >= local_version)
{
drop(deleted_points_guard);
self.set_deleted_offset(point_offset);
return Ok(false);
}
let (all_vectors, payload) = (
wrapped_segment_guard.all_vectors(point_id, hw_counter)?,
wrapped_segment_guard.payload(point_id, hw_counter)?,
);
{
let segment_arc = self.write_segment.get();
let mut write_segment = segment_arc.write();
write_segment.upsert_point(op_num, point_id, all_vectors, hw_counter)?;
if !payload.is_empty() {
write_segment.set_full_payload(op_num, point_id, &payload, hw_counter)?;
}
}

I'm afraid we might suffer from the same deadlock there if we don't find (and fix?) the place where we lock in reverse order.

@timvisee
Copy link
Member

No! I see now. This is a self-deadlock.

We were holding a lock on wrapped_segment before, and had the potential to lock it again in the same thread here.

read_recursive is also an option here, but I like your approach better.

Nice find!

coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Aug 29, 2025
@agourlay
Copy link
Member Author

Thanks for adding a comment, I apologize for my poor PR description 🙇

@timvisee
Copy link
Member

Now that I read it again, your description was perfectly fine:

Fetching a vector on a proxy segment reacquires a read lock on the wrapped segment under the hood.

@agourlay agourlay merged commit 71a8d96 into dev Aug 29, 2025
16 checks passed
@agourlay agourlay deleted the fix-potential-deadlock-proxy-vector branch August 29, 2025 09:06
timvisee added a commit that referenced this pull request Sep 29, 2025
* Fix potential deadlock in ProxySegment all_vectors

* Add comment

---------

Co-authored-by: timvisee <tim@visee.me>
@timvisee timvisee mentioned this pull request Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants