Skip to content

Fix deadlock on Actix RT in Cluster Info#7267

Merged
timvisee merged 1 commit intodevfrom
deadlock-actix-rt-cluster-info
Sep 17, 2025
Merged

Fix deadlock on Actix RT in Cluster Info#7267
timvisee merged 1 commit intodevfrom
deadlock-actix-rt-cluster-info

Conversation

@agourlay
Copy link
Member

@agourlay agourlay commented Sep 17, 2025

Follow up on #7265

When the Actix runtime runs on a single thread, it can be starved by blocking the running thread on a sync lock.

This happens when calling the cluster info API while:

  • have long running streaming snapshot going out holding read lock
  • a segment lock (not segment holder lock)
  • a pending write to the same segment
  • request cluster info that blocks actix runtime with blocking read lock

I was able to reproduce the issue by adapting the existing deadlock test by interleaving calls to the cluster info API.

Thread 8 (Thread 0x7b408ebff6c0 (LWP 313205) "actix-rt|system"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000060f10badea19 in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/thread_parker/linux.rs:112
#2  0x000060f10bade84c in <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/thread_parker/linux.rs:66
#3  0x000060f10badaf2d in parking_lot_core::parking_lot::park::{{closure}} () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:635
#4  0x000060f10bada2af in with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#1}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#2}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>>> () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:207
#5  park<parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#1}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#2}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>> () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:600
#6  0x000060f10bae575b in parking_lot::raw_rwlock::RawRwLock::lock_common () at src/raw_rwlock.rs:1123
#7  0x000060f10bae445e in parking_lot::raw_rwlock::RawRwLock::lock_shared_slow () at src/raw_rwlock.rs:723
#8  0x000060f107975872 in lock_shared () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot-0.12.4/src/raw_rwlock.rs:109
#9  0x000060f1078bb2a3 in lock_api::rwlock::RwLock<R,T>::read () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/lock_api-0.4.13/src/rwlock.rs:468
#10 0x000060f1071d101b in collection::shards::local_shard::LocalShard::estimate_cardinality::{{closure}} () at lib/collection/src/shards/local_shard/mod.rs:1004
#11 0x000060f10731baff in core::iter::adapters::map::map_fold::{{closure}} () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:88
#12 0x000060f107305baf in core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:294
#13 0x000060f106e9da2d in core::iter::traits::iterator::Iterator::fold () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:2602
#14 0x000060f106a3156a in <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::fold () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/chain.rs:123
#15 0x000060f1072fee52 in <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:128
#16 0x000060f10729229a in collection::shards::local_shard::LocalShard::estimate_cardinality () at lib/collection/src/shards/local_shard/mod.rs:1007
#17 0x000060f106c935ea in {async_block#0} () at lib/collection/src/shards/local_shard/shard_ops.rs:218
#18 0x000060f106b8afd9 in poll<alloc::boxed::Box<(dyn core::future::future::Future<Output=core::result::Result<collection::operations::types::CountResult, collection::operations::types::CollectionError>> + core::marker::Send), alloc::alloc::Global>> () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/future.rs:124
#19 0x000060f10600dbe0 in {async_fn#0} () at lib/collection/src/shards/replica_set/read_ops.rs:186
#20 0x000060f10462376b in {async_fn#0} () at lib/collection/src/collection/collection_ops.rs:368
#21 0x000060f1049f08aa in {async_fn#0} () at src/common/collections.rs:195
#22 0x000060f1053d6e3c in {async_block#0}<collection::operations::types::CollectionClusterInfo, qdrant::common::collections::do_get_collection_cluster::{async_fn_env#0}> () at src/actix/helpers.rs:146
#23 0x000060f1053e2fea in {async_fn#0}<collection::operations::types::CollectionClusterInfo, qdrant::actix::helpers::time::{async_fn#0}::{async_block_env#0}<collection::operations::types::CollectionClusterInfo, qdrant::common::collections::do_get_collection_cluster::{async_fn_env#0}>> () at src/actix/helpers.rs:186
#24 0x000060f1053d2a38 in {async_fn#0}<collection::operations::types::CollectionClusterInfo, qdrant::common::collections::do_get_collection_cluster::{async_fn_env#0}> () at src/actix/helpers.rs:146
#25 0x000060f104d17799 in {async_fn#0} () at src/actix/api/collections_api.rs:209

Thread 4 (Thread 0x7b408bfff6c0 (LWP 313219) "update-24"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000060f10badea19 in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/thread_parker/linux.rs:112
#2  0x000060f10bade84c in <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/thread_parker/linux.rs:66
#3  0x000060f10baddef9 in parking_lot_core::parking_lot::park::{{closure}} () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:635
#4  0x000060f10bada604 in with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::wait_for_readers::{closure_env#0}, parking_lot::raw_rwlock::{impl#10}::wait_for_readers::{closure_env#1}, parking_lot::raw_rwlock::{impl#10}::wait_for_readers::{closure_env#2}>> () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:207
#5  park<parking_lot::raw_rwlock::{impl#10}::wait_for_readers::{closure_env#0}, parking_lot::raw_rwlock::{impl#10}::wait_for_readers::{closure_env#1}, parking_lot::raw_rwlock::{impl#10}::wait_for_readers::{closure_env#2}> () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:600
#6  0x000060f10bae5431 in wait_for_readers () at src/raw_rwlock.rs:1022
#7  0x000060f10bae4283 in parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow () at src/raw_rwlock.rs:647
#8  0x000060f107975982 in <parking_lot::raw_rwlock::RawRwLock as lock_api::rwlock::RawRwLock>::lock_exclusive () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot-0.12.4/src/raw_rwlock.rs:73
#9  0x000060f1078bb3b3 in lock_api::rwlock::RwLock<R,T>::write () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/lock_api-0.4.13/src/rwlock.rs:500
#10 0x000060f1078cae0b in apply_points<bool, shard::update::upsert_points::{closure_env#3}<core::slice::iter::Iter<shard::operations::point_ops::PointStructPersisted>>, shard::segment_holder::{impl#2}::apply_points_with_conditional_move::{closure_env#0}<shard::update::upsert_points::{closure_env#1}<core::slice::iter::Iter<shard::operations::point_ops::PointStructPersisted>>, shard::update::upsert_points::{closure_env#3}<core::slice::iter::Iter<shard::operations::point_ops::PointStructPersisted>>, shard::update::upsert_points::{closure_env#2}<core::slice::iter::Iter<shard::operations::point_ops::PointStructPersisted>>>> () at lib/shard/src/segment_holder/mod.rs:579

The fix is to make the underlying estimate_cardinality used by the cluster info non-blocking by offloading it to a dedicated blocking thread.

There is a bit of refactoring:

  • making estimate_cardinality async
  • using the thread-safe HwMeasurementAcc instead of HardwareCounterCell
  • cloning the Filter to be able to spawn a task 🙈

@agourlay agourlay marked this pull request as ready for review September 17, 2025 13:29
});
let segments = self.segments.clone();
let hw_counter = hw_measurement_acc.get_counter_cell();
// clone filter for spawning task
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find an alternative to cloning the incoming Filter as it can't be bound to 'a to be transferred to a spawned task.

&'a self,
filter: Option<&'a Filter>,
hw_counter: &HardwareCounterCell,
hw_measurement_acc: &HwMeasurementAcc,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor to using HwMeasurementAcc as HardwareCounterCell is not Send/Sync

hw_measurement_acc: HwMeasurementAcc,
) -> CollectionResult<BTreeSet<PointIdType>> {
let stopping_guard = StoppingGuard::new();
// cloning filter spawning task
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

highlighting that we already clone the filter for the same reason in read_filtered

coderabbitai[bot]

This comment was marked as resolved.

@agourlay agourlay requested a review from timvisee September 17, 2025 13:43
@qdrant qdrant deleted a comment from coderabbitai bot Sep 17, 2025
@timvisee timvisee merged commit f66dc31 into dev Sep 17, 2025
16 checks passed
@timvisee timvisee deleted the deadlock-actix-rt-cluster-info branch September 17, 2025 13:54
@timvisee timvisee mentioned this pull request Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants