Skip to content

Fix deadlock on Actix RT in Telemetry#7265

Merged
agourlay merged 2 commits intodevfrom
deadlock-actix-rt
Sep 17, 2025
Merged

Fix deadlock on Actix RT in Telemetry#7265
agourlay merged 2 commits intodevfrom
deadlock-actix-rt

Conversation

@agourlay
Copy link
Member

@agourlay agourlay commented Sep 16, 2025

Follow up on #7241

When the Actix runtime runs on a single thread, it can be starved by blocking the running thread on a sync lock.

This happens when calling the telemetry while:

  • have long running streaming snapshot going out holding read lock
  • a segment lock (not segment holder lock)
  • a pending write to the same segment
  • request something from telemetry that blocks actix runtime with blocking read lock

I was able to reproduce the issue by adapting the existing deadlock test by setting more loads on the system

e.g.

Thread 7 (Thread 0x7d69fa5ff6c0 (LWP 1119317) "actix-rt|system"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00006367ccaaf2b9 in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/thread_parker/linux.rs:112
#2  0x00006367ccaaf0ec in <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/thread_parker/linux.rs:66
#3  0x00006367ccaab7cd in parking_lot_core::parking_lot::park::{{closure}} () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:635
#4  0x00006367ccaaab4f in with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#1}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#2}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>>> () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:207
#5  park<parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#1}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>, parking_lot::raw_rwlock::{impl#10}::lock_common::{closure_env#2}<parking_lot::raw_rwlock::{impl#10}::lock_shared_slow::{closure_env#0}>> () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:600
#6  0x00006367ccab5ffb in parking_lot::raw_rwlock::RawRwLock::lock_common () at src/raw_rwlock.rs:1123
#7  0x00006367ccab4cfe in parking_lot::raw_rwlock::RawRwLock::lock_shared_slow () at src/raw_rwlock.rs:723
#8  0x00006367c8946162 in lock_shared () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot-0.12.4/src/raw_rwlock.rs:109
#9  0x00006367c888bb93 in lock_api::rwlock::RwLock<R,T>::read () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/lock_api-0.4.13/src/rwlock.rs:468
#10 0x00006367c819f60c in collection::shards::local_shard::LocalShard::local_shard_status::{{closure}}::{{closure}} () at lib/collection/src/shards/local_shard/mod.rs:1047
#11 0x00006367c82d1d31 in core::iter::adapters::map::map_try_fold::{{closure}} () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:95
#12 0x00006367c82c8441 in core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:294
#13 0x00006367c7e5045a in core::iter::traits::iterator::Iterator::try_fold () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:2426
#14 0x00006367c79f46b5 in <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/chain.rs:108
#15 0x00006367c82c59f1 in <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:121
#16 0x00006367c82c8937 in core::iter::traits::iterator::Iterator::any () at /home/agourlay/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:2826
#17 0x00006367c819f150 in {async_fn#0} () at lib/collection/src/shards/local_shard/mod.rs:1048
#18 0x00006367c572d49e in {async_fn#0} () at lib/collection/src/shards/shard.rs:80
#19 0x00006367c67351d5 in {async_fn#0} () at lib/collection/src/shards/replica_set/telemetry.rs:16
#20 0x00006367c55f735e in {async_fn#0} () at lib/collection/src/collection/mod.rs:788
#21 0x00006367c6693708 in {async_fn#0} () at lib/storage/src/content_manager/toc/telemetry.rs:17
#22 0x00006367c594169d in {async_fn#0} () at src/common/telemetry_ops/collections_telemetry.rs:33
...

Thread 3 (Thread 0x7d69f79ff6c0 (LWP 1119328) "update-24"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00006367ccaaf2b9 in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/thread_parker/linux.rs:112
#2  0x00006367ccaaf0ec in <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/thread_parker/linux.rs:66
#3  0x00006367ccaae799 in parking_lot_core::parking_lot::park::{{closure}} () at /home/agourlay/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parking_lot_core-0.9.11/src/parking_lot.rs:635
#4  0x00006367ccaaaea4 in with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::raw_rwlock::{impl#10}::wait_for_readers::{closure_env#0}, parking_lot::raw_rwlock::{impl#10}::wait_for_readers::{closure_env#1},
...

The proposed fix is to make sure that fetching local_shard_status for the telemetry runs on a dedicated blocking thread.

@agourlay agourlay changed the title Deadlock on Actix RT Fix deadlock on Actix RT in Telemetry Sep 17, 2025
@agourlay agourlay marked this pull request as ready for review September 17, 2025 08:40
coderabbitai[bot]

This comment was marked as resolved.

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different place than the backtraces I've seen. But still a good thing to patch 🙌

@qdrant qdrant deleted a comment from coderabbitai bot Sep 17, 2025
@agourlay
Copy link
Member Author

This is a different place than the backtraces I've seen.

I agree, there are several sync locks to hunt down 👍

@agourlay agourlay merged commit 5238d16 into dev Sep 17, 2025
16 checks passed
@agourlay agourlay deleted the deadlock-actix-rt branch September 17, 2025 08:57
timvisee pushed a commit that referenced this pull request Sep 29, 2025
* Deadlock on Actix RT

* Spawn blocking task for sync lock
@timvisee timvisee mentioned this pull request Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants