Migrate WAL mutex from parking lot to tokio#6307
Conversation
📝 WalkthroughWalkthroughThe pull request introduces several changes aimed at transitioning from synchronous to asynchronous operations across the codebase. Key modifications include updating locking mechanisms for the Write-Ahead Log (WAL) by switching from Suggested Reviewers
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
lib/collection/src/shards/shard.rs (1)
234-234: Consider releasing lock before logging
You're holding the WAL mutex lock while creating and logging this debug message. If the logging operation becomes slow for any reason, it might block other tasks. Consider readinglast_index()into a local variable and then releasing the lock before callinglog::debug!.lib/collection/src/update_handler.rs (2)
695-695: Potential blocking flush in wait logic
Callingwal.lock().await.flush()in the path wherewaitis true may stall other tasks if flush is time-consuming. Consider whether you need a more fine-grained approach, such as performing the flush outside the main lock if feasible.
820-820: Avoid extended lock during acknowledgement
You're locking the WAL to callack(ack). In heavily loaded environments, consider reading the necessary state under lock, releasing it, then acknowledging to minimize time spent holding the mutex.lib/collection/src/shards/queue_proxy_shard.rs (1)
468-468: Potential concurrency bottleneck while reading WAL
Locking the WAL here to retrieve items is correct, but if transferring large batches, you might consider partial reads or using separate read structures to reduce lock contention.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
lib/collection/src/shards/local_shard/mod.rs(5 hunks)lib/collection/src/shards/queue_proxy_shard.rs(5 hunks)lib/collection/src/shards/replica_set/shard_transfer.rs(2 hunks)lib/collection/src/shards/shard.rs(2 hunks)lib/collection/src/update_handler.rs(5 hunks)lib/collection/src/wal_delta.rs(26 hunks)
🧰 Additional context used
🧬 Code Definitions (2)
lib/collection/src/shards/shard.rs (2)
lib/collection/src/wal_delta.rs (2)
wal(1615-1627)wal_version(121-128)lib/collection/src/shards/replica_set/shard_transfer.rs (1)
wal_version(492-501)
lib/collection/src/wal_delta.rs (4)
lib/collection/src/shards/replica_set/shard_transfer.rs (1)
wal_version(492-501)lib/collection/src/shards/shard.rs (1)
wal_version(250-266)lib/collection/src/update_handler.rs (1)
new(132-169)lib/collection/src/shards/local_shard/mod.rs (2)
new(161-239)new(1272-1277)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test-consensus
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: test (windows-latest)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
🔇 Additional comments (41)
lib/collection/src/shards/shard.rs (1)
250-252:Details
✅ Verification successful
Verify async signature usage
Thewal_versionfunction is now asynchronous. Double-check that all consumers ofwal_version()call it with.awaitand handle potential concurrency or error scenarios from the future properly.
🏁 Script executed:
#!/bin/bash # Check for all usage of wal_version in the repo to ensure calls are awaited. rg -A 3 "wal_version\("Length of output: 1933
Async Signature Verified:
The asynchronous implementation ofwal_versionappears to be correctly integrated across the codebase. All consumers—such as inlib/collection/src/shards/shard.rs,lib/collection/src/wal_delta.rs,lib/collection/src/shards/replica_set/shard_transfer.rs, andlib/collection/src/shards/transfer/wal_delta.rs—are properly using.awaitand handling errors as expected. No further action is required.lib/collection/src/update_handler.rs (3)
246-252: Ensure asynchronous error handling intry_recover
This function has been changed to async. Confirm that any errors, especially those arising fromwal.lock().await, are properly handled or bubbled up. Also verify that external callers are updated to use.awaitand are prepared to handle partial recoveries mid-await.
628-630: Double-check logic after async recovery
Here,try_recoveris being awaited. If recovery fails, the code aborts further processing. Ensure that this is the intended flow and that no critical cleanup steps are missed after a failed recovery.
779-779: Confirm concurrency of async flush
Invokingflush_async()here is appropriate for non-blocking behavior. Verify downstream code to ensure it safely handles any partial flush states that might occur if tasks race to read from the WAL while this flush is ongoing.lib/collection/src/shards/queue_proxy_shard.rs (3)
69-75: Async constructor pattern
Definingpub async fn newand immediately callingInner::new(...).awaitensures the shard setup can await WAL locks or remote checks. This is appropriate if shard construction necessarily depends on async I/O. Confirm that any failures are properly handled or propagated as errors.
91-100: Async creation from existing WAL version
Similarly, makingpub async fn new_from_versionensures that verifying theversionwithin the WAL is fully asynchronous. Verify that the calling sites either handle or propagate potential concurrency errors when multiple tasks attempt creation from different versions.
389-394: Check nested WAL locking innew
Callingwrapped_shard.wal.wal.lock().await.last_index()inline is straightforward; just confirm that no other operation in this constructor accidentally re-locks the same WAL in nested code paths, which might risk deadlocks.lib/collection/src/shards/local_shard/mod.rs (7)
176-176: Adoption oftokio::sync::Mutexmatches the new async designUsing
tokio::sync::Mutexhere is congruent with the PR objective of migrating to async locking, reducing potential blocking.
619-619: Asynchronous lock acquisitionTransitioning to
.lock().awaitis correct for async contexts and aligns with the new async Mutex usage.
628-630: Improved logging granularityThis updated log message clarifies the WAL recovery start point. Looks good.
929-932: Documentation clarifies blocking lock usageExplicitly stating the function will panic in async contexts helps avoid misuse. Consider elaborating any multi-threaded constraints if relevant.
939-939: Blocking lock is valid for synchronous usage
blocking_lock()is appropriate if strictly invoked from non-async code. Ensure external callers never await this function, as warned in the docstring.
968-972: Reinforcing the panic condition in documentationReiterating the risk of calling this function asynchronously helps steer correct usage patterns.
974-974: Second blocking lock usageSame caution applies; ensure a strictly synchronous calling context to prevent runtime deadlocks.
lib/collection/src/shards/replica_set/shard_transfer.rs (2)
180-193: Async queue-proxy initializationAwaiting either
QueueProxyShard::newor::new_from_versionis correctly handled for bothfrom_versionscenarios.
500-500: Asyncwal_versionretrievalUsing
.awaitfor the WAL version lookup ensures consistent async handling of I/O-bound operations.lib/collection/src/wal_delta.rs (25)
5-5: Use of tokio::sync is correct
Migration from parking_lot to tokio's Mutex is well-implemented.
11-11: Refined visibility
Changing the type alias topub(crate)properly restricts scope. No issues found.
50-53: Refined function signature
Returning anOwnedMutexGuardin an async context avoids lifetime complications and looks clean.
68-68: Owned lock acquisition
lock_owned(self.wal.clone()).awaitis correctly used to yield anOwnedMutexGuard.
112-112: Awaiting WAL lock
Replacing synchronous locking with.awaitaligns with the asynchronous migration.
121-122: Asynchronous wal_version
Switching to an async function and awaiting the lock is consistent with the new locking model.
136-136: Await lock before read
Ensuring the lock is awaited prevents blocking in an async context.
292-292: Arc<Mutex<...>> initialization
Wrapping the WAL inArc<Mutex<...>>correctly adopts asynchronous locking patterns.
377-377: Verify single operation
Awaitingb_wal.wal.lock()before counting ensures proper synchronization.
389-390: Zipping multiple reads
Locking each WAL asynchronously then zipping results maintains consistency across WALs.
487-487: Count check
Locking and reading the WAL ensures the correct number of points is counted.
496-499: Multiple WAL comparison
Acquiring locks and chaining.read(0)calls for each WAL is a clear approach in test code.
583-583: Test assertion
Confirming the correct number of missed operations after an async lock is reliable.
593-596: Concurrent WAL checks
Awaiting each lock and zipping across multiple WALs is consistent with async usage.
689-689: Delta read check
Async lock usage matches the rest of the code and ensures consistency.
699-701: Chained locking
Locking and reading each WAL before zipping them together is a standard test pattern.
794-795: Double assertion
Ensuring both nodes have two operations post-delta is correct under async locking.
805-807: Await lock
Guard acquisition is awaited before reading and zipping with the counterpart WAL’s read, minimizing race conditions.
815-817: Consistency check
Again, verifying equality after awaiting each lock is central to safe concurrency.
821-823: Ensuring 3 operations
Confirming the WAL operation count matches across all nodes after acquiring locks.
839-839: Awaited lock in testing
Continuing to ensure we don't block elsewhere by asynchronously locking the WAL.
846-846: Lock for read
Explicitly awaiting each lock promotes correctness in these test flows.
853-853: Consistent async usage
No issues with another awaited lock call here.
1214-1214: Operation count assertion
Verifies newly inserted deltas are detected properly via async locks.
1399-1408: Collecting locked WAL references
Temporarily holding each WAL lock for comparison is an acceptable pattern in tests. Blocking concurrency here is fine in a test context.
|
Can we please get a small performance test for a quick sanity check that this is not much slower? |
Here are some basic
|
* Migrate WAL from sync parking lot to async tokio mutex * Improve logging * Migrate tests
Migrates the WAL lock from a sync parking lot mutex to an async tokio mutex.
The mutex is almost exclusively used in async context. I therefore suggest to make it an async mutex, so that we don't run into the constant issue of trying to hold this lock across await points.
This is the first step in a bigger attempt to more clearly define the boundary between our sync (segments) and async (the rest) code. Currently, this boundary is very vague because we constantly mix up the two flavors, which creates constant problems. I'd like to clean up this technical dept a bit.
Contains no logic changes. Purely mechanical to change the mutex type.
All Submissions:
devbranch. Did you create your branch fromdev?New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --all --all-featurescommand?