threading: lock-free fast path for SemaphoreSlim.WaitAsync#125452
Open
thomhurst wants to merge 20 commits into
Open
threading: lock-free fast path for SemaphoreSlim.WaitAsync#125452thomhurst wants to merge 20 commits into
thomhurst wants to merge 20 commits into
Conversation
Author
|
@EgorBot -intel -amd -arm using System.Threading;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
[MemoryDiagnoser]
public class SemaphoreSlimUncontended
{
private SemaphoreSlim _sem = new SemaphoreSlim(1, 1);
[Benchmark]
public async Task WaitAsync_Release()
{
await _sem.WaitAsync();
_sem.Release();
}
} |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a lock-free fast path in SemaphoreSlim.WaitAsync that attempts to acquire an available permit via CAS, avoiding taking m_lockObjAndDisposed when uncontended.
Changes:
- Added a CAS-based fast path to decrement
m_currentCountwhen a permit appears immediately available. - Added special-case handling to keep
AvailableWaitHandlestate consistent if it’s initialized concurrently during the fast-path acquire.
Member
|
Doesn't lock itself has fast paths for that? |
Author
|
@EgorBot -intel -amd -arm using System.Threading;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
[MemoryDiagnoser]
public class SemaphoreSlimUncontended
{
private SemaphoreSlim _sem = new SemaphoreSlim(1, 1);
[Benchmark]
public async Task WaitAsync_Release()
{
await _sem.WaitAsync();
_sem.Release();
}
} |
Author
|
@EgorBo I think just by not entering the lock we can save some time: EgorBot/Benchmarks#31 |
fb8aeb0 to
419dbb0
Compare
Author
|
@EgorBot -intel -amd -arm using System.Threading;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
[MemoryDiagnoser]
public class SemaphoreSlimUncontended
{
private SemaphoreSlim _sem = new SemaphoreSlim(1, 1);
[Benchmark]
public async Task WaitAsync_Release()
{
await _sem.WaitAsync();
_sem.Release();
}
} |
Member
Are the benchmarks hitting deadlocks? |
Use Interlocked.Add to apply a relative delta to m_currentCount rather than writing back an absolute snapshot-derived value, so concurrent lock-free decrements from the WaitAsync fast path are not overwritten.
Replace plain --m_currentCount with a CAS loop to prevent a double grant when the lock-free WaitAsync fast path decrements m_currentCount between the > 0 check and the decrement in the slow path. WaitCore is safe because m_waitCount++ on lock entry blocks the CAS guard for its entire critical section. WaitAsyncCore has no such protection.
Apply the same CAS-loop pattern to WaitCore's m_currentCount decrement that was applied to WaitAsyncCore in the previous commit. A fast-path thread that read m_waitCount = 0 before WaitCore's m_waitCount++ can still race with WaitCore's check-at-404 / decrement-at-407 sequence. The CAS loop serializes both operations on m_currentCount atomically.
… stress test The assert !waitSuccessful || m_currentCount > 0 in WaitCore could fire spuriously in Debug builds: the lock-free WaitAsync fast path runs outside the lock, so it can decrement m_currentCount to 0 between WaitUntilCountOrTimeout returning and the assert executing. Adds a stress test that races AvailableWaitHandle lazy initialization against WaitAsync fast-path acquires and verifies the handle is never signaled when CurrentCount == 0.
…t WaitAsync fast path
…phoreSlim Also fix CS0420 in Release(): Volatile.Read(ref volatile_field) triggers a compiler error in the coreclr project build; replaced with a plain field read (already volatile) so the testhost can be rebuilt with the fixed implementation.
…cessor task is always cancelled
- Extract duplicated CAS-decrement loop into TryDecrementCount() with AggressiveInlining, replacing inline copies in WaitCore and WaitAsyncCore - Strengthen Assert.InRange to Assert.Equal in NeverUnderflows test - Add bulk Release(2) concurrent stress test for Interlocked.Add delta math - Add cancellation-during-fast-path stress test for count integrity - Use m_currentCount (post-Add) instead of netCount for m_waitHandle.Set() - Add UncontendedSync and MixedSyncAsync benchmarks for sync path coverage
Prevents the background task from pegging a CPU core in CI while still exercising concurrent lazy initialization of the wait handle.
… file WaitAsync(CancellationToken) returns Task, not Task<bool>; the prior declaration didn't compile. Removed PerformanceTests/SemaphoreSlimBenchmarks.cs since it had no csproj and wasn't wired into any project; runtime perf benchmarks live in dotnet/performance, and the EgorBot inline benchmarks posted on the PR cover the relevant scenarios.
The comment narrated what the next several lines do; the variable name and surrounding structure already convey it.
Address review hazards in the WaitAsync CAS fast path: - Make m_waitCount and m_asyncHead volatile. The fast path reads them without the lock; sync-waiter writes inside the lock must publish via release semantics rather than depending on the lock release that the fast path bypasses. Without this, ARM64 can let the fast path observe m_waitCount == 0 while a sync waiter is parked, stealing the slot and leaving the waiter blocked. - Restructure AvailableWaitHandle init to publish-then-reflect: publish the handle unsignaled, full barrier, then conditionally Set based on the post-publish count read. Closes the race where ManualResetEvent allocation overlapped a fast-path CAS, leaving the handle Set with count == 0. - WaitCore: loop instead of falling through with a stale waitSuccessful when TryDecrementCount loses to a fast-path acquirer. Fixes the case where Wait(Infinite) could return without owning a permit (silently dropped by the void overload, lying about acquisition for bool overloads). - Release: use Interlocked.Add's return value for the MRE.Set sentinel so a fast-path decrement racing between the Add and the re-read doesn't mask the 0 -> positive transition. - Strengthen the AvailableWaitHandle init test: allocate a fresh SemaphoreSlim per iteration so each iteration is a real attempt at the race. With a single semaphore the race only fires on the first AvailableWaitHandle access.
Defensive: the prior placement was correct because a thrown OCE always short-circuits before re-entry, but moving 'oce' (and 'timedOut', already loop-local) inside makes the freshness invariant explicit and robust to future edits.
Original test had 8 workers each doing WaitAsync;WaitAsync;Release(2) against 4 permits. Under contention, 4 workers could each grab one permit then block on the second WaitAsync — no holder ever reaches Release(2). Helix CI hit this on loaded ARM/Mono interp legs and killed the work item at the 45-min timeout. Split into pure consumers (WaitAsync only) and a single producer issuing Release(2) bulk releases. Same intent — bulk Release racing concurrent fast-path waiters — but deadlock-free by construction.
f4ed333 to
60419be
Compare
Comment on lines
910
to
920
| // m_currentCount is declared volatile; this read sees any decrement by the lock-free WaitAsync fast path. | ||
| // The snapshot is an upper bound on the count for the duration of this method (a concurrent fast path | ||
| // may decrement, but no other thread can increment since Release holds the lock); the SemaphoreFullException | ||
| // check below uses it intentionally to honor the caller's view of the count at entry. | ||
| int snapshot = m_currentCount; | ||
| returnCount = snapshot; | ||
|
|
||
| if (m_maxCount - snapshot < releaseCount) | ||
| { | ||
| throw new SemaphoreFullException(); | ||
| } |
Comment on lines
+735
to
+751
| [ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.IsMultithreadingSupported))] | ||
| public static async Task WaitAsync_CancellationDuringFastPath_NoCountCorruption() | ||
| { | ||
| const int Iterations = 10_000; | ||
| var sem = new SemaphoreSlim(1, 1); | ||
|
|
||
| for (int i = 0; i < Iterations; i++) | ||
| { | ||
| using var cts = new CancellationTokenSource(); | ||
| Task t = sem.WaitAsync(cts.Token); | ||
| cts.Cancel(); | ||
| try { await t; sem.Release(); } | ||
| catch (OperationCanceledException) { } | ||
| } | ||
|
|
||
| Assert.Equal(1, sem.CurrentCount); | ||
| } |
Comment on lines
+649
to
+663
| const int Threads = 16, Iterations = 1_000; | ||
| var sem = new SemaphoreSlim(1, 1); | ||
|
|
||
| await Task.WhenAll(Enumerable.Range(0, Threads).Select(_ => Task.Run(async () => | ||
| { | ||
| for (int i = 0; i < Iterations; i++) | ||
| { | ||
| if (await sem.WaitAsync(0)) | ||
| sem.Release(); | ||
| } | ||
| }))); | ||
|
|
||
| // Every successful WaitAsync(0) is paired with a Release; final count must be exactly 1. | ||
| Assert.Equal(1, sem.CurrentCount); | ||
| } |
… race test, dispose test semaphores - Release: replace the single m_currentCount snapshot + max-count check with a CAS loop that re-observes the live count, validates releaseCount against m_maxCount, and applies the increment atomically. Makes the max-count check linearizable with the increment so a concurrent lock-free WaitAsync fast-path decrement can no longer trigger a spurious SemaphoreFullException. - Tests: rewrite WaitAsync_CancellationDuringFastPath_NoCountCorruption (which never exercised the race, since an uncontended WaitAsync on (1,1) completes synchronously before Cancel) as WaitAsync_CancellationRacingAcquire_NoCountCorruption: contended workers with each token cancelled from a separate thread. - Tests: dispose the SemaphoreSlim instances in the new tests via using.
The handoff loop tracked two opposing counters (decrementing maxAsyncToRelease and incrementing asyncReleased) for the same loop progress. Use a single up-counter compared against the fixed bound instead. No behavior change.
Comment on lines
+751
to
+765
| using var cts = new CancellationTokenSource(); | ||
|
|
||
| // Cancel from another thread so it races the WaitAsync below. | ||
| Task canceller = Task.Run(cts.Cancel); | ||
|
|
||
| try | ||
| { | ||
| // Completes (acquires the permit) on success, throws on cancellation. | ||
| await sem.WaitAsync(cts.Token); | ||
| sem.Release(); | ||
| } | ||
| catch (OperationCanceledException) { } | ||
|
|
||
| await canceller; | ||
| } |
Comment on lines
+409
to
+414
| // Prepare for the main wait... | ||
| // wait until the count become greater than zero or the timeout is expired | ||
| try | ||
| { | ||
| timedOut = !WaitUntilCountOrTimeout(millisecondsTimeout, startTime, cancellationToken); | ||
| } |
Release handed permits to async waiters by CAS-bumping m_currentCount to observed + releaseCount, draining waiters from the list, then subtracting the handed-out count with Interlocked.Add. Between removing a waiter from the list and the subtract, m_currentCount was inflated while m_asyncHead and m_waitCount read as empty, so the lock-free WaitAsync fast path could acquire a permit already earmarked for that waiter. This double-acquired a permit and drove the count negative, after which a later Release exceeded m_maxCount and threw SemaphoreFullException. The weaker Mono memory model exposed it (Mono_MiniJIT libraries tests crashing in xUnit's WaitAsync throttle), but the race is platform-independent. Never publish the inflated count. Compute the post-release count in a local, drain waiters, and apply only the net delta (releaseCount - asyncReleased) in a single Interlocked.Add at the end, so the fast path never observes a permit reserved for a waiter. The max-count check is linearized before any waiter is touched by re-reading the live count on a mismatch, keeping the spurious-throw protection without the bump. Add Release_AsyncWaiterHandoff_RacingFastPath_NeverExceedsMaxCount to cover the async-waiter handoff racing the fast path. Also switch the cancellation race test's per-iteration canceller from Task.Run to ThreadPool to drop the per-iteration Task allocation, and fix a grammar typo in a WaitCore comment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Use a lock-free CAS fast path in SemaphoreSlim.WaitAsync to skip the Monitor lock when a permit is immediately available, improving uncontended throughput