Add low-cost instrumented version of RuntimeAsyncTask::DispatchContinuations reclaiming lost performance.#126091
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-runtime |
There was a problem hiding this comment.
Pull request overview
This PR refactors runtime-async continuation dispatch to support a low-overhead “uninstrumented” fast path while still enabling Debugger/TPL (and future profiler) instrumentation via a separate, JIT-specialized codegen path, with updated flag plumbing and added tests to validate behavior and cleanup.
Changes:
- Introduces a generic, instrumentation-specialized
DispatchContinuations<TRuntimeAsyncTaskInstrumentation>()path and centralized runtime-async instrumentation flag management. - Refactors
Task’s runtime-async timestamp bookkeeping APIs to better support continuation chains and exception/unwind cleanup. - Adds/expands RuntimeAsync tests for timestamp cleanup, debugger detach behavior, continuation timestamp visibility, and TPL EventSource events; wires TPL EventSource enable/disable to update instrumentation flags.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/libraries/System.Runtime/tests/System.Threading.Tasks.Tests/System.Runtime.CompilerServices/RuntimeAsyncTests.cs | Adds coverage for runtime-async instrumentation behavior (timestamps, detach, unwind/cancel, and TPL events). |
| src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/TplEventSource.cs | Updates runtime-async instrumentation flags when TPL EventSource commands change enabled keywords. |
| src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs | Refactors runtime-async timestamp dictionaries and adds helpers for chain timestamp propagation and cleanup. |
| src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs | Adds the generic instrumentation abstraction and splits dispatch/finalize into specialized uninstrumented vs instrumented implementations. |
...time/tests/System.Threading.Tasks.Tests/System.Runtime.CompilerServices/RuntimeAsyncTests.cs
Outdated
Show resolved
Hide resolved
...time/tests/System.Threading.Tasks.Tests/System.Runtime.CompilerServices/RuntimeAsyncTests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
...time/tests/System.Threading.Tasks.Tests/System.Runtime.CompilerServices/RuntimeAsyncTests.cs
Outdated
Show resolved
Hide resolved
...time/tests/System.Threading.Tasks.Tests/System.Runtime.CompilerServices/RuntimeAsyncTests.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
|
Most failures are due to test |
|
s_asyncDebuggingEnabled implies that these TPL events have been enabled, with two exceptions:
We can assume that s_asyncDebuggingEnabled implies required TPL events enabled, and save the setting and checking of these flags. The downside is a tiny amount of overhead spent checking whether TPL events are enabled in these two exceptional cases during debugging - in my opinion, an acceptable tradeoff for simplification and any possible optimization of the hot path. If someone believes that scenario 1 is not very uncommon, it is possible to change Concord such that they will only be out of sync in case 2. |
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
Question is if pure TPL will have meaning here, like adding TPL events for runtime async activities in case a client only consumes TPL events to recreate async callstacks, or if that was never the intent with this change and it should only support the debugger features, then we should definitely put all logic under the same flag. If we could assume the feature will trigger the TPL event source, we could handle it all in the on command event onTPL event source, and we can do the fast check only on flags, maybe that was what you proposed @rcj1? In that case this feature won't light up without the TPL session enabled, but that might be, OK? It would speed up the instrumentation gate check since it will only look at the flags again and ignore s_asyncDebuggingEnabled as part of dispatch loop. |
I worry that all the The other thing with the current approach -- the inversion of control makes the dispatcher quite hard to follow for me. But I can live with this given the improved performance. |
Doesn't look too bad, I tried a binary counter micro benchmark that loads 4096 of the dispatchers: using System;
using System.Diagnostics;
using System.Threading.Tasks;
namespace AsyncMicro;
public class Program
{
static void Main()
{
Stopwatch timer = Stopwatch.StartNew();
new Program().Recurse<byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte>(0).GetAwaiter().GetResult();
Console.WriteLine("Took {0} ms", timer.ElapsedMilliseconds);
}
private async Task<V<T11, T10, T9, T8, T7, T6, T5, T4, T3, T2, T1, T0>> Recurse<T11, T10, T9, T8, T7, T6, T5, T4, T3, T2, T1, T0>(int n)
{
await Task.Yield();
//Console.WriteLine(n);
Task t;
if (typeof(T0) == typeof(byte))
{
t = Recurse<T11, T10, T9, T8, T7, T6, T5, T4, T3, T2, T1, int>(n + 1);
}
else if (typeof(T1) == typeof(byte))
{
t = Recurse<T11, T10, T9, T8, T7, T6, T5, T4, T3, T2, int, byte>(n + 1);
}
else if (typeof(T2) == typeof(byte))
{
t = Recurse<T11, T10, T9, T8, T7, T6, T5, T4, T3, int, byte, byte>(n + 1);
}
else if (typeof(T3) == typeof(byte))
{
t = Recurse<T11, T10, T9, T8, T7, T6, T5, T4, int, byte, byte, byte>(n + 1);
}
else if (typeof(T4) == typeof(byte))
{
t = Recurse<T11, T10, T9, T8, T7, T6, T5, int, byte, byte, byte, byte>(n + 1);
}
else if (typeof(T5) == typeof(byte))
{
t = Recurse<T11, T10, T9, T8, T7, T6, int, byte, byte, byte, byte, byte>(n + 1);
}
else if (typeof(T6) == typeof(byte))
{
t = Recurse<T11, T10, T9, T8, T7, int, byte, byte, byte, byte, byte, byte>(n + 1);
}
else if (typeof(T7) == typeof(byte))
{
t = Recurse<T11, T10, T9, T8, int, byte, byte, byte, byte, byte, byte, byte>(n + 1);
}
else if (typeof(T8) == typeof(byte))
{
t = Recurse<T11, T10, T9, int, byte, byte, byte, byte, byte, byte, byte, byte>(n + 1);
}
else if (typeof(T9) == typeof(byte))
{
t = Recurse<T11, T10, int, byte, byte, byte, byte, byte, byte, byte, byte, byte>(n + 1);
}
else if (typeof(T10) == typeof(byte))
{
t = Recurse<T11, int, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte>(n + 1);
}
else if (typeof(T11) == typeof(byte))
{
t = Recurse<int, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte>(n + 1);
}
else
{
return default;
}
await t;
return default;
}
private struct V<T11, T10, T9, T8, T7, T6, T5, T4, T3, T2, T1, T0>
{
public int X;
}
}The difference is only around 100 ms on this PR. (P.S. a simpler binary recursing generic hits some exponential behavior... Need to investigate that.) |
The alternative is to duplicate the DispatchContinuation implementation into two different methods, one default (with upgrade capabilities) and one instrumented version. In the end it depends on whats important, this PR currently take the reduce code duplication path, with slightly more complex dispatcher implementation (due to the instrumentation callbacks), and I think that is fine. Dropping the R2R exclusion reducing one indirect call on the instrumented path will simplify the instrumented path a little more as well, with very small size increase of S.P.C. |
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
Add instrumentation probes to RuntimeAsyncTask DispatchContinuations loop. This is an extremely hot loop, in the centre of dispatching async continuations. dotnet#123727 introduced a regression of ~7% when adding additional instrumentation for debugger/tpl into the loop. This commit uses generic value type specialization to setup an interface for the probes that JIT can use to create two versions of codegen for this hot method, most of the probes will be transformed to noop when profiling/debugging is disabled, introduce minimal overhead to critical hot code path. Dispatch loop checks on entry if instrumentation is enabled, if so it will switch to instrumented version of the function. It also checks on each completion of a continuation if instrumentation flags changed and if that is the case it will again switch to the instrumented version. In total it performs small set of instructions to "upgrade" the method on entry, and also a fast check against a static on each loop to support late attach scenarios. This change will make sure the dispatch continuations loop is protected from future performance regressions when more instrumentation gets added.
…s/System.Runtime.CompilerServices/RuntimeAsyncTests.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…s/Task.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…s/System.Runtime.CompilerServices/RuntimeAsyncTests.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…s/Task.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…s/TplEventSource.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
069b76c to
75f5f5e
Compare
#123727 introduced a regression of ~7% adding additional instrumentation for Debugger/TPL into
RuntimeAsyncTask::DispatchContinuations. Going forward, there will be even more instrumentation needed when implementing async profiling and that would increase the overhead even more, so we need a way to isolate the instrumented vs none instrumented version of this method and regain some of the lost performance.This PR duplicates DispatchContinuations into two methods, regular and instrumented. Previous version of PR used a generic value type specialization, but it was decided that the simplification not using generic value type specialization is worth the duplication of the function.
To be able to "upgrade" from none instrumented to instrumented version of
DispatchContinuations, there are checks at method entry and after each completed continuation detecting if method should switch to instrumented version. Both checks are small and fast just a load of a flag in a static variable and checked if it's not 0.This change will make sure the
RuntimeAsyncTask::DispatchContinuationsis protected from future performance regressions when more instrumentation gets added into the instrumented version.Running the same benchmark, #123727 (comment) now shows the following numbers on old vs new implementation:
S.P.C)RuntimeAsyncTask::DispatchContinuations)Measurements done on Windows x64.
S.P.Cis 16 KB larger with this PR, due to duplicated DispatchContinuation method.JIT Size is 379 bytes smaller on none instrumented version of
RuntimeAsyncTask::DispatchContinuations, all previous instrumentation has been moved to instrumented version, completely eliminated in default method.Benchmark shows that this PR recover most of the performance previously lost in #123727.
Code paths triggering the use of instrumented version,
InstrumentedDispatchContinuationsare protected by aIsSupportedflag, so can be trimmed away, removing references toInstrumentedDispatchContinuations.Most changes in this PR are around extracting out existing instrumentation into the instrumentation implementation. PR also optimize some of the debugger instrumentations previously implemented reducing locking in scenarios where continuation chains are handled.
PR adds a number of new tests validating that the current debugger and TPL instrumentation is still working.
PR also adds preparation for async profiler instrumentation in the AsyncInstrumentation type. This type will be used by more scenarios going forward.