[Tests][UserEvents] Improve test timing and re-enable tests#124919
Merged
mdh1418 merged 6 commits intodotnet:mainfrom Mar 3, 2026
Merged
[Tests][UserEvents] Improve test timing and re-enable tests#124919mdh1418 merged 6 commits intodotnet:mainfrom
mdh1418 merged 6 commits intodotnet:mainfrom
Conversation
Update the Microsoft.OneCollect.RecordTrace package to 0.1.33421 which includes a fix to drain perf_event ring buffers after session disable, ensuring all buffered events are flushed to the trace file on SIGINT. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a 700ms delay (EventGenerationDelayMs) before the tracee performs its event-generating action. record-trace must discover the process, send an EventPipe IPC command, register tracepoints, and enable PerfSession before the tracee writes events. Without this delay, events can be lost if the tracee writes before PerfSession is enabled — a race observed when the tracee is discovered during record-trace's /proc scan. Route all subprocess stderr to stdout with [stdout]/[stderr] tags so test output appears in the correct chronological order. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
record-trace traces system-wide (pid=-1), so the resulting nettrace file contains events from ALL processes on the system. Pass the tracee PID to each test's trace validator so it can filter events to only those from the test process, preventing false positives and false negatives caused by concurrent event sources. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
EnsureCleanDiagnosticPorts deletes leftover Unix domain sockets from /tmp/dotnet-diagnostic-*. On multi-user systems, these files may be owned by other users. Catch IOException and UnauthorizedAccessException when deleting files to avoid test failures from permission errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove CLRTestTargetUnsupported that was set in response to persistent flakiness (issue dotnet#123442). The root causes have been addressed: - OneCollect 0.1.33421 drains ring buffers on SIGINT - Synchronization delays cover tracepoint registration and PerfSession::enable ordering - PID filtering prevents cross-process event contamination Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request fixes flaky user events tracing tests that were failing intermittently on CI. The tests were originally disabled due to issue #123442 where traces would not contain expected events. The PR identifies and addresses three independent root causes: race conditions in trace setup timing, cross-process event contamination, and permission errors when cleaning diagnostic ports.
Changes:
- Increased tracee event generation delay from 200ms to 700ms based on empirical measurements to prevent race conditions
- Added ProcessID filtering to all trace validators to eliminate cross-process event contamination
- Added exception handling for diagnostic port cleanup to handle permission errors on shared CI machines
- Improved test output ordering by routing stderr to stdout with clear tags for better diagnostics
- Bumped record-trace version from 0.1.33304 to 0.1.33421 for EventPipe IPC handling fixes
- Re-enabled the tests by removing the
CLRTestTargetUnsupportedproperty
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/tests/tracing/userevents/Directory.Build.props | Re-enables the user events tests by removing the CLRTestTargetUnsupported property that was added due to issue #123442 |
| eng/Versions.props | Updates MicrosoftOneCollectRecordTraceVersion from 0.1.33304 to 0.1.33421 to pick up EventPipe IPC handling fixes |
| src/tests/tracing/userevents/common/UserEventsTestRunner.cs | Core infrastructure changes: increases event generation delay to 700ms, adds PID parameter to validator signature, improves output tagging, adds exception handling for diagnostic port cleanup |
| src/tests/tracing/userevents/basic/basic.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
| src/tests/tracing/userevents/activity/activity.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
| src/tests/tracing/userevents/custommetadata/custommetadata.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
| src/tests/tracing/userevents/managedevent/managedevent.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
| src/tests/tracing/userevents/multithread/multithread.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
This was referenced Feb 27, 2026
noahfalk
reviewed
Feb 27, 2026
Replace the fixed 700ms Thread.Sleep before the event-generating action with two complementary synchronization mechanisms: 1. EventCommandExecuted callback: The tracee waits on a ManualResetEventSlim that is set when the EventSource receives an Enable command via IPC, confirming tracepoints are registered. 2. Orchestrator delay (300ms): Delays tracee startup to let record-trace complete its /proc scan and enable PerfSession (ring buffers). This prevents the tracee from being discovered during the /proc scan, where it could receive IPC before ring buffers are active. All five tests now pass their EventSource to the runner. The basic test uses EventSource.GetSources() to find the NativeRuntimeEventSource since it is internal to System.Private.CoreLib. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This was referenced Feb 27, 2026
Member
Author
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
noahfalk
approved these changes
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix flaky userevents tracing tests
Fixes #123442
Root Causes
Investigation identified three independent causes of test flakiness:
1. Race between tracee event generation and record-trace setup
record-trace must discover the tracee process, send an EventPipe IPC command (which the runtime processes to register user_events tracepoints), and enable its PerfSession (ring buffer collection) before the tracee writes any events. The tracee has no callback to know when these steps are complete. With the original 200ms delay, the tracee could write events before PerfSession was enabled — a race observed when the tracee is discovered during record-trace's
/procscan. Events written in this window are silently lost.Empirically measured on a 2-core system, which CI runs on and hit the failures more frequently than my local WSL2 environment, both tracepoint registration and PerfSession enable completed within 229ms (p99). The delay is increased to 700ms, providing a ~3x safety margin.
2. Cross-process event contamination
record-trace captures events from all .NET processes on the machine, not just the tracee. On CI machines running multiple .NET processes, the trace validators could match events from unrelated processes or fail to find the tracee's events among the noise.
Each validator now receives the tracee's PID and filters events by
ProcessID, ignoring events from other processes.3. Permission errors cleaning up diagnostic ports
EnsureCleanDiagnosticPortsdeletes zombie diagnostic IPC sockets in/tmp, but on shared CI machines these sockets may be owned by other users.UnauthorizedAccessExceptionandIOExceptionduring deletion would crash the test before it even started.These exceptions are now caught and logged, allowing the test to proceed.
Additional Improvements
[stdout]/[stderr]tags so test output appears in chronological order, making timing-sensitive failures easier to diagnose.Validation
Stress-tested with 500 consecutive runs (100 per scenario) on a 2-core system — 500/500 passed (0% failure rate, up from ~65% with the original code).