Skip to content

[Tests][UserEvents] Improve test timing and re-enable tests#124919

Merged
mdh1418 merged 6 commits intodotnet:mainfrom
mdh1418:fix/userevents-flaky-tests
Mar 3, 2026
Merged

[Tests][UserEvents] Improve test timing and re-enable tests#124919
mdh1418 merged 6 commits intodotnet:mainfrom
mdh1418:fix/userevents-flaky-tests

Conversation

@mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Feb 26, 2026

Fix flaky userevents tracing tests

Fixes #123442

Root Causes

Investigation identified three independent causes of test flakiness:

1. Race between tracee event generation and record-trace setup

record-trace must discover the tracee process, send an EventPipe IPC command (which the runtime processes to register user_events tracepoints), and enable its PerfSession (ring buffer collection) before the tracee writes any events. The tracee has no callback to know when these steps are complete. With the original 200ms delay, the tracee could write events before PerfSession was enabled — a race observed when the tracee is discovered during record-trace's /proc scan. Events written in this window are silently lost.

Empirically measured on a 2-core system, which CI runs on and hit the failures more frequently than my local WSL2 environment, both tracepoint registration and PerfSession enable completed within 229ms (p99). The delay is increased to 700ms, providing a ~3x safety margin.

2. Cross-process event contamination

record-trace captures events from all .NET processes on the machine, not just the tracee. On CI machines running multiple .NET processes, the trace validators could match events from unrelated processes or fail to find the tracee's events among the noise.

Each validator now receives the tracee's PID and filters events by ProcessID, ignoring events from other processes.

3. Permission errors cleaning up diagnostic ports

EnsureCleanDiagnosticPorts deletes zombie diagnostic IPC sockets in /tmp, but on shared CI machines these sockets may be owned by other users. UnauthorizedAccessException and IOException during deletion would crash the test before it even started.

These exceptions are now caught and logged, allowing the test to proceed.

Additional Improvements

  • Output ordering: Subprocess stderr is routed to stdout with [stdout]/[stderr] tags so test output appears in chronological order, making timing-sensitive failures easier to diagnose.
  • record-trace version: Bumped from 0.1.33304 to 0.1.33421 to pick up fixes for EventPipe IPC handling.

Validation

Stress-tested with 500 consecutive runs (100 per scenario) on a 2-core system — 500/500 passed (0% failure rate, up from ~65% with the original code).

mdh1418 and others added 5 commits February 26, 2026 02:28
Update the Microsoft.OneCollect.RecordTrace package to 0.1.33421 which
includes a fix to drain perf_event ring buffers after session disable,
ensuring all buffered events are flushed to the trace file on SIGINT.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a 700ms delay (EventGenerationDelayMs) before the tracee performs its
event-generating action. record-trace must discover the process, send an
EventPipe IPC command, register tracepoints, and enable PerfSession before
the tracee writes events. Without this delay, events can be lost if the
tracee writes before PerfSession is enabled — a race observed when the
tracee is discovered during record-trace's /proc scan.

Route all subprocess stderr to stdout with [stdout]/[stderr] tags so test
output appears in the correct chronological order.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
record-trace traces system-wide (pid=-1), so the resulting nettrace file
contains events from ALL processes on the system. Pass the tracee PID to
each test's trace validator so it can filter events to only those from
the test process, preventing false positives and false negatives caused
by concurrent event sources.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
EnsureCleanDiagnosticPorts deletes leftover Unix domain sockets from
/tmp/dotnet-diagnostic-*. On multi-user systems, these files may be
owned by other users. Catch IOException and UnauthorizedAccessException
when deleting files to avoid test failures from permission errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove CLRTestTargetUnsupported that was set in response to persistent
flakiness (issue dotnet#123442). The root causes have been addressed:
- OneCollect 0.1.33421 drains ring buffers on SIGINT
- Synchronization delays cover tracepoint registration and
  PerfSession::enable ordering
- PID filtering prevents cross-process event contamination

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes flaky user events tracing tests that were failing intermittently on CI. The tests were originally disabled due to issue #123442 where traces would not contain expected events. The PR identifies and addresses three independent root causes: race conditions in trace setup timing, cross-process event contamination, and permission errors when cleaning diagnostic ports.

Changes:

  • Increased tracee event generation delay from 200ms to 700ms based on empirical measurements to prevent race conditions
  • Added ProcessID filtering to all trace validators to eliminate cross-process event contamination
  • Added exception handling for diagnostic port cleanup to handle permission errors on shared CI machines
  • Improved test output ordering by routing stderr to stdout with clear tags for better diagnostics
  • Bumped record-trace version from 0.1.33304 to 0.1.33421 for EventPipe IPC handling fixes
  • Re-enabled the tests by removing the CLRTestTargetUnsupported property

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/tests/tracing/userevents/Directory.Build.props Re-enables the user events tests by removing the CLRTestTargetUnsupported property that was added due to issue #123442
eng/Versions.props Updates MicrosoftOneCollectRecordTraceVersion from 0.1.33304 to 0.1.33421 to pick up EventPipe IPC handling fixes
src/tests/tracing/userevents/common/UserEventsTestRunner.cs Core infrastructure changes: increases event generation delay to 700ms, adds PID parameter to validator signature, improves output tagging, adds exception handling for diagnostic port cleanup
src/tests/tracing/userevents/basic/basic.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages
src/tests/tracing/userevents/activity/activity.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages
src/tests/tracing/userevents/custommetadata/custommetadata.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages
src/tests/tracing/userevents/managedevent/managedevent.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages
src/tests/tracing/userevents/multithread/multithread.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages

Replace the fixed 700ms Thread.Sleep before the event-generating action
with two complementary synchronization mechanisms:

1. EventCommandExecuted callback: The tracee waits on a ManualResetEventSlim
   that is set when the EventSource receives an Enable command via IPC,
   confirming tracepoints are registered.

2. Orchestrator delay (300ms): Delays tracee startup to let record-trace
   complete its /proc scan and enable PerfSession (ring buffers). This
   prevents the tracee from being discovered during the /proc scan, where
   it could receive IPC before ring buffers are active.

All five tests now pass their EventSource to the runner. The basic test
uses EventSource.GetSources() to find the NativeRuntimeEventSource since
it is internal to System.Private.CoreLib.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mdh1418
Copy link
Member Author

mdh1418 commented Mar 2, 2026

/azp run runtime-nativeaot-outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mdh1418 mdh1418 merged commit a155715 into dotnet:main Mar 3, 2026
176 of 196 checks passed
@mdh1418 mdh1418 deleted the fix/userevents-flaky-tests branch March 3, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Test][UserEvents] Trace file does not contain expected events

4 participants