Skip to content

Fix flaky MountVolume tests: replace fixed delays with polling loops#125914

Merged
danmoseley merged 4 commits intomainfrom
copilot/fix-mountvolume-test-flakiness
Mar 28, 2026
Merged

Fix flaky MountVolume tests: replace fixed delays with polling loops#125914
danmoseley merged 4 commits intomainfrom
copilot/fix-mountvolume-test-flakiness

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 22, 2026

Description

Directory_Delete_MountVolume and Directory_ReparsePoints_MountVolume are flaky on loaded CI machines because fixed-duration waits are insufficient for NTFS mount point operations to propagate.

Delete_MountVolume.cs

7 locations used Task.Delay(300).Wait() before asserting !Directory.Exists() after deleting through a mount point. 300 ms is not enough under load.

  • Added WaitForDirectoryGone(string path): polls Directory.Exists every 100 ms for up to 60 s using Stopwatch for readable elapsed tracking
  • Replaced all 7 fixed delays with WaitForDirectoryGone(<path>)
  • Replaced Task.Delay(300).Wait() in the DeleteDir retry loop with Thread.Sleep(300); removed unused System.Threading.Tasks import
// Before
Directory.Delete(dirNameReferredFromMountedDrive, true);
Task.Delay(300).Wait();
Eval(!Directory.Exists(dirName), "Err_20387g! ...");

// After
Directory.Delete(dirNameReferredFromMountedDrive, true);
WaitForDirectoryGone(dirName);
Eval(!Directory.Exists(dirName), "Err_20387g! ...");
// WaitForDirectoryGone implementation
Stopwatch sw = Stopwatch.StartNew();
while (Directory.Exists(path) && sw.Elapsed < TimeSpan.FromSeconds(60))
    Thread.Sleep(100);

ReparsePoints_MountVolume.cs

DeleteDir (called in finally blocks after MountHelper.Unmount) had no retry logic, so transient IOException during volume teardown would fail cleanup silently or throw.

  • Replaced the single-shot Directory.Delete with a retry loop: catches IOException specifically (which can occur transiently when the volume is still being unmounted — the reparse point directory may be momentarily locked by the kernel while the mount is being torn down), retries up to 10× with 200 ms back-off
  • Added explanatory comment on the catch (IOException) block documenting the observed transient failure mode
  • Added using System.Threading

Changes

  • Directory/Delete_MountVolume.cs — polling helper (60 s timeout, Stopwatch-based) + 7 delay replacements
  • Directory/ReparsePoints_MountVolume.cs — robust DeleteDir with IOException retry and explanatory comment

Testing

Tests are Windows-only ([PlatformSpecific(TestPlatforms.Windows)], requires NTFS) and require an elevated environment with mount point access. Validation requires observing reduced flakiness on CI.

Original prompt

This section details on the original issue you should resolve

<issue_title>MountVolume test flakiness: Delete_MountVolume and ReparsePoints_MountVolume</issue_title>
<issue_description>> [!NOTE]

This issue was created with Copilot assistance.

Tracking flaky MountVolume tests that are hitting blocking-clean-ci:

Root cause (Delete_MountVolume): 7 locations use a fixed Task.Delay(300) after Directory.Delete through NTFS mount points, then assert !Directory.Exists(). 300ms is insufficient on loaded CI machines.

Root cause (ReparsePoints_MountVolume): Needs investigation — no Task.Delay in the test; may be mount setup/teardown timing.

Proposed fix: Replace fixed delays with a polling loop (retry Directory.Exists with backoff, ~10s timeout).

Known Issue Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

</issue_description>

<agent_instructions>create a pr with the recommended test-only fix. the PR should be set to close the open bugs when merged</agent_instructions>

Comments on the Issue (you are @copilot in this section)


📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

… loops

- Delete_MountVolume.cs: Replace 7 fixed Task.Delay(300).Wait() calls with
  WaitForDirectoryGone() polling helper (10s timeout, 100ms intervals using
  Environment.TickCount64) before Directory.Exists assertions. Also replace
  Task.Delay in DeleteDir retry loop with Thread.Sleep.
- ReparsePoints_MountVolume.cs: Add retry logic to DeleteDir with IOException
  handling and Thread.Sleep backoff to handle mount teardown timing.

Fixes #125295, #125624

Co-authored-by: danmoseley <6385855+danmoseley@users.noreply.github.com>
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/90f5ab9b-bd66-4286-9b2c-72778993ada3
Copilot AI changed the title [WIP] Fix MountVolume test flakiness in Delete_MountVolume and ReparsePoints_MountVolume Fix flaky MountVolume tests: replace fixed delays with polling loops Mar 22, 2026
Copilot AI requested a review from danmoseley March 22, 2026 04:46
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

@danmoseley
Copy link
Copy Markdown
Member

@copilot address feedback

Copilot AI requested a review from danmoseley March 22, 2026 05:34
@danmoseley danmoseley marked this pull request as ready for review March 22, 2026 05:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces flakiness in Windows-only NTFS MountVolume filesystem tests by replacing fixed sleeps with polling/retry logic around directory deletion and cleanup.

Changes:

  • Delete_MountVolume.cs: replaces multiple fixed 300ms delays with a WaitForDirectoryGone polling helper (up to 10s), and switches a retry-loop delay to Thread.Sleep.
  • ReparsePoints_MountVolume.cs: hardens cleanup by retrying Directory.Delete on transient IOException during unmount teardown.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/libraries/System.Runtime/tests/System.IO.FileSystem.Tests/Directory/Delete_MountVolume.cs Introduces a polling helper and removes fixed-delay assumptions after deletes via mount points.
src/libraries/System.Runtime/tests/System.IO.FileSystem.Tests/Directory/ReparsePoints_MountVolume.cs Adds targeted retry logic for transient unmount-related IOException during directory cleanup.

@danmoseley
Copy link
Copy Markdown
Member

@copilot OK address feedback and resolve it

…tForDirectoryGone

Co-authored-by: danmoseley <6385855+danmoseley@users.noreply.github.com>
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/b0df647a-d5ae-4269-8916-d11d32a84cd5
Copilot AI requested a review from danmoseley March 23, 2026 03:30
@danmoseley danmoseley enabled auto-merge (squash) March 24, 2026 02:01
@jozkee
Copy link
Copy Markdown
Member

jozkee commented Mar 28, 2026

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 28, 2026

@copilot can you verify if this is going to fix the errors in https://helixr18s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-126092-merge-8b4026dc5a5b435f8d/System.IO....

This specific log shows two errors that are not the primary timing pattern this PR fixes:

  1. Scenario 1 (Err_768lme!): MountHelper.Unmount throws Win32 error: 4390 (ERROR_REPARSE_POINT_NOT_RESOLVED) in the finally cleanup block (original line 85)
  2. Scenario 2 (Err_231vwf!): Directory.Delete itself throws DirectoryNotFoundException when trying to delete through the mount path (original line 124)

The Eval(!Directory.Exists(...)) checks actually passed in this run — there are no Err_20387g!/Err_794aiu! messages in the output. So this particular failure instance was caused by mount infrastructure issues (unmount failure, path resolution through mount) rather than the fixed-delay timing pattern.

This PR primarily addresses the most commonly reported failure mode from #125295 — the Task.Delay(300) before Exists checks. By extending the wait to 60s with polling, filesystem operations are also more likely to fully complete before cleanup runs, which could indirectly reduce the Unmount failures. But the specific errors in this log (Win32 4390 from Unmount, DirectoryNotFoundException during Delete) represent a less common failure mode that may need separate investigation if it persists after this PR.

Copilot AI requested a review from jozkee March 28, 2026 04:01
@danmoseley
Copy link
Copy Markdown
Member

I can do a follow up for the remainder mentioned if we merge this.

@danmoseley danmoseley merged commit a3741f7 into main Mar 28, 2026
83 of 86 checks passed
@danmoseley danmoseley deleted the copilot/fix-mountvolume-test-flakiness branch March 28, 2026 04:59
@danmoseley
Copy link
Copy Markdown
Member

@copilot confused by your analysis above about what remains to be addressed more this is merged. It closed #125913. You mention #125295 and maybe issues not covered by that.

What related problems do you believe are not fixed by this pr? I'd there another relevant PR already? What issues are they tracked by?

danmoseley added a commit that referenced this pull request Apr 11, 2026
…126660)

> [!NOTE]
> This PR was created with Copilot assistance.

## Fix deterministic MountVolume test failures on ARM64 Helix machines

Fixes #125295, fixes #125624, fixes #126627

### Problem

`Directory_Delete_MountVolume.RunTest` and
`Directory_ReparsePoints_MountVolume.runTest` fail deterministically
(~100% of the time, ~750ms duration) on the `Windows.11.Arm64.Open`
Helix machine pool. This is **not timing-related** and was not addressed
by the delay/polling fixes in #125914 or the Unmount resilience fix in
#125625 (those PRs fixed real timing issues -- pre-fix failures on other
configurations have since expired from AzDO retention, so we can't
verify directly, but there is no evidence they were ineffective for
their intended purpose).

**Root cause**: The ARM64 Helix machines have an E:\ drive (likely an
Azure resource/temp disk) that passes all `DriveInfo` checks --
`DriveType=Fixed`, `DriveFormat=NTFS`, `IsReady=True` -- but
`GetVolumeNameForVolumeMountPoint` fails with `ERROR_INVALID_PARAMETER`
(87). The drive has no volume GUID and doesn't support volume mount
point operations. `IOServices.GetNtfsDriveOtherThanCurrent()` returns
this drive, and the test crashes trying to use it.

Some ARM64 Helix machines have only C:\ and a CD-ROM (no second drive at
all). On those machines, the cross-drive scenarios already skip
gracefully and only same-drive scenarios 3.x run.

### Evidence

Analyzed Helix console logs from 5 post-fix builds (all
`arm64-NativeAOT-Win11`, same C:\ volume GUID). Every failure shows the
identical pattern:
- Scenario 1: `GetVolumeNameForVolumeMountPoint("E:\")` -> error 87
- Scenario 2: `SetVolumeMountPoint` onto E:\ succeeds but path traversal
through the mount point fails with `DirectoryNotFoundException`
- Scenarios 3.x (same-drive): Always pass

Reproduced locally by removing the real E: drive letter and creating
`SUBST E:` which exhibits identical error 87 behavior.

### Changes

1. **`IOServices.GetNtfsDriveOtherThan()`**: After the existing
Fixed/Ready/NTFS checks, also verify the drive has a volume GUID via
`GetVolumeNameForVolumeMountPoint`. Drives without one (SUBST drives,
Azure resource disks) are skipped.

2. **`DumpDriveInformation` diagnostic test**: New Helix-only test
(following the `DescriptionNameTests.DumpRuntimeInformationToConsole`
pattern) that dumps all drives with their volume GUIDs to the console
log. Makes future drive-related CI issues immediately diagnosable from
the same Helix work item log.

3. **`GetVolumeNameForVolumeMountPoint` P/Invoke in DllImports.cs**:
Uses `char[]` (not `StringBuilder`) because this file uses
`LibraryImport` which does not support `StringBuilder`.

### Local validation

| Scenario | Before fix | After fix |
|---|---|---|
| SUBST E: (no volume GUID) | Error 87 / DirectoryNotFoundException |
Pass (SUBST filtered, scenarios 3.x run) |
| Real NTFS E: | Pass (all scenarios) | Pass (all scenarios) |
| Single-drive machine | Scenarios 1/2 skip, 3.x pass | Same -- no
change |

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MountVolume test flakiness: Delete_MountVolume and ReparsePoints_MountVolume

5 participants