Fix flaky MountVolume tests: replace fixed delays with polling loops#125914
Fix flaky MountVolume tests: replace fixed delays with polling loops#125914danmoseley merged 4 commits intomainfrom
Conversation
… loops - Delete_MountVolume.cs: Replace 7 fixed Task.Delay(300).Wait() calls with WaitForDirectoryGone() polling helper (10s timeout, 100ms intervals using Environment.TickCount64) before Directory.Exists assertions. Also replace Task.Delay in DeleteDir retry loop with Thread.Sleep. - ReparsePoints_MountVolume.cs: Add retry logic to DeleteDir with IOException handling and Thread.Sleep backoff to handle mount teardown timing. Fixes #125295, #125624 Co-authored-by: danmoseley <6385855+danmoseley@users.noreply.github.com> Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/90f5ab9b-bd66-4286-9b2c-72778993ada3
|
Tagging subscribers to this area: @dotnet/area-system-io |
...aries/System.Runtime/tests/System.IO.FileSystem.Tests/Directory/ReparsePoints_MountVolume.cs
Show resolved
Hide resolved
|
@copilot address feedback |
Co-authored-by: danmoseley <6385855+danmoseley@users.noreply.github.com> Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/1cc6fac5-ff28-4293-b1dc-b35cdb75dd67
There was a problem hiding this comment.
Pull request overview
Reduces flakiness in Windows-only NTFS MountVolume filesystem tests by replacing fixed sleeps with polling/retry logic around directory deletion and cleanup.
Changes:
Delete_MountVolume.cs: replaces multiple fixed 300ms delays with aWaitForDirectoryGonepolling helper (up to 10s), and switches a retry-loop delay toThread.Sleep.ReparsePoints_MountVolume.cs: hardens cleanup by retryingDirectory.Deleteon transientIOExceptionduring unmount teardown.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/libraries/System.Runtime/tests/System.IO.FileSystem.Tests/Directory/Delete_MountVolume.cs | Introduces a polling helper and removes fixed-delay assumptions after deletes via mount points. |
| src/libraries/System.Runtime/tests/System.IO.FileSystem.Tests/Directory/ReparsePoints_MountVolume.cs | Adds targeted retry logic for transient unmount-related IOException during directory cleanup. |
src/libraries/System.Runtime/tests/System.IO.FileSystem.Tests/Directory/Delete_MountVolume.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Runtime/tests/System.IO.FileSystem.Tests/Directory/Delete_MountVolume.cs
Outdated
Show resolved
Hide resolved
|
@copilot OK address feedback and resolve it |
…tForDirectoryGone Co-authored-by: danmoseley <6385855+danmoseley@users.noreply.github.com> Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/b0df647a-d5ae-4269-8916-d11d32a84cd5
|
@copilot can you verify if this is going to fix the errors in https://helixr18s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-126092-merge-8b4026dc5a5b435f8d/System.IO.FileSystem.Tests/1/console.7d4c5526.log?helixlogtype=result which is a log from #125295. |
This specific log shows two errors that are not the primary timing pattern this PR fixes:
The This PR primarily addresses the most commonly reported failure mode from #125295 — the |
|
I can do a follow up for the remainder mentioned if we merge this. |
|
@copilot confused by your analysis above about what remains to be addressed more this is merged. It closed #125913. You mention #125295 and maybe issues not covered by that. What related problems do you believe are not fixed by this pr? I'd there another relevant PR already? What issues are they tracked by? |
…126660) > [!NOTE] > This PR was created with Copilot assistance. ## Fix deterministic MountVolume test failures on ARM64 Helix machines Fixes #125295, fixes #125624, fixes #126627 ### Problem `Directory_Delete_MountVolume.RunTest` and `Directory_ReparsePoints_MountVolume.runTest` fail deterministically (~100% of the time, ~750ms duration) on the `Windows.11.Arm64.Open` Helix machine pool. This is **not timing-related** and was not addressed by the delay/polling fixes in #125914 or the Unmount resilience fix in #125625 (those PRs fixed real timing issues -- pre-fix failures on other configurations have since expired from AzDO retention, so we can't verify directly, but there is no evidence they were ineffective for their intended purpose). **Root cause**: The ARM64 Helix machines have an E:\ drive (likely an Azure resource/temp disk) that passes all `DriveInfo` checks -- `DriveType=Fixed`, `DriveFormat=NTFS`, `IsReady=True` -- but `GetVolumeNameForVolumeMountPoint` fails with `ERROR_INVALID_PARAMETER` (87). The drive has no volume GUID and doesn't support volume mount point operations. `IOServices.GetNtfsDriveOtherThanCurrent()` returns this drive, and the test crashes trying to use it. Some ARM64 Helix machines have only C:\ and a CD-ROM (no second drive at all). On those machines, the cross-drive scenarios already skip gracefully and only same-drive scenarios 3.x run. ### Evidence Analyzed Helix console logs from 5 post-fix builds (all `arm64-NativeAOT-Win11`, same C:\ volume GUID). Every failure shows the identical pattern: - Scenario 1: `GetVolumeNameForVolumeMountPoint("E:\")` -> error 87 - Scenario 2: `SetVolumeMountPoint` onto E:\ succeeds but path traversal through the mount point fails with `DirectoryNotFoundException` - Scenarios 3.x (same-drive): Always pass Reproduced locally by removing the real E: drive letter and creating `SUBST E:` which exhibits identical error 87 behavior. ### Changes 1. **`IOServices.GetNtfsDriveOtherThan()`**: After the existing Fixed/Ready/NTFS checks, also verify the drive has a volume GUID via `GetVolumeNameForVolumeMountPoint`. Drives without one (SUBST drives, Azure resource disks) are skipped. 2. **`DumpDriveInformation` diagnostic test**: New Helix-only test (following the `DescriptionNameTests.DumpRuntimeInformationToConsole` pattern) that dumps all drives with their volume GUIDs to the console log. Makes future drive-related CI issues immediately diagnosable from the same Helix work item log. 3. **`GetVolumeNameForVolumeMountPoint` P/Invoke in DllImports.cs**: Uses `char[]` (not `StringBuilder`) because this file uses `LibraryImport` which does not support `StringBuilder`. ### Local validation | Scenario | Before fix | After fix | |---|---|---| | SUBST E: (no volume GUID) | Error 87 / DirectoryNotFoundException | Pass (SUBST filtered, scenarios 3.x run) | | Real NTFS E: | Pass (all scenarios) | Pass (all scenarios) | | Single-drive machine | Scenarios 1/2 skip, 3.x pass | Same -- no change | --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
Directory_Delete_MountVolumeandDirectory_ReparsePoints_MountVolumeare flaky on loaded CI machines because fixed-duration waits are insufficient for NTFS mount point operations to propagate.Delete_MountVolume.cs7 locations used
Task.Delay(300).Wait()before asserting!Directory.Exists()after deleting through a mount point. 300 ms is not enough under load.WaitForDirectoryGone(string path): pollsDirectory.Existsevery 100 ms for up to 60 s usingStopwatchfor readable elapsed trackingWaitForDirectoryGone(<path>)Task.Delay(300).Wait()in theDeleteDirretry loop withThread.Sleep(300); removed unusedSystem.Threading.TasksimportReparsePoints_MountVolume.csDeleteDir(called infinallyblocks afterMountHelper.Unmount) had no retry logic, so transientIOExceptionduring volume teardown would fail cleanup silently or throw.Directory.Deletewith a retry loop: catchesIOExceptionspecifically (which can occur transiently when the volume is still being unmounted — the reparse point directory may be momentarily locked by the kernel while the mount is being torn down), retries up to 10× with 200 ms back-offcatch (IOException)block documenting the observed transient failure modeusing System.ThreadingChanges
Directory/Delete_MountVolume.cs— polling helper (60 s timeout,Stopwatch-based) + 7 delay replacementsDirectory/ReparsePoints_MountVolume.cs— robustDeleteDirwithIOExceptionretry and explanatory commentTesting
Tests are Windows-only (
[PlatformSpecific(TestPlatforms.Windows)], requires NTFS) and require an elevated environment with mount point access. Validation requires observing reduced flakiness on CI.Original prompt
📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.