Conversation
…P_CRASH results Three layered fixes: 1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes in StopCapture() to prevent indefinite blocking when Apple TV device log streaming is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code). 2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays) to devicectl file copy in CopyResultsAsync() to handle transient device communication errors (com.apple.Mercury.error 1000, RSD error 0xE8000003). 3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult() (test run completed via app end signal), but the results file is missing (devicectl failed), return Succeeded instead of Crashed. Device communication failure after successful test completion is an infrastructure issue, not an app crash. 4. AppTester: Skip crash report retrieval when test run already completed and file copy fails, replacing misleading 'app may have crashed' message with accurate warning. Adds unit tests for retry behavior and the new TestReporter.ParseResult() path. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
…ignore Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes tvOS/iOS 18+ “false failures” in XHarness’ Apple test/run flows by making device-result collection resilient to flaky log streaming and devicectl file-copy issues, so successful runs aren’t misreported as TIMED_OUT/APP_CRASH.
Changes:
- Add timeouts to
DeviceLogCapturer.StopCapture()log collection/reading to prevent indefinite blocking on broken tvOS log streaming. - Add retry-with-backoff to results copying via
devicectl, with injectable delays for unit tests. - Preserve “successful completion” state in
TestReporter.ParseResult()when the results file is missing after completion, and avoid crash-report retrieval noise inAppTesterwhen success is already confirmed.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/Microsoft.DotNet.XHarness.iOS.Shared/Logging/DeviceLogCapturer.cs |
Adds 2-minute timeouts to log collect / log show to prevent hangs on unresponsive devices. |
src/Microsoft.DotNet.XHarness.iOS.Shared/ResultFileHandler.cs |
Retries results-file copying (esp. devicectl) with backoff to handle transient device communication failures. |
src/Microsoft.DotNet.XHarness.iOS.Shared/TestReporter.cs |
Prevents “missing results file” from overriding an already-confirmed successful completion. |
src/Microsoft.DotNet.XHarness.Apple/AppOperations/AppTester.cs |
Skips crash-report retrieval/logging when test completion was already confirmed but result copy failed. |
tests/Microsoft.DotNet.XHarness.iOS.Shared.Tests/ResultFileHandlerTests.cs |
Updates/extends unit tests to cover retry behavior and new log messages. |
tests/Microsoft.DotNet.XHarness.iOS.Shared.Tests/TestReporterTests.cs |
Adds coverage ensuring “completed but results unavailable” stays Succeeded. |
.gitignore |
Ignores .nuget/ directory. |
You can also share your feedback on Copilot code review. Take the survey.
|
@steveisok is this going to fix this issue as well? #1552 |
I don't know if it'll fix it totally, but certainly makes things better. When it hits this it is misleading. |
|
@copilot if a helix queue is using osx.13., please change it to osx.15. |
- DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure async stdout/stderr buffers are fully drained before reading them (both log collect and log show processes). - DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly. - ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor); clone the array on assignment to prevent external mutation. - Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
|
The simulator.commands failure is known cc: #1545. |
|
/ba-g Known issues |
…P_CRASH after successful runs (dotnet#1551) * Initial plan * Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH results Three layered fixes: 1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes in StopCapture() to prevent indefinite blocking when Apple TV device log streaming is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code). 2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays) to devicectl file copy in CopyResultsAsync() to handle transient device communication errors (com.apple.Mercury.error 1000, RSD error 0xE8000003). 3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult() (test run completed via app end signal), but the results file is missing (devicectl failed), return Succeeded instead of Crashed. Device communication failure after successful test completion is an infrastructure issue, not an app crash. 4. AppTester: Skip crash report retrieval when test run already completed and file copy fails, replacing misleading 'app may have crashed' message with accurate warning. Adds unit tests for retry behavior and the new TestReporter.ParseResult() path. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Improve test method naming per code review feedback Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Remove accidentally committed .nuget/nuget.exe binary and add to .gitignore Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Address PR review feedback - DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure async stdout/stderr buffers are fully drained before reading them (both log collect and log show processes). - DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly. - ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor); clone the array on assignment to prevent external mutation. - Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
…P_CRASH after successful runs (dotnet#1551) * Initial plan * Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH results Three layered fixes: 1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes in StopCapture() to prevent indefinite blocking when Apple TV device log streaming is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code). 2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays) to devicectl file copy in CopyResultsAsync() to handle transient device communication errors (com.apple.Mercury.error 1000, RSD error 0xE8000003). 3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult() (test run completed via app end signal), but the results file is missing (devicectl failed), return Succeeded instead of Crashed. Device communication failure after successful test completion is an infrastructure issue, not an app crash. 4. AppTester: Skip crash report retrieval when test run already completed and file copy fails, replacing misleading 'app may have crashed' message with accurate warning. Adds unit tests for retry behavior and the new TestReporter.ParseResult() path. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Improve test method naming per code review feedback Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Remove accidentally committed .nuget/nuget.exe binary and add to .gitignore Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Address PR review feedback - DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure async stdout/stderr buffers are fully drained before reading them (both log collect and log show processes). - DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly. - ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor); clone the array on assignment to prevent external mutation. - Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
…P_CRASH after successful runs (dotnet#1551) * Initial plan * Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH results Three layered fixes: 1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes in StopCapture() to prevent indefinite blocking when Apple TV device log streaming is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code). 2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays) to devicectl file copy in CopyResultsAsync() to handle transient device communication errors (com.apple.Mercury.error 1000, RSD error 0xE8000003). 3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult() (test run completed via app end signal), but the results file is missing (devicectl failed), return Succeeded instead of Crashed. Device communication failure after successful test completion is an infrastructure issue, not an app crash. 4. AppTester: Skip crash report retrieval when test run already completed and file copy fails, replacing misleading 'app may have crashed' message with accurate warning. Adds unit tests for retry behavior and the new TestReporter.ParseResult() path. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Improve test method naming per code review feedback Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Remove accidentally committed .nuget/nuget.exe binary and add to .gitignore Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Address PR review feedback - DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure async stdout/stderr buffers are fully drained before reading them (both log collect and log show processes). - DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly. - ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor); clone the array on assignment to prevent external mutation. - Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
DeviceLogCapturer.StopCapture()solog collect/log showdon't hang indefinitely (fixesapple runTIMED_OUT)ResultFileHandler.CopyResultsAsync()for transient devicectl errors (fixesapple testAPP_CRASH path)TestReporter.ParseResult()to not overrideSuccess=truewithcrashed=truewhen test log file is missing due to device communication failure (fixesapple testAPP_CRASH)AppTester.RunDeviceTests()andRunSimulatorTests()to skip misleading crash report logic when test run already completedDeviceLogCapturer: AddWaitForExit()flush after timed wait for async buffer drainDeviceLogCapturer: Fix_outputPathcleanup on all early-return paths (extractCleanupOutputPath())ResultFileHandler: MakeDefaultRetryDelaysMsprivate; clone array to prevent mutationosx.13.amd64.open→osx.15.amd64.openOriginal prompt
This section details on the original issue you should resolve
<issue_title>tvOS: Device communication failures cause false TIMED_OUT and APP_CRASH results after successful test runs</issue_title>
<issue_description>## Summary
When running tests on tvOS devices via
apple run --signal-app-end --expected-exit-code 42, XHarness correctly observesmlaunchexiting with the expected exit code (42), but then blocks indefinitely on a background device log reading task instead of using the mlaunch exit code. If the device log stream produces no data (empty log file), XHarness never completes and the Helix workload timeout kills it after 720 seconds.Root Cause
Comparing passing and failing runs on tvOS 26.2 / macOS 15.7 / Xcode 26.0.1:
Passing (DNCENGTVOS-106):
mlaunchexits with 42 ✅Wall Clock adjustment detectederror but file is non-emptyFailed to determine the exit code from .../device-DNCENGTVOS-106-*.logFailing (DNCENGTVOS-036):
mlaunchexits with 42 ✅Both runs emit the warning:
Exit code detection is not working on iOS/tvOS 15+ so the run will fail to match it with the expected valueThe critical difference is that the device log stream works on some Apple TV devices but not others. When it doesn't work, XHarness has a blocking dependency on the log reader completing before it can process the mlaunch exit code.
Expected Behavior
When
mlaunchexits with the expected exit code and the device log stream is empty/non-responsive, XHarness should not block indefinitely. It should either:Reproduction
This occurs on the
osx.15.amd64.appletv.openHelix queue with tvOS 26.2, macOS 15.7, Xcode 26.0.1, and XHarness 11.0.0-prerelease.26064.3.Affected devices: DNCENGTVOS-036, DNCENGTVOS-022 (device log streaming broken)
Working devices: DNCENGTVOS-106 (device log streaming works)
All are AppleTV5,3 (4th generation), same tvOS version (26.2, build 23K54).
Evidence
Key log lines in failing run:
warn: Exit code detection is not working on iOS/tvOS 15+dbug: Process mlaunch exited with 42fail: Run timed out after 720 secondsKey log lines in passing run:
dbug: Process mlaunch exited with 42dbug: Failed to determine the exit code from .../device-DNCENGTVOS-106-*.loginfo: Application has finished with exit code: 42 (as expected)Impact
This causes ~10% failure rate in the
tvos-arm64 Release AllSubsets_CoreCLR_Smokeleg in the dotnet/runtime CI pipeline (definition 129), specifically theiOS.CoreCLR.R2R.Testwork item. Failures depend on which Apple TV device the job lands on.</issue_description><agent_instructions>Fix #1548 — tvOS device communication failures cause false test failures after successful runs.
There are two XHarness code paths affected:
apple testpath (AppTester/TestReporter): After detecting"test end tag"and logging"Test run completed", XHarness tries to copy testResults.xml from the device via devicectl. When devicectl fails (Mercury error 1000, RSD error 0xE8000003), it falls through to crash detection and returns APP_CRASH (exit 80) even though it already knows tests completed.apple runpath (AppRunner): After mlaunch exits with the expected exit code, XHarness blocks on a device log reader task. If the device log stream is empty, it hangs until the Helix timeout kills it (exit 143).The fix should be layered:
apple testpath. The connection errors are likely transient."Test run completed"is detected or mlaunch exits with expected code, don't let subsequent device communication failures escalate to APP_CRASH or TIMED_OUT. Use the known-good result.[PASS]/[FAIL]lines al...✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.