Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH after successful runs by Copilot · Pull Request #1551 · dotnet/xharness

Copilot · 2026-03-06T13:37:59Z

Fix 1: Add process timeouts (2 min) to DeviceLogCapturer.StopCapture() so log collect/log show don't hang indefinitely (fixes apple run TIMED_OUT)
Fix 2: Add retry with backoff (3 attempts, 5s/10s/20s) to ResultFileHandler.CopyResultsAsync() for transient devicectl errors (fixes apple test APP_CRASH path)
Fix 3: Update TestReporter.ParseResult() to not override Success=true with crashed=true when test log file is missing due to device communication failure (fixes apple test APP_CRASH)
Update AppTester.RunDeviceTests() and RunSimulatorTests() to skip misleading crash report logic when test run already completed
Update/add unit tests for new behavior
All tests pass (285 iOS.Shared + 77 Apple tests)
No CodeQL security alerts
PR review feedback addressed:
- DeviceLogCapturer: Add WaitForExit() flush after timed wait for async buffer drain
- DeviceLogCapturer: Fix _outputPath cleanup on all early-return paths (extract CleanupOutputPath())
- ResultFileHandler: Make DefaultRetryDelaysMs private; clone array to prevent mutation
- Integration tests: Update Helix queue osx.13.amd64.open → osx.15.amd64.open

Original prompt

This section details on the original issue you should resolve

<issue_title>tvOS: Device communication failures cause false TIMED_OUT and APP_CRASH results after successful test runs</issue_title>
<issue_description>## Summary

When running tests on tvOS devices via apple run --signal-app-end --expected-exit-code 42, XHarness correctly observes mlaunch exiting with the expected exit code (42), but then blocks indefinitely on a background device log reading task instead of using the mlaunch exit code. If the device log stream produces no data (empty log file), XHarness never completes and the Helix workload timeout kills it after 720 seconds.

Root Cause

Comparing passing and failing runs on tvOS 26.2 / macOS 15.7 / Xcode 26.0.1:

Passing (DNCENGTVOS-106):

mlaunch exits with 42 ✅
XHarness reads device logs → gets Wall Clock adjustment detected error but file is non-empty
Failed to determine the exit code from .../device-DNCENGTVOS-106-*.log
Falls back to mlaunch exit code → "Application has finished with exit code: 42 (as expected)" ✅

Failing (DNCENGTVOS-036):

mlaunch exits with 42 ✅
Device log file is empty (no data streamed from Apple TV)
XHarness hangs waiting on device log reader — never reaches fallback
After 720s: "Run timed out after 720 seconds" → exit 143 ❌

Both runs emit the warning: Exit code detection is not working on iOS/tvOS 15+ so the run will fail to match it with the expected value

The critical difference is that the device log stream works on some Apple TV devices but not others. When it doesn't work, XHarness has a blocking dependency on the log reader completing before it can process the mlaunch exit code.

Expected Behavior

When mlaunch exits with the expected exit code and the device log stream is empty/non-responsive, XHarness should not block indefinitely. It should either:

Use a timeout on the device log reading task, or
Recognize the mlaunch exit code immediately without waiting for the log stream

Reproduction

This occurs on the osx.15.amd64.appletv.open Helix queue with tvOS 26.2, macOS 15.7, Xcode 26.0.1, and XHarness 11.0.0-prerelease.26064.3.

Affected devices: DNCENGTVOS-036, DNCENGTVOS-022 (device log streaming broken)
Working devices: DNCENGTVOS-106 (device log streaming works)

All are AppleTV5,3 (4th generation), same tvOS version (26.2, build 23K54).

Evidence

Failing console log (DNCENGTVOS-036): https://helix.dot.net/api/2019-06-17/jobs/b5f2318f-9ef8-451f-92d3-06f7131853a3/workitems/iOS.CoreCLR.R2R.Test/console
Failing console log (DNCENGTVOS-022): https://helix.dot.net/api/2019-06-17/jobs/240c0204-e308-4a53-9b80-6833e1552fb3/workitems/iOS.CoreCLR.R2R.Test/console
Passing console log (DNCENGTVOS-106): https://helix.dot.net/api/2019-06-17/jobs/ea2e4b68-8b18-4a8a-8113-a275b43ad007/workitems/iOS.CoreCLR.R2R.Test/console

Key log lines in failing run:

Line ~525: warn: Exit code detection is not working on iOS/tvOS 15+
Line ~1970: dbug: Process mlaunch exited with 42
Line ~1975 (12 min later): fail: Run timed out after 720 seconds

Key log lines in passing run:

Line ~1637: dbug: Process mlaunch exited with 42
Line ~1640: dbug: Failed to determine the exit code from .../device-DNCENGTVOS-106-*.log
Line ~1642: info: Application has finished with exit code: 42 (as expected)

Impact

This causes ~10% failure rate in the tvos-arm64 Release AllSubsets_CoreCLR_Smoke leg in the dotnet/runtime CI pipeline (definition 129), specifically the iOS.CoreCLR.R2R.Test work item. Failures depend on which Apple TV device the job lands on.</issue_description>

<agent_instructions>Fix #1548 — tvOS device communication failures cause false test failures after successful runs.

There are two XHarness code paths affected:

apple test path (AppTester / TestReporter): After detecting "test end tag" and logging "Test run completed", XHarness tries to copy testResults.xml from the device via devicectl. When devicectl fails (Mercury error 1000, RSD error 0xE8000003), it falls through to crash detection and returns APP_CRASH (exit 80) even though it already knows tests completed.
apple run path (AppRunner): After mlaunch exits with the expected exit code, XHarness blocks on a device log reader task. If the device log stream is empty, it hangs until the Helix timeout kills it (exit 143).

The fix should be layered:

First: Add retry with backoff (3 attempts) to the devicectl file copy in the apple test path. The connection errors are likely transient.
Second: Track completion state — once "Test run completed" is detected or mlaunch exits with expected code, don't let subsequent device communication failures escalate to APP_CRASH or TIMED_OUT. Use the known-good result.
Third (safety net): If file copy fails after retries, synthesize testResults.xml from the [PASS]/[FAIL] lines al...

Fixes tvOS: Device communication failures cause false TIMED_OUT and APP_CRASH results after successful test runs #1548

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…P_CRASH results Three layered fixes: 1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes in StopCapture() to prevent indefinite blocking when Apple TV device log streaming is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code). 2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays) to devicectl file copy in CopyResultsAsync() to handle transient device communication errors (com.apple.Mercury.error 1000, RSD error 0xE8000003). 3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult() (test run completed via app end signal), but the results file is missing (devicectl failed), return Succeeded instead of Crashed. Device communication failure after successful test completion is an infrastructure issue, not an app crash. 4. AppTester: Skip crash report retrieval when test run already completed and file copy fails, replacing misleading 'app may have crashed' message with accurate warning. Adds unit tests for retry behavior and the new TestReporter.ParseResult() path. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

…ignore Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

kotlarmilos

LGTM!

Copilot

Pull request overview

Fixes tvOS/iOS 18+ “false failures” in XHarness’ Apple test/run flows by making device-result collection resilient to flaky log streaming and devicectl file-copy issues, so successful runs aren’t misreported as TIMED_OUT/APP_CRASH.

Changes:

Add timeouts to DeviceLogCapturer.StopCapture() log collection/reading to prevent indefinite blocking on broken tvOS log streaming.
Add retry-with-backoff to results copying via devicectl, with injectable delays for unit tests.
Preserve “successful completion” state in TestReporter.ParseResult() when the results file is missing after completion, and avoid crash-report retrieval noise in AppTester when success is already confirmed.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/Microsoft.DotNet.XHarness.iOS.Shared/Logging/DeviceLogCapturer.cs`	Adds 2-minute timeouts to `log collect` / `log show` to prevent hangs on unresponsive devices.
`src/Microsoft.DotNet.XHarness.iOS.Shared/ResultFileHandler.cs`	Retries results-file copying (esp. `devicectl`) with backoff to handle transient device communication failures.
`src/Microsoft.DotNet.XHarness.iOS.Shared/TestReporter.cs`	Prevents “missing results file” from overriding an already-confirmed successful completion.
`src/Microsoft.DotNet.XHarness.Apple/AppOperations/AppTester.cs`	Skips crash-report retrieval/logging when test completion was already confirmed but result copy failed.
`tests/Microsoft.DotNet.XHarness.iOS.Shared.Tests/ResultFileHandlerTests.cs`	Updates/extends unit tests to cover retry behavior and new log messages.
`tests/Microsoft.DotNet.XHarness.iOS.Shared.Tests/TestReporterTests.cs`	Adds coverage ensuring “completed but results unavailable” stays `Succeeded`.
`.gitignore`	Ignores `.nuget/` directory.

You can also share your feedback on Copilot code review. Take the survey.

src/Microsoft.DotNet.XHarness.iOS.Shared/Logging/DeviceLogCapturer.cs

src/Microsoft.DotNet.XHarness.iOS.Shared/ResultFileHandler.cs

src/Microsoft.DotNet.XHarness.iOS.Shared/Logging/DeviceLogCapturer.cs

matouskozak · 2026-03-09T09:35:39Z

@steveisok is this going to fix this issue as well? #1552

steveisok · 2026-03-09T10:59:25Z

@steveisok is this going to fix this issue as well? #1552

I don't know if it'll fix it totally, but certainly makes things better. When it hits this it is misleading.

steveisok · 2026-03-09T11:05:21Z

@copilot if a helix queue is using osx.13., please change it to osx.15.

- DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure async stdout/stderr buffers are fully drained before reading them (both log collect and log show processes). - DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly. - ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor); clone the array on assignment to prevent external mutation. - Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

Copilot · 2026-03-09T11:09:00Z

@copilot if a helix queue is using osx.13., please change it to osx.15.

Updated in f36173a — changed osx.13.amd64.open to osx.15.amd64.open in tests/integration-tests/Apple/Simulator.Commands.Tests.proj.

matouskozak · 2026-03-09T11:20:25Z

The simulator.commands failure is known cc: #1545.

steveisok · 2026-03-10T02:24:40Z

/ba-g Known issues

…P_CRASH after successful runs (dotnet#1551) * Initial plan * Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH results Three layered fixes: 1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes in StopCapture() to prevent indefinite blocking when Apple TV device log streaming is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code). 2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays) to devicectl file copy in CopyResultsAsync() to handle transient device communication errors (com.apple.Mercury.error 1000, RSD error 0xE8000003). 3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult() (test run completed via app end signal), but the results file is missing (devicectl failed), return Succeeded instead of Crashed. Device communication failure after successful test completion is an infrastructure issue, not an app crash. 4. AppTester: Skip crash report retrieval when test run already completed and file copy fails, replacing misleading 'app may have crashed' message with accurate warning. Adds unit tests for retry behavior and the new TestReporter.ParseResult() path. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Improve test method naming per code review feedback Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Remove accidentally committed .nuget/nuget.exe binary and add to .gitignore Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> * Address PR review feedback - DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure async stdout/stderr buffers are fully drained before reading them (both log collect and log show processes). - DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly. - ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor); clone the array on assignment to prevent external mutation. - Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open. Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

Initial plan

4e77994

Copilot AI assigned Copilot and steveisok Mar 6, 2026

Copilot started work on behalf of steveisok March 6, 2026 13:38 View session

steveisok requested review from kotlarmilos and matouskozak March 6, 2026 13:53

Copilot AI and others added 2 commits March 6, 2026 13:56

Improve test method naming per code review feedback

96dd3df

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

Remove accidentally committed .nuget/nuget.exe binary and add to .git…

9978ed6

…ignore Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix tvOS device communication failures causing false results~~ Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH after successful runs Mar 6, 2026

Copilot finished work on behalf of steveisok March 6, 2026 13:58

kotlarmilos approved these changes Mar 6, 2026

View reviewed changes

kotlarmilos marked this pull request as ready for review March 6, 2026 15:01

matouskozak requested a review from Copilot March 6, 2026 16:32

matouskozak approved these changes Mar 6, 2026

View reviewed changes

Copilot started reviewing on behalf of matouskozak March 6, 2026 16:32 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Copilot started work on behalf of steveisok March 9, 2026 11:05 View session

Copilot finished work on behalf of steveisok March 9, 2026 11:09

build-analysis bot mentioned this pull request Mar 9, 2026

File upload failed even after retry dotnet/dnceng#1916

Open

3 tasks

steveisok enabled auto-merge (squash) March 10, 2026 02:24

steveisok merged commit a6f9fe6 into main Mar 10, 2026
15 of 17 checks passed

steveisok deleted the copilot/fix-tvos-device-communication-issue branch March 10, 2026 02:24

matouskozak mentioned this pull request Mar 11, 2026

[release/10.0] Selective merge of main to release/10.0 #1553

Merged

kotlarmilos mentioned this pull request Mar 16, 2026

[release/9.0] Backport iOS 18+ queue compatibility #1570

Closed

kotlarmilos mentioned this pull request Mar 16, 2026

[release/8.0] Backport iOS 18+ queue compatibility #1571

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH after successful runs#1551

Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH after successful runs#1551
steveisok merged 5 commits intomainfrom
copilot/fix-tvos-device-communication-issue

Copilot AI commented Mar 6, 2026 •

edited

Loading

Uh oh!

kotlarmilos left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matouskozak commented Mar 9, 2026

Uh oh!

steveisok commented Mar 9, 2026

Uh oh!

steveisok commented Mar 9, 2026

Uh oh!

Copilot AI commented Mar 9, 2026

Uh oh!

matouskozak commented Mar 9, 2026

Uh oh!

steveisok commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Copilot AI commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Expected Behavior

Reproduction

Evidence

Impact

Uh oh!

kotlarmilos left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matouskozak commented Mar 9, 2026

Uh oh!

steveisok commented Mar 9, 2026

Uh oh!

steveisok commented Mar 9, 2026

Uh oh!

Copilot AI commented Mar 9, 2026

Uh oh!

matouskozak commented Mar 9, 2026

Uh oh!

steveisok commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Copilot AI commented Mar 6, 2026 •

edited

Loading