Skip to content

Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH after successful runs#1551

Merged
steveisok merged 5 commits intomainfrom
copilot/fix-tvos-device-communication-issue
Mar 10, 2026
Merged

Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH after successful runs#1551
steveisok merged 5 commits intomainfrom
copilot/fix-tvos-device-communication-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 6, 2026

  • Fix 1: Add process timeouts (2 min) to DeviceLogCapturer.StopCapture() so log collect/log show don't hang indefinitely (fixes apple run TIMED_OUT)
  • Fix 2: Add retry with backoff (3 attempts, 5s/10s/20s) to ResultFileHandler.CopyResultsAsync() for transient devicectl errors (fixes apple test APP_CRASH path)
  • Fix 3: Update TestReporter.ParseResult() to not override Success=true with crashed=true when test log file is missing due to device communication failure (fixes apple test APP_CRASH)
  • Update AppTester.RunDeviceTests() and RunSimulatorTests() to skip misleading crash report logic when test run already completed
  • Update/add unit tests for new behavior
  • All tests pass (285 iOS.Shared + 77 Apple tests)
  • No CodeQL security alerts
  • PR review feedback addressed:
    • DeviceLogCapturer: Add WaitForExit() flush after timed wait for async buffer drain
    • DeviceLogCapturer: Fix _outputPath cleanup on all early-return paths (extract CleanupOutputPath())
    • ResultFileHandler: Make DefaultRetryDelaysMs private; clone array to prevent mutation
    • Integration tests: Update Helix queue osx.13.amd64.openosx.15.amd64.open
Original prompt

This section details on the original issue you should resolve

<issue_title>tvOS: Device communication failures cause false TIMED_OUT and APP_CRASH results after successful test runs</issue_title>
<issue_description>## Summary

When running tests on tvOS devices via apple run --signal-app-end --expected-exit-code 42, XHarness correctly observes mlaunch exiting with the expected exit code (42), but then blocks indefinitely on a background device log reading task instead of using the mlaunch exit code. If the device log stream produces no data (empty log file), XHarness never completes and the Helix workload timeout kills it after 720 seconds.

Root Cause

Comparing passing and failing runs on tvOS 26.2 / macOS 15.7 / Xcode 26.0.1:

Passing (DNCENGTVOS-106):

  1. mlaunch exits with 42 ✅
  2. XHarness reads device logs → gets Wall Clock adjustment detected error but file is non-empty
  3. Failed to determine the exit code from .../device-DNCENGTVOS-106-*.log
  4. Falls back to mlaunch exit code → "Application has finished with exit code: 42 (as expected)"

Failing (DNCENGTVOS-036):

  1. mlaunch exits with 42 ✅
  2. Device log file is empty (no data streamed from Apple TV)
  3. XHarness hangs waiting on device log reader — never reaches fallback
  4. After 720s: "Run timed out after 720 seconds" → exit 143 ❌

Both runs emit the warning: Exit code detection is not working on iOS/tvOS 15+ so the run will fail to match it with the expected value

The critical difference is that the device log stream works on some Apple TV devices but not others. When it doesn't work, XHarness has a blocking dependency on the log reader completing before it can process the mlaunch exit code.

Expected Behavior

When mlaunch exits with the expected exit code and the device log stream is empty/non-responsive, XHarness should not block indefinitely. It should either:

  • Use a timeout on the device log reading task, or
  • Recognize the mlaunch exit code immediately without waiting for the log stream

Reproduction

This occurs on the osx.15.amd64.appletv.open Helix queue with tvOS 26.2, macOS 15.7, Xcode 26.0.1, and XHarness 11.0.0-prerelease.26064.3.

Affected devices: DNCENGTVOS-036, DNCENGTVOS-022 (device log streaming broken)
Working devices: DNCENGTVOS-106 (device log streaming works)

All are AppleTV5,3 (4th generation), same tvOS version (26.2, build 23K54).

Evidence

Key log lines in failing run:

  • Line ~525: warn: Exit code detection is not working on iOS/tvOS 15+
  • Line ~1970: dbug: Process mlaunch exited with 42
  • Line ~1975 (12 min later): fail: Run timed out after 720 seconds

Key log lines in passing run:

  • Line ~1637: dbug: Process mlaunch exited with 42
  • Line ~1640: dbug: Failed to determine the exit code from .../device-DNCENGTVOS-106-*.log
  • Line ~1642: info: Application has finished with exit code: 42 (as expected)

Impact

This causes ~10% failure rate in the tvos-arm64 Release AllSubsets_CoreCLR_Smoke leg in the dotnet/runtime CI pipeline (definition 129), specifically the iOS.CoreCLR.R2R.Test work item. Failures depend on which Apple TV device the job lands on.</issue_description>

<agent_instructions>Fix #1548 — tvOS device communication failures cause false test failures after successful runs.

There are two XHarness code paths affected:

  1. apple test path (AppTester / TestReporter): After detecting "test end tag" and logging "Test run completed", XHarness tries to copy testResults.xml from the device via devicectl. When devicectl fails (Mercury error 1000, RSD error 0xE8000003), it falls through to crash detection and returns APP_CRASH (exit 80) even though it already knows tests completed.

  2. apple run path (AppRunner): After mlaunch exits with the expected exit code, XHarness blocks on a device log reader task. If the device log stream is empty, it hangs until the Helix timeout kills it (exit 143).

The fix should be layered:

  • First: Add retry with backoff (3 attempts) to the devicectl file copy in the apple test path. The connection errors are likely transient.
  • Second: Track completion state — once "Test run completed" is detected or mlaunch exits with expected code, don't let subsequent device communication failures escalate to APP_CRASH or TIMED_OUT. Use the known-good result.
  • Third (safety net): If file copy fails after retries, synthesize testResults.xml from the [PASS]/[FAIL] lines al...

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…P_CRASH results

Three layered fixes:

1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes
   in StopCapture() to prevent indefinite blocking when Apple TV device log streaming
   is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code).

2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays)
   to devicectl file copy in CopyResultsAsync() to handle transient device communication
   errors (com.apple.Mercury.error 1000, RSD error 0xE8000003).

3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult()
   (test run completed via app end signal), but the results file is missing (devicectl
   failed), return Succeeded instead of Crashed. Device communication failure after
   successful test completion is an infrastructure issue, not an app crash.

4. AppTester: Skip crash report retrieval when test run already completed and file copy
   fails, replacing misleading 'app may have crashed' message with accurate warning.

Adds unit tests for retry behavior and the new TestReporter.ParseResult() path.

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
Copilot AI and others added 2 commits March 6, 2026 13:56
Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
…ignore

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix tvOS device communication failures causing false results Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH after successful runs Mar 6, 2026
Copy link
Copy Markdown
Member

@kotlarmilos kotlarmilos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kotlarmilos kotlarmilos marked this pull request as ready for review March 6, 2026 15:01
@matouskozak matouskozak requested a review from Copilot March 6, 2026 16:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes tvOS/iOS 18+ “false failures” in XHarness’ Apple test/run flows by making device-result collection resilient to flaky log streaming and devicectl file-copy issues, so successful runs aren’t misreported as TIMED_OUT/APP_CRASH.

Changes:

  • Add timeouts to DeviceLogCapturer.StopCapture() log collection/reading to prevent indefinite blocking on broken tvOS log streaming.
  • Add retry-with-backoff to results copying via devicectl, with injectable delays for unit tests.
  • Preserve “successful completion” state in TestReporter.ParseResult() when the results file is missing after completion, and avoid crash-report retrieval noise in AppTester when success is already confirmed.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/Microsoft.DotNet.XHarness.iOS.Shared/Logging/DeviceLogCapturer.cs Adds 2-minute timeouts to log collect / log show to prevent hangs on unresponsive devices.
src/Microsoft.DotNet.XHarness.iOS.Shared/ResultFileHandler.cs Retries results-file copying (esp. devicectl) with backoff to handle transient device communication failures.
src/Microsoft.DotNet.XHarness.iOS.Shared/TestReporter.cs Prevents “missing results file” from overriding an already-confirmed successful completion.
src/Microsoft.DotNet.XHarness.Apple/AppOperations/AppTester.cs Skips crash-report retrieval/logging when test completion was already confirmed but result copy failed.
tests/Microsoft.DotNet.XHarness.iOS.Shared.Tests/ResultFileHandlerTests.cs Updates/extends unit tests to cover retry behavior and new log messages.
tests/Microsoft.DotNet.XHarness.iOS.Shared.Tests/TestReporterTests.cs Adds coverage ensuring “completed but results unavailable” stays Succeeded.
.gitignore Ignores .nuget/ directory.

You can also share your feedback on Copilot code review. Take the survey.

@matouskozak
Copy link
Copy Markdown
Member

@steveisok is this going to fix this issue as well? #1552

@steveisok
Copy link
Copy Markdown
Member

@steveisok is this going to fix this issue as well? #1552

I don't know if it'll fix it totally, but certainly makes things better. When it hits this it is misleading.

@steveisok
Copy link
Copy Markdown
Member

@copilot if a helix queue is using osx.13., please change it to osx.15.

- DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure
  async stdout/stderr buffers are fully drained before reading them (both log collect
  and log show processes).
- DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and
  non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly.
- ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor);
  clone the array on assignment to prevent external mutation.
- Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open.

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 9, 2026

@copilot if a helix queue is using osx.13., please change it to osx.15.

Updated in f36173a — changed osx.13.amd64.open to osx.15.amd64.open in tests/integration-tests/Apple/Simulator.Commands.Tests.proj.

@matouskozak
Copy link
Copy Markdown
Member

The simulator.commands failure is known cc: #1545.

@steveisok steveisok enabled auto-merge (squash) March 10, 2026 02:24
@steveisok
Copy link
Copy Markdown
Member

/ba-g Known issues

@steveisok steveisok merged commit a6f9fe6 into main Mar 10, 2026
15 of 17 checks passed
@steveisok steveisok deleted the copilot/fix-tvos-device-communication-issue branch March 10, 2026 02:24
matouskozak pushed a commit to matouskozak/xharness that referenced this pull request Mar 11, 2026
…P_CRASH after successful runs (dotnet#1551)

* Initial plan

* Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH results

Three layered fixes:

1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes
   in StopCapture() to prevent indefinite blocking when Apple TV device log streaming
   is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code).

2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays)
   to devicectl file copy in CopyResultsAsync() to handle transient device communication
   errors (com.apple.Mercury.error 1000, RSD error 0xE8000003).

3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult()
   (test run completed via app end signal), but the results file is missing (devicectl
   failed), return Succeeded instead of Crashed. Device communication failure after
   successful test completion is an infrastructure issue, not an app crash.

4. AppTester: Skip crash report retrieval when test run already completed and file copy
   fails, replacing misleading 'app may have crashed' message with accurate warning.

Adds unit tests for retry behavior and the new TestReporter.ParseResult() path.

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

* Improve test method naming per code review feedback

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

* Remove accidentally committed .nuget/nuget.exe binary and add to .gitignore

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

* Address PR review feedback

- DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure
  async stdout/stderr buffers are fully drained before reading them (both log collect
  and log show processes).
- DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and
  non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly.
- ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor);
  clone the array on assignment to prevent external mutation.
- Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open.

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
kotlarmilos pushed a commit to kotlarmilos/xharness that referenced this pull request Mar 16, 2026
…P_CRASH after successful runs (dotnet#1551)

* Initial plan

* Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH results

Three layered fixes:

1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes
   in StopCapture() to prevent indefinite blocking when Apple TV device log streaming
   is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code).

2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays)
   to devicectl file copy in CopyResultsAsync() to handle transient device communication
   errors (com.apple.Mercury.error 1000, RSD error 0xE8000003).

3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult()
   (test run completed via app end signal), but the results file is missing (devicectl
   failed), return Succeeded instead of Crashed. Device communication failure after
   successful test completion is an infrastructure issue, not an app crash.

4. AppTester: Skip crash report retrieval when test run already completed and file copy
   fails, replacing misleading 'app may have crashed' message with accurate warning.

Adds unit tests for retry behavior and the new TestReporter.ParseResult() path.

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

* Improve test method naming per code review feedback

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

* Remove accidentally committed .nuget/nuget.exe binary and add to .gitignore

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

* Address PR review feedback

- DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure
  async stdout/stderr buffers are fully drained before reading them (both log collect
  and log show processes).
- DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and
  non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly.
- ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor);
  clone the array on assignment to prevent external mutation.
- Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open.

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
kotlarmilos pushed a commit to kotlarmilos/xharness that referenced this pull request Mar 16, 2026
…P_CRASH after successful runs (dotnet#1551)

* Initial plan

* Fix tvOS device communication failures causing false TIMED_OUT and APP_CRASH results

Three layered fixes:

1. DeviceLogCapturer: Add 2-minute timeouts to 'log collect' and 'log show' processes
   in StopCapture() to prevent indefinite blocking when Apple TV device log streaming
   is broken (fixes apple run TIMED_OUT after mlaunch exits with expected code).

2. ResultFileHandler: Add retry with backoff (up to 3 attempts, 5s/10s/20s delays)
   to devicectl file copy in CopyResultsAsync() to handle transient device communication
   errors (com.apple.Mercury.error 1000, RSD error 0xE8000003).

3. TestReporter.ParseResult(): When Success=true was already set by CollectDeviceResult()
   (test run completed via app end signal), but the results file is missing (devicectl
   failed), return Succeeded instead of Crashed. Device communication failure after
   successful test completion is an infrastructure issue, not an app crash.

4. AppTester: Skip crash report retrieval when test run already completed and file copy
   fails, replacing misleading 'app may have crashed' message with accurate warning.

Adds unit tests for retry behavior and the new TestReporter.ParseResult() path.

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

* Improve test method naming per code review feedback

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

* Remove accidentally committed .nuget/nuget.exe binary and add to .gitignore

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

* Address PR review feedback

- DeviceLogCapturer: Add WaitForExit() flush after timed WaitForExit() to ensure
  async stdout/stderr buffers are fully drained before reading them (both log collect
  and log show processes).
- DeviceLogCapturer: Fix _outputPath cleanup on early-return paths (timeout and
  non-zero exit) by extracting CleanupOutputPath() helper and calling it uniformly.
- ResultFileHandler: Make DefaultRetryDelaysMs private (tests inject via constructor);
  clone the array on assignment to prevent external mutation.
- Integration tests: Update osx.13.amd64.open Helix queue to osx.15.amd64.open.

Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: steveisok <471438+steveisok@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tvOS: Device communication failures cause false TIMED_OUT and APP_CRASH results after successful test runs

5 participants