Add crash/failure telemetry to MSBuild#13270
Conversation
e5aaf87 to
f205d92
Compare
Add CrashTelemetry data class and CrashTelemetryRecorder helper that capture rich exception information: exception type, inner exception type, stack trace hash (SHA-256 for bucketing without PII), top stack frame, HResult, exit type classification, criticality flag, MSBuild version, framework name, and host. CrashTelemetryRecorder centralizes recording and flushing logic used by all three crash telemetry emission points: 1. MSBuild.exe (XMake.Execute): - All catch blocks record crash telemetry via RecordCrashTelemetry - FlushCrashTelemetry in the finally block emits via TelemetryManager 2. API mode (BuildManager.EndBuild): - Catch block records crash telemetry for shutdown exceptions - _threadException (node crashes) is recorded before re-throwing - FlushCrashTelemetry emits via TelemetryManager 3. Unhandled exceptions (ExceptionHandling.UnhandledExceptionHandler): - RecordAndFlushCrashTelemetry immediately emits since process is dying All telemetry code is best-effort with catch-all guards to prevent secondary failures during crash handling. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f205d92 to
5c4a3f1
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive crash and failure telemetry to MSBuild to improve diagnostics and error tracking. The implementation introduces a new CrashTelemetry class that captures rich exception information including exception types, stack trace hashes (SHA-256 for PII-free bucketing), top stack frames, HResult codes, exit type classifications, criticality flags, MSBuild version, framework name, and host environment.
Changes:
- Adds
CrashTelemetryandCrashTelemetryRecorderclasses to capture and emit crash telemetry - Integrates crash telemetry recording into all exception catch blocks in
XMake.Execute()andBuildManager.EndBuild() - Adds unhandled exception handler support via
ExceptionHandling.UnhandledExceptionHandler
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Framework/Telemetry/CrashTelemetry.cs | Core telemetry data class with exception information, stack hashing, and property serialization |
| src/Framework/Telemetry/CrashTelemetryRecorder.cs | Centralized helper for recording and flushing crash telemetry with best-effort error handling |
| src/Framework/Telemetry/TelemetryConstants.cs | Adds "Crash" constant for crash activity naming |
| src/Framework/Telemetry/KnownTelemetry.cs | Adds static CrashTelemetry property for crash telemetry storage |
| src/MSBuild/XMake.cs | Integrates crash telemetry recording in all exception catch blocks and finally block flush |
| src/Build/BackEnd/BuildManager/BuildManager.cs | Adds crash telemetry recording for BuildManager exceptions and thread exceptions |
| src/Shared/ExceptionHandling.cs | Adds crash telemetry recording for truly unhandled exceptions |
…ts, fix exitType - Extract GetHostName() as shared method in XMake.cs and BuildManager.cs to eliminate duplicated host detection logic - Sanitize StackTop to redact file paths that may contain PII (usernames) while preserving method names and line numbers - Use consistent exitType 'EndBuildFailure' in BuildManager instead of exception.GetType().Name - Add CrashTelemetry_Tests with 8 tests covering PopulateFromException, GetProperties, GetActivityProperties, PII redaction, and null handling - Add comment clarifying why Initialize is needed in RecordAndFlushCrashTelemetry Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JanProvaznik
left a comment
There was a problem hiding this comment.
I think this is the kind of change that would be better with a spec that has a feedback from the team.
| ShowHelpPrompt(); | ||
|
|
||
| exitType = ExitType.SwitchError; | ||
| RecordCrashTelemetry(e, exitType); |
There was a problem hiding this comment.
is this really interesting to collect the instances when users have typos in switches?
There was a problem hiding this comment.
yes, because it helps us to understand if documentation should be updated/other aliases of switches added.
| Exception ex = (Exception)e.ExceptionObject; | ||
| DumpExceptionToFile(ex); | ||
| #if !CLR2COMPATIBILITY && !MICROSOFT_BUILD_ENGINE_OM_UNITTESTS | ||
| RecordCrashTelemetryForUnhandledException(ex); |
There was a problem hiding this comment.
will this really avoid sending telemetry in our tests?
There was a problem hiding this comment.
It should be, but if our tests crash with unhandled exceptions - it's not a good sign.
| /// <summary> | ||
| /// Timestamp when the crash occurred. | ||
| /// </summary> | ||
| public DateTime? CrashTimestamp { get; set; } |
There was a problem hiding this comment.
why? isn't this already included in the data field in the database on ingestion?
| /// <summary> | ||
| /// The exit type / category of the crash (e.g., "LoggerFailure", "Unexpected", "UnhandledException"). | ||
| /// </summary> | ||
| public string? ExitType { get; set; } |
There was a problem hiding this comment.
typing this string seems a bit too liberal?
There was a problem hiding this comment.
I don't know!
it's a version 1, before that we haven't had insights in the crashes - once it's available and analyzed by the team, the corresponding adjustments can be made.
it's yet another telemetry event , i don't see why it's can't be iterative and addressed based on feedback later. |
…detection (#13289) ## Summary Improves the crash telemetry infrastructure to produce higher-quality, more actionable data and enable Prism dashboard visibility ( continuation of #13270 based on the first results) ## Changes ### Convert ExitType and CrashOrigin from strings to enums ### Remove noise from crash telemetry - Removed `SwitchError` and `InitializationError` from crash telemetry recording — these accounted for ~99.8% of crash events but represent expected user errors (bad CLI args, missing toolsets), not actual crashes ### Remove redundant `CrashTimestamp` - The database ingestion timestamp already captures this; the client-side timestamp added no value ### Deduplicate `GetHostName()` - Consolidated 3 copies of host-detection logic (VS / VSCode / `MSBUILD_HOST_NAME`) into a single `BuildEnvironmentState.GetHostName()` method ### Harden `DebugUtils` static constructor - Wrapped in try/catch to prevent `TypeInitializationException` crashes when environment variable access fails --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add CrashTelemetry class that captures rich exception information including exception type, inner exception type, stack trace hash (SHA-256 for bucketing without PII), top stack frame, HResult, exit type classification, criticality flag, MSBuild version, framework name, and host environment. Crash telemetry is emitted in two places: - XMake.Execute() catch blocks for all handled exception types - ExceptionHandling.UnhandledExceptionHandler for truly unhandled exceptions The telemetry is recorded via KnownTelemetry.CrashTelemetry and flushed in the finally block of XMake.Execute() using TelemetryManager and the existing IActivity/ActivitySource infrastructure. All telemetry code is best-effort with catch-all guards to prevent secondary failures during crash handling. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…detection (dotnet#13289) ## Summary Improves the crash telemetry infrastructure to produce higher-quality, more actionable data and enable Prism dashboard visibility ( continuation of dotnet#13270 based on the first results) ## Changes ### Convert ExitType and CrashOrigin from strings to enums ### Remove noise from crash telemetry - Removed `SwitchError` and `InitializationError` from crash telemetry recording — these accounted for ~99.8% of crash events but represent expected user errors (bad CLI args, missing toolsets), not actual crashes ### Remove redundant `CrashTimestamp` - The database ingestion timestamp already captures this; the client-side timestamp added no value ### Deduplicate `GetHostName()` - Consolidated 3 copies of host-detection logic (VS / VSCode / `MSBUILD_HOST_NAME`) into a single `BuildEnvironmentState.GetHostName()` method ### Harden `DebugUtils` static constructor - Wrapped in try/catch to prevent `TypeInitializationException` crashes when environment variable access fails --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add CrashTelemetry class that captures rich exception information including exception type, inner exception type, stack trace hash (SHA-256 for bucketing without PII), top stack frame, HResult, exit type classification, criticality flag, MSBuild version, framework name, and host environment.
Crash telemetry is emitted in two places:
The telemetry is recorded via KnownTelemetry.CrashTelemetry and flushed in the finally block of XMake.Execute() using TelemetryManager and the existing IActivity/ActivitySource infrastructure. All telemetry code is best-effort with catch-all guards to prevent secondary failures during crash handling.