fix: Write reports on concurrent crashes#7340
Conversation
Concurrent crashes from different threads were incorrectly detected as recrash, causing ENOENT errors when the recrash logic tried to read a crash report that hadn't been written yet. Add thread-aware atomic guard to ensure only one thread handles a crash. Same-thread recrash is still detected correctly; concurrent crashes from other threads are discarded. Fixes #3296
Semver Impact of This PR🟢 Patch (bug fixes) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨
Bug Fixes 🐛
Internal Changes 🔧Deps
🤖 This preview updates automatically when you update the PR. |
|
Please look closely at the monitor changes ( |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #7340 +/- ##
=============================================
- Coverage 85.324% 85.324% -0.001%
=============================================
Files 480 480
Lines 28620 28632 +12
Branches 12371 12398 +27
=============================================
+ Hits 24420 24430 +10
- Misses 4150 4154 +4
+ Partials 50 48 -2
... and 5 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
|
Reproduced: @IBAction func asyncCrash(_ sender: Any) {
for i in 1...6 {
DispatchQueue.global().async {
fatalError("Concurrent crash #\(i)")
}
}
} |
|
Sorry that it took so long. It just took me 2 minutes to give input 🤦 So first, I would double-check KSCrash. They did plenty of fixes over the past, and they might have a similar fix. And second, we need a review from a native dev. We could ping Mischan, but I think he might be sick this week. Thanks for looking into this @denrase 😃 |
|
@philipphofmann Ok, I aligned this with the KSCrash solution, where they suspend for 2 seconds when a concurrent thread also crashed. Also updated naming. This fixes our issue with incorrectly detecting a re-crash and also aligns us closer with KSCrash. Of course we still need a native team review here. 🙇 |
itaybre
left a comment
There was a problem hiding this comment.
LGTM, just some small comments
|
(sorry for the duplicated comments, github was having issues) |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Performance metrics 🚀
|
| Revision | Plain | With Sentry | Diff |
|---|---|---|---|
| 50e7b3e | 1221.54 ms | 1250.81 ms | 29.27 ms |
| ffe0649 | 1213.35 ms | 1248.64 ms | 35.29 ms |
| a1a9260 | 1192.15 ms | 1229.80 ms | 37.64 ms |
| 58a9225 | 1211.40 ms | 1238.88 ms | 27.48 ms |
| e8cc4e7 | 1216.94 ms | 1246.62 ms | 29.68 ms |
| 0e4b033 | 1218.59 ms | 1248.51 ms | 29.92 ms |
| 41b4993 | 1215.15 ms | 1248.14 ms | 32.99 ms |
| 88c11f3 | 1226.18 ms | 1262.62 ms | 36.44 ms |
| 5ca545a | 1219.06 ms | 1244.59 ms | 25.53 ms |
| e92ab66 | 1228.86 ms | 1258.43 ms | 29.57 ms |
App size
| Revision | Plain | With Sentry | Diff |
|---|---|---|---|
| 50e7b3e | 24.14 KiB | 1.04 MiB | 1.02 MiB |
| ffe0649 | 24.14 KiB | 1.06 MiB | 1.04 MiB |
| a1a9260 | 24.14 KiB | 1.08 MiB | 1.06 MiB |
| 58a9225 | 24.14 KiB | 1.06 MiB | 1.04 MiB |
| e8cc4e7 | 24.14 KiB | 1.11 MiB | 1.09 MiB |
| 0e4b033 | 24.14 KiB | 1.11 MiB | 1.09 MiB |
| 41b4993 | 24.14 KiB | 1.06 MiB | 1.04 MiB |
| 88c11f3 | 24.14 KiB | 1.11 MiB | 1.09 MiB |
| 5ca545a | 24.14 KiB | 1.06 MiB | 1.04 MiB |
| e92ab66 | 24.14 KiB | 1.10 MiB | 1.08 MiB |
|
@supervacuus I am once again asking for your review. 🙇 |
supervacuus
left a comment
There was a problem hiding this comment.
This looks good. The thing that surprises me the most, though, is that all monitors that invoke sentrycrashcm_notifyFatalException() have very different behavior:
- fatal signal handlers preempt the thread that crashed (and can preempt themselves), so they can reenter not only from other threads but from the same
- Mach exceptions are handled on a port listener thread and would only re-enter from another thread if the handler dishes out the retrieved exception to worker threads
- The
cxa_throwhook only reenters from other threads (edge-case is throwing fromcxa_throw, of course, which is certainly less defined than than raising a signal from a signal handler)
(I don't know enough about the NSException mechanism to categorize)
So, while guarding against reentrancy and async-signal-safety is certainly not a bad idea in all of the above, the exposure to certain issues is very different. This is primarily something that I would inline document. For instance, wasHandlingFatalException would never appear in Mach exception handling (unless that handler defers to worker threads), and an assigned thread wouldn't be the "crashing thread" but the crash handler thread. The code, as written in sentrycrashcm_notifyFatalException(), appears to be tuned to POSIX signal handling.
|
@supervacuus I have added additional comments derived from your feedback, do you think this is good enough? |
I think so. It is now clear that this is primarily a first preventative measure, using the worst-case monitor (the signal handler) as a baseline in the face of concurrent crashes, and is much better than how it was handled before. I know inline docs can get old quickly, but it certainly helps to orient whoever works on this next. |
📜 Description
sentrycrashcm_notifyFatalExceptionCapturedtosentrycrashcm_notifyFatalExceptionto move it closer to KSCrash naming.sentrycrashcm_notifyFatalExceptionto prevent deadlock (concurrency check must run before any threads are suspended)This eliminates the ENOENT race by ensuring only one thread ever enters crash handling at a time.
Aligned with KSCrash upstream (KSCrashMonitor.c):
💡 Motivation and Context
When multiple threads crash simultaneously, the second crash was incorrectly treated as a "recrash" (crash during crash handling) because g_handlingFatalException had no thread awareness.
Fixes #3296
💚 How did you test it?
📝 Checklist
You have to check all boxes before merging:
sendDefaultPIIis enabled.