Description
A large multi-threaded .NET application crashes with SIGSEGV inside the CLR Garbage Collector's mark/plan phase on .NET 10.0.3. The crash is 100% reproducible — two independent crashes hit the exact same instruction at offset 0x57d680 in libcoreclr.so (symbol14311 + 0xAF0). The same application binary, built targeting net9.0, runs stably on the .NET 9 runtime (13,700+ CPU-minutes of uptime, zero crashes) but consistently crashes within ~1 minute on .NET 10.0.3 via roll-forward.
Runtime: 10.0.326.7603, commit c2435c3e0f46de784341ac3ed62863ce77e117b4
Crash mechanism
A GC worker thread traverses the managed heap during the mark/plan phase and encounters a null/corrupt object reference in a heap slot:
57d679: 48 8b 03 mov (%rbx),%rax ; Load object ref from GC heap slot
57d67c: 48 83 e0 f8 and $0xfffffffffffffff8,%rax ; Clear GC mark/pin bits (low 3 bits)
57d680: 8b 48 04 mov 0x4(%rax),%ecx ; CRASH: rax=0 → reads address 0x4 → SIGSEGV
57d683: 83 38 00 cmpl $0x0,(%rax) ; Would check if object is free
57d686: 78 73 js 57d6fb ; Jump if free object (sign bit)
Register dump at crash point (crash 1, captured directly via LLDB):
rax = 0x0000000000000000 ← NULL after AND ~7 (the heap slot contained 0x0 or 0x1–0x7)
rbx = 0x00007882b594aa70 ← pointer to the GC heap slot
rip = 0x000078e09d37d680 libcoreclr.so`___lldb_unnamed_symbol14311 + 2800
The AND $~7 pattern clears the lowest 3 GC mark/pin bits. The result being zero means the slot value was 0x0–0x7 — an invalid/null object reference where a valid managed object pointer was expected. The function is likely gc_heap::mark_object_simple1 based on the bit-masking pattern and the call chain through GC mark/plan/worker functions.
Reproduction Steps
We cannot share the proprietary application, but here is a detailed characterization to help reproduce:
-
Application type: Long-running, real-time data processing console app with heavy network I/O (hundreds of WebSocket connections, HTTP clients). ASP.NET Core framework is referenced but no request pipeline is used.
-
Build: Compiled with .NET 9 SDK, targeting net9.0 TFM. Runs on .NET 10 runtime via RollForward=LatestMajor.
-
GC configuration (from runtimeconfig.json / Directory.Build.props):
{
"System.GC.Server": true,
"System.GC.DynamicAdaptationMode": 0,
"System.Runtime.TieredPGO": true
}
Server GC enabled, DATAS disabled, Dynamic PGO enabled.
-
Workload characteristics at crash time:
- Thread count: ~94 threads (crash 1), 1019 threads / 989 managed (crash 2)
- GC pressure: Extremely heavy — 58× Gen0, 33× Gen1, 17× Gen2 collections in ~54 seconds
- CPU: 81.1% utilization
- Memory: ~23 GB RSS on a 188 GB machine
- Application-level GC monitoring detected frequent undesired GC pauses
- Explicit LOH compaction triggered periodically by the application
-
Concurrency profile:
- ~400–600 persistent WebSocket connections (concurrent reads/writes)
- Multiple HTTP client pools with warm-up across 10 source IP addresses
- 11 custom busy-wait spin-loop threads
- Concurrent queue operations (lock-free
TryDequeue on custom concurrent queues)
- JIT compilation still active (thread deep in
libclrjit.so at crash time)
-
Crash timing: Both crashes occurred within ~1 minute of application startup, during the first processing round, while the app was warming up connections and JIT-compiling.
-
How to trigger: Start the application on .NET 10.0.3 runtime. The crash occurs deterministically within the first minute. No special input or external interaction required beyond normal startup.
Expected behavior
The application runs without crashing, as it does on .NET 9.0.1 (same binary, same workload, 13,700+ CPU-minutes stable).
Actual behavior
The application crashes with SIGSEGV (signal 11) within ~1 minute of startup. The crash occurs on a GC worker thread inside native CLR GC code — no managed code is involved on the crashing thread.
createdump output (crash 2):
[createdump] Gathering state for process 3591845 dotnet
[createdump] Crashing thread 36d0d0 signal 11 (000b)
[createdump] Writing crash report to file /tmp/dotnet_crash_3591845.dmp.crashreport.json
[createdump] Writing full dump to file /tmp/dotnet_crash_3591845.dmp
[createdump] Written 25329680384 bytes (6184004 pages) to core file
[createdump] Dump successfully written in 21927ms
Segmentation fault (core dumped)
Full native backtrace (crash 2, LLDB):
frame #0: libc.so.6`wait4 + 95 ← post-crash: waiting for createdump
frame #1: libcoreclr.so symbol16492 + 849 ← crash handler
frame #2: libcoreclr.so symbol16495 + 2959 ← crash handler
frame #3: libcoreclr.so symbol15913 + 265 ← crash handler
frame #4: libcoreclr.so symbol15901 + 427 ← crash handler
frame #5: libc.so.6 __restore_rt + 1 ← signal trampoline
frame #6: libcoreclr.so symbol14311 + 2800 ← ACTUAL CRASH POINT (offset 0x57d680)
frame #7: libcoreclr.so symbol14314 + 409 ← GC caller
frame #8: libcoreclr.so symbol13471 + 109 ← GC mark/plan phase
frame #9: libcoreclr.so symbol13488 + 702
frame #10: libcoreclr.so symbol13496 + 866
frame #11: libcoreclr.so symbol13426 + 375
frame #12: libcoreclr.so symbol13523 + 338
frame #13: libcoreclr.so symbol14219 + 1330 ← GC worker function
frame #14: libcoreclr.so symbol14216 + 262 ← GC thread entry
frame #15: libcoreclr.so symbol14423 + 304 ← thread start
frame #16: libcoreclr.so symbol11171 + 116
frame #17: libcoreclr.so symbol16527 + 505
frame #18: libc.so.6 pthread_start + 755
frame #19: libc.so.6 clone + 11
Evidence of reproducibility
Crash 1 (from apport core dump, LLDB): crashing thread at libcoreclr.so symbol14311 + 2800, rax = 0x0.
Crash 2 (reproduced with DOTNET_EnableCrashReport=1): crashing thread at libcoreclr.so symbol14311 + 2800 — identical offset.
In crash 1, another GC worker thread (#5) was stopped at symbol14311 + 2796 — just 4 bytes before the crash point — executing the preceding AND $~7 instruction. Multiple GC worker threads (#5, #41, #50, #64, #71, #85, #91, #93) were all active in the same GC code path.
Regression?
Yes. The same application binary (built targeting net9.0) runs stably on .NET 9.0.1 but crashes consistently on .NET 10.0.3.
- .NET 9.0.1: 13,700+ CPU-minutes of stable operation, zero crashes, same workload.
- .NET 10.0.3: Crashes within ~1 minute, 100% reproducible (2/2 attempts crashed at identical instruction).
Known Workarounds
- Confirmed: Running on .NET 9.0.1 instead of 10.0.3 — the application is stable.
- Untested but possible:
DOTNET_GCServer=0 (switch to Workstation GC) or DOTNET_GCHeapCount=4 (reduce parallel GC worker threads) may avoid the race condition.
Configuration
- Runtime: .NET 10.0.3, version
10.0.326.7603, commit c2435c3e0f46de784341ac3ed62863ce77e117b4
- SDK: 10.0.103
- OS: Ubuntu 22.04.1 LTS, kernel
6.8.12, x86_64
- Hardware: Happened on multiple hardwares, e.g.: AWS EC2 m5zn.metal instance, 188 GB RAM, no swap, Intel Xeon Platinum (Skylake/Cascade Lake)
- GC mode: Server GC, DATAS disabled (
System.GC.DynamicAdaptationMode = 0)
- Installed runtimes: 7.0.19, 9.0.1, 10.0.3 (side-by-side)
- libcoreclr.so:
/usr/lib/dotnet/shared/Microsoft.NETCore.App/10.0.3/libcoreclr.so, stripped (no debug symbols), ELF 64-bit, build-id b2c2db54fcf5a2c174f255e4769dbcda03740d4c
Other information
Crash report JSON structure (DOTNET_EnableCrashReport=1)
The JSON crash report from crash 2 confirms:
"version": "10.0.326.7603 @Commit: c2435c3e0f46de784341ac3ed62863ce77e117b4"
"ExceptionType": "0x20000000"
- 1019 threads total, crashing thread index 550,
"crashed": "true"
- 946 of 989 managed threads had only
"unknown" method names — consistent with all managed threads being suspended during GC
Main thread managed stack at crash time
Thread 0 was in normal application startup — waiting on Task.InternalWait() inside a WebSocket stream connection setup. No unusual or error-related code path.
GC pause observation just before crash 2
The application's own GC monitoring detected a ~321ms stall of its busy-wait threads, with the last observed GC being a 13ms collection at 05:47:09.935–05:47:09.948. The crash occurred shortly after, consistent with a GC cycle that began normally but ended in the SIGSEGV.
dotnet-dump limitation
dotnet-dump version 9.0.661903 cannot analyze .NET 10 dumps (Failed to load data access module, 0x80004002). No .NET 10 version of dotnet-dump is available on NuGet, and setclrpath to the .NET 10 DAC directory also fails. Debug symbols for this runtime build are not available on the Microsoft symbol server (build-id b2c2db54fcf5a2c174f255e4769dbcda03740d4c).
Dumps available
Full core dumps (24–28 GB) and the 10 MB JSON crash report are preserved on the server and can be shared if needed for investigation.
Description
A large multi-threaded .NET application crashes with SIGSEGV inside the CLR Garbage Collector's mark/plan phase on .NET 10.0.3. The crash is 100% reproducible — two independent crashes hit the exact same instruction at offset
0x57d680inlibcoreclr.so(symbol14311 + 0xAF0). The same application binary, built targetingnet9.0, runs stably on the .NET 9 runtime (13,700+ CPU-minutes of uptime, zero crashes) but consistently crashes within ~1 minute on .NET 10.0.3 via roll-forward.Runtime:
10.0.326.7603, commitc2435c3e0f46de784341ac3ed62863ce77e117b4Crash mechanism
A GC worker thread traverses the managed heap during the mark/plan phase and encounters a null/corrupt object reference in a heap slot:
Register dump at crash point (crash 1, captured directly via LLDB):
The
AND $~7pattern clears the lowest 3 GC mark/pin bits. The result being zero means the slot value was0x0–0x7— an invalid/null object reference where a valid managed object pointer was expected. The function is likelygc_heap::mark_object_simple1based on the bit-masking pattern and the call chain through GC mark/plan/worker functions.Reproduction Steps
We cannot share the proprietary application, but here is a detailed characterization to help reproduce:
Application type: Long-running, real-time data processing console app with heavy network I/O (hundreds of WebSocket connections, HTTP clients). ASP.NET Core framework is referenced but no request pipeline is used.
Build: Compiled with .NET 9 SDK, targeting
net9.0TFM. Runs on .NET 10 runtime viaRollForward=LatestMajor.GC configuration (from
runtimeconfig.json/Directory.Build.props):{ "System.GC.Server": true, "System.GC.DynamicAdaptationMode": 0, "System.Runtime.TieredPGO": true }Server GC enabled, DATAS disabled, Dynamic PGO enabled.
Workload characteristics at crash time:
Concurrency profile:
TryDequeueon custom concurrent queues)libclrjit.soat crash time)Crash timing: Both crashes occurred within ~1 minute of application startup, during the first processing round, while the app was warming up connections and JIT-compiling.
How to trigger: Start the application on .NET 10.0.3 runtime. The crash occurs deterministically within the first minute. No special input or external interaction required beyond normal startup.
Expected behavior
The application runs without crashing, as it does on .NET 9.0.1 (same binary, same workload, 13,700+ CPU-minutes stable).
Actual behavior
The application crashes with SIGSEGV (signal 11) within ~1 minute of startup. The crash occurs on a GC worker thread inside native CLR GC code — no managed code is involved on the crashing thread.
createdump output (crash 2):
Full native backtrace (crash 2, LLDB):
Evidence of reproducibility
Crash 1 (from apport core dump, LLDB): crashing thread at
libcoreclr.so symbol14311 + 2800,rax = 0x0.Crash 2 (reproduced with
DOTNET_EnableCrashReport=1): crashing thread atlibcoreclr.so symbol14311 + 2800— identical offset.In crash 1, another GC worker thread (#5) was stopped at
symbol14311 + 2796— just 4 bytes before the crash point — executing the precedingAND $~7instruction. Multiple GC worker threads (#5, #41, #50, #64, #71, #85, #91, #93) were all active in the same GC code path.Regression?
Yes. The same application binary (built targeting
net9.0) runs stably on .NET 9.0.1 but crashes consistently on .NET 10.0.3.Known Workarounds
DOTNET_GCServer=0(switch to Workstation GC) orDOTNET_GCHeapCount=4(reduce parallel GC worker threads) may avoid the race condition.Configuration
10.0.326.7603, commitc2435c3e0f46de784341ac3ed62863ce77e117b46.8.12, x86_64System.GC.DynamicAdaptationMode = 0)/usr/lib/dotnet/shared/Microsoft.NETCore.App/10.0.3/libcoreclr.so, stripped (no debug symbols), ELF 64-bit, build-idb2c2db54fcf5a2c174f255e4769dbcda03740d4cOther information
Crash report JSON structure (
DOTNET_EnableCrashReport=1)The JSON crash report from crash 2 confirms:
"version": "10.0.326.7603 @Commit: c2435c3e0f46de784341ac3ed62863ce77e117b4""ExceptionType": "0x20000000""crashed": "true""unknown"method names — consistent with all managed threads being suspended during GCMain thread managed stack at crash time
Thread 0 was in normal application startup — waiting on
Task.InternalWait()inside a WebSocket stream connection setup. No unusual or error-related code path.GC pause observation just before crash 2
The application's own GC monitoring detected a ~321ms stall of its busy-wait threads, with the last observed GC being a 13ms collection at 05:47:09.935–05:47:09.948. The crash occurred shortly after, consistent with a GC cycle that began normally but ended in the SIGSEGV.
dotnet-dump limitation
dotnet-dumpversion 9.0.661903 cannot analyze .NET 10 dumps (Failed to load data access module, 0x80004002). No .NET 10 version ofdotnet-dumpis available on NuGet, andsetclrpathto the .NET 10 DAC directory also fails. Debug symbols for this runtime build are not available on the Microsoft symbol server (build-id b2c2db54fcf5a2c174f255e4769dbcda03740d4c).Dumps available
Full core dumps (24–28 GB) and the 10 MB JSON crash report are preserved on the server and can be shared if needed for investigation.