Skip to content

SIGSEGV in GC mark phase (gc_heap::mark_object_simple1) on .NET 10.0.3 — null object reference in heap slot during Server GC #125169

@frakon

Description

@frakon

Description

A large multi-threaded .NET application crashes with SIGSEGV inside the CLR Garbage Collector's mark/plan phase on .NET 10.0.3. The crash is 100% reproducible — two independent crashes hit the exact same instruction at offset 0x57d680 in libcoreclr.so (symbol14311 + 0xAF0). The same application binary, built targeting net9.0, runs stably on the .NET 9 runtime (13,700+ CPU-minutes of uptime, zero crashes) but consistently crashes within ~1 minute on .NET 10.0.3 via roll-forward.

Runtime: 10.0.326.7603, commit c2435c3e0f46de784341ac3ed62863ce77e117b4

Crash mechanism

A GC worker thread traverses the managed heap during the mark/plan phase and encounters a null/corrupt object reference in a heap slot:

57d679:  48 8b 03             mov    (%rbx),%rax          ; Load object ref from GC heap slot
57d67c:  48 83 e0 f8          and    $0xfffffffffffffff8,%rax  ; Clear GC mark/pin bits (low 3 bits)
57d680:  8b 48 04             mov    0x4(%rax),%ecx       ; CRASH: rax=0 → reads address 0x4 → SIGSEGV
57d683:  83 38 00             cmpl   $0x0,(%rax)          ; Would check if object is free
57d686:  78 73                js     57d6fb               ; Jump if free object (sign bit)

Register dump at crash point (crash 1, captured directly via LLDB):

rax = 0x0000000000000000   ← NULL after AND ~7 (the heap slot contained 0x0 or 0x1–0x7)
rbx = 0x00007882b594aa70   ← pointer to the GC heap slot
rip = 0x000078e09d37d680   libcoreclr.so`___lldb_unnamed_symbol14311 + 2800

The AND $~7 pattern clears the lowest 3 GC mark/pin bits. The result being zero means the slot value was 0x00x7 — an invalid/null object reference where a valid managed object pointer was expected. The function is likely gc_heap::mark_object_simple1 based on the bit-masking pattern and the call chain through GC mark/plan/worker functions.


Reproduction Steps

We cannot share the proprietary application, but here is a detailed characterization to help reproduce:

  1. Application type: Long-running, real-time data processing console app with heavy network I/O (hundreds of WebSocket connections, HTTP clients). ASP.NET Core framework is referenced but no request pipeline is used.

  2. Build: Compiled with .NET 9 SDK, targeting net9.0 TFM. Runs on .NET 10 runtime via RollForward=LatestMajor.

  3. GC configuration (from runtimeconfig.json / Directory.Build.props):

    {
      "System.GC.Server": true,
      "System.GC.DynamicAdaptationMode": 0,
      "System.Runtime.TieredPGO": true
    }

    Server GC enabled, DATAS disabled, Dynamic PGO enabled.

  4. Workload characteristics at crash time:

    • Thread count: ~94 threads (crash 1), 1019 threads / 989 managed (crash 2)
    • GC pressure: Extremely heavy — 58× Gen0, 33× Gen1, 17× Gen2 collections in ~54 seconds
    • CPU: 81.1% utilization
    • Memory: ~23 GB RSS on a 188 GB machine
    • Application-level GC monitoring detected frequent undesired GC pauses
    • Explicit LOH compaction triggered periodically by the application
  5. Concurrency profile:

    • ~400–600 persistent WebSocket connections (concurrent reads/writes)
    • Multiple HTTP client pools with warm-up across 10 source IP addresses
    • 11 custom busy-wait spin-loop threads
    • Concurrent queue operations (lock-free TryDequeue on custom concurrent queues)
    • JIT compilation still active (thread deep in libclrjit.so at crash time)
  6. Crash timing: Both crashes occurred within ~1 minute of application startup, during the first processing round, while the app was warming up connections and JIT-compiling.

  7. How to trigger: Start the application on .NET 10.0.3 runtime. The crash occurs deterministically within the first minute. No special input or external interaction required beyond normal startup.

Expected behavior

The application runs without crashing, as it does on .NET 9.0.1 (same binary, same workload, 13,700+ CPU-minutes stable).

Actual behavior

The application crashes with SIGSEGV (signal 11) within ~1 minute of startup. The crash occurs on a GC worker thread inside native CLR GC code — no managed code is involved on the crashing thread.

createdump output (crash 2):

[createdump] Gathering state for process 3591845 dotnet
[createdump] Crashing thread 36d0d0 signal 11 (000b)
[createdump] Writing crash report to file /tmp/dotnet_crash_3591845.dmp.crashreport.json
[createdump] Writing full dump to file /tmp/dotnet_crash_3591845.dmp
[createdump] Written 25329680384 bytes (6184004 pages) to core file
[createdump] Dump successfully written in 21927ms
Segmentation fault (core dumped)

Full native backtrace (crash 2, LLDB):

frame #0:  libc.so.6`wait4 + 95                     ← post-crash: waiting for createdump
frame #1:  libcoreclr.so symbol16492 + 849           ← crash handler
frame #2:  libcoreclr.so symbol16495 + 2959          ← crash handler
frame #3:  libcoreclr.so symbol15913 + 265           ← crash handler
frame #4:  libcoreclr.so symbol15901 + 427           ← crash handler
frame #5:  libc.so.6 __restore_rt + 1                ← signal trampoline
frame #6:  libcoreclr.so symbol14311 + 2800          ← ACTUAL CRASH POINT (offset 0x57d680)
frame #7:  libcoreclr.so symbol14314 + 409           ← GC caller
frame #8:  libcoreclr.so symbol13471 + 109           ← GC mark/plan phase
frame #9:  libcoreclr.so symbol13488 + 702
frame #10: libcoreclr.so symbol13496 + 866
frame #11: libcoreclr.so symbol13426 + 375
frame #12: libcoreclr.so symbol13523 + 338
frame #13: libcoreclr.so symbol14219 + 1330          ← GC worker function
frame #14: libcoreclr.so symbol14216 + 262           ← GC thread entry
frame #15: libcoreclr.so symbol14423 + 304           ← thread start
frame #16: libcoreclr.so symbol11171 + 116
frame #17: libcoreclr.so symbol16527 + 505
frame #18: libc.so.6 pthread_start + 755
frame #19: libc.so.6 clone + 11

Evidence of reproducibility

Crash 1 (from apport core dump, LLDB): crashing thread at libcoreclr.so symbol14311 + 2800, rax = 0x0.

Crash 2 (reproduced with DOTNET_EnableCrashReport=1): crashing thread at libcoreclr.so symbol14311 + 2800identical offset.

In crash 1, another GC worker thread (#5) was stopped at symbol14311 + 2796 — just 4 bytes before the crash point — executing the preceding AND $~7 instruction. Multiple GC worker threads (#5, #41, #50, #64, #71, #85, #91, #93) were all active in the same GC code path.


Regression?

Yes. The same application binary (built targeting net9.0) runs stably on .NET 9.0.1 but crashes consistently on .NET 10.0.3.

  • .NET 9.0.1: 13,700+ CPU-minutes of stable operation, zero crashes, same workload.
  • .NET 10.0.3: Crashes within ~1 minute, 100% reproducible (2/2 attempts crashed at identical instruction).

Known Workarounds

  • Confirmed: Running on .NET 9.0.1 instead of 10.0.3 — the application is stable.
  • Untested but possible: DOTNET_GCServer=0 (switch to Workstation GC) or DOTNET_GCHeapCount=4 (reduce parallel GC worker threads) may avoid the race condition.

Configuration

  • Runtime: .NET 10.0.3, version 10.0.326.7603, commit c2435c3e0f46de784341ac3ed62863ce77e117b4
  • SDK: 10.0.103
  • OS: Ubuntu 22.04.1 LTS, kernel 6.8.12, x86_64
  • Hardware: Happened on multiple hardwares, e.g.: AWS EC2 m5zn.metal instance, 188 GB RAM, no swap, Intel Xeon Platinum (Skylake/Cascade Lake)
  • GC mode: Server GC, DATAS disabled (System.GC.DynamicAdaptationMode = 0)
  • Installed runtimes: 7.0.19, 9.0.1, 10.0.3 (side-by-side)
  • libcoreclr.so: /usr/lib/dotnet/shared/Microsoft.NETCore.App/10.0.3/libcoreclr.so, stripped (no debug symbols), ELF 64-bit, build-id b2c2db54fcf5a2c174f255e4769dbcda03740d4c

Other information

Crash report JSON structure (DOTNET_EnableCrashReport=1)

The JSON crash report from crash 2 confirms:

  • "version": "10.0.326.7603 @Commit: c2435c3e0f46de784341ac3ed62863ce77e117b4"
  • "ExceptionType": "0x20000000"
  • 1019 threads total, crashing thread index 550, "crashed": "true"
  • 946 of 989 managed threads had only "unknown" method names — consistent with all managed threads being suspended during GC

Main thread managed stack at crash time

Thread 0 was in normal application startup — waiting on Task.InternalWait() inside a WebSocket stream connection setup. No unusual or error-related code path.

GC pause observation just before crash 2

The application's own GC monitoring detected a ~321ms stall of its busy-wait threads, with the last observed GC being a 13ms collection at 05:47:09.935–05:47:09.948. The crash occurred shortly after, consistent with a GC cycle that began normally but ended in the SIGSEGV.

dotnet-dump limitation

dotnet-dump version 9.0.661903 cannot analyze .NET 10 dumps (Failed to load data access module, 0x80004002). No .NET 10 version of dotnet-dump is available on NuGet, and setclrpath to the .NET 10 DAC directory also fails. Debug symbols for this runtime build are not available on the Microsoft symbol server (build-id b2c2db54fcf5a2c174f255e4769dbcda03740d4c).

Dumps available

Full core dumps (24–28 GB) and the 10 MB JSON crash report are preserved on the server and can be shared if needed for investigation.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions