SIGSEGV in GC mark phase (gc_heap::mark_object_simple1) on .NET 10.0.3 — null object reference in heap slot during Server GC

### Description

A large multi-threaded .NET application crashes with SIGSEGV inside the CLR Garbage Collector's mark/plan phase on .NET 10.0.3. The crash is **100% reproducible** — two independent crashes hit the **exact same instruction** at offset `0x57d680` in `libcoreclr.so` (`symbol14311 + 0xAF0`). The same application binary, built targeting `net9.0`, runs stably on the .NET 9 runtime (13,700+ CPU-minutes of uptime, zero crashes) but consistently crashes within ~1 minute on .NET 10.0.3 via roll-forward.

**Runtime**: `10.0.326.7603`, commit `c2435c3e0f46de784341ac3ed62863ce77e117b4`

### Crash mechanism

A GC worker thread traverses the managed heap during the mark/plan phase and encounters a **null/corrupt object reference** in a heap slot:

```asm
57d679:  48 8b 03             mov    (%rbx),%rax          ; Load object ref from GC heap slot
57d67c:  48 83 e0 f8          and    $0xfffffffffffffff8,%rax  ; Clear GC mark/pin bits (low 3 bits)
57d680:  8b 48 04             mov    0x4(%rax),%ecx       ; CRASH: rax=0 → reads address 0x4 → SIGSEGV
57d683:  83 38 00             cmpl   $0x0,(%rax)          ; Would check if object is free
57d686:  78 73                js     57d6fb               ; Jump if free object (sign bit)
```

Register dump at crash point (crash 1, captured directly via LLDB):
```
rax = 0x0000000000000000   ← NULL after AND ~7 (the heap slot contained 0x0 or 0x1–0x7)
rbx = 0x00007882b594aa70   ← pointer to the GC heap slot
rip = 0x000078e09d37d680   libcoreclr.so`___lldb_unnamed_symbol14311 + 2800
```

The `AND $~7` pattern clears the lowest 3 GC mark/pin bits. The result being zero means the slot value was `0x0`–`0x7` — an invalid/null object reference where a valid managed object pointer was expected. The function is likely `gc_heap::mark_object_simple1` based on the bit-masking pattern and the call chain through GC mark/plan/worker functions.

---

### Reproduction Steps

We cannot share the proprietary application, but here is a detailed characterization to help reproduce:

1. **Application type**: Long-running, real-time data processing console app with heavy network I/O (hundreds of WebSocket connections, HTTP clients). ASP.NET Core framework is referenced but no request pipeline is used.

2. **Build**: Compiled with .NET 9 SDK, targeting `net9.0` TFM. Runs on .NET 10 runtime via `RollForward=LatestMajor`.

3. **GC configuration** (from `runtimeconfig.json` / `Directory.Build.props`):
   ```json
   {
     "System.GC.Server": true,
     "System.GC.DynamicAdaptationMode": 0,
     "System.Runtime.TieredPGO": true
   }
   ```
   Server GC enabled, DATAS disabled, Dynamic PGO enabled.

4. **Workload characteristics at crash time**:
   - **Thread count**: ~94 threads (crash 1), 1019 threads / 989 managed (crash 2)
   - **GC pressure**: Extremely heavy — 58× Gen0, 33× Gen1, 17× Gen2 collections in ~54 seconds
   - **CPU**: 81.1% utilization
   - **Memory**: ~23 GB RSS on a 188 GB machine
   - Application-level GC monitoring detected frequent undesired GC pauses
   - Explicit LOH compaction triggered periodically by the application

5. **Concurrency profile**:
   - ~400–600 persistent WebSocket connections (concurrent reads/writes)
   - Multiple HTTP client pools with warm-up across 10 source IP addresses
   - 11 custom busy-wait spin-loop threads
   - Concurrent queue operations (lock-free `TryDequeue` on custom concurrent queues)
   - JIT compilation still active (thread deep in `libclrjit.so` at crash time)

6. **Crash timing**: Both crashes occurred within ~1 minute of application startup, during the first processing round, while the app was warming up connections and JIT-compiling.

7. **How to trigger**: Start the application on .NET 10.0.3 runtime. The crash occurs deterministically within the first minute. No special input or external interaction required beyond normal startup.

### Expected behavior

The application runs without crashing, as it does on .NET 9.0.1 (same binary, same workload, 13,700+ CPU-minutes stable).

### Actual behavior

The application crashes with **SIGSEGV (signal 11)** within ~1 minute of startup. The crash occurs on a **GC worker thread** inside native CLR GC code — no managed code is involved on the crashing thread.

### createdump output (crash 2):
```
[createdump] Gathering state for process 3591845 dotnet
[createdump] Crashing thread 36d0d0 signal 11 (000b)
[createdump] Writing crash report to file /tmp/dotnet_crash_3591845.dmp.crashreport.json
[createdump] Writing full dump to file /tmp/dotnet_crash_3591845.dmp
[createdump] Written 25329680384 bytes (6184004 pages) to core file
[createdump] Dump successfully written in 21927ms
Segmentation fault (core dumped)
```

### Full native backtrace (crash 2, LLDB):
```
frame #0:  libc.so.6`wait4 + 95                     ← post-crash: waiting for createdump
frame #1:  libcoreclr.so symbol16492 + 849           ← crash handler
frame #2:  libcoreclr.so symbol16495 + 2959          ← crash handler
frame #3:  libcoreclr.so symbol15913 + 265           ← crash handler
frame #4:  libcoreclr.so symbol15901 + 427           ← crash handler
frame #5:  libc.so.6 __restore_rt + 1                ← signal trampoline
frame #6:  libcoreclr.so symbol14311 + 2800          ← ACTUAL CRASH POINT (offset 0x57d680)
frame #7:  libcoreclr.so symbol14314 + 409           ← GC caller
frame #8:  libcoreclr.so symbol13471 + 109           ← GC mark/plan phase
frame #9:  libcoreclr.so symbol13488 + 702
frame #10: libcoreclr.so symbol13496 + 866
frame #11: libcoreclr.so symbol13426 + 375
frame #12: libcoreclr.so symbol13523 + 338
frame #13: libcoreclr.so symbol14219 + 1330          ← GC worker function
frame #14: libcoreclr.so symbol14216 + 262           ← GC thread entry
frame #15: libcoreclr.so symbol14423 + 304           ← thread start
frame #16: libcoreclr.so symbol11171 + 116
frame #17: libcoreclr.so symbol16527 + 505
frame #18: libc.so.6 pthread_start + 755
frame #19: libc.so.6 clone + 11
```

### Evidence of reproducibility

**Crash 1** (from apport core dump, LLDB): crashing thread at `libcoreclr.so symbol14311 + 2800`, `rax = 0x0`.

**Crash 2** (reproduced with `DOTNET_EnableCrashReport=1`): crashing thread at `libcoreclr.so symbol14311 + 2800` — **identical offset**.

In crash 1, another GC worker thread (#5) was stopped at `symbol14311 + 2796` — just **4 bytes before** the crash point — executing the preceding `AND $~7` instruction. Multiple GC worker threads (#5, #41, #50, #64, #71, #85, #91, #93) were all active in the same GC code path.

---

### Regression?

**Yes.** The same application binary (built targeting `net9.0`) runs stably on .NET 9.0.1 but crashes consistently on .NET 10.0.3.

- **.NET 9.0.1**: **13,700+ CPU-minutes** of stable operation, zero crashes, same workload.
- **.NET 10.0.3**: Crashes within ~1 minute, 100% reproducible (2/2 attempts crashed at identical instruction).

### Known Workarounds

- **Confirmed**: Running on .NET 9.0.1 instead of 10.0.3 — the application is stable.
- **Untested but possible**: `DOTNET_GCServer=0` (switch to Workstation GC) or `DOTNET_GCHeapCount=4` (reduce parallel GC worker threads) may avoid the race condition.

### Configuration

- **Runtime**: .NET 10.0.3, version `10.0.326.7603`, commit `c2435c3e0f46de784341ac3ed62863ce77e117b4`
- **SDK**: 10.0.103
- **OS**: Ubuntu 22.04.1 LTS, kernel `6.8.12`, x86_64
- **Hardware**: Happened on multiple hardwares, e.g.: AWS EC2 m5zn.metal instance, 188 GB RAM, no swap, Intel Xeon Platinum (Skylake/Cascade Lake)
- **GC mode**: Server GC, DATAS disabled (`System.GC.DynamicAdaptationMode = 0`)
- **Installed runtimes**: 7.0.19, 9.0.1, 10.0.3 (side-by-side)
- **libcoreclr.so**: `/usr/lib/dotnet/shared/Microsoft.NETCore.App/10.0.3/libcoreclr.so`, stripped (no debug symbols), ELF 64-bit, build-id `b2c2db54fcf5a2c174f255e4769dbcda03740d4c`

### Other information

### Crash report JSON structure (`DOTNET_EnableCrashReport=1`)

The JSON crash report from crash 2 confirms:
- `"version": "10.0.326.7603 @Commit: c2435c3e0f46de784341ac3ed62863ce77e117b4"`
- `"ExceptionType": "0x20000000"`
- 1019 threads total, crashing thread index 550, `"crashed": "true"`
- 946 of 989 managed threads had only `"unknown"` method names — consistent with all managed threads being suspended during GC

### Main thread managed stack at crash time

Thread 0 was in normal application startup — waiting on `Task.InternalWait()` inside a WebSocket stream connection setup. No unusual or error-related code path.

### GC pause observation just before crash 2

The application's own GC monitoring detected a ~321ms stall of its busy-wait threads, with the last observed GC being a 13ms collection at 05:47:09.935–05:47:09.948. The crash occurred shortly after, consistent with a GC cycle that began normally but ended in the SIGSEGV.

### dotnet-dump limitation

`dotnet-dump` version 9.0.661903 cannot analyze .NET 10 dumps (`Failed to load data access module, 0x80004002`). No .NET 10 version of `dotnet-dump` is available on NuGet, and `setclrpath` to the .NET 10 DAC directory also fails. Debug symbols for this runtime build are not available on the Microsoft symbol server (`build-id b2c2db54fcf5a2c174f255e4769dbcda03740d4c`).

### Dumps available

Full core dumps (24–28 GB) and the 10 MB JSON crash report are preserved on the server and can be shared if needed for investigation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV in GC mark phase (gc_heap::mark_object_simple1) on .NET 10.0.3 — null object reference in heap slot during Server GC #125169

Description

Crash mechanism

Reproduction Steps

Expected behavior

Actual behavior

createdump output (crash 2):

Full native backtrace (crash 2, LLDB):

Evidence of reproducibility

Regression?

Known Workarounds

Configuration

Other information

Crash report JSON structure (`DOTNET_EnableCrashReport=1`)

Main thread managed stack at crash time

GC pause observation just before crash 2

dotnet-dump limitation

Dumps available

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SIGSEGV in GC mark phase (gc_heap::mark_object_simple1) on .NET 10.0.3 — null object reference in heap slot during Server GC #125169

Description

Description

Crash mechanism

Reproduction Steps

Expected behavior

Actual behavior

createdump output (crash 2):

Full native backtrace (crash 2, LLDB):

Evidence of reproducibility

Regression?

Known Workarounds

Configuration

Other information

Crash report JSON structure (DOTNET_EnableCrashReport=1)

Main thread managed stack at crash time

GC pause observation just before crash 2

dotnet-dump limitation

Dumps available

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Crash report JSON structure (`DOTNET_EnableCrashReport=1`)