Fix ARM64 interface dispatch cache torn read#126346
Fix ARM64 interface dispatch cache torn read#126346MichalStrehovsky merged 3 commits intodotnet:mainfrom
Conversation
On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode from a cache entry using two separate ldr instructions separated by a control dependency (cmp/bne). ARM64's weak memory model does not order loads across control dependencies, so the hardware can speculatively satisfy the second load (target) before the first (type) commits. When a concurrent thread atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically), the reader can observe the new m_pInstanceType but the old m_pTargetCode (0), then br to address 0. Fix by using ldp to load both fields in a single instruction (single-copy atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally guaranteed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @agocke, @dotnet/ilc-contrib |
There was a problem hiding this comment.
Pull request overview
Fixes a potential torn-read race in the ARM64 cached interface dispatch fast-path that could lead to branching to address 0 when reading a concurrently populated cache entry.
Changes:
- Replace two independent loads of cache-entry fields with a single
ldppair load to avoid reordering across control dependencies on ARM64. - Add a
cbzguard on the loaded target to treat observed torn reads (type updated, target still 0) as a cache miss on pre-LSE2 hardware. - Mirror the changes in both the GAS (
.S) and ARMASM (.asm) implementations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/coreclr/runtime/arm64/StubDispatch.S | Updates ARM64 stub macro to use ldp + cbz to avoid torn cache-entry reads. |
| src/coreclr/runtime/arm64/StubDispatch.asm | Same logic as above for the ARMASM variant to keep implementations consistent. |
Instead of emitting an add instruction per cache entry when the ldp offset exceeds [-512,504], rebase x9 once when the threshold is crossed. This keeps the per-entry probe to a single ldp for all entries in the 32/64 slot stubs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Apple ARM64 platforms all have FEAT_LSE2, which makes ldp single-copy atomic for 16-byte aligned pairs. The cbz torn-read guard is unnecessary there. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
I also looked whether the new scheme over at #123252 would have this issue in the monomorphic path (that looks similar to this) and it seems like it doesn't because of ldar use. I guess ldar could be another option here. |
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/ba-g the GC test issue is unrelated and I fixed it. the leg timeouts are not interesting because they're not ARM64 |
|
/backport to release/10.0 |
|
Started backporting to |
On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode from a cache entry using two separate ldr instructions separated by a control dependency (cmp/bne). ARM64's weak memory model does not order loads across control dependencies, so the hardware can speculatively satisfy the second load (target) before the first (type) commits. When a concurrent thread atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically), the reader can observe the new m_pInstanceType but the old m_pTargetCode (0), then br to address 0. Fix by using ldp to load both fields in a single instruction (single-copy atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally guaranteed. Fixes dotnet#126345
On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode from a cache entry using two separate ldr instructions separated by a control dependency (cmp/bne). ARM64's weak memory model does not order loads across control dependencies, so the hardware can speculatively satisfy the second load (target) before the first (type) commits. When a concurrent thread atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically), the reader can observe the new m_pInstanceType but the old m_pTargetCode (0), then br to address 0.
Fix by using ldp to load both fields in a single instruction (single-copy atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally guaranteed.
Fixes #126345