In the following blog post a user points out that our compiler always uses the load-linked/store-conditional subset of ARM, even on hardware with better solutions.
https://megayuchi.com/2019/12/08/surface-pro-x-benchmark-from-the-programmers-point-of-view/
We should work with our hardware partners to determine in which places we can emit better instructions given a CPU feature test.
I have asked folks who own these parts of the compiler to indicate here which specific intrinsics they want us to call.