Skip to content

Fix for arm64 outerloop jitstress failures in FMA#126434

Merged
dhartglassMSFT merged 2 commits intodotnet:mainfrom
dhartglassMSFT:fix_126379
Apr 7, 2026
Merged

Fix for arm64 outerloop jitstress failures in FMA#126434
dhartglassMSFT merged 2 commits intodotnet:mainfrom
dhartglassMSFT:fix_126379

Conversation

@dhartglassMSFT
Copy link
Copy Markdown
Contributor

@dhartglassMSFT dhartglassMSFT commented Apr 1, 2026

Previous codegen for FMLA, when supplied with a non-constant operand, looked something like:

mov     v0.8b, v8.8b // fill in RMW operand
br      x20
fmla    v0.2s, v9.2s, v10.s[0]
b       G_M19624_IG23
fmla    v0.2s, v9.2s, v10.s[1]
b       G_M19624_IG23
// and so on for each of the immediates

With the recent refactor to move the mov-emit-for-RMW down into the emit routines, the mov is emitted instead when codegen happens for each arm of the jump table:

br      x20
mov     v0.8b, v8.8b                    // fill in RMW operand
fmla    v0.2s, v9.2s, v10.s[0]
b       G_M19624_IG24
mov     v0.8b, v8.8b                   // fill in RMW operand
fmla    v0.2s, v9.2s, v10.s[1]
b       G_M19624_IG24
// etc...

Bug happens because the jump table builder thought each case would contain only 1 instruction, not two.

@a74nh Another option to fix this is to undo the refactor just for this case, and not duplicate the mov in each branch. This would yield slightly smaller codesize. I think it shouldn't matter much overall because the size difference is small, and the non-constant-operand case is rare. Let me know if we should go this route instead

fixes #126379

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 1, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes ARM64 JIT codegen for certain HWIntrinsics that use a non-constant immediate (requiring a jump table) and have RMW semantics (or otherwise emit an extra mov), ensuring the jump table entry-size calculation matches the actual number of emitted instructions.

Changes:

  • Pass an explicit numInstrs to HWIntrinsicImmOpHelper for RMW SIMDByIndexedElement (4-operand) and RMW shift-by-immediate intrinsics to account for the extra mov that may be emitted.
  • Add debug assertions that the 2-operand and 3-operand SIMDByIndexedElement immediate cases are not treated as RMW.

@a74nh
Copy link
Copy Markdown
Contributor

a74nh commented Apr 2, 2026

@a74nh Another option to fix this is to undo the refactor just for this case, and not duplicate the mov in each branch. This would yield slightly smaller codesize. I think it shouldn't matter much overall because the size difference is small, and the non-constant-operand case is rare. Let me know if we should go this route instead

As long as we don't change the code in the emitter, then I'm happy.

If you undid the code in these two codegen functions, then there would be a mov added if required, and then the call to the emitter would be
emitIns_R_R_R_R_I(ins, emitSize, targetReg, targetReg, op2Reg, op3Reg, elementIndex, opt);
which means the emitter wouldn't emit a movprfx.

I'm happy with doing it that way.

Yes, it's only two functions (and they can't get inlined), but we should set a precedence if we need more in the future.

@dhartglassMSFT
Copy link
Copy Markdown
Contributor Author

/azp run runtime-coreclr jitstress2-jitstressregs

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@dhartglassMSFT
Copy link
Copy Markdown
Contributor Author

jitstress regs failures are infrastructure failures

osx-arm64 stress passed, at least

@dhartglassMSFT dhartglassMSFT merged commit 95f94b6 into dotnet:main Apr 7, 2026
146 of 147 checks passed
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Apr 9, 2026
Previous codegen for FMLA, when supplied with a non-constant operand,
looked something like:
```
mov     v0.8b, v8.8b // fill in RMW operand
br      x20
fmla    v0.2s, v9.2s, v10.s[0]
b       G_M19624_IG23
fmla    v0.2s, v9.2s, v10.s[1]
b       G_M19624_IG23
// and so on for each of the immediates
```
With the recent refactor to move the mov-emit-for-RMW down into the emit
routines, the mov is emitted instead when codegen happens for each arm
of the jump table:
```
br      x20
mov     v0.8b, v8.8b                    // fill in RMW operand
fmla    v0.2s, v9.2s, v10.s[0]
b       G_M19624_IG24
mov     v0.8b, v8.8b                   // fill in RMW operand
fmla    v0.2s, v9.2s, v10.s[1]
b       G_M19624_IG24
// etc...
```
Bug happens because the jump table builder thought each case would
contain only 1 instruction, not two.

fixes dotnet#126379
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test Failure: _AdvSimd_ro::JIT.HardwareIntrinsics.Arm._AdvSimd.Program.MultiplyBySelectedScalarWideningUpperAndAdd_Vector128_Int16_Vector128_Int16_7()

4 participants