Skip to content

8251152: ARM32: jtreg c2 Test8202414 test crash#48

Closed
fzhinkin wants to merge 1 commit intoopenjdk:masterfrom
fzhinkin:8251152-skip-unaligned-memory-accesses-related-tests
Closed

8251152: ARM32: jtreg c2 Test8202414 test crash#48
fzhinkin wants to merge 1 commit intoopenjdk:masterfrom
fzhinkin:8251152-skip-unaligned-memory-accesses-related-tests

Conversation

@fzhinkin
Copy link
Copy Markdown
Contributor

@fzhinkin fzhinkin commented Sep 7, 2020

Some CPUs (like ARM32) does not support unaligned memory accesses. To avoid JVM crashes tests that perform such accesses should be skipped on corresponding platforms.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/48/head:pull/48
$ git checkout pull/48

@bridgekeeper
Copy link
Copy Markdown

bridgekeeper Bot commented Sep 7, 2020

👋 Welcome back fzhinkin! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin Setting summary to Some CPUs (like ARM32) does not support unaligned memory accesses. To avoid JVM crashes tests that perform such accesses should be skipped on corresponding platforms.

@openjdk openjdk Bot changed the title 8251152: Skip Test8202414 on CPUs missing unaligned memory accesses support 8251152: ARM32: jtreg c2 Test8202414 test crash Sep 7, 2020
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin This issue is referenced in the PR title - it will now be updated.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin The command reviewer cannot be used in the pull request body. Please use it in a new comment.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin The command reviewer cannot be used in the pull request body. Please use it in a new comment.

@fzhinkin
Copy link
Copy Markdown
Contributor Author

fzhinkin commented Sep 7, 2020

/reviewer add iignatyev
/reviewer add clanger

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin
Reviewer iignatyev successfully added.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin This change now passes all automated pre-integration checks. When the change also fulfills all project specific requirements, type /integrate in a new comment to proceed. After integration, the commit message will be:

8251152: ARM32: jtreg c2 Test8202414 test crash

Some CPUs (like ARM32) does not support unaligned memory accesses. To avoid JVM crashes tests that perform such accesses should be skipped on corresponding platforms.

Reviewed-by: iignatyev, clanger
  • If you would like to add a summary, use the /summary command.
  • To credit additional contributors, use the /contributor command.
  • To add additional solved issues, use the /issue command.

Since the source branch of this PR was last updated there have been 2 commits pushed to the master branch:

  • e0d5b5f: 8252627: Make it safe for JFR thread to read threadObj
  • e29c3f6: 8252661: Change SafepointMechanism terminology to talk less about "blocking"

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid automatic rebasing, please merge master into your branch, and then specify the current head hash when integrating, like this: /integrate e0d5b5f7f2c7290db0680d060acad66066b83499.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk Bot added the ready Pull request is ready to be integrated label Sep 7, 2020
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin
Reviewer clanger successfully added.

@fzhinkin
Copy link
Copy Markdown
Contributor Author

fzhinkin commented Sep 7, 2020

/issue add 8251152

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin This issue is referenced in the PR title - it will now be updated.

@fzhinkin fzhinkin marked this pull request as ready for review September 7, 2020 11:19
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin The following label will be automatically applied to this pull request: hotspot-compiler.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label (add|remove) "label" command.

@openjdk openjdk Bot added hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review labels Sep 7, 2020
@mlbridge
Copy link
Copy Markdown

mlbridge Bot commented Sep 7, 2020

Webrevs

// memory accesses. This test may cause JVM crash due to
// alignment check failure on such CPUs.
if (!jdk.internal.misc.Unsafe.getUnsafe().unalignedAccess()) {
throw new SkippedException(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think we need a line break here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, all lines within this file were no longer than 80 chars, so I decided to follow the same restriction.

@fzhinkin
Copy link
Copy Markdown
Contributor Author

fzhinkin commented Sep 7, 2020

/reviewer remove clanger
/reviewer add @RealCLanger

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin
Reviewer clanger successfully removed.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin
Reviewer clanger successfully added.

@fzhinkin
Copy link
Copy Markdown
Contributor Author

fzhinkin commented Sep 7, 2020

/integrate

@openjdk openjdk Bot closed this Sep 7, 2020
@openjdk openjdk Bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 7, 2020
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 7, 2020

@fzhinkin Since your change was applied there have been 2 commits pushed to the master branch:

  • e0d5b5f: 8252627: Make it safe for JFR thread to read threadObj
  • e29c3f6: 8252661: Change SafepointMechanism terminology to talk less about "blocking"

Your commit was automatically rebased without conflicts.

Pushed as commit 70d5cac.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

lewurm added a commit to lewurm/openjdk that referenced this pull request Oct 6, 2021
Restore looks like this now:
```
  0x0000000106e4dfcc:   movk    x9, #0x5e4, lsl openjdk#16
  0x0000000106e4dfd0:   movk    x9, #0x1, lsl openjdk#32
  0x0000000106e4dfd4:   blr x9
  0x0000000106e4dfd8:   ldp x2, x3, [sp, openjdk#16]
  0x0000000106e4dfdc:   ldp x4, x5, [sp, openjdk#32]
  0x0000000106e4dfe0:   ldp x6, x7, [sp, openjdk#48]
  0x0000000106e4dfe4:   ldp x8, x9, [sp, openjdk#64]
  0x0000000106e4dfe8:   ldp x10, x11, [sp, openjdk#80]
  0x0000000106e4dfec:   ldp x12, x13, [sp, openjdk#96]
  0x0000000106e4dff0:   ldp x14, x15, [sp, openjdk#112]
  0x0000000106e4dff4:   ldp x16, x17, [sp, openjdk#128]
  0x0000000106e4dff8:   ldp x0, x1, [sp], openjdk#144
  0x0000000106e4dffc:   ldp xzr, x19, [sp], openjdk#16
  0x0000000106e4e000:   ldp x22, x23, [sp, openjdk#16]
  0x0000000106e4e004:   ldp x24, x25, [sp, openjdk#32]
  0x0000000106e4e008:   ldp x26, x27, [sp, openjdk#48]
  0x0000000106e4e00c:   ldp x28, x29, [sp, openjdk#64]
  0x0000000106e4e010:   ldp x30, xzr, [sp, openjdk#80]
  0x0000000106e4e014:   ldp x20, x21, [sp], openjdk#96
  0x0000000106e4e018:   ldur    x12, [x29, #-24]
  0x0000000106e4e01c:   ldr x22, [x12, openjdk#16]
  0x0000000106e4e020:   add x22, x22, #0x30
  0x0000000106e4e024:   ldr x8, [x28, openjdk#8]
```
e1iu pushed a commit to e1iu/jdk that referenced this pull request Mar 29, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
e1iu pushed a commit to e1iu/jdk that referenced this pull request Apr 21, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
asotona added a commit to asotona/jdk that referenced this pull request Feb 9, 2023
asotona added a commit to asotona/jdk that referenced this pull request Feb 15, 2023
caojoshua pushed a commit to caojoshua/jdk that referenced this pull request Jun 9, 2023
robehn pushed a commit to robehn/jdk that referenced this pull request Aug 15, 2023
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Nov 21, 2023
…ng into ldp/stp on AArch64

Macro-assembler on aarch64 can merge adjacent loads or stores
into ldp/stp[1]. For example, it can merge:
```
str     w20, [sp, openjdk#16]
str     w10, [sp, openjdk#20]
```
into
```
stp     w20, w10, [sp, openjdk#16]
```

But C2 may generate a sequence like:
```
str     x21, [sp, openjdk#8]
str     w20, [sp, openjdk#16]
str     x19, [sp, openjdk#24] <---
str     w10, [sp, openjdk#20] <--- Before sorting
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```
We can't do any merging for non-adjacent loads or stores.

The patch is to sort the spilling or unspilling sequence in
the order of offset during instruction scheduling and bundling
phase. After that, we can get a new sequence:
```
str     x21, [sp, openjdk#8]
str     w20, [sp, openjdk#16]
str     w10, [sp, openjdk#20] <---
str     x19, [sp, openjdk#24] <--- After sorting
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```

Then macro-assembler can do ld/st merging:
```
str     x21, [sp, openjdk#8]
stp     w20, w10, [sp, openjdk#16] <--- Merged
str     x19, [sp, openjdk#24]
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```

To justify the patch, we run `HelloWorld.java`
```
public class HelloWorld {
    public static void main(String [] args) {
        System.out.println("Hello World!");
    }
}
```
with `java -Xcomp -XX:-TieredCompilation HelloWorld`.

Before the patch, macro-assembler can do ld/st merging for
3688 times. After the patch, the number of ld/st merging
increases to 3871 times, by ~5 %.

Tested tier1~3 on x86 and AArch64.

[1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079
pf0n pushed a commit to pf0n/jdk that referenced this pull request Jul 9, 2025
fg1417 added a commit to fg1417/jdk that referenced this pull request Mar 12, 2026
JDK-8196064 added support for merging 4- and 8-byte scalar
load/store operations into load/store pairs. The AArch64 platform
also supports SIMD load/store pairs for SIMD&FP registers[1].

This patch extends that functionality to support merging 4-, 8-,
and 16-byte vector load/store operations into vector load/store
pairs.

For example, given the following assembly:
```
str q16, [x14, openjdk#32]
str q16, [x14, openjdk#48]
```
after this change, it can be merged into:
```
stp q16, q16, [x14, openjdk#32]
```
Tier 1~3 tests passed on aarch64 platform.

[1] https://developer.arm.com/documentation/ddi0602/2025-12/SIMD-FP-Instructions/LDP--SIMD-FP---Load-pair-of-SIMD-FP-registers-?lang=en
fg1417 added a commit to fg1417/jdk that referenced this pull request Mar 13, 2026
…marks after JDK-8340093

JDK-8340093 enabled auto-vectorization for more reduction loop cases
using 128-bit vector operations. As a result, the following
microbenchmarks are negatively affected:
VectorReduction2.longAddDotProduct
VectorReduction2.longMulDotProduct
VectorReduction2.longMulSimple

This patch fixes these regressions.

1. Improve code generation for MLA

For longAddDotProduct[1], the current implementation generates
vectorized code similar to:
```
ldr     q17, [x12, openjdk#16]
ldr     q18, [x11, openjdk#16]
mla     z16.d, p7/m, z17.d, z18.d
ldr     q17, [x11, openjdk#32]
ldr     q18, [x12, openjdk#32]
mla     z16.d, p7/m, z18.d, z17.d
...
ldr     q17, [x11, openjdk#128]
ldr     q18, [x12, openjdk#128]
mla     z16.d, p7/m, z18.d, z17.d
```
`z16` is the third source and destination register. There are
true dependencies between consecutive mla[2] instructions.
As a result, this vectorized code performs significantly worse
than the scalar version due to limited instruction-level
parallelism.

These mla instructions are produced by a backend match rule that
fuses AddVL and MulVL into a vector MLA[3]. In this situation,
avoiding instruction fusion and instead generating separate SVE
mul and add instructions can improve instruction-level parallelism
and overall performance.

To address this, this patch introduces
is_multiply_accumulate_candidate() to determine whether a node is
a suitable vector MLA candidate. For node patterns that may
increase execution latency, instruction fusion into MLA is
disabled.

After applying this patch, the generated assembly looks like:
```
ldr     q17, [x12, openjdk#16]
ldr     q18, [x11, openjdk#16]
ldr     q19, [x11, openjdk#32]
mul     z17.d, p7/m, z17.d, z18.d
ldr     q18, [x12, openjdk#32]
ldr     q20, [x11, openjdk#48]
mul     z18.d, p7/m, z18.d, z19.d
ldr     q19, [x12, openjdk#48]
add     v16.2d, v17.2d, v16.2d
ldr     q17, [x11, openjdk#64]
add     v16.2d, v18.2d, v16.2d
ldr     q18, [x12, openjdk#64]
mul     z19.d, p7/m, z19.d, z20.d
ldr     q20, [x12, openjdk#80]
add     v16.2d, v19.2d, v16.2d
```
This sequence exposes more independent operations and reduces
dependency chains, leading to improved performance.

Since SVE mls instructions may suffer from similar issues, the
same logic has been extended to cover MLS as well. Additional
microbenchmarks have been added accordingly.

2. Avoid vectorizing MUL-heavy loops

For longMulSimple[3], the generated vectorized code exhibits
long dependency chains of SVE mul instructions, which results
in worse performance than scalar execution:
```
ldr     q17, [x1, openjdk#16]
ldr     q18, [x1, openjdk#32]
mul     z17.d, p7/m, z17.d, z16.d
ldr     q16, [x1, openjdk#48]
mul     z17.d, p7/m, z17.d, z18.d
ldr     q18, [x1, openjdk#64]
mul     z16.d, p7/m, z16.d, z17.d
...
ldr     q16, [x1, openjdk#256]
mul     z17.d, p7/m, z17.d, z19.d
mul     z16.d, p7/m, z16.d, z17.d
```

To address this, the patch introduces a platform-specific interface:
`VTransformElementWiseVectorNode::node_weight()`.

For 128-bit operations, this interface detects consecutive vector
long multiply operations and increases the node weight to 4, which is
the minimum value required for the cost model to avoid vectorization
on both 128-bit and 256-bit platforms.

3. Results
Performance measurements on 128-bit and 256-bit SVE machines show that
these changes avoid harmful vectorization and improve overall
performance for the affected benchmarks.

patch: results obtained after applying this patch, using default
auto-vectorization settings (-XX:+UseSuperWord,
-XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode)

main-default: results on mainline using the same default
auto-vectorization settings (-XX:+UseSuperWord,
-XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode)

main-scalar: results on mainline with -XX:+UseSuperWord and
-XX:AutoVectorizationOverrideProfitability=0 (force scalar code)

The table below reports relative performance changes:
p/m1 = (patch - main-default) / main-default
p/m0 = (patch - main-scalar) / main-scalar

Mode: avgt
Unit: ns/op

Arm Neoverse V2 machine (128 bit SVE):
Benchmark                                         (COUNT)    p/m1       p/m0
TypeVectorOperationsSuperWord.mlaL                  512     0.16%      -50.42%
TypeVectorOperationsSuperWord.mlaL                  2048    0.26%      -56.70%
TypeVectorOperationsSuperWord.mlsL                  512     -0.10%     -50.37%
TypeVectorOperationsSuperWord.mlsL                  2048    0.14%      -56.82%
TypeVectorOperationsSuperWord.mulBigL               512     0.06%      -25.77%
TypeVectorOperationsSuperWord.mulBigL               2048    -0.02%     -19.63%
TypeVectorOperationsSuperWord.mulI                  512     0.63%      -63.44%
TypeVectorOperationsSuperWord.mulI                  2048    0.28%      -63.07%
TypeVectorOperationsSuperWord.mulL                  512     -0.03%     -50.47%
TypeVectorOperationsSuperWord.mulL                  2048    0.29%      -50.82%
TypeVectorOperationsSuperWord.mulMediumL            512     -0.19%     -27.54%
TypeVectorOperationsSuperWord.mulMediumL            2048    0.24%      -25.18%
TypeVectorOperationsSuperWord.mulMlaLDependent      512     0.30%      -28.70%
TypeVectorOperationsSuperWord.mulMlaLDependent      2048    0.12%      -26.74%
TypeVectorOperationsSuperWord.mulMlaLIndependent    512     -10.43%    -43.09%
TypeVectorOperationsSuperWord.mulMlaLIndependent    2048    -14.82%    -42.68%
VectorReduction2.WithSuperword.longAddBig           2048    -15.15%    -44.01%
VectorReduction2.WithSuperword.longAddBigMixSub1    2048    -6.19%     -43.92%
VectorReduction2.WithSuperword.longAddBigMixSub2    2048    -15.18%    -43.90%
VectorReduction2.WithSuperword.longAddBigMixSub3    2048    -5.74%     -43.87%
VectorReduction2.WithSuperword.longAddDotProduct    2048    -33.36%    -18.16%
VectorReduction2.WithSuperword.longAddSimple        2048    -0.02%     -6.72%
VectorReduction2.WithSuperword.longAndBig           2048    -16.32%    -44.06%
VectorReduction2.WithSuperword.longAndDotProduct    2048    -0.01%     -3.74%
VectorReduction2.WithSuperword.longAndSimple        2048    0.00%      -6.35%
VectorReduction2.WithSuperword.longMaxBig           2048    -15.29%    -52.09%
VectorReduction2.WithSuperword.longMaxDotProduct    2048    -0.03%     -52.08%
VectorReduction2.WithSuperword.longMaxSimple        2048    -0.40%     -52.74%
VectorReduction2.WithSuperword.longMinBig           2048    -14.88%    -51.70%
VectorReduction2.WithSuperword.longMinDotProduct    2048    0.01%      -52.21%
VectorReduction2.WithSuperword.longMinSimple        2048    0.26%      -52.88%
VectorReduction2.WithSuperword.longMulBig           2048    -2.21%     -0.07%
VectorReduction2.WithSuperword.longMulDotProduct    2048    -15.47%    0.00%
VectorReduction2.WithSuperword.longMulSimple        2048    -17.87%    -0.33%
VectorReduction2.WithSuperword.longOrBig            2048    -15.23%    -43.94%
VectorReduction2.WithSuperword.longOrDotProduct     2048    -0.01%     -3.83%
VectorReduction2.WithSuperword.longOrSimple         2048    -0.01%     -6.60%
VectorReduction2.WithSuperword.longXorBig           2048    -10.03%    -41.62%
VectorReduction2.WithSuperword.longXorDotProduct    2048    0.01%      -38.61%
VectorReduction2.WithSuperword.longXorSimple        2048    0.02%      -53.18%

Arm Neoverse V1 machine (256 bit SVE):
Note: In the current mainline code, the AArch64 backend supports
only 128-bit multiply long operations. Auto-vectorization accounts
for this backend constraint and splits 256-bit vectors into 128-bit
chunks so that the loop can still be vectorized. This is why
256-bit platforms also benefit from this patch.

No obvious performance changes are observed for other benchmarks.

Benchmark                           (COUNT)       p/m1       p/m0
VectorReduction2.longMulDotProduct    2048       -28.23%    0.00%
VectorReduction2.longMulSimple        2048       -19.29%    0.01%

Tier 1 - 3 passed on both aarch64 and x86 platforms.

[1] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096
[2] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/MLA--vectors---Multiply-add--predicated--?lang=en
[3] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2617
[4] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1035
fg1417 added a commit to fg1417/jdk that referenced this pull request Mar 30, 2026
The microbenchmark ArraysFill.testLongFill[1] on
128-bit vector platforms generates vectorized store instructions
with non-monotonic memory offsets, e.g.:

str q16, [x12, openjdk#80]
str q16, [x12, openjdk#48]
str q16, [x12, openjdk#128]
...

This arises because SuperWord only considers true dependencies
when building edges (see [3]), and
therefore does not enforce ordering among independent vector
memory operations. These nodes are later scheduled using RPO,
which can result in an apparently unordered sequence of memory
accesses.

This patch replaces RPO-based scheduling with a priority-based
topological sort to improve ordering and locality.

The scheduling policy is:
1. Prefer nodes whose weak predecessors have already been
scheduled.
2. Prioritize node types in the following order: scalar
operations (loads/stores, address expressions), vector arithmetic,
vector loads, vector stores, then others.
3. For independent loads/stores sharing the same base address,
prefer ascending offsets.
4. Use VTransformNodeIDX to ensure stable ordering.

With this change, the generated code becomes monotonic in memory
offsets:

str q16, [x12, openjdk#16]
str q16, [x12, openjdk#32]
str q16, [x12, openjdk#48]
...

On Arm Neoverse V2 machine (128 bit SVE), this improves the
following benchmarks:

TypeVectorOperationsSuperWord.java[2]

Benchmark          (COUNT)   Mode    Units    Difference
absD                 512     avgt    ns/op    -27.05%
absD                 2048    avgt    ns/op    -27.05%
absL                 512     avgt    ns/op    -24.46%
absL                 2048    avgt    ns/op    -27.26%
convertD2LBitsRaw    512     avgt    ns/op    -20.39%
convertD2LBitsRaw    2048    avgt    ns/op    -23.92%
convertF2L           512     avgt    ns/op    -16.82%
convertF2L           2048    avgt    ns/op    -22.60%
convertI2D           512     avgt    ns/op    -12.50%
convertI2D           2048    avgt    ns/op    -17.92%
convertLBits2D       512     avgt    ns/op    -27.13%
convertLBits2D       2048    avgt    ns/op    -31.69%
negD                 512     avgt    ns/op    -26.85%
negD                 2048    avgt    ns/op    -27.09%

ArraysFill.java[1]:

Benchmark       (size)    Mode     Units     Difference
testDoubleFill    250     thrpt    ops/ms    26.46%
testDoubleFill    266     thrpt    ops/ms    32.69%
testDoubleFill    511     thrpt    ops/ms    33.83%
testDoubleFill    2047    thrpt    ops/ms    45.35%
testDoubleFill    2048    thrpt    ops/ms    45.38%
testDoubleFill    8195    thrpt    ops/ms    49.32%
testLongFill      250     thrpt    ops/ms    28.12%
testLongFill      266     thrpt    ops/ms    40.30%
testLongFill      511     thrpt    ops/ms    34.79%
testLongFill      2047    thrpt    ops/ms    45.71%
testLongFill      2048    thrpt    ops/ms    53.07%
testLongFill      8195    thrpt    ops/ms    49.52%

No significant performance changes are observed on wider vector
platforms (e.g., 256-bit or 512-bit), where fewer vector
operations are generated in SuperWord and scheduling has less
impact.

[1] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/java/util/ArraysFill.java#L92
[2] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java
[3] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/src/hotspot/share/opto/superwordVTransformBuilder.cpp#L99
ruben-arm added a commit to ruben-arm/jdk that referenced this pull request Mar 30, 2026
Some vector operations do not have inputs and essentially initialize
vectors with a constant value. These operations can be marked for
spilling and subsequently rematerialized at every use. The result of
the transformation might look as follows:
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#64]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#32]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#16]
   movi    v16.2d, #0x0
   str     q16, [x16]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#48]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#112]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#80]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#96]

Introduce deduplication of these rematerialized vector
constant initializations reducing the above sequence to:
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#64]
   str     q16, [x16, openjdk#32]
   str     q16, [x16, openjdk#16]
   str     q16, [x16]
   str     q16, [x16, openjdk#48]
   str     q16, [x16, openjdk#112]
   str     q16, [x16, openjdk#80]
   str     q16, [x16, openjdk#96]
snake66 added a commit to snake66/jdk that referenced this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

2 participants