[JVMCI] Libgraal can deadlock in blocking compilation mode by dougxc · Pull Request #16 · openjdk/jdk

dougxc · 2020-09-03T15:11:27Z

https://bugs.openjdk.java.net/browse/JDK-8252543

… JVMCI compilation

bridgekeeper · 2020-09-03T15:13:13Z

Welcome to the OpenJDK organization on GitHub!

This repository is currently a read-only git mirror of the official Mercurial repository (located at https://hg.openjdk.java.net/). As such, we are not currently accepting pull requests here. If you would like to contribute to the OpenJDK project, please see https://openjdk.java.net/contribute/ on how to proceed.

This pull request will be automatically closed.

…ner_virtual_thread 8246039: SSLSocket HandshakeCompletedListeners are run on virtual threads

Like scalar shift, vector shift could do nothing when shift count is zero. This patch implements the 'Identity' method for all kinds of vector shift nodes to optimize out 'ShiftVCntNode 0', which is typically a redundant 'mov' in final generated code like below: ``` add x17, x12, x14 ldr q16, [x17, openjdk#16] mov v16.16b, v16.16b add x14, x13, x14 str q16, [x14, openjdk#16] ``` With this patch, the code above could be optimized as below: ``` add x17, x12, x14 ldr q16, [x17, openjdk#16] add x14, x13, x14 str q16, [x14, openjdk#16] ``` [TESTS] compiler/vectorapi/TestVectorShiftImm.java, jdk/incubator/vector, hotspot::tier1 passed without new failure. Change-Id: I7657c0daaa5f758966936b9ede670c8b9ad94c48

The vector shift count was defined by two separate nodes(LShiftCntV and RShiftCntV), which would prevent them from being shared when the shift counts are the same. ``` public static void test_shiftv(int sh) { for (int i = 0; i < N; i+=1) { a0[i] = a1[i] << sh; b0[i] = b1[i] >> sh; } } ``` Given the example above, by merging the same shift counts into one node, they could be shared by shift nodes(RShiftV or LShiftV) like below: ``` Before: 1184 LShiftCntV === _ 1189 [[ 1185 ... ]] 1190 RShiftCntV === _ 1189 [[ 1191 ... ]] 1185 LShiftVI === _ 1181 1184 [[ 1186 ]] 1191 RShiftVI === _ 1187 1190 [[ 1192 ]] After: 1190 ShiftCntV === _ 1189 [[ 1191 1204 ... ]] 1204 LShiftVI === _ 1211 1190 [[ 1203 ]] 1191 RShiftVI === _ 1187 1190 [[ 1192 ]] ``` The final code could remove one redundant “dup”(scalar->vector), with one register saved. ``` Before: dup v16.16b, w12 dup v17.16b, w12 ... ldr q18, [x13, openjdk#16] sshl v18.4s, v18.4s, v16.4s add x18, x16, x12 ; iaload add x4, x15, x12 str q18, [x4, openjdk#16] ; iastore ldr q18, [x18, openjdk#16] add x12, x14, x12 neg v19.16b, v17.16b sshl v18.4s, v18.4s, v19.4s str q18, [x12, openjdk#16] ; iastore After: dup v16.16b, w11 ... ldr q17, [x13, openjdk#16] sshl v17.4s, v17.4s, v16.4s add x2, x22, x11 ; iaload add x4, x16, x11 str q17, [x4, openjdk#16] ; iastore ldr q17, [x2, openjdk#16] add x11, x21, x11 neg v18.16b, v16.16b sshl v17.4s, v17.4s, v18.4s str q17, [x11, openjdk#16] ; iastore ``` Change-Id: I047f3f32df9535d706a9920857d212610e8ce315

r18 should not be used as it is reserved as platform register. Linux is fine with userspace using it, but Windows and also recently macOS ( openjdk/jdk11u-dev#301 (comment) ) are actually using it on the kernel side. The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to specify which registers to spill; fortunately this helper is only used here: https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404 I haven't seen causing this particular instance any issues in practice _yet_, presumably because it looks hard to align the stars in order to trigger a problem (between stp and ldp of r18 a transition to kernel space must happen *and* the kernel needs to do something with r18). But jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that causes troubles as explained in the link above. Output of `-XX:+PrintInterpreter` before this change: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000138809b00, 0x000000013880a280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000138809b00: ldr x2, [x12, #16] 0x0000000138809b04: ldrh w2, [x2, #44] 0x0000000138809b08: add x24, x20, x2, uxtx #3 0x0000000138809b0c: sub x24, x24, #0x8 [...] 0x0000000138809fa4: stp x16, x17, [sp, #128] 0x0000000138809fa8: stp x18, x19, [sp, #144] 0x0000000138809fac: stp x20, x21, [sp, #160] [...] 0x0000000138809fc0: stp x30, xzr, [sp, #240] 0x0000000138809fc4: mov x0, x28 ;; 0x10864ACCC 0x0000000138809fc8: mov x9, #0xaccc // #44236 0x0000000138809fcc: movk x9, #0x864, lsl #16 0x0000000138809fd0: movk x9, #0x1, lsl #32 0x0000000138809fd4: blr x9 0x0000000138809fd8: ldp x2, x3, [sp, #16] [...] 0x0000000138809ff4: ldp x16, x17, [sp, #128] 0x0000000138809ff8: ldp x18, x19, [sp, #144] 0x0000000138809ffc: ldp x20, x21, [sp, #160] ``` After: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000108e4db00, 0x0000000108e4e280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000108e4db00: ldr x2, [x12, #16] 0x0000000108e4db04: ldrh w2, [x2, #44] 0x0000000108e4db08: add x24, x20, x2, uxtx #3 0x0000000108e4db0c: sub x24, x24, #0x8 [...] 0x0000000108e4dfa4: stp x16, x17, [sp, #128] 0x0000000108e4dfa8: stp x19, x20, [sp, #144] 0x0000000108e4dfac: stp x21, x22, [sp, #160] [...] 0x0000000108e4dfbc: stp x29, x30, [sp, #224] 0x0000000108e4dfc0: mov x0, x28 ;; 0x107E4A06C 0x0000000108e4dfc4: mov x9, #0xa06c // #41068 0x0000000108e4dfc8: movk x9, #0x7e4, lsl #16 0x0000000108e4dfcc: movk x9, #0x1, lsl #32 0x0000000108e4dfd0: blr x9 0x0000000108e4dfd4: ldp x2, x3, [sp, #16] [...] 0x0000000108e4dff0: ldp x16, x17, [sp, #128] 0x0000000108e4dff4: ldp x19, x20, [sp, #144] 0x0000000108e4dff8: ldp x21, x22, [sp, #160] [...] ```

Restore looks like this now: ``` 0x0000000106e4dfcc: movk x9, #0x5e4, lsl openjdk#16 0x0000000106e4dfd0: movk x9, #0x1, lsl openjdk#32 0x0000000106e4dfd4: blr x9 0x0000000106e4dfd8: ldp x2, x3, [sp, openjdk#16] 0x0000000106e4dfdc: ldp x4, x5, [sp, openjdk#32] 0x0000000106e4dfe0: ldp x6, x7, [sp, openjdk#48] 0x0000000106e4dfe4: ldp x8, x9, [sp, openjdk#64] 0x0000000106e4dfe8: ldp x10, x11, [sp, openjdk#80] 0x0000000106e4dfec: ldp x12, x13, [sp, openjdk#96] 0x0000000106e4dff0: ldp x14, x15, [sp, openjdk#112] 0x0000000106e4dff4: ldp x16, x17, [sp, openjdk#128] 0x0000000106e4dff8: ldp x0, x1, [sp], openjdk#144 0x0000000106e4dffc: ldp xzr, x19, [sp], openjdk#16 0x0000000106e4e000: ldp x22, x23, [sp, openjdk#16] 0x0000000106e4e004: ldp x24, x25, [sp, openjdk#32] 0x0000000106e4e008: ldp x26, x27, [sp, openjdk#48] 0x0000000106e4e00c: ldp x28, x29, [sp, openjdk#64] 0x0000000106e4e010: ldp x30, xzr, [sp, openjdk#80] 0x0000000106e4e014: ldp x20, x21, [sp], openjdk#96 0x0000000106e4e018: ldur x12, [x29, #-24] 0x0000000106e4e01c: ldr x22, [x12, openjdk#16] 0x0000000106e4e020: add x22, x22, #0x30 0x0000000106e4e024: ldr x8, [x28, openjdk#8] ```

The patch aims to help optimize Math.abs() mainly from these three parts: 1) Remove redundant instructions for abs with constant values 2) Remove redundant instructions for abs with char type 3) Convert some common abs operations to ideal forms 1. Remove redundant instructions for abs with constant values If we can decide the value of the input node for function Math.abs() at compile-time, we can substitute the Abs node with the absolute value of the constant and don't have to calculate it at runtime. For example, int[] a for (int i = 0; i < SIZE; i++) { a[i] = Math.abs(-38); } Before the patch, the generated code for the testcase above is: ... mov w10, #0xffffffda cmp w10, wzr cneg w17, w10, lt dup v16.8h, w17 ... After the patch, the generated code for the testcase above is : ... movi v16.4s, #0x26 ... 2. Remove redundant instructions for abs with char type In Java semantics, as the char type is always non-negative, we could actually remove the absI node in the C2 middle end. As for vectorization part, in current SLP, the vectorization of Math.abs() with char type is intentionally disabled after JDK-8261022 because it generates incorrect result before. After removing the AbsI node in the middle end, Math.abs(char) can be vectorized naturally. For example, char[] a; char[] b; for (int i = 0; i < SIZE; i++) { b[i] = (char) Math.abs(a[i]); } Before the patch, the generated assembly code for the testcase above is: B15: add x13, x21, w20, sxtw openjdk#1 ldrh w11, [x13, openjdk#16] cmp w11, wzr cneg w10, w11, lt strh w10, [x13, openjdk#16] ldrh w10, [x13, openjdk#18] cmp w10, wzr cneg w10, w10, lt strh w10, [x13, openjdk#18] ... add w20, w20, #0x1 cmp w20, w17 b.lt B15 After the patch, the generated assembly code is: B15: sbfiz x18, x19, openjdk#1, openjdk#32 add x0, x14, x18 ldr q16, [x0, openjdk#16] add x18, x21, x18 str q16, [x18, openjdk#16] ldr q16, [x0, openjdk#32] str q16, [x18, openjdk#32] ... add w19, w19, #0x40 cmp w19, w17 b.lt B15 3. Convert some common abs operations to ideal forms The patch overrides some virtual support functions for AbsNode so that optimization of gvn can work on it. Here are the optimizable forms: a) abs(0 - x) => abs(x) Before the patch: ... ldr w13, [x13, openjdk#16] neg w13, w13 cmp w13, wzr cneg w14, w13, lt ... After the patch: ... ldr w13, [x13, openjdk#16] cmp w13, wzr cneg w13, w13, lt ... b) abs(abs(x)) => abs(x) Before the patch: ... ldr w12, [x12, openjdk#16] cmp w12, wzr cneg w12, w12, lt cmp w12, wzr cneg w12, w12, lt ... After the patch: ... ldr w13, [x13, openjdk#16] cmp w13, wzr cneg w13, w13, lt ... Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd

*** Implementation In AArch64 NEON, vector shift right is implemented by vector shift left instructions (SSHL[1] and USHL[2]) with negative shift count value. In C2 backend, we generate a `neg` to given shift value followed by `sshl` or `ushl` instruction. For vector shift right, the vector shift count has two origins: 1) it can be duplicated from scalar variable/immediate(case-1), 2) it can be loaded directly from one vector(case-2). This patch aims to optimize case-1. Specifically, we move the negate from RShiftV* rules to RShiftCntV rule. As a result, the negate can be hoisted outside of the loop if it's a loop invariant. In this patch, 1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle shift left and shift right respectively. Compared to vslcnt* rules, the negate is conducted in vsrcnt*. 2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use vsra*_var and vsrl*_var rules to handle case-2. Note that ShiftVNode::is_var_shift() can be used to distinguish case-1 from case-2. 3) we add one assertion for the vs*_imm rules as we have done on ARM32[3]. 4) several style issues are resolved. *** Example Take function `rShiftInt()` in the newly added micro benchmark VectorShiftRight.java as an example. ``` public void rShiftInt() { for (int i = 0; i < SIZE; i++) { intsB[i] = intsA[i] >> count; } } ``` Arithmetic shift right is conducted inside a big loop. The following code snippet shows the disassembly code generated by auto-vectorization before we apply current patch. We can see that `neg` is conducted in the loop body. ``` 0x0000ffff89057a64: dup v16.16b, w13 <-- dup 0x0000ffff89057a68: mov w12, #0x7d00 // #32000 0x0000ffff89057a6c: sub w13, w2, w10 0x0000ffff89057a70: cmp w2, w10 0x0000ffff89057a74: csel w13, wzr, w13, lt 0x0000ffff89057a78: mov w8, #0x7d00 // #32000 0x0000ffff89057a7c: cmp w13, w8 0x0000ffff89057a80: csel w13, w12, w13, hi 0x0000ffff89057a84: add w14, w13, w10 0x0000ffff89057a88: nop 0x0000ffff89057a8c: nop 0x0000ffff89057a90: sbfiz x13, x10, openjdk#2, openjdk#32 <-- loop entry 0x0000ffff89057a94: add x15, x17, x13 0x0000ffff89057a98: ldr q17, [x15,openjdk#16] 0x0000ffff89057a9c: add x13, x0, x13 0x0000ffff89057aa0: neg v18.16b, v16.16b <-- neg 0x0000ffff89057aa4: sshl v17.4s, v17.4s, v18.4s <-- shift right 0x0000ffff89057aa8: str q17, [x13,openjdk#16] 0x0000ffff89057aac: ... 0x0000ffff89057b1c: add w10, w10, #0x20 0x0000ffff89057b20: cmp w10, w14 0x0000ffff89057b24: b.lt 0x0000ffff89057a90 <-- loop end ``` Here is the disassembly code after we apply current patch. We can see that the negate is no longer conducted inside the loop, and it is hoisted to the outside. ``` 0x0000ffff8d053a68: neg w14, w13 <---- neg 0x0000ffff8d053a6c: dup v16.16b, w14 <---- dup 0x0000ffff8d053a70: sub w14, w2, w10 0x0000ffff8d053a74: cmp w2, w10 0x0000ffff8d053a78: csel w14, wzr, w14, lt 0x0000ffff8d053a7c: mov w8, #0x7d00 // #32000 0x0000ffff8d053a80: cmp w14, w8 0x0000ffff8d053a84: csel w14, w12, w14, hi 0x0000ffff8d053a88: add w13, w14, w10 0x0000ffff8d053a8c: nop 0x0000ffff8d053a90: sbfiz x14, x10, openjdk#2, openjdk#32 <-- loop entry 0x0000ffff8d053a94: add x15, x17, x14 0x0000ffff8d053a98: ldr q17, [x15,openjdk#16] 0x0000ffff8d053a9c: sshl v17.4s, v17.4s, v16.4s <-- shift right 0x0000ffff8d053aa0: add x14, x0, x14 0x0000ffff8d053aa4: str q17, [x14,openjdk#16] 0x0000ffff8d053aa8: ... 0x0000ffff8d053afc: add w10, w10, #0x20 0x0000ffff8d053b00: cmp w10, w13 0x0000ffff8d053b04: b.lt 0x0000ffff8d053a90 <-- loop end ``` *** Testing Tier1~3 tests passed on Linux/AArch64 platform. *** Performance Evaluation - Auto-vectorization One micro benchmark, i.e. VectorShiftRight.java, is added by this patch in order to evaluate the optimization on vector shift right. The following table shows the result. Column `Score-1` shows the score before we apply current patch, and column `Score-2` shows the score when we apply current patch. We witness about 30% ~ 53% improvement on microbenchmarks. ``` Benchmark Units Score-1 Score-2 VectorShiftRight.rShiftByte ops/ms 10601.980 13816.353 VectorShiftRight.rShiftInt ops/ms 3592.831 5502.941 VectorShiftRight.rShiftLong ops/ms 1584.012 2425.247 VectorShiftRight.rShiftShort ops/ms 6643.414 9728.762 VectorShiftRight.urShiftByte ops/ms 2066.965 2048.336 (*) VectorShiftRight.urShiftChar ops/ms 6660.805 9728.478 VectorShiftRight.urShiftInt ops/ms 3592.909 5514.928 VectorShiftRight.urShiftLong ops/ms 1583.995 2422.991 *: Logical shift right for Byte type(urShiftByte) is not vectorized, as disscussed in [4]. ``` - VectorAPI Furthermore, we also evaluate the impact of this patch on VectorAPI benchmarks, e.g., [5]. Details can be found in the table below. Columns `Score-1` and `Score-2` show the scores before and after applying current patch. ``` Benchmark Units Score-1 Score-2 Byte128Vector.LSHL ops/ms 10867.666 10873.993 Byte128Vector.LSHLShift ops/ms 10945.729 10945.741 Byte128Vector.LSHR ops/ms 8629.305 8629.343 Byte128Vector.LSHRShift ops/ms 8245.864 10303.521 <-- Byte128Vector.ASHR ops/ms 8619.691 8629.438 Byte128Vector.ASHRShift ops/ms 8245.860 10305.027 <-- Int128Vector.LSHL ops/ms 3104.213 3103.702 Int128Vector.LSHLShift ops/ms 3114.354 3114.371 Int128Vector.LSHR ops/ms 2380.717 2380.693 Int128Vector.LSHRShift ops/ms 2312.871 2992.377 <-- Int128Vector.ASHR ops/ms 2380.668 2380.647 Int128Vector.ASHRShift ops/ms 2312.894 2992.332 <-- Long128Vector.LSHL ops/ms 1586.907 1587.591 Long128Vector.LSHLShift ops/ms 1589.469 1589.540 Long128Vector.LSHR ops/ms 1209.754 1209.687 Long128Vector.LSHRShift ops/ms 1174.718 1527.502 <-- Long128Vector.ASHR ops/ms 1209.713 1209.669 Long128Vector.ASHRShift ops/ms 1174.712 1527.174 <-- Short128Vector.LSHL ops/ms 5945.542 5943.770 Short128Vector.LSHLShift ops/ms 5984.743 5984.640 Short128Vector.LSHR ops/ms 4613.378 4613.577 Short128Vector.LSHRShift ops/ms 4486.023 5746.466 <-- Short128Vector.ASHR ops/ms 4613.389 4613.478 Short128Vector.ASHRShift ops/ms 4486.019 5746.368 <-- ``` 1) For logical shift left(LSHL and LSHLShift), and shift right with variable vector shift count(LSHR and ASHR) cases, we didn't find much changes, which is expected. 2) For shift right with scalar shift count(LSHRShift and ASHRShift) case, about 25% ~ 30% improvement can be observed, and this benefit is introduced by current patch. [1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register-- [2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register-- [3] openjdk/jdk18#41 [4] openjdk#1087 [5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509

After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: int <-> double float <-> long int <-> long float <-> double A typical test case: int[] a; double[] b; for (int i = start; i < limit; i++) { b[i] = (double) a[i]; } Our expected OptoAssembly code for one iteration is like below: add R12, R2, R11, LShiftL openjdk#2 vector_load V16,[R12, openjdk#16] vectorcast_i2d V16, V16 # convert I to D vector add R11, R1, R11, LShiftL openjdk#3 # ptr add R13, R11, openjdk#16 # ptr vector_store [R13], V16 To enable the vectorization, the patch solves the following problems in the SLP. There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use. After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. Here is the test data on NEON: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 216.431 ± 0.131 ns/op VectorLoop.convertD2I 523 avgt 15 220.522 ± 0.311 ns/op VectorLoop.convertF2D 523 avgt 15 217.034 ± 0.292 ns/op VectorLoop.convertF2L 523 avgt 15 231.634 ± 1.881 ns/op VectorLoop.convertI2D 523 avgt 15 229.538 ± 0.095 ns/op VectorLoop.convertI2L 523 avgt 15 214.822 ± 0.131 ns/op VectorLoop.convertL2F 523 avgt 15 230.188 ± 0.217 ns/op VectorLoop.convertL2I 523 avgt 15 162.234 ± 0.235 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 124.352 ± 1.079 ns/op VectorLoop.convertD2I 523 avgt 15 557.388 ± 8.166 ns/op VectorLoop.convertF2D 523 avgt 15 118.082 ± 4.026 ns/op VectorLoop.convertF2L 523 avgt 15 225.810 ± 11.180 ns/op VectorLoop.convertI2D 523 avgt 15 166.247 ± 0.120 ns/op VectorLoop.convertI2L 523 avgt 15 119.699 ± 2.925 ns/op VectorLoop.convertL2F 523 avgt 15 220.847 ± 0.053 ns/op VectorLoop.convertL2I 523 avgt 15 122.339 ± 2.738 ns/op perf data on X86: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 279.466 ± 0.069 ns/op VectorLoop.convertD2I 523 avgt 15 551.009 ± 7.459 ns/op VectorLoop.convertF2D 523 avgt 15 276.066 ± 0.117 ns/op VectorLoop.convertF2L 523 avgt 15 545.108 ± 5.697 ns/op VectorLoop.convertI2D 523 avgt 15 745.303 ± 0.185 ns/op VectorLoop.convertI2L 523 avgt 15 260.878 ± 0.044 ns/op VectorLoop.convertL2F 523 avgt 15 502.016 ± 0.172 ns/op VectorLoop.convertL2I 523 avgt 15 261.654 ± 3.326 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 106.975 ± 0.045 ns/op VectorLoop.convertD2I 523 avgt 15 546.866 ± 9.287 ns/op VectorLoop.convertF2D 523 avgt 15 82.414 ± 0.340 ns/op VectorLoop.convertF2L 523 avgt 15 542.235 ± 2.785 ns/op VectorLoop.convertI2D 523 avgt 15 92.966 ± 1.400 ns/op VectorLoop.convertI2L 523 avgt 15 79.960 ± 0.528 ns/op VectorLoop.convertL2F 523 avgt 15 504.712 ± 4.794 ns/op VectorLoop.convertL2I 523 avgt 15 129.753 ± 0.094 ns/op perf data on AVX512: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 282.984 ± 4.022 ns/op VectorLoop.convertD2I 523 avgt 15 543.080 ± 3.873 ns/op VectorLoop.convertF2D 523 avgt 15 273.950 ± 0.131 ns/op VectorLoop.convertF2L 523 avgt 15 539.568 ± 2.747 ns/op VectorLoop.convertI2D 523 avgt 15 745.238 ± 0.069 ns/op VectorLoop.convertI2L 523 avgt 15 260.935 ± 0.169 ns/op VectorLoop.convertL2F 523 avgt 15 501.870 ± 0.359 ns/op VectorLoop.convertL2I 523 avgt 15 257.508 ± 0.174 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 76.687 ± 0.530 ns/op VectorLoop.convertD2I 523 avgt 15 545.408 ± 4.657 ns/op VectorLoop.convertF2D 523 avgt 15 273.935 ± 0.099 ns/op VectorLoop.convertF2L 523 avgt 15 540.534 ± 3.032 ns/op VectorLoop.convertI2D 523 avgt 15 745.234 ± 0.053 ns/op VectorLoop.convertI2L 523 avgt 15 260.865 ± 0.104 ns/op VectorLoop.convertL2F 523 avgt 15 63.834 ± 4.777 ns/op VectorLoop.convertL2I 523 avgt 15 48.183 ± 0.990 ns/op Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef

This patch fixes the wrong matching rule of replicate2L_zero. It was matched "ReplicateI" by mistake so that long immediates(not only zero) had to be moved to register first and matched to replicate2L finally. To fix this trivial bug, this patch fixes the typo and extends the rule of replicate2L_zero to replicate2L_imm, which now supports all possible long immediate values. The final code changes are shown as below: replicate2L_imm: mov x13, #0xff movk x13, #0xff, lsl openjdk#16 movk x13, #0xff, lsl openjdk#32 dup v16.2d, x13 => movi v16.2d, #0xff00ff00ff [Test] test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi passed without failure. Change-Id: Ieac92820dea560239a968de3d7430003f01726bd

``` public short[] vectorUnsignedShiftRight(short[] shorts) { short[] res = new short[SIZE]; for (int i = 0; i < SIZE; i++) { res[i] = (short) (shorts[i] >>> 3); } return res; } ``` In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. Taking unsigned right shift on short type as an example, Short: | <- 16 bits -> | <- 16 bits -> | | 1 1 1 ... 1 1 | data | when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: For T_SHORT (shift <= 16): src RShiftCntV shift src RShiftCntV shift \ / ==> \ / URShiftVS RShiftVS This patch does the transformation in SuperWord::implemented() and SuperWord::output(). It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: ``` ... sbfiz x13, x10, openjdk#1, openjdk#32 add x15, x11, x13 ldr q16, [x15, openjdk#16] sshr v16.8h, v16.8h, openjdk#3 add x13, x17, x13 str q16, [x13, openjdk#16] ... ``` Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. The perf data on AArch64: Before the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 295.711 ± 0.117 ns/op urShiftImmShort 1024 3 avgt 5 284.559 ± 0.148 ns/op after the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 45.111 ± 0.047 ns/op urShiftImmShort 1024 3 avgt 5 55.294 ± 0.072 ns/op The perf data on X86: Before the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 361.374 ± 4.621 ns/op urShiftImmShort 1024 3 avgt 5 365.390 ± 3.595 ns/op After the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 105.489 ± 0.488 ns/op urShiftImmShort 1024 3 avgt 5 43.400 ± 0.394 ns/op [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 [2] https://github.com/jpountz/decode-128-ints-benchmark/ Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161

This patch optimizes the backend implementation of VectorMaskToLong for AArch64, given a more efficient approach to mov value bits from predicate register to general purpose register as x86 PMOVMSK[1] does, by using BEXT[2] which is available in SVE2. With this patch, the final code (input mask is byte type with SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU emulator) changes as below: Before: mov z16.b, p0/z, #1 fmov x0, d16 orr x0, x0, x0, lsr openjdk#7 orr x0, x0, x0, lsr openjdk#14 orr x0, x0, x0, lsr openjdk#28 and x0, x0, #0xff fmov x8, v16.d[1] orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#8 orr x8, xzr, #0x2 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#16 orr x8, xzr, #0x3 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#24 orr x8, xzr, #0x4 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#32 mov x8, #0x5 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#40 orr x8, xzr, #0x6 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#48 orr x8, xzr, #0x7 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#56 After: mov z16.b, p0/z, #1 mov z17.b, #1 bext z16.d, z16.d, z17.d mov z17.d, #0 uzp1 z16.s, z16.s, z17.s uzp1 z16.h, z16.h, z17.h uzp1 z16.b, z16.b, z17.b mov x0, v16.d[0] [1] https://www.felixcloutier.com/x86/pmovmskb [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0

@gnu-andrew

…penjdk#16) Reviewed-by: @gnu-andrew

After JDK-8283091, the loop below can be vectorized partially. Statement 1 can be vectorized but statement 2 can't. ``` // int[] iArr; long[] lArrFld; int i1,i2; for (i1 = 6; i1 < 227; i1++) { iArr[i1] += lArrFld[i1]++; // statement 1 iArr[i1 + 1] -= (i2++); // statement 2 } ``` But we got incorrect results because the vector packs of iArr are scheduled incorrectly like: ``` ... load_vector XMM1,[R8 + openjdk#16 + R11 << openjdk#2] movl RDI, [R8 + openjdk#20 + R11 << openjdk#2] # int load_vector XMM2,[R9 + openjdk#8 + R11 << openjdk#3] subl RDI, R11 # int vpaddq XMM3,XMM2,XMM0 ! add packedL store_vector [R9 + openjdk#8 + R11 << openjdk#3],XMM3 vector_cast_l2x XMM2,XMM2 ! vpaddd XMM1,XMM2,XMM1 ! add packedI addl RDI, openjdk#228 # int movl [R8 + openjdk#20 + R11 << openjdk#2], RDI # int movl RBX, [R8 + openjdk#24 + R11 << openjdk#2] # int subl RBX, R11 # int addl RBX, openjdk#227 # int movl [R8 + openjdk#24 + R11 << openjdk#2], RBX # int ... movl RBX, [R8 + openjdk#40 + R11 << openjdk#2] # int subl RBX, R11 # int addl RBX, openjdk#223 # int movl [R8 + openjdk#40 + R11 << openjdk#2], RBX # int movl RDI, [R8 + openjdk#44 + R11 << openjdk#2] # int subl RDI, R11 # int addl RDI, openjdk#222 # int movl [R8 + openjdk#44 + R11 << openjdk#2], RDI # int store_vector [R8 + openjdk#16 + R11 << openjdk#2],XMM1 ... ``` simplified as: ``` load_vector iArr in statement 1 unvectorized loads/stores in statement 2 store_vector iArr in statement 1 ``` We cannot pick the memory state from the first load for LoadI pack here, as the LoadI vector operation must load the new values in memory after iArr writes 'iArr[i1 + 1] - (i2++)' to 'iArr[i1 + 1]'(statement 2). We must take the memory state of the last load where we have assigned new values ('iArr[i1 + 1] - (i2++)') to the iArr array. In JDK-8240281, we picked the memory state of the first load. Different from the scenario in JDK-8240281, the store, which is dependent on an earlier load here, is in a pack to be scheduled and the LoadI pack depends on the last_mem. As designed[2], to schedule the StoreI pack, all memory operations in another single pack should be moved in the same direction. We know that the store in the pack depends on one of loads in the LoadI pack, so the LoadI pack should be scheduled before the StoreI pack. And the LoadI pack depends on the last_mem, so the last_mem must be scheduled before the LoadI pack and also before the store pack. Therefore, we need to take the memory state of the last load for the LoadI pack here. To fix it, the pack adds additional checks while picking the memory state of the first load. When the store locates in a pack and the load pack relies on the last_mem, we shouldn't choose the memory state of the first load but choose the memory state of the last load. [1]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2380 [2]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2232 Jira: ENTLLT-5482 Change-Id: I341d10b91957b60a1b4aff8116723e54083a5fb8 CustomizedGitHooks: yes

…nodes Recently we found that the rotate left/right benchmarks with vectorapi emit a redundant "and" instruction on both aarch64 and x86_64 machines which can be done away with. For example - and(and(a, b), b) generates two "and" instructions which can be reduced to a single "and" operation- and(a, b) since "and" (and "or") operations are commutative and idempotent in nature. This can help improve performance for all those workloads which have multiple "and"/"or" operations with the same value by reducing them to fewer "and"/"or" operations accordingly. This patch adds the following transformations for vector logical operations - AndV and OrV : (OpV (OpV a b) b) => (OpV a b) (OpV (OpV a b) a) => (OpV a b) (OpV (OpV a b m1) b m1) => (OpV a b m1) (OpV (OpV a b m1) a m1) => (OpV a b m1) (OpV a (OpV a b)) => (OpV a b) (OpV b (OpV a b)) => (OpV a b) (OpV a (OpV a b m) m) => (OpV a b m) where Op = "And", "Or" Links for benchmarks tested are given below :- https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L728 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L764 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L728 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L764 Before this patch, the disassembly for one these testcases (IntMaxVector.ROR) for Neon is shown below : ldr q16, [x12, openjdk#16] and v16.16b, v16.16b, v20.16b and v16.16b, v16.16b, v20.16b add x12, x16, x11 sub v17.4s, v21.4s, v16.4s ldr q18, [x12, openjdk#16] sshl v17.4s, v18.4s, v17.4s add x11, x18, x11 neg v19.16b, v16.16b ushl v19.4s, v18.4s, v19.4s orr v16.16b, v17.16b, v19.16b str q16, [x11, openjdk#16] After this patch, the disassembly for the same testcase above is shown below : ldr q16, [x12, openjdk#16] and v16.16b, v16.16b, v20.16b add x12, x16, x11 sub v17.4s, v21.4s, v16.4s ldr q18, [x12, openjdk#16] sshl v17.4s, v18.4s, v17.4s add x11, x18, x11 neg v19.16b, v16.16b ushl v19.4s, v18.4s, v19.4s orr v16.16b, v17.16b, v19.16b str q16, [x11, openjdk#16] The other tests also emit an extra "and" instruction as shown above for the vector ROR/ROL operations. Below are the performance results for the vectorapi rotate tests (tests given in the links above) with this patch on aarch64 and x86_64 machines (for int and long types) - Benchmark aarch64 x86_64 IntMaxVector.ROL 25.57% 26.09% IntMaxVector.ROR 23.75% 24.15% LongMaxVector.ROL 28.91% 28.51% LongMaxVector.ROR 16.51% 29.11% The percentage indicates the percent gain/improvement in performance (ops/ms) with this patch over the master build without this patch. The machine descriptions are given below - aarch64 - 128-bit aarch64 machine x86_64 - 256-bit x86 machine

Fix failing tests

…erOfTrailingZeros/numberOfLeadingZeros()` Background: Java API[1] for `Long.bitCount/numberOfTrailingZeros/ numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one. We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below. 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type. For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression. 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto- vectorization is profitable. 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like: ``` LOOP: sxtw x13, w12 add x14, x15, x13, uxtx openjdk#3 add x17, x14, #0x10 ld1d {z16.d}, p7/z, [x17] // Incorrectly use integer rbit/clz insn for long type vector *rbit z16.s, p7/m, z16.s *clz z16.s, p7/m, z16.s add x13, x16, x13, uxtx openjdk#2 str q16, [x13, openjdk#16] ... add w12, w12, #0x20 cmp w12, w3 b.lt LOOP ``` It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected: ``` LOOP: sxtw x13, w12 add x14, x15, x13, uxtx openjdk#3 add x17, x14, #0x10 ld1d {z16.d}, p7/z, [x17] // Compute with long vector type and convert to int vector type *rbit z16.d, p7/m, z16.d *clz z16.d, p7/m, z16.d *mov z24.d, #0 *uzp1 z25.s, z16.s, z24.s add x13, x16, x13, uxtx openjdk#2 str q25, [x13, openjdk#16] ... add w12, w12, #0x20 cmp w12, w3 b.lt LOOP ``` 4. Fix an assertion failure on x86 avx2 platform Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like: ``` // long[] ia; // int[] ic; for (int i = 0; i < LENGTH; ++i) { ic[i] = Long.numberOfLeadingZeros(ia[i]); } ``` X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally. Tier 1~3 passed with no new failures on Linux AArch64/X86 platform. [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long) https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long) https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long) [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687 [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418 [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239

…dk#16) * Only use conditional far branch in copy_memory for zgc * Remove unused code

Co-authored-by: Xin Liu <xxinliu@amazon.com>

@gnu-andrew

…penjdk#16) Reviewed-by: @gnu-andrew

@gnu-andrew

…penjdk#16) Reviewed-by: @gnu-andrew

…ng into ldp/stp on AArch64 Macro-assembler on aarch64 can merge adjacent loads or stores into ldp/stp[1]. For example, it can merge: ``` str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] ``` into ``` stp w20, w10, [sp, openjdk#16] ``` But C2 may generate a sequence like: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str x19, [sp, openjdk#24] <--- str w10, [sp, openjdk#20] <--- Before sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` We can't do any merging for non-adjacent loads or stores. The patch is to sort the spilling or unspilling sequence in the order of offset during instruction scheduling and bundling phase. After that, we can get a new sequence: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] <--- str x19, [sp, openjdk#24] <--- After sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` Then macro-assembler can do ld/st merging: ``` str x21, [sp, openjdk#8] stp w20, w10, [sp, openjdk#16] <--- Merged str x19, [sp, openjdk#24] str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` To justify the patch, we run `HelloWorld.java` ``` public class HelloWorld { public static void main(String [] args) { System.out.println("Hello World!"); } } ``` with `java -Xcomp -XX:-TieredCompilation HelloWorld`. Before the patch, macro-assembler can do ld/st merging for 3688 times. After the patch, the number of ld/st merging increases to 3871 times, by ~5 %. Tested tier1~3 on x86 and AArch64. [1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079

Add framework for other platforms. Moved fill_to_memory_atomic back to the .cpp from the .hpp in order to get 32-bit fixed.

* Intial cut for repeatable builds * Fix line wrapping * Fix line wrapping * Fix line wrapping * Fix line wrapping

…marks after JDK-8340093 JDK-8340093 enabled auto-vectorization for more reduction loop cases using 128-bit vector operations. As a result, the following microbenchmarks are negatively affected: VectorReduction2.longAddDotProduct VectorReduction2.longMulDotProduct VectorReduction2.longMulSimple This patch fixes these regressions. 1. Improve code generation for MLA For longAddDotProduct[1], the current implementation generates vectorized code similar to: ``` ldr q17, [x12, openjdk#16] ldr q18, [x11, openjdk#16] mla z16.d, p7/m, z17.d, z18.d ldr q17, [x11, openjdk#32] ldr q18, [x12, openjdk#32] mla z16.d, p7/m, z18.d, z17.d ... ldr q17, [x11, openjdk#128] ldr q18, [x12, openjdk#128] mla z16.d, p7/m, z18.d, z17.d ``` `z16` is the third source and destination register. There are true dependencies between consecutive mla[2] instructions. As a result, this vectorized code performs significantly worse than the scalar version due to limited instruction-level parallelism. These mla instructions are produced by a backend match rule that fuses AddVL and MulVL into a vector MLA[3]. In this situation, avoiding instruction fusion and instead generating separate SVE mul and add instructions can improve instruction-level parallelism and overall performance. To address this, this patch introduces is_multiply_accumulate_candidate() to determine whether a node is a suitable vector MLA candidate. For node patterns that may increase execution latency, instruction fusion into MLA is disabled. After applying this patch, the generated assembly looks like: ``` ldr q17, [x12, openjdk#16] ldr q18, [x11, openjdk#16] ldr q19, [x11, openjdk#32] mul z17.d, p7/m, z17.d, z18.d ldr q18, [x12, openjdk#32] ldr q20, [x11, openjdk#48] mul z18.d, p7/m, z18.d, z19.d ldr q19, [x12, openjdk#48] add v16.2d, v17.2d, v16.2d ldr q17, [x11, openjdk#64] add v16.2d, v18.2d, v16.2d ldr q18, [x12, openjdk#64] mul z19.d, p7/m, z19.d, z20.d ldr q20, [x12, openjdk#80] add v16.2d, v19.2d, v16.2d ``` This sequence exposes more independent operations and reduces dependency chains, leading to improved performance. Since SVE mls instructions may suffer from similar issues, the same logic has been extended to cover MLS as well. Additional microbenchmarks have been added accordingly. 2. Avoid vectorizing MUL-heavy loops For longMulSimple[3], the generated vectorized code exhibits long dependency chains of SVE mul instructions, which results in worse performance than scalar execution: ``` ldr q17, [x1, openjdk#16] ldr q18, [x1, openjdk#32] mul z17.d, p7/m, z17.d, z16.d ldr q16, [x1, openjdk#48] mul z17.d, p7/m, z17.d, z18.d ldr q18, [x1, openjdk#64] mul z16.d, p7/m, z16.d, z17.d ... ldr q16, [x1, openjdk#256] mul z17.d, p7/m, z17.d, z19.d mul z16.d, p7/m, z16.d, z17.d ``` To address this, the patch introduces a platform-specific interface: `VTransformElementWiseVectorNode::node_weight()`. For 128-bit operations, this interface detects consecutive vector long multiply operations and increases the node weight to 4, which is the minimum value required for the cost model to avoid vectorization on both 128-bit and 256-bit platforms. 3. Results Performance measurements on 128-bit and 256-bit SVE machines show that these changes avoid harmful vectorization and improve overall performance for the affected benchmarks. patch: results obtained after applying this patch, using default auto-vectorization settings (-XX:+UseSuperWord, -XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode) main-default: results on mainline using the same default auto-vectorization settings (-XX:+UseSuperWord, -XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode) main-scalar: results on mainline with -XX:+UseSuperWord and -XX:AutoVectorizationOverrideProfitability=0 (force scalar code) The table below reports relative performance changes: p/m1 = (patch - main-default) / main-default p/m0 = (patch - main-scalar) / main-scalar Mode: avgt Unit: ns/op Arm Neoverse V2 machine (128 bit SVE): Benchmark (COUNT) p/m1 p/m0 TypeVectorOperationsSuperWord.mlaL 512 0.16% -50.42% TypeVectorOperationsSuperWord.mlaL 2048 0.26% -56.70% TypeVectorOperationsSuperWord.mlsL 512 -0.10% -50.37% TypeVectorOperationsSuperWord.mlsL 2048 0.14% -56.82% TypeVectorOperationsSuperWord.mulBigL 512 0.06% -25.77% TypeVectorOperationsSuperWord.mulBigL 2048 -0.02% -19.63% TypeVectorOperationsSuperWord.mulI 512 0.63% -63.44% TypeVectorOperationsSuperWord.mulI 2048 0.28% -63.07% TypeVectorOperationsSuperWord.mulL 512 -0.03% -50.47% TypeVectorOperationsSuperWord.mulL 2048 0.29% -50.82% TypeVectorOperationsSuperWord.mulMediumL 512 -0.19% -27.54% TypeVectorOperationsSuperWord.mulMediumL 2048 0.24% -25.18% TypeVectorOperationsSuperWord.mulMlaLDependent 512 0.30% -28.70% TypeVectorOperationsSuperWord.mulMlaLDependent 2048 0.12% -26.74% TypeVectorOperationsSuperWord.mulMlaLIndependent 512 -10.43% -43.09% TypeVectorOperationsSuperWord.mulMlaLIndependent 2048 -14.82% -42.68% VectorReduction2.WithSuperword.longAddBig 2048 -15.15% -44.01% VectorReduction2.WithSuperword.longAddBigMixSub1 2048 -6.19% -43.92% VectorReduction2.WithSuperword.longAddBigMixSub2 2048 -15.18% -43.90% VectorReduction2.WithSuperword.longAddBigMixSub3 2048 -5.74% -43.87% VectorReduction2.WithSuperword.longAddDotProduct 2048 -33.36% -18.16% VectorReduction2.WithSuperword.longAddSimple 2048 -0.02% -6.72% VectorReduction2.WithSuperword.longAndBig 2048 -16.32% -44.06% VectorReduction2.WithSuperword.longAndDotProduct 2048 -0.01% -3.74% VectorReduction2.WithSuperword.longAndSimple 2048 0.00% -6.35% VectorReduction2.WithSuperword.longMaxBig 2048 -15.29% -52.09% VectorReduction2.WithSuperword.longMaxDotProduct 2048 -0.03% -52.08% VectorReduction2.WithSuperword.longMaxSimple 2048 -0.40% -52.74% VectorReduction2.WithSuperword.longMinBig 2048 -14.88% -51.70% VectorReduction2.WithSuperword.longMinDotProduct 2048 0.01% -52.21% VectorReduction2.WithSuperword.longMinSimple 2048 0.26% -52.88% VectorReduction2.WithSuperword.longMulBig 2048 -2.21% -0.07% VectorReduction2.WithSuperword.longMulDotProduct 2048 -15.47% 0.00% VectorReduction2.WithSuperword.longMulSimple 2048 -17.87% -0.33% VectorReduction2.WithSuperword.longOrBig 2048 -15.23% -43.94% VectorReduction2.WithSuperword.longOrDotProduct 2048 -0.01% -3.83% VectorReduction2.WithSuperword.longOrSimple 2048 -0.01% -6.60% VectorReduction2.WithSuperword.longXorBig 2048 -10.03% -41.62% VectorReduction2.WithSuperword.longXorDotProduct 2048 0.01% -38.61% VectorReduction2.WithSuperword.longXorSimple 2048 0.02% -53.18% Arm Neoverse V1 machine (256 bit SVE): Note: In the current mainline code, the AArch64 backend supports only 128-bit multiply long operations. Auto-vectorization accounts for this backend constraint and splits 256-bit vectors into 128-bit chunks so that the loop can still be vectorized. This is why 256-bit platforms also benefit from this patch. No obvious performance changes are observed for other benchmarks. Benchmark (COUNT) p/m1 p/m0 VectorReduction2.longMulDotProduct 2048 -28.23% 0.00% VectorReduction2.longMulSimple 2048 -19.29% 0.01% Tier 1 - 3 passed on both aarch64 and x86 platforms. [1] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096 [2] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/MLA--vectors---Multiply-add--predicated--?lang=en [3] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2617 [4] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1035

The microbenchmark ArraysFill.testLongFill[1] on 128-bit vector platforms generates vectorized store instructions with non-monotonic memory offsets, e.g.: str q16, [x12, openjdk#80] str q16, [x12, openjdk#48] str q16, [x12, openjdk#128] ... This arises because SuperWord only considers true dependencies when building edges (see [3]), and therefore does not enforce ordering among independent vector memory operations. These nodes are later scheduled using RPO, which can result in an apparently unordered sequence of memory accesses. This patch replaces RPO-based scheduling with a priority-based topological sort to improve ordering and locality. The scheduling policy is: 1. Prefer nodes whose weak predecessors have already been scheduled. 2. Prioritize node types in the following order: scalar operations (loads/stores, address expressions), vector arithmetic, vector loads, vector stores, then others. 3. For independent loads/stores sharing the same base address, prefer ascending offsets. 4. Use VTransformNodeIDX to ensure stable ordering. With this change, the generated code becomes monotonic in memory offsets: str q16, [x12, openjdk#16] str q16, [x12, openjdk#32] str q16, [x12, openjdk#48] ... On Arm Neoverse V2 machine (128 bit SVE), this improves the following benchmarks: TypeVectorOperationsSuperWord.java[2] Benchmark (COUNT) Mode Units Difference absD 512 avgt ns/op -27.05% absD 2048 avgt ns/op -27.05% absL 512 avgt ns/op -24.46% absL 2048 avgt ns/op -27.26% convertD2LBitsRaw 512 avgt ns/op -20.39% convertD2LBitsRaw 2048 avgt ns/op -23.92% convertF2L 512 avgt ns/op -16.82% convertF2L 2048 avgt ns/op -22.60% convertI2D 512 avgt ns/op -12.50% convertI2D 2048 avgt ns/op -17.92% convertLBits2D 512 avgt ns/op -27.13% convertLBits2D 2048 avgt ns/op -31.69% negD 512 avgt ns/op -26.85% negD 2048 avgt ns/op -27.09% ArraysFill.java[1]: Benchmark (size) Mode Units Difference testDoubleFill 250 thrpt ops/ms 26.46% testDoubleFill 266 thrpt ops/ms 32.69% testDoubleFill 511 thrpt ops/ms 33.83% testDoubleFill 2047 thrpt ops/ms 45.35% testDoubleFill 2048 thrpt ops/ms 45.38% testDoubleFill 8195 thrpt ops/ms 49.32% testLongFill 250 thrpt ops/ms 28.12% testLongFill 266 thrpt ops/ms 40.30% testLongFill 511 thrpt ops/ms 34.79% testLongFill 2047 thrpt ops/ms 45.71% testLongFill 2048 thrpt ops/ms 53.07% testLongFill 8195 thrpt ops/ms 49.52% No significant performance changes are observed on wider vector platforms (e.g., 256-bit or 512-bit), where fewer vector operations are generated in SuperWord and scheduling has less impact. [1] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/java/util/ArraysFill.java#L92 [2] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java [3] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/src/hotspot/share/opto/superwordVTransformBuilder.cpp#L99

Some vector operations do not have inputs and essentially initialize vectors with a constant value. These operations can be marked for spilling and subsequently rematerialized at every use. The result of the transformation might look as follows: movi v16.2d, #0x0 str q16, [x16, openjdk#64] movi v16.2d, #0x0 str q16, [x16, openjdk#32] movi v16.2d, #0x0 str q16, [x16, openjdk#16] movi v16.2d, #0x0 str q16, [x16] movi v16.2d, #0x0 str q16, [x16, openjdk#48] movi v16.2d, #0x0 str q16, [x16, openjdk#112] movi v16.2d, #0x0 str q16, [x16, openjdk#80] movi v16.2d, #0x0 str q16, [x16, openjdk#96] Introduce deduplication of these rematerialized vector constant initializations reducing the above sequence to: movi v16.2d, #0x0 str q16, [x16, openjdk#64] str q16, [x16, openjdk#32] str q16, [x16, openjdk#16] str q16, [x16] str q16, [x16, openjdk#48] str q16, [x16, openjdk#112] str q16, [x16, openjdk#80] str q16, [x16, openjdk#96]

Revert "Sync constructors for ThreadWXEnable with MacOS impl"

add compilation ticks for mitigating against deadlock due to blocking…

ac825ab

… JVMCI compilation

bridgekeeper Bot closed this Sep 3, 2020

mlbridge Bot mentioned this pull request Oct 27, 2020

8255246: AArch64: Implement BigInteger shiftRight and shiftLeft accelerator/intrinsic #861

Closed

3 tasks

fisk pushed a commit to fisk/jdk that referenced this pull request Oct 28, 2020

Merge pull request openjdk#16 from carterkozak/ckozak/handshake_liste…

696e5a9

…ner_virtual_thread 8246039: SSLSocket HandshakeCompletedListeners are run on virtual threads

mlbridge Bot mentioned this pull request Nov 9, 2020

8255949: AArch64: Add support for vectorized shift right and accumulate #1087

Closed

3 tasks

Wanghuang-Huawei mentioned this pull request Jan 5, 2021

8259044: JVM lacks data type qualifier when using -XX:+PrintAssembly with AArch64-Neon backend #1941

Closed

3 tasks

cushon pushed a commit to cushon/jdk that referenced this pull request Apr 2, 2021

Add PolySignature to String#intern. (openjdk#16)

68389f6

fg1417 mentioned this pull request Dec 8, 2021

8276673: Optimize abs operations in C2 compiler #6755

Closed

3 tasks

fg1417 mentioned this pull request Mar 14, 2022

8283091: Support type conversion between different data sizes in SLP #7806

Closed

3 tasks

franferrax added a commit to franferrax/jdk that referenced this pull request Aug 11, 2022

RH2092507: P11Key.getEncoded does not work for DH keys in FIPS mode (o…

84266ee

…penjdk#16) Reviewed-by: @gnu-andrew

openjdk-notifier Bot pushed a commit that referenced this pull request Nov 9, 2022

Merge pull request #16 from minborg/fix-tests2

c6cd176

Fix failing tests

yadongw mentioned this pull request Jan 3, 2023

8299525: RISC-V: Add backend support for half float conversion intrinsics #11828

Closed

3 tasks

sunny868 mentioned this pull request Feb 14, 2023

8302369: Reduce the stack size of the C1 compiler #12548

Closed

3 tasks

stefank pushed a commit to stefank/jdk that referenced this pull request Mar 24, 2023

RISC-V: Only use conditional far branch in copy_memory for ZGC (openj…

f17f924

…dk#16) * Only use conditional far branch in copy_memory for zgc * Remove unused code

caojoshua pushed a commit to caojoshua/jdk that referenced this pull request Mar 29, 2023

Using single CompilerThread to execute CTW. (openjdk#16)

ec9eef5

Co-authored-by: Xin Liu <xxinliu@amazon.com>

gnu-andrew pushed a commit to gnu-andrew/jdk that referenced this pull request Apr 4, 2023

RH2092507: P11Key.getEncoded does not work for DH keys in FIPS mode (o…

356bde4

…penjdk#16) Reviewed-by: @gnu-andrew

robehn pushed a commit to robehn/jdk that referenced this pull request Aug 15, 2023

Run GHA tests on cross-compile targets (openjdk#16)

1e6f366

gnu-andrew pushed a commit to gnu-andrew/jdk that referenced this pull request Aug 18, 2023

RH2092507: P11Key.getEncoded does not work for DH keys in FIPS mode (o…

a3bc14a

…penjdk#16) Reviewed-by: @gnu-andrew

JoshuaZhuwj mentioned this pull request Feb 23, 2024

8326541: [AArch64] ZGC C2 load barrier stub considers the length of live registers when spilling registers #17974

Closed

3 tasks

openjdk-notifier Bot pushed a commit that referenced this pull request Apr 11, 2024

Set memory test (#16)

95230e2

Add framework for other platforms. Moved fill_to_memory_atomic back to the .cpp from the .hpp in order to get 32-bit fixed.

MBaesken mentioned this pull request May 27, 2024

8332955: ubsan: runningCounters.cpp:48:61: runtime error: member call on null pointer of type 'struct VirtualSpaceList' #19412

Closed

3 tasks

This was referenced Jun 5, 2024

8332865: ubsan: os::attempt_reserve_memory_between reports overflow #19543

Closed

8332818: ubsan: archiveHeapLoader.cpp:70:27: runtime error: applying non-zero offset 18446744073707454464 to null pointer #19597

Closed

MBaesken mentioned this pull request Aug 16, 2024

8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero #20615

Closed

3 tasks

MBaesken mentioned this pull request Jan 3, 2025

8345676: [ubsan] ProcessImpl_md.c:561:40: runtime error: applying zero offset to null pointer on macOS aarch64 #22910

Closed

3 tasks

lahodaj added a commit to lahodaj/jdk that referenced this pull request Mar 17, 2025

Addding forgotten file. (openjdk#16)

6c08011

pf0n pushed a commit to pf0n/jdk that referenced this pull request Jul 9, 2025

Intial cut for repeatable builds (openjdk#16)

8f0e003

* Intial cut for repeatable builds * Fix line wrapping * Fix line wrapping * Fix line wrapping * Fix line wrapping

mhaessig mentioned this pull request Mar 5, 2026

8377541: C2: Memory Barrier IR nodes not eliminated for stable array element access #30077

Closed

8 tasks

vicente-romero-oracle mentioned this pull request Mar 5, 2026

8379201: Wrong type annotation offset on casts on expressions (redo) #30091

Closed

3 tasks

snake66 added a commit to snake66/jdk that referenced this pull request Apr 20, 2026

Merge pull request openjdk#16 from snake66/jdk25u-bsd-port--fix-aarch64

8426b84

Revert "Sync constructors for ThreadWXEnable with MacOS impl"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JVMCI] Libgraal can deadlock in blocking compilation mode#16

[JVMCI] Libgraal can deadlock in blocking compilation mode#16
dougxc wants to merge 1 commit intoopenjdk:masterfrom
dougxc:JDK-8252543

dougxc commented Sep 3, 2020

Uh oh!

bridgekeeper Bot commented Sep 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

dougxc commented Sep 3, 2020

Uh oh!

bridgekeeper Bot commented Sep 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant