[JVMCI] Libgraal can deadlock in blocking compilation mode#16
Closed
dougxc wants to merge 1 commit intoopenjdk:masterfrom
Closed
[JVMCI] Libgraal can deadlock in blocking compilation mode#16dougxc wants to merge 1 commit intoopenjdk:masterfrom
dougxc wants to merge 1 commit intoopenjdk:masterfrom
Conversation
… JVMCI compilation
|
Welcome to the OpenJDK organization on GitHub! This repository is currently a read-only git mirror of the official Mercurial repository (located at https://hg.openjdk.java.net/). As such, we are not currently accepting pull requests here. If you would like to contribute to the OpenJDK project, please see https://openjdk.java.net/contribute/ on how to proceed. This pull request will be automatically closed. |
3 tasks
fisk
pushed a commit
to fisk/jdk
that referenced
this pull request
Oct 28, 2020
…ner_virtual_thread 8246039: SSLSocket HandshakeCompletedListeners are run on virtual threads
3 tasks
e1iu
pushed a commit
to e1iu/jdk
that referenced
this pull request
Mar 10, 2021
Like scalar shift, vector shift could do nothing when shift count is
zero.
This patch implements the 'Identity' method for all kinds of vector
shift nodes to optimize out 'ShiftVCntNode 0', which is typically a
redundant 'mov' in final generated code like below:
```
add x17, x12, x14
ldr q16, [x17, openjdk#16]
mov v16.16b, v16.16b
add x14, x13, x14
str q16, [x14, openjdk#16]
```
With this patch, the code above could be optimized as below:
```
add x17, x12, x14
ldr q16, [x17, openjdk#16]
add x14, x13, x14
str q16, [x14, openjdk#16]
```
[TESTS]
compiler/vectorapi/TestVectorShiftImm.java, jdk/incubator/vector,
hotspot::tier1 passed without new failure.
Change-Id: I7657c0daaa5f758966936b9ede670c8b9ad94c48
cushon
pushed a commit
to cushon/jdk
that referenced
this pull request
Apr 2, 2021
e1iu
pushed a commit
to e1iu/jdk
that referenced
this pull request
Apr 7, 2021
The vector shift count was defined by two separate nodes(LShiftCntV and
RShiftCntV), which would prevent them from being shared when the shift
counts are the same.
```
public static void test_shiftv(int sh) {
for (int i = 0; i < N; i+=1) {
a0[i] = a1[i] << sh;
b0[i] = b1[i] >> sh;
}
}
```
Given the example above, by merging the same shift counts into one
node, they could be shared by shift nodes(RShiftV or LShiftV) like
below:
```
Before:
1184 LShiftCntV === _ 1189 [[ 1185 ... ]]
1190 RShiftCntV === _ 1189 [[ 1191 ... ]]
1185 LShiftVI === _ 1181 1184 [[ 1186 ]]
1191 RShiftVI === _ 1187 1190 [[ 1192 ]]
After:
1190 ShiftCntV === _ 1189 [[ 1191 1204 ... ]]
1204 LShiftVI === _ 1211 1190 [[ 1203 ]]
1191 RShiftVI === _ 1187 1190 [[ 1192 ]]
```
The final code could remove one redundant “dup”(scalar->vector),
with one register saved.
```
Before:
dup v16.16b, w12
dup v17.16b, w12
...
ldr q18, [x13, openjdk#16]
sshl v18.4s, v18.4s, v16.4s
add x18, x16, x12 ; iaload
add x4, x15, x12
str q18, [x4, openjdk#16] ; iastore
ldr q18, [x18, openjdk#16]
add x12, x14, x12
neg v19.16b, v17.16b
sshl v18.4s, v18.4s, v19.4s
str q18, [x12, openjdk#16] ; iastore
After:
dup v16.16b, w11
...
ldr q17, [x13, openjdk#16]
sshl v17.4s, v17.4s, v16.4s
add x2, x22, x11 ; iaload
add x4, x16, x11
str q17, [x4, openjdk#16] ; iastore
ldr q17, [x2, openjdk#16]
add x11, x21, x11
neg v18.16b, v16.16b
sshl v17.4s, v17.4s, v18.4s
str q17, [x11, openjdk#16] ; iastore
```
Change-Id: I047f3f32df9535d706a9920857d212610e8ce315
openjdk-notifier Bot
pushed a commit
that referenced
this pull request
Oct 5, 2021
r18 should not be used as it is reserved as platform register. Linux is fine with userspace using it, but Windows and also recently macOS ( openjdk/jdk11u-dev#301 (comment) ) are actually using it on the kernel side. The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to specify which registers to spill; fortunately this helper is only used here: https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404 I haven't seen causing this particular instance any issues in practice _yet_, presumably because it looks hard to align the stars in order to trigger a problem (between stp and ldp of r18 a transition to kernel space must happen *and* the kernel needs to do something with r18). But jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that causes troubles as explained in the link above. Output of `-XX:+PrintInterpreter` before this change: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000138809b00, 0x000000013880a280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000138809b00: ldr x2, [x12, #16] 0x0000000138809b04: ldrh w2, [x2, #44] 0x0000000138809b08: add x24, x20, x2, uxtx #3 0x0000000138809b0c: sub x24, x24, #0x8 [...] 0x0000000138809fa4: stp x16, x17, [sp, #128] 0x0000000138809fa8: stp x18, x19, [sp, #144] 0x0000000138809fac: stp x20, x21, [sp, #160] [...] 0x0000000138809fc0: stp x30, xzr, [sp, #240] 0x0000000138809fc4: mov x0, x28 ;; 0x10864ACCC 0x0000000138809fc8: mov x9, #0xaccc // #44236 0x0000000138809fcc: movk x9, #0x864, lsl #16 0x0000000138809fd0: movk x9, #0x1, lsl #32 0x0000000138809fd4: blr x9 0x0000000138809fd8: ldp x2, x3, [sp, #16] [...] 0x0000000138809ff4: ldp x16, x17, [sp, #128] 0x0000000138809ff8: ldp x18, x19, [sp, #144] 0x0000000138809ffc: ldp x20, x21, [sp, #160] ``` After: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000108e4db00, 0x0000000108e4e280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000108e4db00: ldr x2, [x12, #16] 0x0000000108e4db04: ldrh w2, [x2, #44] 0x0000000108e4db08: add x24, x20, x2, uxtx #3 0x0000000108e4db0c: sub x24, x24, #0x8 [...] 0x0000000108e4dfa4: stp x16, x17, [sp, #128] 0x0000000108e4dfa8: stp x19, x20, [sp, #144] 0x0000000108e4dfac: stp x21, x22, [sp, #160] [...] 0x0000000108e4dfbc: stp x29, x30, [sp, #224] 0x0000000108e4dfc0: mov x0, x28 ;; 0x107E4A06C 0x0000000108e4dfc4: mov x9, #0xa06c // #41068 0x0000000108e4dfc8: movk x9, #0x7e4, lsl #16 0x0000000108e4dfcc: movk x9, #0x1, lsl #32 0x0000000108e4dfd0: blr x9 0x0000000108e4dfd4: ldp x2, x3, [sp, #16] [...] 0x0000000108e4dff0: ldp x16, x17, [sp, #128] 0x0000000108e4dff4: ldp x19, x20, [sp, #144] 0x0000000108e4dff8: ldp x21, x22, [sp, #160] [...] ```
lewurm
added a commit
to lewurm/openjdk
that referenced
this pull request
Oct 6, 2021
Restore looks like this now: ``` 0x0000000106e4dfcc: movk x9, #0x5e4, lsl openjdk#16 0x0000000106e4dfd0: movk x9, #0x1, lsl openjdk#32 0x0000000106e4dfd4: blr x9 0x0000000106e4dfd8: ldp x2, x3, [sp, openjdk#16] 0x0000000106e4dfdc: ldp x4, x5, [sp, openjdk#32] 0x0000000106e4dfe0: ldp x6, x7, [sp, openjdk#48] 0x0000000106e4dfe4: ldp x8, x9, [sp, openjdk#64] 0x0000000106e4dfe8: ldp x10, x11, [sp, openjdk#80] 0x0000000106e4dfec: ldp x12, x13, [sp, openjdk#96] 0x0000000106e4dff0: ldp x14, x15, [sp, openjdk#112] 0x0000000106e4dff4: ldp x16, x17, [sp, openjdk#128] 0x0000000106e4dff8: ldp x0, x1, [sp], openjdk#144 0x0000000106e4dffc: ldp xzr, x19, [sp], openjdk#16 0x0000000106e4e000: ldp x22, x23, [sp, openjdk#16] 0x0000000106e4e004: ldp x24, x25, [sp, openjdk#32] 0x0000000106e4e008: ldp x26, x27, [sp, openjdk#48] 0x0000000106e4e00c: ldp x28, x29, [sp, openjdk#64] 0x0000000106e4e010: ldp x30, xzr, [sp, openjdk#80] 0x0000000106e4e014: ldp x20, x21, [sp], openjdk#96 0x0000000106e4e018: ldur x12, [x29, #-24] 0x0000000106e4e01c: ldr x22, [x12, openjdk#16] 0x0000000106e4e020: add x22, x22, #0x30 0x0000000106e4e024: ldr x8, [x28, openjdk#8] ```
3 tasks
fg1417
pushed a commit
to fg1417/jdk
that referenced
this pull request
Dec 8, 2021
The patch aims to help optimize Math.abs() mainly from these three parts:
1) Remove redundant instructions for abs with constant values
2) Remove redundant instructions for abs with char type
3) Convert some common abs operations to ideal forms
1. Remove redundant instructions for abs with constant values
If we can decide the value of the input node for function Math.abs()
at compile-time, we can substitute the Abs node with the absolute
value of the constant and don't have to calculate it at runtime.
For example,
int[] a
for (int i = 0; i < SIZE; i++) {
a[i] = Math.abs(-38);
}
Before the patch, the generated code for the testcase above is:
...
mov w10, #0xffffffda
cmp w10, wzr
cneg w17, w10, lt
dup v16.8h, w17
...
After the patch, the generated code for the testcase above is :
...
movi v16.4s, #0x26
...
2. Remove redundant instructions for abs with char type
In Java semantics, as the char type is always non-negative, we
could actually remove the absI node in the C2 middle end.
As for vectorization part, in current SLP, the vectorization of
Math.abs() with char type is intentionally disabled after
JDK-8261022 because it generates incorrect result before. After
removing the AbsI node in the middle end, Math.abs(char) can be
vectorized naturally.
For example,
char[] a;
char[] b;
for (int i = 0; i < SIZE; i++) {
b[i] = (char) Math.abs(a[i]);
}
Before the patch, the generated assembly code for the testcase
above is:
B15:
add x13, x21, w20, sxtw openjdk#1
ldrh w11, [x13, openjdk#16]
cmp w11, wzr
cneg w10, w11, lt
strh w10, [x13, openjdk#16]
ldrh w10, [x13, openjdk#18]
cmp w10, wzr
cneg w10, w10, lt
strh w10, [x13, openjdk#18]
...
add w20, w20, #0x1
cmp w20, w17
b.lt B15
After the patch, the generated assembly code is:
B15:
sbfiz x18, x19, openjdk#1, openjdk#32
add x0, x14, x18
ldr q16, [x0, openjdk#16]
add x18, x21, x18
str q16, [x18, openjdk#16]
ldr q16, [x0, openjdk#32]
str q16, [x18, openjdk#32]
...
add w19, w19, #0x40
cmp w19, w17
b.lt B15
3. Convert some common abs operations to ideal forms
The patch overrides some virtual support functions for AbsNode
so that optimization of gvn can work on it. Here are the optimizable
forms:
a) abs(0 - x) => abs(x)
Before the patch:
...
ldr w13, [x13, openjdk#16]
neg w13, w13
cmp w13, wzr
cneg w14, w13, lt
...
After the patch:
...
ldr w13, [x13, openjdk#16]
cmp w13, wzr
cneg w13, w13, lt
...
b) abs(abs(x)) => abs(x)
Before the patch:
...
ldr w12, [x12, openjdk#16]
cmp w12, wzr
cneg w12, w12, lt
cmp w12, wzr
cneg w12, w12, lt
...
After the patch:
...
ldr w13, [x13, openjdk#16]
cmp w13, wzr
cneg w13, w13, lt
...
Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
fg1417
pushed a commit
to fg1417/jdk
that referenced
this pull request
Dec 8, 2021
The patch aims to help optimize Math.abs() mainly from these three parts:
1) Remove redundant instructions for abs with constant values
2) Remove redundant instructions for abs with char type
3) Convert some common abs operations to ideal forms
1. Remove redundant instructions for abs with constant values
If we can decide the value of the input node for function Math.abs()
at compile-time, we can substitute the Abs node with the absolute
value of the constant and don't have to calculate it at runtime.
For example,
int[] a
for (int i = 0; i < SIZE; i++) {
a[i] = Math.abs(-38);
}
Before the patch, the generated code for the testcase above is:
...
mov w10, #0xffffffda
cmp w10, wzr
cneg w17, w10, lt
dup v16.8h, w17
...
After the patch, the generated code for the testcase above is :
...
movi v16.4s, #0x26
...
2. Remove redundant instructions for abs with char type
In Java semantics, as the char type is always non-negative, we
could actually remove the absI node in the C2 middle end.
As for vectorization part, in current SLP, the vectorization of
Math.abs() with char type is intentionally disabled after
JDK-8261022 because it generates incorrect result before. After
removing the AbsI node in the middle end, Math.abs(char) can be
vectorized naturally.
For example,
char[] a;
char[] b;
for (int i = 0; i < SIZE; i++) {
b[i] = (char) Math.abs(a[i]);
}
Before the patch, the generated assembly code for the testcase
above is:
B15:
add x13, x21, w20, sxtw openjdk#1
ldrh w11, [x13, openjdk#16]
cmp w11, wzr
cneg w10, w11, lt
strh w10, [x13, openjdk#16]
ldrh w10, [x13, openjdk#18]
cmp w10, wzr
cneg w10, w10, lt
strh w10, [x13, openjdk#18]
...
add w20, w20, #0x1
cmp w20, w17
b.lt B15
After the patch, the generated assembly code is:
B15:
sbfiz x18, x19, openjdk#1, openjdk#32
add x0, x14, x18
ldr q16, [x0, openjdk#16]
add x18, x21, x18
str q16, [x18, openjdk#16]
ldr q16, [x0, openjdk#32]
str q16, [x18, openjdk#32]
...
add w19, w19, #0x40
cmp w19, w17
b.lt B15
3. Convert some common abs operations to ideal forms
The patch overrides some virtual support functions for AbsNode
so that optimization of gvn can work on it. Here are the optimizable
forms:
a) abs(0 - x) => abs(x)
Before the patch:
...
ldr w13, [x13, openjdk#16]
neg w13, w13
cmp w13, wzr
cneg w14, w13, lt
...
After the patch:
...
ldr w13, [x13, openjdk#16]
cmp w13, wzr
cneg w13, w13, lt
...
b) abs(abs(x)) => abs(x)
Before the patch:
...
ldr w12, [x12, openjdk#16]
cmp w12, wzr
cneg w12, w12, lt
cmp w12, wzr
cneg w12, w12, lt
...
After the patch:
...
ldr w13, [x13, openjdk#16]
cmp w13, wzr
cneg w13, w13, lt
...
Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
shqking
added a commit
to shqking/jdk
that referenced
this pull request
Mar 7, 2022
*** Implementation
In AArch64 NEON, vector shift right is implemented by vector shift left
instructions (SSHL[1] and USHL[2]) with negative shift count value. In
C2 backend, we generate a `neg` to given shift value followed by `sshl`
or `ushl` instruction.
For vector shift right, the vector shift count has two origins:
1) it can be duplicated from scalar variable/immediate(case-1),
2) it can be loaded directly from one vector(case-2).
This patch aims to optimize case-1. Specifically, we move the negate
from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
hoisted outside of the loop if it's a loop invariant.
In this patch,
1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
shift left and shift right respectively. Compared to vslcnt* rules, the
negate is conducted in vsrcnt*.
2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
vsra*_var and vsrl*_var rules to handle case-2. Note that
ShiftVNode::is_var_shift() can be used to distinguish case-1 from
case-2.
3) we add one assertion for the vs*_imm rules as we have done on
ARM32[3].
4) several style issues are resolved.
*** Example
Take function `rShiftInt()` in the newly added micro benchmark
VectorShiftRight.java as an example.
```
public void rShiftInt() {
for (int i = 0; i < SIZE; i++) {
intsB[i] = intsA[i] >> count;
}
}
```
Arithmetic shift right is conducted inside a big loop. The following
code snippet shows the disassembly code generated by auto-vectorization
before we apply current patch. We can see that `neg` is conducted in the
loop body.
```
0x0000ffff89057a64: dup v16.16b, w13 <-- dup
0x0000ffff89057a68: mov w12, #0x7d00 // #32000
0x0000ffff89057a6c: sub w13, w2, w10
0x0000ffff89057a70: cmp w2, w10
0x0000ffff89057a74: csel w13, wzr, w13, lt
0x0000ffff89057a78: mov w8, #0x7d00 // #32000
0x0000ffff89057a7c: cmp w13, w8
0x0000ffff89057a80: csel w13, w12, w13, hi
0x0000ffff89057a84: add w14, w13, w10
0x0000ffff89057a88: nop
0x0000ffff89057a8c: nop
0x0000ffff89057a90: sbfiz x13, x10, openjdk#2, openjdk#32 <-- loop entry
0x0000ffff89057a94: add x15, x17, x13
0x0000ffff89057a98: ldr q17, [x15,openjdk#16]
0x0000ffff89057a9c: add x13, x0, x13
0x0000ffff89057aa0: neg v18.16b, v16.16b <-- neg
0x0000ffff89057aa4: sshl v17.4s, v17.4s, v18.4s <-- shift right
0x0000ffff89057aa8: str q17, [x13,openjdk#16]
0x0000ffff89057aac: ...
0x0000ffff89057b1c: add w10, w10, #0x20
0x0000ffff89057b20: cmp w10, w14
0x0000ffff89057b24: b.lt 0x0000ffff89057a90 <-- loop end
```
Here is the disassembly code after we apply current patch. We can see
that the negate is no longer conducted inside the loop, and it is
hoisted to the outside.
```
0x0000ffff8d053a68: neg w14, w13 <---- neg
0x0000ffff8d053a6c: dup v16.16b, w14 <---- dup
0x0000ffff8d053a70: sub w14, w2, w10
0x0000ffff8d053a74: cmp w2, w10
0x0000ffff8d053a78: csel w14, wzr, w14, lt
0x0000ffff8d053a7c: mov w8, #0x7d00 // #32000
0x0000ffff8d053a80: cmp w14, w8
0x0000ffff8d053a84: csel w14, w12, w14, hi
0x0000ffff8d053a88: add w13, w14, w10
0x0000ffff8d053a8c: nop
0x0000ffff8d053a90: sbfiz x14, x10, openjdk#2, openjdk#32 <-- loop entry
0x0000ffff8d053a94: add x15, x17, x14
0x0000ffff8d053a98: ldr q17, [x15,openjdk#16]
0x0000ffff8d053a9c: sshl v17.4s, v17.4s, v16.4s <-- shift right
0x0000ffff8d053aa0: add x14, x0, x14
0x0000ffff8d053aa4: str q17, [x14,openjdk#16]
0x0000ffff8d053aa8: ...
0x0000ffff8d053afc: add w10, w10, #0x20
0x0000ffff8d053b00: cmp w10, w13
0x0000ffff8d053b04: b.lt 0x0000ffff8d053a90 <-- loop end
```
*** Testing
Tier1~3 tests passed on Linux/AArch64 platform.
*** Performance Evaluation
- Auto-vectorization
One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
in order to evaluate the optimization on vector shift right.
The following table shows the result. Column `Score-1` shows the score
before we apply current patch, and column `Score-2` shows the score when
we apply current patch.
We witness about 30% ~ 53% improvement on microbenchmarks.
```
Benchmark Units Score-1 Score-2
VectorShiftRight.rShiftByte ops/ms 10601.980 13816.353
VectorShiftRight.rShiftInt ops/ms 3592.831 5502.941
VectorShiftRight.rShiftLong ops/ms 1584.012 2425.247
VectorShiftRight.rShiftShort ops/ms 6643.414 9728.762
VectorShiftRight.urShiftByte ops/ms 2066.965 2048.336 (*)
VectorShiftRight.urShiftChar ops/ms 6660.805 9728.478
VectorShiftRight.urShiftInt ops/ms 3592.909 5514.928
VectorShiftRight.urShiftLong ops/ms 1583.995 2422.991
*: Logical shift right for Byte type(urShiftByte) is not vectorized, as
disscussed in [4].
```
- VectorAPI
Furthermore, we also evaluate the impact of this patch on VectorAPI
benchmarks, e.g., [5]. Details can be found in the table below. Columns
`Score-1` and `Score-2` show the scores before and after applying
current patch.
```
Benchmark Units Score-1 Score-2
Byte128Vector.LSHL ops/ms 10867.666 10873.993
Byte128Vector.LSHLShift ops/ms 10945.729 10945.741
Byte128Vector.LSHR ops/ms 8629.305 8629.343
Byte128Vector.LSHRShift ops/ms 8245.864 10303.521 <--
Byte128Vector.ASHR ops/ms 8619.691 8629.438
Byte128Vector.ASHRShift ops/ms 8245.860 10305.027 <--
Int128Vector.LSHL ops/ms 3104.213 3103.702
Int128Vector.LSHLShift ops/ms 3114.354 3114.371
Int128Vector.LSHR ops/ms 2380.717 2380.693
Int128Vector.LSHRShift ops/ms 2312.871 2992.377 <--
Int128Vector.ASHR ops/ms 2380.668 2380.647
Int128Vector.ASHRShift ops/ms 2312.894 2992.332 <--
Long128Vector.LSHL ops/ms 1586.907 1587.591
Long128Vector.LSHLShift ops/ms 1589.469 1589.540
Long128Vector.LSHR ops/ms 1209.754 1209.687
Long128Vector.LSHRShift ops/ms 1174.718 1527.502 <--
Long128Vector.ASHR ops/ms 1209.713 1209.669
Long128Vector.ASHRShift ops/ms 1174.712 1527.174 <--
Short128Vector.LSHL ops/ms 5945.542 5943.770
Short128Vector.LSHLShift ops/ms 5984.743 5984.640
Short128Vector.LSHR ops/ms 4613.378 4613.577
Short128Vector.LSHRShift ops/ms 4486.023 5746.466 <--
Short128Vector.ASHR ops/ms 4613.389 4613.478
Short128Vector.ASHRShift ops/ms 4486.019 5746.368 <--
```
1) For logical shift left(LSHL and LSHLShift), and shift right with
variable vector shift count(LSHR and ASHR) cases, we didn't find much
changes, which is expected.
2) For shift right with scalar shift count(LSHRShift and ASHRShift)
case, about 25% ~ 30% improvement can be observed, and this benefit is
introduced by current patch.
[1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
[2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
[3] openjdk/jdk18#41
[4] openjdk#1087
[5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
3 tasks
fg1417
pushed a commit
to fg1417/jdk
that referenced
this pull request
Mar 14, 2022
After JDK-8275317, C2's SLP vectorizer has supported type conversion
between the same data size. We can also support conversions between
different data sizes like:
int <-> double
float <-> long
int <-> long
float <-> double
A typical test case:
int[] a;
double[] b;
for (int i = start; i < limit; i++) {
b[i] = (double) a[i];
}
Our expected OptoAssembly code for one iteration is like below:
add R12, R2, R11, LShiftL openjdk#2
vector_load V16,[R12, openjdk#16]
vectorcast_i2d V16, V16 # convert I to D vector
add R11, R1, R11, LShiftL openjdk#3 # ptr
add R13, R11, openjdk#16 # ptr
vector_store [R13], V16
To enable the vectorization, the patch solves the following problems
in the SLP.
There are three main operations in the case above, LoadI, ConvI2D and
StoreD. Assuming that the vector length is 128 bits, how many scalar
nodes should be packed together to a vector? If we decide it
separately for each operation node, like what we did before the patch
in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI
or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes
in a vector node sequence, like loading 4 elements to a vector, then
typecasting 2 elements and lastly storing these 2 elements, they become
invalid. As a result, we should look through the whole def-use chain
and then pick up the minimum of these element sizes, like function
SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp.
In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then
generate valid vector node sequence, like loading 2 elements,
converting the 2 elements to another type and storing the 2 elements
with new type.
After this, LoadI nodes don't make full use of the whole vector and
only occupy part of it. So we adapt the code in
SuperWord::get_vw_bytes_special() to the situation.
In SLP, we calculate a kind of alignment as position trace for each
scalar node in the whole vector. In this case, the alignments for 2
LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8.
Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which
mark that this node is the second node in the whole vector, while the
difference between 4 and 8 are just because of their own data sizes. In
this situation, we should try to remove the impact caused by different
data size in SLP. For example, in the stage of
SuperWord::extend_packlist(), while determining if it's potential to
pack a pair of def nodes in the function SuperWord::follow_use_defs(),
we remove the side effect of different data size by transforming the
target alignment from the use node. Because we believe that, assuming
that the vector length is 512 bits, if the ConvI2D use nodes have
alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12,
these two LoadI nodes should be packed as a pair as well.
Similarly, when determining if the vectorization is profitable, type
conversion between different data size takes a type of one size and
produces a type of another size, hence the special checks on alignment
and size should be applied, like what we do in SuperWord::is_vector_use.
After solving these problems, we successfully implemented the
vectorization of type conversion between different data sizes.
Here is the test data on NEON:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 216.431 ± 0.131 ns/op
VectorLoop.convertD2I 523 avgt 15 220.522 ± 0.311 ns/op
VectorLoop.convertF2D 523 avgt 15 217.034 ± 0.292 ns/op
VectorLoop.convertF2L 523 avgt 15 231.634 ± 1.881 ns/op
VectorLoop.convertI2D 523 avgt 15 229.538 ± 0.095 ns/op
VectorLoop.convertI2L 523 avgt 15 214.822 ± 0.131 ns/op
VectorLoop.convertL2F 523 avgt 15 230.188 ± 0.217 ns/op
VectorLoop.convertL2I 523 avgt 15 162.234 ± 0.235 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 124.352 ± 1.079 ns/op
VectorLoop.convertD2I 523 avgt 15 557.388 ± 8.166 ns/op
VectorLoop.convertF2D 523 avgt 15 118.082 ± 4.026 ns/op
VectorLoop.convertF2L 523 avgt 15 225.810 ± 11.180 ns/op
VectorLoop.convertI2D 523 avgt 15 166.247 ± 0.120 ns/op
VectorLoop.convertI2L 523 avgt 15 119.699 ± 2.925 ns/op
VectorLoop.convertL2F 523 avgt 15 220.847 ± 0.053 ns/op
VectorLoop.convertL2I 523 avgt 15 122.339 ± 2.738 ns/op
perf data on X86:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 279.466 ± 0.069 ns/op
VectorLoop.convertD2I 523 avgt 15 551.009 ± 7.459 ns/op
VectorLoop.convertF2D 523 avgt 15 276.066 ± 0.117 ns/op
VectorLoop.convertF2L 523 avgt 15 545.108 ± 5.697 ns/op
VectorLoop.convertI2D 523 avgt 15 745.303 ± 0.185 ns/op
VectorLoop.convertI2L 523 avgt 15 260.878 ± 0.044 ns/op
VectorLoop.convertL2F 523 avgt 15 502.016 ± 0.172 ns/op
VectorLoop.convertL2I 523 avgt 15 261.654 ± 3.326 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 106.975 ± 0.045 ns/op
VectorLoop.convertD2I 523 avgt 15 546.866 ± 9.287 ns/op
VectorLoop.convertF2D 523 avgt 15 82.414 ± 0.340 ns/op
VectorLoop.convertF2L 523 avgt 15 542.235 ± 2.785 ns/op
VectorLoop.convertI2D 523 avgt 15 92.966 ± 1.400 ns/op
VectorLoop.convertI2L 523 avgt 15 79.960 ± 0.528 ns/op
VectorLoop.convertL2F 523 avgt 15 504.712 ± 4.794 ns/op
VectorLoop.convertL2I 523 avgt 15 129.753 ± 0.094 ns/op
perf data on AVX512:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 282.984 ± 4.022 ns/op
VectorLoop.convertD2I 523 avgt 15 543.080 ± 3.873 ns/op
VectorLoop.convertF2D 523 avgt 15 273.950 ± 0.131 ns/op
VectorLoop.convertF2L 523 avgt 15 539.568 ± 2.747 ns/op
VectorLoop.convertI2D 523 avgt 15 745.238 ± 0.069 ns/op
VectorLoop.convertI2L 523 avgt 15 260.935 ± 0.169 ns/op
VectorLoop.convertL2F 523 avgt 15 501.870 ± 0.359 ns/op
VectorLoop.convertL2I 523 avgt 15 257.508 ± 0.174 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 76.687 ± 0.530 ns/op
VectorLoop.convertD2I 523 avgt 15 545.408 ± 4.657 ns/op
VectorLoop.convertF2D 523 avgt 15 273.935 ± 0.099 ns/op
VectorLoop.convertF2L 523 avgt 15 540.534 ± 3.032 ns/op
VectorLoop.convertI2D 523 avgt 15 745.234 ± 0.053 ns/op
VectorLoop.convertI2L 523 avgt 15 260.865 ± 0.104 ns/op
VectorLoop.convertL2F 523 avgt 15 63.834 ± 4.777 ns/op
VectorLoop.convertL2I 523 avgt 15 48.183 ± 0.990 ns/op
Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef
e1iu
pushed a commit
to e1iu/jdk
that referenced
this pull request
Mar 24, 2022
This patch fixes the wrong matching rule of replicate2L_zero. It was
matched "ReplicateI" by mistake so that long immediates(not only zero)
had to be moved to register first and matched to replicate2L finally. To
fix this trivial bug, this patch fixes the typo and extends the rule of
replicate2L_zero to replicate2L_imm, which now supports all possible
long immediate values.
The final code changes are shown as below:
replicate2L_imm:
mov x13, #0xff
movk x13, #0xff, lsl openjdk#16
movk x13, #0xff, lsl openjdk#32
dup v16.2d, x13
=>
movi v16.2d, #0xff00ff00ff
[Test]
test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
passed without failure.
Change-Id: Ieac92820dea560239a968de3d7430003f01726bd
fg1417
pushed a commit
to fg1417/jdk
that referenced
this pull request
Mar 28, 2022
```
public short[] vectorUnsignedShiftRight(short[] shorts) {
short[] res = new short[SIZE];
for (int i = 0; i < SIZE; i++) {
res[i] = (short) (shorts[i] >>> 3);
}
return res;
}
```
In C2's SLP, vectorization of unsigned shift right on signed
subword types (byte/short) like the case above is intentionally
disabled[1]. Because the vector unsigned shift on signed
subword types behaves differently from the Java spec. It's
worthy to vectorize more cases in quite low cost. Also,
unsigned shift right on signed subword is not uncommon and we
may find similar cases in Lucene benchmark[2].
Taking unsigned right shift on short type as an example,
Short:
| <- 16 bits -> | <- 16 bits -> |
| 1 1 1 ... 1 1 | data |
when the shift amount is a constant not greater than the number
of sign extended bits, 16 higher bits for short type shown like
above, the unsigned shift on signed subword types can be
transformed into a signed shift and hence becomes vectorizable.
Here is the transformation:
For T_SHORT (shift <= 16):
src RShiftCntV shift src RShiftCntV shift
\ / ==> \ /
URShiftVS RShiftVS
This patch does the transformation in SuperWord::implemented() and
SuperWord::output(). It helps vectorize the short cases above. We
can handle unsigned right shift on byte type in a similar way. The
generated assembly code for one iteration on aarch64 is like:
```
...
sbfiz x13, x10, openjdk#1, openjdk#32
add x15, x11, x13
ldr q16, [x15, openjdk#16]
sshr v16.8h, v16.8h, openjdk#3
add x13, x17, x13
str q16, [x13, openjdk#16]
...
```
Here is the performance data for micro-benchmark before and after
this patch on both AArch64 and x64 machines. We can observe about
~80% improvement with this patch.
The perf data on AArch64:
Before the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 295.711 ± 0.117 ns/op
urShiftImmShort 1024 3 avgt 5 284.559 ± 0.148 ns/op
after the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 45.111 ± 0.047 ns/op
urShiftImmShort 1024 3 avgt 5 55.294 ± 0.072 ns/op
The perf data on X86:
Before the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 361.374 ± 4.621 ns/op
urShiftImmShort 1024 3 avgt 5 365.390 ± 3.595 ns/op
After the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 105.489 ± 0.488 ns/op
urShiftImmShort 1024 3 avgt 5 43.400 ± 0.394 ns/op
[1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
[2] https://github.com/jpountz/decode-128-ints-benchmark/
Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161
e1iu
pushed a commit
to e1iu/jdk
that referenced
this pull request
Mar 29, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.
With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:
Before:
mov z16.b, p0/z, #1
fmov x0, d16
orr x0, x0, x0, lsr openjdk#7
orr x0, x0, x0, lsr openjdk#14
orr x0, x0, x0, lsr openjdk#28
and x0, x0, #0xff
fmov x8, v16.d[1]
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#8
orr x8, xzr, #0x2
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#16
orr x8, xzr, #0x3
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#24
orr x8, xzr, #0x4
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#32
mov x8, #0x5
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#40
orr x8, xzr, #0x6
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#48
orr x8, xzr, #0x7
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#56
After:
mov z16.b, p0/z, #1
mov z17.b, #1
bext z16.d, z16.d, z17.d
mov z17.d, #0
uzp1 z16.s, z16.s, z17.s
uzp1 z16.h, z16.h, z17.h
uzp1 z16.b, z16.b, z17.b
mov x0, v16.d[0]
[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-
Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
e1iu
pushed a commit
to e1iu/jdk
that referenced
this pull request
Apr 21, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.
With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:
Before:
mov z16.b, p0/z, #1
fmov x0, d16
orr x0, x0, x0, lsr openjdk#7
orr x0, x0, x0, lsr openjdk#14
orr x0, x0, x0, lsr openjdk#28
and x0, x0, #0xff
fmov x8, v16.d[1]
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#8
orr x8, xzr, #0x2
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#16
orr x8, xzr, #0x3
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#24
orr x8, xzr, #0x4
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#32
mov x8, #0x5
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#40
orr x8, xzr, #0x6
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#48
orr x8, xzr, #0x7
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#56
After:
mov z16.b, p0/z, #1
mov z17.b, #1
bext z16.d, z16.d, z17.d
mov z17.d, #0
uzp1 z16.s, z16.s, z17.s
uzp1 z16.h, z16.h, z17.h
uzp1 z16.b, z16.b, z17.b
mov x0, v16.d[0]
[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-
Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
franferrax
added a commit
to franferrax/jdk
that referenced
this pull request
Aug 11, 2022
fg1417
pushed a commit
to fg1417/jdk
that referenced
this pull request
Aug 17, 2022
After JDK-8283091, the loop below can be vectorized partially.
Statement 1 can be vectorized but statement 2 can't.
```
// int[] iArr; long[] lArrFld; int i1,i2;
for (i1 = 6; i1 < 227; i1++) {
iArr[i1] += lArrFld[i1]++; // statement 1
iArr[i1 + 1] -= (i2++); // statement 2
}
```
But we got incorrect results because the vector packs of iArr are
scheduled incorrectly like:
```
...
load_vector XMM1,[R8 + openjdk#16 + R11 << openjdk#2]
movl RDI, [R8 + openjdk#20 + R11 << openjdk#2] # int
load_vector XMM2,[R9 + openjdk#8 + R11 << openjdk#3]
subl RDI, R11 # int
vpaddq XMM3,XMM2,XMM0 ! add packedL
store_vector [R9 + openjdk#8 + R11 << openjdk#3],XMM3
vector_cast_l2x XMM2,XMM2 !
vpaddd XMM1,XMM2,XMM1 ! add packedI
addl RDI, openjdk#228 # int
movl [R8 + openjdk#20 + R11 << openjdk#2], RDI # int
movl RBX, [R8 + openjdk#24 + R11 << openjdk#2] # int
subl RBX, R11 # int
addl RBX, openjdk#227 # int
movl [R8 + openjdk#24 + R11 << openjdk#2], RBX # int
...
movl RBX, [R8 + openjdk#40 + R11 << openjdk#2] # int
subl RBX, R11 # int
addl RBX, openjdk#223 # int
movl [R8 + openjdk#40 + R11 << openjdk#2], RBX # int
movl RDI, [R8 + openjdk#44 + R11 << openjdk#2] # int
subl RDI, R11 # int
addl RDI, openjdk#222 # int
movl [R8 + openjdk#44 + R11 << openjdk#2], RDI # int
store_vector [R8 + openjdk#16 + R11 << openjdk#2],XMM1
...
```
simplified as:
```
load_vector iArr in statement 1
unvectorized loads/stores in statement 2
store_vector iArr in statement 1
```
We cannot pick the memory state from the first load for LoadI pack
here, as the LoadI vector operation must load the new values in memory
after iArr writes 'iArr[i1 + 1] - (i2++)' to 'iArr[i1 + 1]'(statement 2).
We must take the memory state of the last load where we have assigned
new values ('iArr[i1 + 1] - (i2++)') to the iArr array.
In JDK-8240281, we picked the memory state of the first load. Different
from the scenario in JDK-8240281, the store, which is dependent on an
earlier load here, is in a pack to be scheduled and the LoadI pack
depends on the last_mem. As designed[2], to schedule the StoreI pack,
all memory operations in another single pack should be moved in the same
direction. We know that the store in the pack depends on one of loads in
the LoadI pack, so the LoadI pack should be scheduled before the StoreI
pack. And the LoadI pack depends on the last_mem, so the last_mem must
be scheduled before the LoadI pack and also before the store pack.
Therefore, we need to take the memory state of the last load for the
LoadI pack here.
To fix it, the pack adds additional checks while picking the memory state
of the first load. When the store locates in a pack and the load pack
relies on the last_mem, we shouldn't choose the memory state of the
first load but choose the memory state of the last load.
[1]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2380
[2]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2232
Jira: ENTLLT-5482
Change-Id: I341d10b91957b60a1b4aff8116723e54083a5fb8
CustomizedGitHooks: yes
Bhavana-Kilambi
added a commit
to Bhavana-Kilambi/jdk
that referenced
this pull request
Sep 5, 2022
…nodes Recently we found that the rotate left/right benchmarks with vectorapi emit a redundant "and" instruction on both aarch64 and x86_64 machines which can be done away with. For example - and(and(a, b), b) generates two "and" instructions which can be reduced to a single "and" operation- and(a, b) since "and" (and "or") operations are commutative and idempotent in nature. This can help improve performance for all those workloads which have multiple "and"/"or" operations with the same value by reducing them to fewer "and"/"or" operations accordingly. This patch adds the following transformations for vector logical operations - AndV and OrV : (OpV (OpV a b) b) => (OpV a b) (OpV (OpV a b) a) => (OpV a b) (OpV (OpV a b m1) b m1) => (OpV a b m1) (OpV (OpV a b m1) a m1) => (OpV a b m1) (OpV a (OpV a b)) => (OpV a b) (OpV b (OpV a b)) => (OpV a b) (OpV a (OpV a b m) m) => (OpV a b m) where Op = "And", "Or" Links for benchmarks tested are given below :- https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L728 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L764 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L728 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L764 Before this patch, the disassembly for one these testcases (IntMaxVector.ROR) for Neon is shown below : ldr q16, [x12, openjdk#16] and v16.16b, v16.16b, v20.16b and v16.16b, v16.16b, v20.16b add x12, x16, x11 sub v17.4s, v21.4s, v16.4s ldr q18, [x12, openjdk#16] sshl v17.4s, v18.4s, v17.4s add x11, x18, x11 neg v19.16b, v16.16b ushl v19.4s, v18.4s, v19.4s orr v16.16b, v17.16b, v19.16b str q16, [x11, openjdk#16] After this patch, the disassembly for the same testcase above is shown below : ldr q16, [x12, openjdk#16] and v16.16b, v16.16b, v20.16b add x12, x16, x11 sub v17.4s, v21.4s, v16.4s ldr q18, [x12, openjdk#16] sshl v17.4s, v18.4s, v17.4s add x11, x18, x11 neg v19.16b, v16.16b ushl v19.4s, v18.4s, v19.4s orr v16.16b, v17.16b, v19.16b str q16, [x11, openjdk#16] The other tests also emit an extra "and" instruction as shown above for the vector ROR/ROL operations. Below are the performance results for the vectorapi rotate tests (tests given in the links above) with this patch on aarch64 and x86_64 machines (for int and long types) - Benchmark aarch64 x86_64 IntMaxVector.ROL 25.57% 26.09% IntMaxVector.ROR 23.75% 24.15% LongMaxVector.ROL 28.91% 28.51% LongMaxVector.ROR 16.51% 29.11% The percentage indicates the percent gain/improvement in performance (ops/ms) with this patch over the master build without this patch. The machine descriptions are given below - aarch64 - 128-bit aarch64 machine x86_64 - 256-bit x86 machine
openjdk-notifier Bot
pushed a commit
that referenced
this pull request
Nov 9, 2022
Fix failing tests
fg1417
pushed a commit
to fg1417/jdk
that referenced
this pull request
Nov 29, 2022
…erOfTrailingZeros/numberOfLeadingZeros()` Background: Java API[1] for `Long.bitCount/numberOfTrailingZeros/ numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one. We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below. 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type. For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression. 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto- vectorization is profitable. 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like: ``` LOOP: sxtw x13, w12 add x14, x15, x13, uxtx openjdk#3 add x17, x14, #0x10 ld1d {z16.d}, p7/z, [x17] // Incorrectly use integer rbit/clz insn for long type vector *rbit z16.s, p7/m, z16.s *clz z16.s, p7/m, z16.s add x13, x16, x13, uxtx openjdk#2 str q16, [x13, openjdk#16] ... add w12, w12, #0x20 cmp w12, w3 b.lt LOOP ``` It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected: ``` LOOP: sxtw x13, w12 add x14, x15, x13, uxtx openjdk#3 add x17, x14, #0x10 ld1d {z16.d}, p7/z, [x17] // Compute with long vector type and convert to int vector type *rbit z16.d, p7/m, z16.d *clz z16.d, p7/m, z16.d *mov z24.d, #0 *uzp1 z25.s, z16.s, z24.s add x13, x16, x13, uxtx openjdk#2 str q25, [x13, openjdk#16] ... add w12, w12, #0x20 cmp w12, w3 b.lt LOOP ``` 4. Fix an assertion failure on x86 avx2 platform Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like: ``` // long[] ia; // int[] ic; for (int i = 0; i < LENGTH; ++i) { ic[i] = Long.numberOfLeadingZeros(ia[i]); } ``` X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally. Tier 1~3 passed with no new failures on Linux AArch64/X86 platform. [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long) https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long) https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long) [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687 [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418 [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239
3 tasks
3 tasks
stefank
pushed a commit
to stefank/jdk
that referenced
this pull request
Mar 24, 2023
…dk#16) * Only use conditional far branch in copy_memory for zgc * Remove unused code
caojoshua
pushed a commit
to caojoshua/jdk
that referenced
this pull request
Mar 29, 2023
Co-authored-by: Xin Liu <xxinliu@amazon.com>
gnu-andrew
pushed a commit
to gnu-andrew/jdk
that referenced
this pull request
Apr 4, 2023
robehn
pushed a commit
to robehn/jdk
that referenced
this pull request
Aug 15, 2023
gnu-andrew
pushed a commit
to gnu-andrew/jdk
that referenced
this pull request
Aug 18, 2023
fg1417
pushed a commit
to fg1417/jdk
that referenced
this pull request
Nov 21, 2023
…ng into ldp/stp on AArch64 Macro-assembler on aarch64 can merge adjacent loads or stores into ldp/stp[1]. For example, it can merge: ``` str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] ``` into ``` stp w20, w10, [sp, openjdk#16] ``` But C2 may generate a sequence like: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str x19, [sp, openjdk#24] <--- str w10, [sp, openjdk#20] <--- Before sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` We can't do any merging for non-adjacent loads or stores. The patch is to sort the spilling or unspilling sequence in the order of offset during instruction scheduling and bundling phase. After that, we can get a new sequence: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] <--- str x19, [sp, openjdk#24] <--- After sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` Then macro-assembler can do ld/st merging: ``` str x21, [sp, openjdk#8] stp w20, w10, [sp, openjdk#16] <--- Merged str x19, [sp, openjdk#24] str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` To justify the patch, we run `HelloWorld.java` ``` public class HelloWorld { public static void main(String [] args) { System.out.println("Hello World!"); } } ``` with `java -Xcomp -XX:-TieredCompilation HelloWorld`. Before the patch, macro-assembler can do ld/st merging for 3688 times. After the patch, the number of ld/st merging increases to 3871 times, by ~5 %. Tested tier1~3 on x86 and AArch64. [1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079
3 tasks
openjdk-notifier Bot
pushed a commit
that referenced
this pull request
Apr 11, 2024
Add framework for other platforms. Moved fill_to_memory_atomic back to the .cpp from the .hpp in order to get 32-bit fixed.
3 tasks
3 tasks
3 tasks
lahodaj
added a commit
to lahodaj/jdk
that referenced
this pull request
Mar 17, 2025
pf0n
pushed a commit
to pf0n/jdk
that referenced
this pull request
Jul 9, 2025
* Intial cut for repeatable builds * Fix line wrapping * Fix line wrapping * Fix line wrapping * Fix line wrapping
8 tasks
3 tasks
fg1417
added a commit
to fg1417/jdk
that referenced
this pull request
Mar 13, 2026
…marks after JDK-8340093 JDK-8340093 enabled auto-vectorization for more reduction loop cases using 128-bit vector operations. As a result, the following microbenchmarks are negatively affected: VectorReduction2.longAddDotProduct VectorReduction2.longMulDotProduct VectorReduction2.longMulSimple This patch fixes these regressions. 1. Improve code generation for MLA For longAddDotProduct[1], the current implementation generates vectorized code similar to: ``` ldr q17, [x12, openjdk#16] ldr q18, [x11, openjdk#16] mla z16.d, p7/m, z17.d, z18.d ldr q17, [x11, openjdk#32] ldr q18, [x12, openjdk#32] mla z16.d, p7/m, z18.d, z17.d ... ldr q17, [x11, openjdk#128] ldr q18, [x12, openjdk#128] mla z16.d, p7/m, z18.d, z17.d ``` `z16` is the third source and destination register. There are true dependencies between consecutive mla[2] instructions. As a result, this vectorized code performs significantly worse than the scalar version due to limited instruction-level parallelism. These mla instructions are produced by a backend match rule that fuses AddVL and MulVL into a vector MLA[3]. In this situation, avoiding instruction fusion and instead generating separate SVE mul and add instructions can improve instruction-level parallelism and overall performance. To address this, this patch introduces is_multiply_accumulate_candidate() to determine whether a node is a suitable vector MLA candidate. For node patterns that may increase execution latency, instruction fusion into MLA is disabled. After applying this patch, the generated assembly looks like: ``` ldr q17, [x12, openjdk#16] ldr q18, [x11, openjdk#16] ldr q19, [x11, openjdk#32] mul z17.d, p7/m, z17.d, z18.d ldr q18, [x12, openjdk#32] ldr q20, [x11, openjdk#48] mul z18.d, p7/m, z18.d, z19.d ldr q19, [x12, openjdk#48] add v16.2d, v17.2d, v16.2d ldr q17, [x11, openjdk#64] add v16.2d, v18.2d, v16.2d ldr q18, [x12, openjdk#64] mul z19.d, p7/m, z19.d, z20.d ldr q20, [x12, openjdk#80] add v16.2d, v19.2d, v16.2d ``` This sequence exposes more independent operations and reduces dependency chains, leading to improved performance. Since SVE mls instructions may suffer from similar issues, the same logic has been extended to cover MLS as well. Additional microbenchmarks have been added accordingly. 2. Avoid vectorizing MUL-heavy loops For longMulSimple[3], the generated vectorized code exhibits long dependency chains of SVE mul instructions, which results in worse performance than scalar execution: ``` ldr q17, [x1, openjdk#16] ldr q18, [x1, openjdk#32] mul z17.d, p7/m, z17.d, z16.d ldr q16, [x1, openjdk#48] mul z17.d, p7/m, z17.d, z18.d ldr q18, [x1, openjdk#64] mul z16.d, p7/m, z16.d, z17.d ... ldr q16, [x1, openjdk#256] mul z17.d, p7/m, z17.d, z19.d mul z16.d, p7/m, z16.d, z17.d ``` To address this, the patch introduces a platform-specific interface: `VTransformElementWiseVectorNode::node_weight()`. For 128-bit operations, this interface detects consecutive vector long multiply operations and increases the node weight to 4, which is the minimum value required for the cost model to avoid vectorization on both 128-bit and 256-bit platforms. 3. Results Performance measurements on 128-bit and 256-bit SVE machines show that these changes avoid harmful vectorization and improve overall performance for the affected benchmarks. patch: results obtained after applying this patch, using default auto-vectorization settings (-XX:+UseSuperWord, -XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode) main-default: results on mainline using the same default auto-vectorization settings (-XX:+UseSuperWord, -XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode) main-scalar: results on mainline with -XX:+UseSuperWord and -XX:AutoVectorizationOverrideProfitability=0 (force scalar code) The table below reports relative performance changes: p/m1 = (patch - main-default) / main-default p/m0 = (patch - main-scalar) / main-scalar Mode: avgt Unit: ns/op Arm Neoverse V2 machine (128 bit SVE): Benchmark (COUNT) p/m1 p/m0 TypeVectorOperationsSuperWord.mlaL 512 0.16% -50.42% TypeVectorOperationsSuperWord.mlaL 2048 0.26% -56.70% TypeVectorOperationsSuperWord.mlsL 512 -0.10% -50.37% TypeVectorOperationsSuperWord.mlsL 2048 0.14% -56.82% TypeVectorOperationsSuperWord.mulBigL 512 0.06% -25.77% TypeVectorOperationsSuperWord.mulBigL 2048 -0.02% -19.63% TypeVectorOperationsSuperWord.mulI 512 0.63% -63.44% TypeVectorOperationsSuperWord.mulI 2048 0.28% -63.07% TypeVectorOperationsSuperWord.mulL 512 -0.03% -50.47% TypeVectorOperationsSuperWord.mulL 2048 0.29% -50.82% TypeVectorOperationsSuperWord.mulMediumL 512 -0.19% -27.54% TypeVectorOperationsSuperWord.mulMediumL 2048 0.24% -25.18% TypeVectorOperationsSuperWord.mulMlaLDependent 512 0.30% -28.70% TypeVectorOperationsSuperWord.mulMlaLDependent 2048 0.12% -26.74% TypeVectorOperationsSuperWord.mulMlaLIndependent 512 -10.43% -43.09% TypeVectorOperationsSuperWord.mulMlaLIndependent 2048 -14.82% -42.68% VectorReduction2.WithSuperword.longAddBig 2048 -15.15% -44.01% VectorReduction2.WithSuperword.longAddBigMixSub1 2048 -6.19% -43.92% VectorReduction2.WithSuperword.longAddBigMixSub2 2048 -15.18% -43.90% VectorReduction2.WithSuperword.longAddBigMixSub3 2048 -5.74% -43.87% VectorReduction2.WithSuperword.longAddDotProduct 2048 -33.36% -18.16% VectorReduction2.WithSuperword.longAddSimple 2048 -0.02% -6.72% VectorReduction2.WithSuperword.longAndBig 2048 -16.32% -44.06% VectorReduction2.WithSuperword.longAndDotProduct 2048 -0.01% -3.74% VectorReduction2.WithSuperword.longAndSimple 2048 0.00% -6.35% VectorReduction2.WithSuperword.longMaxBig 2048 -15.29% -52.09% VectorReduction2.WithSuperword.longMaxDotProduct 2048 -0.03% -52.08% VectorReduction2.WithSuperword.longMaxSimple 2048 -0.40% -52.74% VectorReduction2.WithSuperword.longMinBig 2048 -14.88% -51.70% VectorReduction2.WithSuperword.longMinDotProduct 2048 0.01% -52.21% VectorReduction2.WithSuperword.longMinSimple 2048 0.26% -52.88% VectorReduction2.WithSuperword.longMulBig 2048 -2.21% -0.07% VectorReduction2.WithSuperword.longMulDotProduct 2048 -15.47% 0.00% VectorReduction2.WithSuperword.longMulSimple 2048 -17.87% -0.33% VectorReduction2.WithSuperword.longOrBig 2048 -15.23% -43.94% VectorReduction2.WithSuperword.longOrDotProduct 2048 -0.01% -3.83% VectorReduction2.WithSuperword.longOrSimple 2048 -0.01% -6.60% VectorReduction2.WithSuperword.longXorBig 2048 -10.03% -41.62% VectorReduction2.WithSuperword.longXorDotProduct 2048 0.01% -38.61% VectorReduction2.WithSuperword.longXorSimple 2048 0.02% -53.18% Arm Neoverse V1 machine (256 bit SVE): Note: In the current mainline code, the AArch64 backend supports only 128-bit multiply long operations. Auto-vectorization accounts for this backend constraint and splits 256-bit vectors into 128-bit chunks so that the loop can still be vectorized. This is why 256-bit platforms also benefit from this patch. No obvious performance changes are observed for other benchmarks. Benchmark (COUNT) p/m1 p/m0 VectorReduction2.longMulDotProduct 2048 -28.23% 0.00% VectorReduction2.longMulSimple 2048 -19.29% 0.01% Tier 1 - 3 passed on both aarch64 and x86 platforms. [1] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096 [2] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/MLA--vectors---Multiply-add--predicated--?lang=en [3] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2617 [4] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1035
fg1417
added a commit
to fg1417/jdk
that referenced
this pull request
Mar 30, 2026
The microbenchmark ArraysFill.testLongFill[1] on 128-bit vector platforms generates vectorized store instructions with non-monotonic memory offsets, e.g.: str q16, [x12, openjdk#80] str q16, [x12, openjdk#48] str q16, [x12, openjdk#128] ... This arises because SuperWord only considers true dependencies when building edges (see [3]), and therefore does not enforce ordering among independent vector memory operations. These nodes are later scheduled using RPO, which can result in an apparently unordered sequence of memory accesses. This patch replaces RPO-based scheduling with a priority-based topological sort to improve ordering and locality. The scheduling policy is: 1. Prefer nodes whose weak predecessors have already been scheduled. 2. Prioritize node types in the following order: scalar operations (loads/stores, address expressions), vector arithmetic, vector loads, vector stores, then others. 3. For independent loads/stores sharing the same base address, prefer ascending offsets. 4. Use VTransformNodeIDX to ensure stable ordering. With this change, the generated code becomes monotonic in memory offsets: str q16, [x12, openjdk#16] str q16, [x12, openjdk#32] str q16, [x12, openjdk#48] ... On Arm Neoverse V2 machine (128 bit SVE), this improves the following benchmarks: TypeVectorOperationsSuperWord.java[2] Benchmark (COUNT) Mode Units Difference absD 512 avgt ns/op -27.05% absD 2048 avgt ns/op -27.05% absL 512 avgt ns/op -24.46% absL 2048 avgt ns/op -27.26% convertD2LBitsRaw 512 avgt ns/op -20.39% convertD2LBitsRaw 2048 avgt ns/op -23.92% convertF2L 512 avgt ns/op -16.82% convertF2L 2048 avgt ns/op -22.60% convertI2D 512 avgt ns/op -12.50% convertI2D 2048 avgt ns/op -17.92% convertLBits2D 512 avgt ns/op -27.13% convertLBits2D 2048 avgt ns/op -31.69% negD 512 avgt ns/op -26.85% negD 2048 avgt ns/op -27.09% ArraysFill.java[1]: Benchmark (size) Mode Units Difference testDoubleFill 250 thrpt ops/ms 26.46% testDoubleFill 266 thrpt ops/ms 32.69% testDoubleFill 511 thrpt ops/ms 33.83% testDoubleFill 2047 thrpt ops/ms 45.35% testDoubleFill 2048 thrpt ops/ms 45.38% testDoubleFill 8195 thrpt ops/ms 49.32% testLongFill 250 thrpt ops/ms 28.12% testLongFill 266 thrpt ops/ms 40.30% testLongFill 511 thrpt ops/ms 34.79% testLongFill 2047 thrpt ops/ms 45.71% testLongFill 2048 thrpt ops/ms 53.07% testLongFill 8195 thrpt ops/ms 49.52% No significant performance changes are observed on wider vector platforms (e.g., 256-bit or 512-bit), where fewer vector operations are generated in SuperWord and scheduling has less impact. [1] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/java/util/ArraysFill.java#L92 [2] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java [3] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/src/hotspot/share/opto/superwordVTransformBuilder.cpp#L99
ruben-arm
added a commit
to ruben-arm/jdk
that referenced
this pull request
Mar 30, 2026
Some vector operations do not have inputs and essentially initialize vectors with a constant value. These operations can be marked for spilling and subsequently rematerialized at every use. The result of the transformation might look as follows: movi v16.2d, #0x0 str q16, [x16, openjdk#64] movi v16.2d, #0x0 str q16, [x16, openjdk#32] movi v16.2d, #0x0 str q16, [x16, openjdk#16] movi v16.2d, #0x0 str q16, [x16] movi v16.2d, #0x0 str q16, [x16, openjdk#48] movi v16.2d, #0x0 str q16, [x16, openjdk#112] movi v16.2d, #0x0 str q16, [x16, openjdk#80] movi v16.2d, #0x0 str q16, [x16, openjdk#96] Introduce deduplication of these rematerialized vector constant initializations reducing the above sequence to: movi v16.2d, #0x0 str q16, [x16, openjdk#64] str q16, [x16, openjdk#32] str q16, [x16, openjdk#16] str q16, [x16] str q16, [x16, openjdk#48] str q16, [x16, openjdk#112] str q16, [x16, openjdk#80] str q16, [x16, openjdk#96]
snake66
added a commit
to snake66/jdk
that referenced
this pull request
Apr 20, 2026
Revert "Sync constructors for ThreadWXEnable with MacOS impl"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
https://bugs.openjdk.java.net/browse/JDK-8252543