8255949: AArch64: Add support for vectorized shift right and accumulate#1087
8255949: AArch64: Add support for vectorized shift right and accumulate#1087dgbo wants to merge 1 commit intoopenjdk:masterfrom
Conversation
|
👋 Welcome back dongbo! A progress list of the required criteria for merging this PR into |
|
Mailing list message from Andrew Haley on hotspot-compiler-dev: On 11/6/20 3:44 AM, Dong Bo wrote:
Do you find it disappointing that there is such a small improvement? -- |
@theRealAph Thanks for the quick review. For test shiftURightAccumulateByte, as claimed before, it is not vectorized with/without this patch, so the performance are all the same. For other tests (14.13%~19.53% improvement), I checked the profile from BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache. I think that's why the improvements are small, hope this could address what you considered, thanks. The profile with test shiftRightAccumulateByte (14.13% improvement): |
|
Mailing list message from Andrew Haley on hotspot-compiler-dev: On 11/7/20 8:43 AM, Dong Bo wrote:
OK, but let's think about how this works in the real world outside Hopefully, though, shifting and accumulating isn't the only thing With that in mind, please produce a benchmark that fits in L1, so -- |
I think the benchmark fits L1 already. Tests shift(U)RightAccumulateLong handle the maximum size of data. According to the PMU counters, the cache misses count is negligible. The cache refill nosie may be from the OS intterrupts, context switch or somesuch. I tried 2 other ways to see if we can do better, but both failed.
The improvement regressed to 7.24%, due to additinal
But it fails to vectorlize. :( |
|
Mailing list message from Andrew Haley on hotspot-compiler-dev: On 11/9/20 5:55 AM, Dong Bo wrote:
Wow, OK. So the problem is that the memory system can barely keep up with Approved. -- |
|
Mailing list message from [E)](mailto:dongbo4@huawei.com (dongbo) on hotspot-compiler-dev: On 2020/11/9 17:37, Andrew Haley wrote:
Totally agree.
Thanks. Could you please approve this on the Github page of these PR? Link: https://git.openjdk.java.net/jdk/pull/1087 BTW, the Base64.encode intrinsic we discussed few days ago has not been Is there any further consideration for that? Base64.encode PR link: https://git.openjdk.java.net/jdk/pull/992 |
|
@dgbo This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 47 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
|
Mailing list message from Andrew Haley on hotspot-compiler-dev: On 09/11/2020 12:40, dongbo (E) wrote:
Yes, there was one minor style thing: Register doff = c_rarg4; // position for writing to dest array Register codec = r6; I guess the change in naming style here is because isURL really is an Otherwise it's all perfectly fine. -- |
|
@theRealAph Thanks for the review. I'll fix the register naming style of Base64.encode intrinisc in that PR as suggested. |
|
/sponsor |
|
@RealFYang @dgbo Since your change was applied there have been 47 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit f71f9dc. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
*** Implementation
In AArch64 NEON, vector shift right is implemented by vector shift left
instructions (SSHL[1] and USHL[2]) with negative shift count value. In
C2 backend, we generate a `neg` to given shift value followed by `sshl`
or `ushl` instruction.
For vector shift right, the vector shift count has two origins:
1) it can be duplicated from scalar variable/immediate(case-1),
2) it can be loaded directly from one vector(case-2).
This patch aims to optimize case-1. Specifically, we move the negate
from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
hoisted outside of the loop if it's a loop invariant.
In this patch,
1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
shift left and shift right respectively. Compared to vslcnt* rules, the
negate is conducted in vsrcnt*.
2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
vsra*_var and vsrl*_var rules to handle case-2. Note that
ShiftVNode::is_var_shift() can be used to distinguish case-1 from
case-2.
3) we add one assertion for the vs*_imm rules as we have done on
ARM32[3].
4) several style issues are resolved.
*** Example
Take function `rShiftInt()` in the newly added micro benchmark
VectorShiftRight.java as an example.
```
public void rShiftInt() {
for (int i = 0; i < SIZE; i++) {
intsB[i] = intsA[i] >> count;
}
}
```
Arithmetic shift right is conducted inside a big loop. The following
code snippet shows the disassembly code generated by auto-vectorization
before we apply current patch. We can see that `neg` is conducted in the
loop body.
```
0x0000ffff89057a64: dup v16.16b, w13 <-- dup
0x0000ffff89057a68: mov w12, #0x7d00 // #32000
0x0000ffff89057a6c: sub w13, w2, w10
0x0000ffff89057a70: cmp w2, w10
0x0000ffff89057a74: csel w13, wzr, w13, lt
0x0000ffff89057a78: mov w8, #0x7d00 // #32000
0x0000ffff89057a7c: cmp w13, w8
0x0000ffff89057a80: csel w13, w12, w13, hi
0x0000ffff89057a84: add w14, w13, w10
0x0000ffff89057a88: nop
0x0000ffff89057a8c: nop
0x0000ffff89057a90: sbfiz x13, x10, openjdk#2, openjdk#32 <-- loop entry
0x0000ffff89057a94: add x15, x17, x13
0x0000ffff89057a98: ldr q17, [x15,openjdk#16]
0x0000ffff89057a9c: add x13, x0, x13
0x0000ffff89057aa0: neg v18.16b, v16.16b <-- neg
0x0000ffff89057aa4: sshl v17.4s, v17.4s, v18.4s <-- shift right
0x0000ffff89057aa8: str q17, [x13,openjdk#16]
0x0000ffff89057aac: ...
0x0000ffff89057b1c: add w10, w10, #0x20
0x0000ffff89057b20: cmp w10, w14
0x0000ffff89057b24: b.lt 0x0000ffff89057a90 <-- loop end
```
Here is the disassembly code after we apply current patch. We can see
that the negate is no longer conducted inside the loop, and it is
hoisted to the outside.
```
0x0000ffff8d053a68: neg w14, w13 <---- neg
0x0000ffff8d053a6c: dup v16.16b, w14 <---- dup
0x0000ffff8d053a70: sub w14, w2, w10
0x0000ffff8d053a74: cmp w2, w10
0x0000ffff8d053a78: csel w14, wzr, w14, lt
0x0000ffff8d053a7c: mov w8, #0x7d00 // #32000
0x0000ffff8d053a80: cmp w14, w8
0x0000ffff8d053a84: csel w14, w12, w14, hi
0x0000ffff8d053a88: add w13, w14, w10
0x0000ffff8d053a8c: nop
0x0000ffff8d053a90: sbfiz x14, x10, openjdk#2, openjdk#32 <-- loop entry
0x0000ffff8d053a94: add x15, x17, x14
0x0000ffff8d053a98: ldr q17, [x15,openjdk#16]
0x0000ffff8d053a9c: sshl v17.4s, v17.4s, v16.4s <-- shift right
0x0000ffff8d053aa0: add x14, x0, x14
0x0000ffff8d053aa4: str q17, [x14,openjdk#16]
0x0000ffff8d053aa8: ...
0x0000ffff8d053afc: add w10, w10, #0x20
0x0000ffff8d053b00: cmp w10, w13
0x0000ffff8d053b04: b.lt 0x0000ffff8d053a90 <-- loop end
```
*** Testing
Tier1~3 tests passed on Linux/AArch64 platform.
*** Performance Evaluation
- Auto-vectorization
One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
in order to evaluate the optimization on vector shift right.
The following table shows the result. Column `Score-1` shows the score
before we apply current patch, and column `Score-2` shows the score when
we apply current patch.
We witness about 30% ~ 53% improvement on microbenchmarks.
```
Benchmark Units Score-1 Score-2
VectorShiftRight.rShiftByte ops/ms 10601.980 13816.353
VectorShiftRight.rShiftInt ops/ms 3592.831 5502.941
VectorShiftRight.rShiftLong ops/ms 1584.012 2425.247
VectorShiftRight.rShiftShort ops/ms 6643.414 9728.762
VectorShiftRight.urShiftByte ops/ms 2066.965 2048.336 (*)
VectorShiftRight.urShiftChar ops/ms 6660.805 9728.478
VectorShiftRight.urShiftInt ops/ms 3592.909 5514.928
VectorShiftRight.urShiftLong ops/ms 1583.995 2422.991
*: Logical shift right for Byte type(urShiftByte) is not vectorized, as
disscussed in [4].
```
- VectorAPI
Furthermore, we also evaluate the impact of this patch on VectorAPI
benchmarks, e.g., [5]. Details can be found in the table below. Columns
`Score-1` and `Score-2` show the scores before and after applying
current patch.
```
Benchmark Units Score-1 Score-2
Byte128Vector.LSHL ops/ms 10867.666 10873.993
Byte128Vector.LSHLShift ops/ms 10945.729 10945.741
Byte128Vector.LSHR ops/ms 8629.305 8629.343
Byte128Vector.LSHRShift ops/ms 8245.864 10303.521 <--
Byte128Vector.ASHR ops/ms 8619.691 8629.438
Byte128Vector.ASHRShift ops/ms 8245.860 10305.027 <--
Int128Vector.LSHL ops/ms 3104.213 3103.702
Int128Vector.LSHLShift ops/ms 3114.354 3114.371
Int128Vector.LSHR ops/ms 2380.717 2380.693
Int128Vector.LSHRShift ops/ms 2312.871 2992.377 <--
Int128Vector.ASHR ops/ms 2380.668 2380.647
Int128Vector.ASHRShift ops/ms 2312.894 2992.332 <--
Long128Vector.LSHL ops/ms 1586.907 1587.591
Long128Vector.LSHLShift ops/ms 1589.469 1589.540
Long128Vector.LSHR ops/ms 1209.754 1209.687
Long128Vector.LSHRShift ops/ms 1174.718 1527.502 <--
Long128Vector.ASHR ops/ms 1209.713 1209.669
Long128Vector.ASHRShift ops/ms 1174.712 1527.174 <--
Short128Vector.LSHL ops/ms 5945.542 5943.770
Short128Vector.LSHLShift ops/ms 5984.743 5984.640
Short128Vector.LSHR ops/ms 4613.378 4613.577
Short128Vector.LSHRShift ops/ms 4486.023 5746.466 <--
Short128Vector.ASHR ops/ms 4613.389 4613.478
Short128Vector.ASHRShift ops/ms 4486.019 5746.368 <--
```
1) For logical shift left(LSHL and LSHLShift), and shift right with
variable vector shift count(LSHR and ASHR) cases, we didn't find much
changes, which is expected.
2) For shift right with scalar shift count(LSHRShift and ASHRShift)
case, about 25% ~ 30% improvement can be observed, and this benefit is
introduced by current patch.
[1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
[2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
[3] openjdk/jdk18#41
[4] openjdk#1087
[5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
Verified with linux-aarch64-server-release, tier1-3.
Added a JMH micro
test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.javafor performance test.We witness about ~20% with different basic types on Kunpeng916. The JMH results:
We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
We also tried the existing vector operation micro urShiftB, i.e.:
It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.
Progress
Testing
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/1087/head:pull/1087$ git checkout pull/1087