8255949: AArch64: Add support for vectorized shift right and accumulate by dgbo · Pull Request #1087 · openjdk/jdk

dgbo · 2020-11-06T03:36:57Z

This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.

Verified with linux-aarch64-server-release, tier1-3.

Added a JMH micro test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java for performance test.
We witness about ~20% with different basic types on Kunpeng916. The JMH results:

Benchmark                                         (count)  (seed)  Mode  Cnt    Score   Error  Units
# before, Kunpeng 916
VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  146.259 ±  0.123  ns/op
VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  454.781 ±  3.856  ns/op
VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  938.842 ± 23.288  ns/op
VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  205.493 ±  4.938  ns/op
VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.483 ±  0.309  ns/op (not vectorized)
VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  220.847 ±  5.868  ns/op
VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  442.587 ±  6.980  ns/op
VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  936.289 ± 21.458  ns/op
# after shift right and accumulate, Kunpeng 916
VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  125.586 ±  0.204  ns/op
VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  365.973 ±  6.466  ns/op
VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  804.605 ± 12.336  ns/op
VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  170.123 ±  4.678  ns/op
VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.779 ±  0.587  ns/op (not vectorized)
VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  185.799 ±  4.764  ns/op
VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  364.360 ±  6.522  ns/op
VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  800.737 ± 13.735  ns/op

We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:

src/hotspot/share/opto/vectornode.cpp, line 226:
  case Op_URShiftI:
    switch (bt) {
    case T_BOOLEAN:return Op_URShiftVB;
    case T_CHAR:   return Op_URShiftVS;
    case T_BYTE:
    case T_SHORT:  return 0; // Vector logical right shift for signed short
                             // values produces incorrect Java result for
                             // negative data because java code should convert
                             // a short value into int value with sign
                             // extension before a shift.
    case T_INT:    return Op_URShiftVI;
    default:       ShouldNotReachHere(); return 0;
    }

We also tried the existing vector operation micro urShiftB, i.e.:

test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
    @Benchmark
    public void urShiftB() {
        for (int i = 0; i < COUNT; i++) {
            resB[i] = (byte) (bytesA[i] >>> 3);
        }
    }

It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Testing

	Linux x64	Linux x86	Windows x64	macOS x64
Build	✔️ (5/5 passed)	✔️ (2/2 passed)	✔️ (2/2 passed)	✔️ (2/2 passed)
Test (tier1)	✔️ (9/9 passed)	✔️ (9/9 passed)	✔️ (9/9 passed)	✔️ (9/9 passed)

Issue

JDK-8255949: AArch64: Add support for vectorized shift right and accumulate

Reviewers

Andrew Haley (@theRealAph - Reviewer)

Download

$ git fetch https://git.openjdk.java.net/jdk pull/1087/head:pull/1087
$ git checkout pull/1087

bridgekeeper · 2020-11-06T03:37:55Z

👋 Welcome back dongbo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2020-11-06T03:39:22Z

@dgbo The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2020-11-06T03:44:05Z

Webrevs

00: Full (71edbca)

mlbridge · 2020-11-06T10:10:54Z

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 11/6/20 3:44 AM, Dong Bo wrote:

Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
We witness about ~20% with different basic types on Kunpeng916.

Do you find it disappointing that there is such a small improvement?
Do you konw why that is? Perhaps the benchmark is memory bound, or
somesuch?

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

dgbo · 2020-11-07T08:40:52Z

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 11/6/20 3:44 AM, Dong Bo wrote:

Added a JMH micro test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java for performance test.
We witness about ~20% with different basic types on Kunpeng916.

Do you find it disappointing that there is such a small improvement?
Do you konw why that is? Perhaps the benchmark is memory bound, or
somesuch?

@theRealAph Thanks for the quick review.

For test shiftURightAccumulateByte, as claimed before, it is not vectorized with/without this patch, so the performance are all the same.

For other tests (14.13%~19.53% improvement), I checked the profile from -prof perfasm in JMH framwork.
The runtime is mainly took by load/store instructions other than shifting and accumulating.
As far as I considered, there is no way that we can test these improvements without these memory accesses.

BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.

I think that's why the improvements are small, hope this could address what you considered, thanks.

The profile with test shiftRightAccumulateByte (14.13% improvement):

# Before							
         ││  0x0000ffff68309804:   add  x6, x2, x15
         ││  0x0000ffff68309808:   add  x7, x3, x15
 19.81%  ││  0x0000ffff6830980c:   ldr  q16, [x6,#16]
  3.81%  ││  0x0000ffff68309810:   ldr  q17, [x7,#16]
         ││  0x0000ffff68309814:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff68309818:   add  v16.16b, v16.16b, v17.16b
         ││  0x0000ffff6830981c:   add  x15, x4, x15
         ││  0x0000ffff68309820:   str  q16, [x15,#16]
  4.06%  ││  0x0000ffff68309824:   ldr  q16, [x6,#32]
  3.79%  ││  0x0000ffff68309828:   ldr  q17, [x7,#32]
         ││  0x0000ffff6830982c:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff68309830:   add  v16.16b, v16.16b, v17.16b
         ││  0x0000ffff68309834:   str  q16, [x15,#32]
  6.05%  ││  0x0000ffff68309838:   ldr  q16, [x6,#48]
  3.48%  ││  0x0000ffff6830983c:   ldr  q17, [x7,#48]
         ││  0x0000ffff68309840:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff68309844:   add  v16.16b, v16.16b, v17.16b
  0.25%  ││  0x0000ffff68309848:   str  q16, [x15,#48]
  8.67%  ││  0x0000ffff6830984c:   ldr  q16, [x6,#64]
  4.30%  ││  0x0000ffff68309850:   ldr  q17, [x7,#64]
         ││  0x0000ffff68309854:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff68309858:   add  v16.16b, v16.16b, v17.16b
  0.06%  ││  0x0000ffff6830985c:   str  q16, [x15,#64]

# After
         ││  0x0000ffff98308d64:   add  x6, x2, x15
 14.77%  ││  0x0000ffff98308d68:   ldr  q16, [x6,#16]
         ││  0x0000ffff98308d6c:   add  x7, x3, x15
  4.55%  ││  0x0000ffff98308d70:   ldr  q17, [x7,#16]
         ││  0x0000ffff98308d74:   ssra v17.16b, v16.16b, #1
         ││  0x0000ffff98308d78:   add  x15, x4, x15
  0.02%  ││  0x0000ffff98308d7c:   str  q17, [x15,#16]
  6.14%  ││  0x0000ffff98308d80:   ldr  q16, [x6,#32]
  5.22%  ││  0x0000ffff98308d84:   ldr  q17, [x7,#32]
         ││  0x0000ffff98308d88:   ssra v17.16b, v16.16b, #1
         ││  0x0000ffff98308d8c:   str  q17, [x15,#32]
  5.26%  ││  0x0000ffff98308d90:   ldr  q16, [x6,#48]
  5.14%  ││  0x0000ffff98308d94:   ldr  q17, [x7,#48]
         ││  0x0000ffff98308d98:   ssra v17.16b, v16.16b, #1
         ││  0x0000ffff98308d9c:   str  q17, [x15,#48]
  6.56%  ││  0x0000ffff98308da0:   ldr  q16, [x6,#64]
  5.10%  ││  0x0000ffff98308da4:   ldr  q17, [x7,#64]
         ││  0x0000ffff98308da8:   ssra v17.16b, v16.16b, #1
  0.06%  ││  0x0000ffff98308dac:   str  q17, [x15,#64]

mlbridge · 2020-11-07T09:14:54Z

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 11/7/20 8:43 AM, Dong Bo wrote:

For other tests (14.13%~19.53% improvement), I checked the profile from `-prof perfasm` in JMH framwork.
The runtime is mainly took by load/store instructions other than shifting and accumulating.
As far as I considered, there is no way that we can test these improvements without these memory accesses.

BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.

I think that's why the improvements are small, hope this could address what you considered, thanks.

OK, but let's think about how this works in the real world outside
benchmarking. If you're missing L1 it really doesn't matter much what
you do with the data, that 12-cycle load latency is going to dominate
whether you use vectorized shifts or not.

Hopefully, though, shifting and accumulating isn't the only thing
you're doing with that data. Probably, you're going to be doing
other things with it too.

With that in mind, please produce a benchmark that fits in L1, so
that we can see if it works better.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

dgbo · 2020-11-09T05:53:28Z

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 11/7/20 8:43 AM, Dong Bo wrote:

I think that's why the improvements are small, hope this could address what you considered, thanks.

OK, but let's think about how this works in the real world outside
benchmarking. If you're missing L1 it really doesn't matter much what
you do with the data, that 12-cycle load latency is going to dominate
whether you use vectorized shifts or not.

Hopefully, though, shifting and accumulating isn't the only thing
you're doing with that data. Probably, you're going to be doing
other things with it too.

With that in mind, please produce a benchmark that fits in L1, so
that we can see if it works better.

I think the benchmark fits L1 already.

Tests shift(U)RightAccumulateLong handle the maximum size of data.
The array size is 1028 (count=1028), basic type long (8B), there are 3 arrays. So the data size is abount 24KB.
The data cache of Kunpeng916 (cpu cortex-A72) is 32KB per core, it can hold all the data accessed.

According to the PMU counters, the cache misses count is negligible.
The perf L1-DCache profile of shiftRightAccumulateByte (improvement 14.13%, 3KB data size):

# r3: L1-DCache refill
# r4: L1-DCache accesses
# (1 - r3/r4) is L1-DCache hit ratio
$ perf stat -C 0-3 -e r3,r4 -I 1000
#           time             counts unit events
     1.000212280             32,169      r3
     1.000212280      2,582,977,726      r4
     2.000423060             34,958      r3
     2.000423060      2,582,545,543      r4
     3.000591100             67,446      r3
     3.000591100      2,583,826,062      r4
     4.000828880             35,932      r3
     4.000828880      2,583,342,061      r4
     5.001060280             39,008      r3
     5.001060280      2,582,724,118      r4

The cache refill nosie may be from the OS intterrupts, context switch or somesuch.

I tried 2 other ways to see if we can do better, but both failed.

Reducing the number of load instuctions, like:

--- a/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
+++ b/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
@@ -81,7 +81,7 @@ public class VectorShiftAccumulate {
     @Benchmark
     public void shiftRightAccumulateByte() {
         for (int i = 0; i < count; i++) {
-            bytesD[i] = (byte) (bytesA[i] + (bytesB[i] >> 1));
+            bytesD[i] = (byte) (0x45 + (bytesB[i] >> 1));
         }
     }

The improvement regressed to 7.24%, due to additinal mov is added to keep the constant unchanged during the loop:

# after, shift and accumulate, additional mov is added
         ││  0x0000ffff703075a4:   add  x16, x3, x15
 15.65%  ││  0x0000ffff703075a8:   ldr  q17, [x16,#16]
         ││  0x0000ffff703075ac:   mov  v18.16b, v16.16b
         ││  0x0000ffff703075b0:   ssra v18.16b, v17.16b, #1
# before, default
 10.43%  ││  0x0000ffff98309f0c:   ldr  q16, [x6,#16]
  4.41%  ││  0x0000ffff98309f10:   ldr  q17, [x7,#16]
         ││  0x0000ffff98309f14:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff98309f18:   add  v16.16b, v16.16b, v17.16b

Adding more shift and accumulate operations, like:

--- a/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
+++ b/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
@@ -82,6 +82,7 @@ public class VectorShiftAccumulate {
     public void shiftRightAccumulateByte() {
         for (int i = 0; i < count; i++) {
             bytesD[i] = (byte) (bytesA[i] + (bytesB[i] >> 1));
+            bytesA[i] = (byte) (bytesB[i] + (bytesD[i] >> 1));
         }
     }

But it fails to vectorlize. :(

mlbridge · 2020-11-09T09:39:03Z

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 11/9/20 5:55 AM, Dong Bo wrote:

On Sat, 7 Nov 2020 08:40:52 GMT, Dong Bo <dongbo at openjdk.org> wrote:

This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.

Verified with linux-aarch64-server-release, tier1-3.

Added a JMH micro `test/micro/org/openjdk/bench We witness about ~20% with different basic Benchmark # before, Kunpeng 916
VectorShiftAccumulate.shiftRightAccumulateByte VectorShiftAccumulate.shiftRightAccumulateInt VectorShiftAccumulate.shiftRightAccumulateLong VectorShiftAccumulate.shiftRightAccumulateShort VectorShiftAccumulate.shiftURightAccumulateByte VectorShiftAccumulate.shiftURightAccumulateChar VectorShiftAccumulate.shiftURightAccumulateInt VectorShiftAccumulate.shiftURightAccumulateLong # after shift right and accumulate, Kunpeng VectorShiftAccumulate.shiftRightAccumulateByte VectorShiftAccumulate.shiftRightAccumulateInt VectorShiftAccumulate.shiftRightAccumulateLong VectorShiftAccumulate.shiftRightAccumulateShort VectorShiftAccumulate.shiftURightAccumulateByte VectorShiftAccumulate.shiftURightAccumulateChar VectorShiftAccumulate.shiftURightAccumulateInt VectorShiftAccumulate.shiftURightAccumulateLong /vm/compiler/VectorShiftAccumulate.java` for performance test.
types on Kunpeng916. The JMH results:
(count) (seed) Mode Cnt Score Error Units
1028 0 avgt 10 146.259 ? 0.123 ns/op
1028 0 avgt 10 454.781 ? 3.856 ns/op
1028 0 avgt 10 938.842 ? 23.288 ns/op
1028 0 avgt 10 205.493 ? 4.938 ns/op
1028 0 avgt 10 905.483 ? 0.309 ns/op (not vectorized)
1028 0 avgt 10 220.847 ? 5.868 ns/op
1028 0 avgt 10 442.587 ? 6.980 ns/op
1028 0 avgt 10 936.289 ? 21.458 ns/op
916
1028 0 avgt 10 125.586 ? 0.204 ns/op
1028 0 avgt 10 365.973 ? 6.466 ns/op
1028 0 avgt 10 804.605 ? 12.336 ns/op
1028 0 avgt 10 170.123 ? 4.678 ns/op
1028 0 avgt 10 905.779 ? 0.587 ns/op (not vectorized)
1028 0 avgt 10 185.799 ? 4.764 ns/op
1028 0 avgt 10 364.360 ? 6.522 ns/op
1028 0 avgt 10 800.737 ? 13.735 ns/op

We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
src/hotspot/share/opto/vectornode.cpp, line 226:
case Op_URShiftI:
switch (bt) {
case T_BOOLEAN:return Op_URShiftVB;
case T_CHAR: return Op_URShiftVS;
case T_BYTE:
case T_SHORT: return 0; // Vector logical right shift for signed short
// values produces incorrect Java result for
// negative data because java code should convert
// a short value into int value with sign
// extension before a shift.
case T_INT: return Op_URShiftVI;
default: ShouldNotReachHere(); return 0;
}
We also tried the existing vector operation micro urShiftB, i.e.:
test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
@benchmark
public void urShiftB() {
for (int i = 0; i < COUNT; i++) {
resB[i] = (byte) (bytesA[i] >>> 3);
}
}
It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.

_Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_

On 11/6/20 3:44 AM, Dong Bo wrote:

Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
We witness about ~20% with different basic types on Kunpeng916.

Do you find it disappointing that there is such a small improvement?
Do you konw why that is? Perhaps the benchmark is memory bound, or
somesuch?

@theRealAph Thanks for the quick review.

For test shiftURightAccumulateByte, as claimed before, it is not vectorized with/without this patch, so the performance are all the same.

For other tests (14.13%~19.53% improvement), I checked the profile from `-prof perfasm` in JMH framwork.
The runtime is mainly took by load/store instructions other than shifting and accumulating.
As far as I considered, there is no way that we can test these improvements without these memory accesses.

BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.

I think that's why the improvements are small, hope this could address what you considered, thanks.

The profile with test shiftRightAccumulateByte (14.13% improvement):

# Before
?? 0x0000ffff68309804: add x6, x2, x15
?? 0x0000ffff68309808: add x7, x3, x15
19.81% ?? 0x0000ffff6830980c: ldr q16, [x6,#16]
3.81% ?? 0x0000ffff68309810: ldr q17, [x7,#16]
?? 0x0000ffff68309814: sshr v16.16b, v16.16b, #1
?? 0x0000ffff68309818: add v16.16b, v16.16b, v17.16b
?? 0x0000ffff6830981c: add x15, x4, x15
?? 0x0000ffff68309820: str q16, [x15,#16]
4.06% ?? 0x0000ffff68309824: ldr q16, [x6,#32]
3.79% ?? 0x0000ffff68309828: ldr q17, [x7,#32]
?? 0x0000ffff6830982c: sshr v16.16b, v16.16b, #1
?? 0x0000ffff68309830: add v16.16b, v16.16b, v17.16b
?? 0x0000ffff68309834: str q16, [x15,#32]
6.05% ?? 0x0000ffff68309838: ldr q16, [x6,#48]
3.48% ?? 0x0000ffff6830983c: ldr q17, [x7,#48]
?? 0x0000ffff68309840: sshr v16.16b, v16.16b, #1
?? 0x0000ffff68309844: add v16.16b, v16.16b, v17.16b
0.25% ?? 0x0000ffff68309848: str q16, [x15,#48]
8.67% ?? 0x0000ffff6830984c: ldr q16, [x6,#64]
4.30% ?? 0x0000ffff68309850: ldr q17, [x7,#64]
?? 0x0000ffff68309854: sshr v16.16b, v16.16b, #1
?? 0x0000ffff68309858: add v16.16b, v16.16b, v17.16b
0.06% ?? 0x0000ffff6830985c: str q16, [x15,#64]

# After
?? 0x0000ffff98308d64: add x6, x2, x15
14.77% ?? 0x0000ffff98308d68: ldr q16, [x6,#16]
?? 0x0000ffff98308d6c: add x7, x3, x15
4.55% ?? 0x0000ffff98308d70: ldr q17, [x7,#16]
?? 0x0000ffff98308d74: ssra v17.16b, v16.16b, #1
?? 0x0000ffff98308d78: add x15, x4, x15
0.02% ?? 0x0000ffff98308d7c: str q17, [x15,#16]
6.14% ?? 0x0000ffff98308d80: ldr q16, [x6,#32]
5.22% ?? 0x0000ffff98308d84: ldr q17, [x7,#32]
?? 0x0000ffff98308d88: ssra v17.16b, v16.16b, #1
?? 0x0000ffff98308d8c: str q17, [x15,#32]
5.26% ?? 0x0000ffff98308d90: ldr q16, [x6,#48]
5.14% ?? 0x0000ffff98308d94: ldr q17, [x7,#48]
?? 0x0000ffff98308d98: ssra v17.16b, v16.16b, #1
?? 0x0000ffff98308d9c: str q17, [x15,#48]
6.56% ?? 0x0000ffff98308da0: ldr q16, [x6,#64]
5.10% ?? 0x0000ffff98308da4: ldr q17, [x7,#64]
?? 0x0000ffff98308da8: ssra v17.16b, v16.16b, #1
0.06% ?? 0x0000ffff98308dac: str q17, [x15,#64]

_Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_

On 11/7/20 8:43 AM, Dong Bo wrote:

I think that's why the improvements are small, hope this could address what you considered, thanks.

OK, but let's think about how this works in the real world outside
benchmarking. If you're missing L1 it really doesn't matter much what
you do with the data, that 12-cycle load latency is going to dominate
whether you use vectorized shifts or not.

Hopefully, though, shifting and accumulating isn't the only thing
you're doing with that data. Probably, you're going to be doing
other things with it too.

With that in mind, please produce a benchmark that fits in L1, so
that we can see if it works better.

I think the benchmark fits L1 already.

Tests shift(U)RightAccumulateLong handle the maximum size of data.
The array size is 1028 (count=1028), basic type long (8B), there are 3 arrays. So the data size is abount 24KB.
The data cache of Kunpeng916 (cpu cortex-A72) is 32KB per core, it can hold all the data accessed.

Wow, OK. So the problem is that the memory system can barely keep up with
the processor, even when all data is coming in from L1. Fair enough.

Approved.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

mlbridge · 2020-11-09T12:42:00Z

Mailing list message from [E)](mailto:dongbo4@huawei.com (dongbo) on hotspot-compiler-dev:

On 2020/11/9 17:37, Andrew Haley wrote:

On 11/9/20 5:55 AM, Dong Bo wrote:

On Sat, 7 Nov 2020 08:40:52 GMT, Dong Bo <dongbo at openjdk.org> wrote:

This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.

Verified with linux-aarch64-server-release, tier1-3.

Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
We witness about ~20% with different basic types on Kunpeng916.

_Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_

On 11/6/20 3:44 AM, Dong Bo wrote:

Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
We witness about ~20% with different basic types on Kunpeng916.
Do you find it disappointing that there is such a small improvement?
Do you konw why that is? Perhaps the benchmark is memory bound, or
somesuch?
@theRealAph Thanks for the quick review.

BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.

I think that's why the improvements are small, hope this could address what you considered, thanks.
_Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_

On 11/7/20 8:43 AM, Dong Bo wrote:

I think that's why the improvements are small, hope this could address what you considered, thanks.
OK, but let's think about how this works in the real world outside
benchmarking. If you're missing L1 it really doesn't matter much what
you do with the data, that 12-cycle load latency is going to dominate
whether you use vectorized shifts or not.

Hopefully, though, shifting and accumulating isn't the only thing
you're doing with that data. Probably, you're going to be doing
other things with it too.

With that in mind, please produce a benchmark that fits in L1, so
that we can see if it works better.

I think the benchmark fits L1 already.

Tests shift(U)RightAccumulateLong handle the maximum size of data.
The array size is 1028 (count=1028), basic type long (8B), there are 3 arrays. So the data size is abount 24KB.
The data cache of Kunpeng916 (cpu cortex-A72) is 32KB per core, it can hold all the data accessed.
Wow, OK. So the problem is that the memory system can barely keep up with
the processor, even when all data is coming in from L1. Fair enough.

Totally agree.

Approved.

Thanks. Could you please approve this on the Github page of these PR?

Link: https://git.openjdk.java.net/jdk/pull/1087

BTW, the Base64.encode intrinsic we discussed few days ago has not been
approved neither.

Is there any further consideration for that?

Base64.encode PR link: https://git.openjdk.java.net/jdk/pull/992

openjdk · 2020-11-09T16:08:51Z

@dgbo This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8255949: AArch64: Add support for vectorized shift right and accumulate

Reviewed-by: aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 47 new commits pushed to the master branch:

1332ba3: 8256039: Shenandoah: runtime/stringtable/StringTableCleaningTest.java fails
11431b1: 4619330: All built-in java.awt.color.ColorSpace fields should be specified as such
17f04fc: 8254078: DataOutputStream is very slow post-disabling of Biased Locking
79b7909: 8255980: G1 Service thread register_task can be used after shutdown
dd8e4ff: 8255711: Fix and unify hotspot signal handlers
d99e1f6: 8255991: Shenandoah: Apply 'weak' LRB on cmpxchg and xchg
c7551c3: 8256014: Eliminate the warning about serialization in non-public API of Swing
2d6c28d: 6847157: java.lang.NullPointerException: HDC for component at sun.java2d.loops.Blit.Blit
3ce09c0: 8255920: J2DBench should support CS_PYCC color profile
2c8f4e2: 8255799: AArch64: CPU_A53MAC feature may be set incorrectly
... and 37 more: https://git.openjdk.java.net/jdk/compare/397bae20e9e6ae91a7d3ee74460e0e3c9da572f9...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

mlbridge · 2020-11-09T16:15:57Z

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 09/11/2020 12:40, dongbo (E) wrote:

BTW, the Base64.encode intrinsic we discussed few days ago has not been
approved neither.

Is there any further consideration for that?

Base64.encode PR link: https://git.openjdk.java.net/jdk/pull/992

Yes, there was one minor style thing:

Register doff = c_rarg4; // position for writing to dest array
Register isURL = c_rarg5; // Base64 or URL chracter set

Register codec = r6;
Register length = r7;

I guess the change in naming style here is because isURL really is an
argument, but code and length are temps, but this expects the reader
to be aware that r6 == c_rarg6 and r7 == c_rarg7. I just find the change
in register naming style confusing.

Otherwise it's all perfectly fine.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

dgbo · 2020-11-10T01:17:56Z

@theRealAph Thanks for the review.
/integrate

I'll fix the register naming style of Base64.encode intrinisc in that PR as suggested.

openjdk · 2020-11-10T01:18:54Z

@dgbo
Your change (at version 71edbca) is now ready to be sponsored by a Committer.

RealFYang · 2020-11-10T01:23:40Z

/sponsor

openjdk · 2020-11-10T01:24:45Z

@RealFYang @dgbo Since your change was applied there have been 47 commits pushed to the master branch:

1332ba3: 8256039: Shenandoah: runtime/stringtable/StringTableCleaningTest.java fails
11431b1: 4619330: All built-in java.awt.color.ColorSpace fields should be specified as such
17f04fc: 8254078: DataOutputStream is very slow post-disabling of Biased Locking
79b7909: 8255980: G1 Service thread register_task can be used after shutdown
dd8e4ff: 8255711: Fix and unify hotspot signal handlers
d99e1f6: 8255991: Shenandoah: Apply 'weak' LRB on cmpxchg and xchg
c7551c3: 8256014: Eliminate the warning about serialization in non-public API of Swing
2d6c28d: 6847157: java.lang.NullPointerException: HDC for component at sun.java2d.loops.Blit.Blit
3ce09c0: 8255920: J2DBench should support CS_PYCC color profile
2c8f4e2: 8255799: AArch64: CPU_A53MAC feature may be set incorrectly
... and 37 more: https://git.openjdk.java.net/jdk/compare/397bae20e9e6ae91a7d3ee74460e0e3c9da572f9...master

Your commit was automatically rebased without conflicts.

Pushed as commit f71f9dc.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Reviewed-by: aph

*** Implementation In AArch64 NEON, vector shift right is implemented by vector shift left instructions (SSHL[1] and USHL[2]) with negative shift count value. In C2 backend, we generate a `neg` to given shift value followed by `sshl` or `ushl` instruction. For vector shift right, the vector shift count has two origins: 1) it can be duplicated from scalar variable/immediate(case-1), 2) it can be loaded directly from one vector(case-2). This patch aims to optimize case-1. Specifically, we move the negate from RShiftV* rules to RShiftCntV rule. As a result, the negate can be hoisted outside of the loop if it's a loop invariant. In this patch, 1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle shift left and shift right respectively. Compared to vslcnt* rules, the negate is conducted in vsrcnt*. 2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use vsra*_var and vsrl*_var rules to handle case-2. Note that ShiftVNode::is_var_shift() can be used to distinguish case-1 from case-2. 3) we add one assertion for the vs*_imm rules as we have done on ARM32[3]. 4) several style issues are resolved. *** Example Take function `rShiftInt()` in the newly added micro benchmark VectorShiftRight.java as an example. ``` public void rShiftInt() { for (int i = 0; i < SIZE; i++) { intsB[i] = intsA[i] >> count; } } ``` Arithmetic shift right is conducted inside a big loop. The following code snippet shows the disassembly code generated by auto-vectorization before we apply current patch. We can see that `neg` is conducted in the loop body. ``` 0x0000ffff89057a64: dup v16.16b, w13 <-- dup 0x0000ffff89057a68: mov w12, #0x7d00 // #32000 0x0000ffff89057a6c: sub w13, w2, w10 0x0000ffff89057a70: cmp w2, w10 0x0000ffff89057a74: csel w13, wzr, w13, lt 0x0000ffff89057a78: mov w8, #0x7d00 // #32000 0x0000ffff89057a7c: cmp w13, w8 0x0000ffff89057a80: csel w13, w12, w13, hi 0x0000ffff89057a84: add w14, w13, w10 0x0000ffff89057a88: nop 0x0000ffff89057a8c: nop 0x0000ffff89057a90: sbfiz x13, x10, openjdk#2, openjdk#32 <-- loop entry 0x0000ffff89057a94: add x15, x17, x13 0x0000ffff89057a98: ldr q17, [x15,openjdk#16] 0x0000ffff89057a9c: add x13, x0, x13 0x0000ffff89057aa0: neg v18.16b, v16.16b <-- neg 0x0000ffff89057aa4: sshl v17.4s, v17.4s, v18.4s <-- shift right 0x0000ffff89057aa8: str q17, [x13,openjdk#16] 0x0000ffff89057aac: ... 0x0000ffff89057b1c: add w10, w10, #0x20 0x0000ffff89057b20: cmp w10, w14 0x0000ffff89057b24: b.lt 0x0000ffff89057a90 <-- loop end ``` Here is the disassembly code after we apply current patch. We can see that the negate is no longer conducted inside the loop, and it is hoisted to the outside. ``` 0x0000ffff8d053a68: neg w14, w13 <---- neg 0x0000ffff8d053a6c: dup v16.16b, w14 <---- dup 0x0000ffff8d053a70: sub w14, w2, w10 0x0000ffff8d053a74: cmp w2, w10 0x0000ffff8d053a78: csel w14, wzr, w14, lt 0x0000ffff8d053a7c: mov w8, #0x7d00 // #32000 0x0000ffff8d053a80: cmp w14, w8 0x0000ffff8d053a84: csel w14, w12, w14, hi 0x0000ffff8d053a88: add w13, w14, w10 0x0000ffff8d053a8c: nop 0x0000ffff8d053a90: sbfiz x14, x10, openjdk#2, openjdk#32 <-- loop entry 0x0000ffff8d053a94: add x15, x17, x14 0x0000ffff8d053a98: ldr q17, [x15,openjdk#16] 0x0000ffff8d053a9c: sshl v17.4s, v17.4s, v16.4s <-- shift right 0x0000ffff8d053aa0: add x14, x0, x14 0x0000ffff8d053aa4: str q17, [x14,openjdk#16] 0x0000ffff8d053aa8: ... 0x0000ffff8d053afc: add w10, w10, #0x20 0x0000ffff8d053b00: cmp w10, w13 0x0000ffff8d053b04: b.lt 0x0000ffff8d053a90 <-- loop end ``` *** Testing Tier1~3 tests passed on Linux/AArch64 platform. *** Performance Evaluation - Auto-vectorization One micro benchmark, i.e. VectorShiftRight.java, is added by this patch in order to evaluate the optimization on vector shift right. The following table shows the result. Column `Score-1` shows the score before we apply current patch, and column `Score-2` shows the score when we apply current patch. We witness about 30% ~ 53% improvement on microbenchmarks. ``` Benchmark Units Score-1 Score-2 VectorShiftRight.rShiftByte ops/ms 10601.980 13816.353 VectorShiftRight.rShiftInt ops/ms 3592.831 5502.941 VectorShiftRight.rShiftLong ops/ms 1584.012 2425.247 VectorShiftRight.rShiftShort ops/ms 6643.414 9728.762 VectorShiftRight.urShiftByte ops/ms 2066.965 2048.336 (*) VectorShiftRight.urShiftChar ops/ms 6660.805 9728.478 VectorShiftRight.urShiftInt ops/ms 3592.909 5514.928 VectorShiftRight.urShiftLong ops/ms 1583.995 2422.991 *: Logical shift right for Byte type(urShiftByte) is not vectorized, as disscussed in [4]. ``` - VectorAPI Furthermore, we also evaluate the impact of this patch on VectorAPI benchmarks, e.g., [5]. Details can be found in the table below. Columns `Score-1` and `Score-2` show the scores before and after applying current patch. ``` Benchmark Units Score-1 Score-2 Byte128Vector.LSHL ops/ms 10867.666 10873.993 Byte128Vector.LSHLShift ops/ms 10945.729 10945.741 Byte128Vector.LSHR ops/ms 8629.305 8629.343 Byte128Vector.LSHRShift ops/ms 8245.864 10303.521 <-- Byte128Vector.ASHR ops/ms 8619.691 8629.438 Byte128Vector.ASHRShift ops/ms 8245.860 10305.027 <-- Int128Vector.LSHL ops/ms 3104.213 3103.702 Int128Vector.LSHLShift ops/ms 3114.354 3114.371 Int128Vector.LSHR ops/ms 2380.717 2380.693 Int128Vector.LSHRShift ops/ms 2312.871 2992.377 <-- Int128Vector.ASHR ops/ms 2380.668 2380.647 Int128Vector.ASHRShift ops/ms 2312.894 2992.332 <-- Long128Vector.LSHL ops/ms 1586.907 1587.591 Long128Vector.LSHLShift ops/ms 1589.469 1589.540 Long128Vector.LSHR ops/ms 1209.754 1209.687 Long128Vector.LSHRShift ops/ms 1174.718 1527.502 <-- Long128Vector.ASHR ops/ms 1209.713 1209.669 Long128Vector.ASHRShift ops/ms 1174.712 1527.174 <-- Short128Vector.LSHL ops/ms 5945.542 5943.770 Short128Vector.LSHLShift ops/ms 5984.743 5984.640 Short128Vector.LSHR ops/ms 4613.378 4613.577 Short128Vector.LSHRShift ops/ms 4486.023 5746.466 <-- Short128Vector.ASHR ops/ms 4613.389 4613.478 Short128Vector.ASHRShift ops/ms 4486.019 5746.368 <-- ``` 1) For logical shift left(LSHL and LSHLShift), and shift right with variable vector shift count(LSHR and ASHR) cases, we didn't find much changes, which is expected. 2) For shift right with scalar shift count(LSHRShift and ASHRShift) case, about 25% ~ 30% improvement can be observed, and this benefit is introduced by current patch. [1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register-- [2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register-- [3] openjdk/jdk18#41 [4] openjdk#1087 [5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509

8255949: AArch64: Add support for vectorized shift right and accumulate

71edbca

openjdk Bot added the rfr Pull request is ready for review label Nov 6, 2020

openjdk Bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Nov 6, 2020

theRealAph approved these changes Nov 9, 2020

View reviewed changes

openjdk Bot added the ready Pull request is ready to be integrated label Nov 9, 2020

openjdk Bot added the sponsor Pull request is ready to be sponsored label Nov 10, 2020

openjdk Bot closed this Nov 10, 2020

openjdk Bot added integrated Pull request has been integrated and removed sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 10, 2020

openjdk-notifier Bot referenced this pull request Nov 10, 2020

8255949: AArch64: Add support for vectorized shift right and accumulate

f71f9dc

Reviewed-by: aph

dgbo deleted the aarch64-shiftR-accumulate branch November 10, 2020 01:25

shqking mentioned this pull request Mar 7, 2022

8265263: AArch64: Combine vneg with right shift count #7724

Closed

3 tasks

Conversation

dgbo commented Nov 6, 2020 • edited by openjdk Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Testing

Issue

Reviewers

Download

Uh oh!

bridgekeeper Bot commented Nov 6, 2020

Uh oh!

openjdk Bot commented Nov 6, 2020

Uh oh!

mlbridge Bot commented Nov 6, 2020

Webrevs

Uh oh!

mlbridge Bot commented Nov 6, 2020

Uh oh!

dgbo commented Nov 7, 2020

Uh oh!

mlbridge Bot commented Nov 7, 2020

Uh oh!

dgbo commented Nov 9, 2020

Uh oh!

mlbridge Bot commented Nov 9, 2020

Uh oh!

mlbridge Bot commented Nov 9, 2020

Uh oh!

openjdk Bot commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge Bot commented Nov 9, 2020

Uh oh!

dgbo commented Nov 10, 2020

Uh oh!

openjdk Bot commented Nov 10, 2020

Uh oh!

RealFYang commented Nov 10, 2020

Uh oh!

openjdk Bot commented Nov 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

dgbo commented Nov 6, 2020 •

edited by openjdk Bot

Loading

openjdk Bot commented Nov 9, 2020 •

edited

Loading