Optimize Hpack and AsciiString hashcode and equals by njhill · Pull Request #8902 · netty/netty

njhill · 2019-03-01T04:30:43Z

Motivation:

While looking at hpack header-processing hotspots I noticed some low level too-big-to-inline methods which can be shrunk.

Modifications:

Reduce bytecode size and/or runtime operations used for the following methods:

PlatformDependent0.equals(byte[], ...)
PlatformDependent0.equalsConstantTime(byte[], ...)
PlatformDependent0.hashCodeAscii(byte[],int,int)
PlatformDependent.hashCodeAscii(CharSequence)

Result:

Existing benchmarks show decent improvement

Before

Benchmark                     (size)   Mode  Cnt         Score         Error  Units
HpackUtilBenchmark.newEquals   SMALL  thrpt    5  17200229.374 ± 1701239.198  ops/s
HpackUtilBenchmark.newEquals  MEDIUM  thrpt    5   3386061.629 ±   72264.685  ops/s
HpackUtilBenchmark.newEquals   LARGE  thrpt    5    507579.209 ±   65883.951  ops/s

After

Benchmark                     (size)   Mode  Cnt         Score         Error  Units
HpackUtilBenchmark.newEquals   SMALL  thrpt    5  29221527.058 ± 4805825.836  ops/s
HpackUtilBenchmark.newEquals  MEDIUM  thrpt    5   6556251.645 ±  466115.199  ops/s
HpackUtilBenchmark.newEquals   LARGE  thrpt    5    879828.889 ±  148136.641  ops/s

Before

Benchmark                          (size)  Mode  Cnt     Score     Error  Units
PlatformDepBench.unsafeBytesEqual       4  avgt   10     4.263 ±   0.110  ns/op
PlatformDepBench.unsafeBytesEqual      10  avgt   10     5.206 ±   0.133  ns/op
PlatformDepBench.unsafeBytesEqual      50  avgt   10     8.160 ±   0.320  ns/op
PlatformDepBench.unsafeBytesEqual     100  avgt   10    13.810 ±   0.751  ns/op
PlatformDepBench.unsafeBytesEqual    1000  avgt   10    89.077 ±   7.275  ns/op
PlatformDepBench.unsafeBytesEqual   10000  avgt   10   773.940 ±  24.579  ns/op
PlatformDepBench.unsafeBytesEqual  100000  avgt   10  7546.807 ± 110.395  ns/op

After

Benchmark                          (size)  Mode  Cnt     Score     Error  Units
PlatformDepBench.unsafeBytesEqual       4  avgt   10     3.337 ±   0.087  ns/op
PlatformDepBench.unsafeBytesEqual      10  avgt   10     4.286 ±   0.194  ns/op
PlatformDepBench.unsafeBytesEqual      50  avgt   10     7.817 ±   0.123  ns/op
PlatformDepBench.unsafeBytesEqual     100  avgt   10    11.260 ±   0.412  ns/op
PlatformDepBench.unsafeBytesEqual    1000  avgt   10    84.255 ±   2.596  ns/op
PlatformDepBench.unsafeBytesEqual   10000  avgt   10   591.892 ±   5.136  ns/op
PlatformDepBench.unsafeBytesEqual  100000  avgt   10  6978.859 ± 285.043  ns/op

I left out the AsciiStringBenchmark results for brevity but they show ~5-10% improvement, depending on length.

Motivation: While looking at hpack header-processing hotspots I noticed some low level too-big-to-inline methods which can be shrunk. Modifications: Reduce bytecode size and/or runtime operations used for the following methods: PlatformDependent0.equals(byte[], ...) PlatformDependent0.equalsConstantTime(byte[], ...) PlatformDependent0.hashCodeAscii(byte[],int,int) PlatformDependent.hashCodeAscii(CharSequence) Result: Existing benchmarks show decent improvement Before Benchmark (size) Mode Cnt Score Error Units HpackUtilBenchmark.newEquals SMALL thrpt 5 17200229.374 ± 1701239.198 ops/s HpackUtilBenchmark.newEquals MEDIUM thrpt 5 3386061.629 ± 72264.685 ops/s HpackUtilBenchmark.newEquals LARGE thrpt 5 507579.209 ± 65883.951 ops/s After Benchmark (size) Mode Cnt Score Error Units HpackUtilBenchmark.newEquals SMALL thrpt 5 29221527.058 ± 4805825.836 ops/s HpackUtilBenchmark.newEquals MEDIUM thrpt 5 6556251.645 ± 466115.199 ops/s HpackUtilBenchmark.newEquals LARGE thrpt 5 879828.889 ± 148136.641 ops/s Before Benchmark (size) Mode Cnt Score Error Units PlatformDepBench.unsafeBytesEqual 4 avgt 10 4.263 ± 0.110 ns/op PlatformDepBench.unsafeBytesEqual 10 avgt 10 5.206 ± 0.133 ns/op PlatformDepBench.unsafeBytesEqual 50 avgt 10 8.160 ± 0.320 ns/op PlatformDepBench.unsafeBytesEqual 100 avgt 10 13.810 ± 0.751 ns/op PlatformDepBench.unsafeBytesEqual 1000 avgt 10 89.077 ± 7.275 ns/op PlatformDepBench.unsafeBytesEqual 10000 avgt 10 773.940 ± 24.579 ns/op PlatformDepBench.unsafeBytesEqual 100000 avgt 10 7546.807 ± 110.395 ns/op After Benchmark (size) Mode Cnt Score Error Units PlatformDepBench.unsafeBytesEqual 4 avgt 10 3.337 ± 0.087 ns/op PlatformDepBench.unsafeBytesEqual 10 avgt 10 4.286 ± 0.194 ns/op PlatformDepBench.unsafeBytesEqual 50 avgt 10 7.817 ± 0.123 ns/op PlatformDepBench.unsafeBytesEqual 100 avgt 10 11.260 ± 0.412 ns/op PlatformDepBench.unsafeBytesEqual 1000 avgt 10 84.255 ± 2.596 ns/op PlatformDepBench.unsafeBytesEqual 10000 avgt 10 591.892 ± 5.136 ns/op PlatformDepBench.unsafeBytesEqual 100000 avgt 10 6978.859 ± 285.043 ns/op

netty-bot · 2019-03-01T04:32:03Z

Can one of the admins verify this patch?

normanmaurer · 2019-03-01T05:49:20Z

@netty-bot test this please

normanmaurer · 2019-03-01T05:50:16Z

common/src/main/java/io/netty/util/internal/PlatformDependent.java

+        final int length = bytes.length(), remainingBytes = length & 7;
        // Benchmarking shows that by just naively looping for inputs 8~31 bytes long we incur a relatively large
        // performance penalty (only achieve about 60% performance of loop which iterates over each char). So because
        // of this we take special provisions to unroll the looping for these conditions.


@njhill is this comment still true ?

Not sure, I can test it...

would be nice 👍

@normanmaurer interesting, it does appear to still be the case.

This PR as-is:

AsciiStringBenchmark.hashCodeBenchCharSequenceNew 8 thrpt 5 84938768.167 ± 3807502.233 ops/s AsciiStringBenchmark.hashCodeBenchCharSequenceNew 10 thrpt 5 75864099.463 ± 3345939.549 ops/s

Without 8-31 bytes special case:

AsciiStringBenchmark.hashCodeBenchCharSequenceNew 8 thrpt 5 55068716.482 ± 8293068.816 ops/s AsciiStringBenchmark.hashCodeBenchCharSequenceNew 10 thrpt 5 52594830.642 ± 4131819.596 ops/s

@njhill thanks for testing :)

carl-mastrangelo · 2019-03-01T06:59:31Z

common/src/main/java/io/netty/util/internal/PlatformDependent.java

    public static int hashCodeAscii(CharSequence bytes) {
        int hash = HASH_CODE_ASCII_SEED;
-        final int remainingBytes = bytes.length() & 7;
+        final int length = bytes.length(), remainingBytes = length & 7;


I'll defer to @normanmaurer on this, but comma locals aren't good for readability.

I agree with @carl-mastrangelo

carl-mastrangelo · 2019-03-01T07:10:55Z

common/src/main/java/io/netty/util/internal/PlatformDependent0.java

+        final long diff = startPos2 - startPos1;
+        if (length > 7) {
+            final long end = baseOffset1 + remainingBytes;
+            for (long i = baseOffset1 - 8 + length; i >= end; i -= 8) {


@normanmaurer IIRC Java aligns objects on word boundaries, which would imply arrays also start on word boundaries. Doesn't this imply that iterating from a likely odd number offset backwards would result in torn reads? Im surprised this isn't starting at the beginning.

good point. @njhill mind checking ?

@carl-mastrangelo I just left that aspect as it was before, and assumed the reason in this particular method at least was that statistically strings are more likely to differ near the end than the start (common prefixes). That consideration wouldn't apply to the other methods though which always cover all the bytes.

Also note that this is comparing arbitrary regions so the start might not align with a word boundary either, and even if it did it might not for the destination array. And presumably BYTE_ARRAY_BASE_OFFSET would need to be taken into account too.

So I guess it would need some experiments and if there was a big win for aligning the long reads in this way we could consider special-casing very large inputs for the other methods. But I'm not sure how common large inputs are in general since I think this is mainly used for headers?

I think this is mainly used for headers?

Well, at least for gRPC, I think we use the constant time one usually, because we can't tell if a header is sensitive (like an auth header).

Also note that this is comparing arbitrary regions so the start might not align with a word boundary either.

Yeah, I considered the same thing, but my gut says pretty much everyone starts at 0. Since this function is public and not specific to Hpack, we don't know the scope of usage. You are probably right that it doesn't matter, I was jut expressing surprise for the original structure.

for style consistency

normanmaurer · 2019-03-02T06:27:49Z

@netty-bot test this please

vkostyukov

Looks great!

normanmaurer · 2019-03-05T19:32:01Z

@carl-mastrangelo any more comments or you are happy now ?

normanmaurer · 2019-03-06T17:10:13Z

@netty-bot test this please

normanmaurer · 2019-03-06T19:25:59Z

Also @ejona86

carl-mastrangelo

LGTM

normanmaurer · 2019-03-08T06:01:33Z

@njhill thanks a lot!

Motivation: While looking at hpack header-processing hotspots I noticed some low level too-big-to-inline methods which can be shrunk. Modifications: Reduce bytecode size and/or runtime operations used for the following methods: PlatformDependent0.equals(byte[], ...) PlatformDependent0.equalsConstantTime(byte[], ...) PlatformDependent0.hashCodeAscii(byte[],int,int) PlatformDependent.hashCodeAscii(CharSequence) Result: Existing benchmarks show decent improvement Before Benchmark (size) Mode Cnt Score Error Units HpackUtilBenchmark.newEquals SMALL thrpt 5 17200229.374 ± 1701239.198 ops/s HpackUtilBenchmark.newEquals MEDIUM thrpt 5 3386061.629 ± 72264.685 ops/s HpackUtilBenchmark.newEquals LARGE thrpt 5 507579.209 ± 65883.951 ops/s After Benchmark (size) Mode Cnt Score Error Units HpackUtilBenchmark.newEquals SMALL thrpt 5 29221527.058 ± 4805825.836 ops/s HpackUtilBenchmark.newEquals MEDIUM thrpt 5 6556251.645 ± 466115.199 ops/s HpackUtilBenchmark.newEquals LARGE thrpt 5 879828.889 ± 148136.641 ops/s Before Benchmark (size) Mode Cnt Score Error Units PlatformDepBench.unsafeBytesEqual 4 avgt 10 4.263 ± 0.110 ns/op PlatformDepBench.unsafeBytesEqual 10 avgt 10 5.206 ± 0.133 ns/op PlatformDepBench.unsafeBytesEqual 50 avgt 10 8.160 ± 0.320 ns/op PlatformDepBench.unsafeBytesEqual 100 avgt 10 13.810 ± 0.751 ns/op PlatformDepBench.unsafeBytesEqual 1000 avgt 10 89.077 ± 7.275 ns/op PlatformDepBench.unsafeBytesEqual 10000 avgt 10 773.940 ± 24.579 ns/op PlatformDepBench.unsafeBytesEqual 100000 avgt 10 7546.807 ± 110.395 ns/op After Benchmark (size) Mode Cnt Score Error Units PlatformDepBench.unsafeBytesEqual 4 avgt 10 3.337 ± 0.087 ns/op PlatformDepBench.unsafeBytesEqual 10 avgt 10 4.286 ± 0.194 ns/op PlatformDepBench.unsafeBytesEqual 50 avgt 10 7.817 ± 0.123 ns/op PlatformDepBench.unsafeBytesEqual 100 avgt 10 11.260 ± 0.412 ns/op PlatformDepBench.unsafeBytesEqual 1000 avgt 10 84.255 ± 2.596 ns/op PlatformDepBench.unsafeBytesEqual 10000 avgt 10 591.892 ± 5.136 ns/op PlatformDepBench.unsafeBytesEqual 100000 avgt 10 6978.859 ± 285.043 ns/op

normanmaurer requested review from carl-mastrangelo, ejona86, mosesn, normanmaurer, trustin and vkostyukov March 1, 2019 05:49

normanmaurer reviewed Mar 1, 2019

View reviewed changes

carl-mastrangelo reviewed Mar 1, 2019

View reviewed changes

address @carl-mastrangelo's comment and other minor formatting tweaks

f0fb068

for style consistency

normanmaurer approved these changes Mar 5, 2019

View reviewed changes

vkostyukov approved these changes Mar 5, 2019

View reviewed changes

normanmaurer closed this Mar 6, 2019

normanmaurer reopened this Mar 6, 2019

normanmaurer added this to the 4.1.34.Final milestone Mar 7, 2019

normanmaurer added the improvement label Mar 7, 2019

carl-mastrangelo approved these changes Mar 7, 2019

View reviewed changes

normanmaurer merged commit b2eaab0 into netty:4.1 Mar 8, 2019

njhill deleted the fast-arr-cmp branch March 8, 2019 06:36

Uh oh!

Conversation

njhill commented Mar 1, 2019

Uh oh!

netty-bot commented Mar 1, 2019

Uh oh!

normanmaurer commented Mar 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

normanmaurer commented Mar 2, 2019

Uh oh!

vkostyukov left a comment

Choose a reason for hiding this comment

Uh oh!

normanmaurer commented Mar 5, 2019

Uh oh!

normanmaurer commented Mar 6, 2019

Uh oh!

normanmaurer commented Mar 6, 2019

Uh oh!

carl-mastrangelo left a comment

Choose a reason for hiding this comment

Uh oh!

normanmaurer commented Mar 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants