Optimize Hpack and AsciiString hashcode and equals#8902
Optimize Hpack and AsciiString hashcode and equals#8902normanmaurer merged 2 commits intonetty:4.1from
Conversation
Motivation: While looking at hpack header-processing hotspots I noticed some low level too-big-to-inline methods which can be shrunk. Modifications: Reduce bytecode size and/or runtime operations used for the following methods: PlatformDependent0.equals(byte[], ...) PlatformDependent0.equalsConstantTime(byte[], ...) PlatformDependent0.hashCodeAscii(byte[],int,int) PlatformDependent.hashCodeAscii(CharSequence) Result: Existing benchmarks show decent improvement Before Benchmark (size) Mode Cnt Score Error Units HpackUtilBenchmark.newEquals SMALL thrpt 5 17200229.374 ± 1701239.198 ops/s HpackUtilBenchmark.newEquals MEDIUM thrpt 5 3386061.629 ± 72264.685 ops/s HpackUtilBenchmark.newEquals LARGE thrpt 5 507579.209 ± 65883.951 ops/s After Benchmark (size) Mode Cnt Score Error Units HpackUtilBenchmark.newEquals SMALL thrpt 5 29221527.058 ± 4805825.836 ops/s HpackUtilBenchmark.newEquals MEDIUM thrpt 5 6556251.645 ± 466115.199 ops/s HpackUtilBenchmark.newEquals LARGE thrpt 5 879828.889 ± 148136.641 ops/s Before Benchmark (size) Mode Cnt Score Error Units PlatformDepBench.unsafeBytesEqual 4 avgt 10 4.263 ± 0.110 ns/op PlatformDepBench.unsafeBytesEqual 10 avgt 10 5.206 ± 0.133 ns/op PlatformDepBench.unsafeBytesEqual 50 avgt 10 8.160 ± 0.320 ns/op PlatformDepBench.unsafeBytesEqual 100 avgt 10 13.810 ± 0.751 ns/op PlatformDepBench.unsafeBytesEqual 1000 avgt 10 89.077 ± 7.275 ns/op PlatformDepBench.unsafeBytesEqual 10000 avgt 10 773.940 ± 24.579 ns/op PlatformDepBench.unsafeBytesEqual 100000 avgt 10 7546.807 ± 110.395 ns/op After Benchmark (size) Mode Cnt Score Error Units PlatformDepBench.unsafeBytesEqual 4 avgt 10 3.337 ± 0.087 ns/op PlatformDepBench.unsafeBytesEqual 10 avgt 10 4.286 ± 0.194 ns/op PlatformDepBench.unsafeBytesEqual 50 avgt 10 7.817 ± 0.123 ns/op PlatformDepBench.unsafeBytesEqual 100 avgt 10 11.260 ± 0.412 ns/op PlatformDepBench.unsafeBytesEqual 1000 avgt 10 84.255 ± 2.596 ns/op PlatformDepBench.unsafeBytesEqual 10000 avgt 10 591.892 ± 5.136 ns/op PlatformDepBench.unsafeBytesEqual 100000 avgt 10 6978.859 ± 285.043 ns/op
|
Can one of the admins verify this patch? |
|
@netty-bot test this please |
| final int length = bytes.length(), remainingBytes = length & 7; | ||
| // Benchmarking shows that by just naively looping for inputs 8~31 bytes long we incur a relatively large | ||
| // performance penalty (only achieve about 60% performance of loop which iterates over each char). So because | ||
| // of this we take special provisions to unroll the looping for these conditions. |
There was a problem hiding this comment.
Not sure, I can test it...
There was a problem hiding this comment.
@normanmaurer interesting, it does appear to still be the case.
This PR as-is:
AsciiStringBenchmark.hashCodeBenchCharSequenceNew 8 thrpt 5 84938768.167 ± 3807502.233 ops/s
AsciiStringBenchmark.hashCodeBenchCharSequenceNew 10 thrpt 5 75864099.463 ± 3345939.549 ops/s
Without 8-31 bytes special case:
AsciiStringBenchmark.hashCodeBenchCharSequenceNew 8 thrpt 5 55068716.482 ± 8293068.816 ops/s
AsciiStringBenchmark.hashCodeBenchCharSequenceNew 10 thrpt 5 52594830.642 ± 4131819.596 ops/s
| public static int hashCodeAscii(CharSequence bytes) { | ||
| int hash = HASH_CODE_ASCII_SEED; | ||
| final int remainingBytes = bytes.length() & 7; | ||
| final int length = bytes.length(), remainingBytes = length & 7; |
There was a problem hiding this comment.
I'll defer to @normanmaurer on this, but comma locals aren't good for readability.
| final long diff = startPos2 - startPos1; | ||
| if (length > 7) { | ||
| final long end = baseOffset1 + remainingBytes; | ||
| for (long i = baseOffset1 - 8 + length; i >= end; i -= 8) { |
There was a problem hiding this comment.
@normanmaurer IIRC Java aligns objects on word boundaries, which would imply arrays also start on word boundaries. Doesn't this imply that iterating from a likely odd number offset backwards would result in torn reads? Im surprised this isn't starting at the beginning.
There was a problem hiding this comment.
@carl-mastrangelo I just left that aspect as it was before, and assumed the reason in this particular method at least was that statistically strings are more likely to differ near the end than the start (common prefixes). That consideration wouldn't apply to the other methods though which always cover all the bytes.
Also note that this is comparing arbitrary regions so the start might not align with a word boundary either, and even if it did it might not for the destination array. And presumably BYTE_ARRAY_BASE_OFFSET would need to be taken into account too.
So I guess it would need some experiments and if there was a big win for aligning the long reads in this way we could consider special-casing very large inputs for the other methods. But I'm not sure how common large inputs are in general since I think this is mainly used for headers?
There was a problem hiding this comment.
I think this is mainly used for headers?
Well, at least for gRPC, I think we use the constant time one usually, because we can't tell if a header is sensitive (like an auth header).
Also note that this is comparing arbitrary regions so the start might not align with a word boundary either.
Yeah, I considered the same thing, but my gut says pretty much everyone starts at 0. Since this function is public and not specific to Hpack, we don't know the scope of usage. You are probably right that it doesn't matter, I was jut expressing surprise for the original structure.
for style consistency
|
@netty-bot test this please |
|
@carl-mastrangelo any more comments or you are happy now ? |
|
@netty-bot test this please |
|
Also @ejona86 |
|
@njhill thanks a lot! |
Motivation: While looking at hpack header-processing hotspots I noticed some low level too-big-to-inline methods which can be shrunk. Modifications: Reduce bytecode size and/or runtime operations used for the following methods: PlatformDependent0.equals(byte[], ...) PlatformDependent0.equalsConstantTime(byte[], ...) PlatformDependent0.hashCodeAscii(byte[],int,int) PlatformDependent.hashCodeAscii(CharSequence) Result: Existing benchmarks show decent improvement Before Benchmark (size) Mode Cnt Score Error Units HpackUtilBenchmark.newEquals SMALL thrpt 5 17200229.374 ± 1701239.198 ops/s HpackUtilBenchmark.newEquals MEDIUM thrpt 5 3386061.629 ± 72264.685 ops/s HpackUtilBenchmark.newEquals LARGE thrpt 5 507579.209 ± 65883.951 ops/s After Benchmark (size) Mode Cnt Score Error Units HpackUtilBenchmark.newEquals SMALL thrpt 5 29221527.058 ± 4805825.836 ops/s HpackUtilBenchmark.newEquals MEDIUM thrpt 5 6556251.645 ± 466115.199 ops/s HpackUtilBenchmark.newEquals LARGE thrpt 5 879828.889 ± 148136.641 ops/s Before Benchmark (size) Mode Cnt Score Error Units PlatformDepBench.unsafeBytesEqual 4 avgt 10 4.263 ± 0.110 ns/op PlatformDepBench.unsafeBytesEqual 10 avgt 10 5.206 ± 0.133 ns/op PlatformDepBench.unsafeBytesEqual 50 avgt 10 8.160 ± 0.320 ns/op PlatformDepBench.unsafeBytesEqual 100 avgt 10 13.810 ± 0.751 ns/op PlatformDepBench.unsafeBytesEqual 1000 avgt 10 89.077 ± 7.275 ns/op PlatformDepBench.unsafeBytesEqual 10000 avgt 10 773.940 ± 24.579 ns/op PlatformDepBench.unsafeBytesEqual 100000 avgt 10 7546.807 ± 110.395 ns/op After Benchmark (size) Mode Cnt Score Error Units PlatformDepBench.unsafeBytesEqual 4 avgt 10 3.337 ± 0.087 ns/op PlatformDepBench.unsafeBytesEqual 10 avgt 10 4.286 ± 0.194 ns/op PlatformDepBench.unsafeBytesEqual 50 avgt 10 7.817 ± 0.123 ns/op PlatformDepBench.unsafeBytesEqual 100 avgt 10 11.260 ± 0.412 ns/op PlatformDepBench.unsafeBytesEqual 1000 avgt 10 84.255 ± 2.596 ns/op PlatformDepBench.unsafeBytesEqual 10000 avgt 10 591.892 ± 5.136 ns/op PlatformDepBench.unsafeBytesEqual 100000 avgt 10 6978.859 ± 285.043 ns/op
Motivation:
While looking at hpack header-processing hotspots I noticed some low level too-big-to-inline methods which can be shrunk.
Modifications:
Reduce bytecode size and/or runtime operations used for the following methods:
PlatformDependent0.equals(byte[], ...)PlatformDependent0.equalsConstantTime(byte[], ...)PlatformDependent0.hashCodeAscii(byte[],int,int)PlatformDependent.hashCodeAscii(CharSequence)Result:
Existing benchmarks show decent improvement
Before
After
Before
After
I left out the
AsciiStringBenchmarkresults for brevity but they show ~5-10% improvement, depending on length.