Skip to content

Optimize Hpack and AsciiString hashcode and equals#8902

Merged
normanmaurer merged 2 commits intonetty:4.1from
njhill:fast-arr-cmp
Mar 8, 2019
Merged

Optimize Hpack and AsciiString hashcode and equals#8902
normanmaurer merged 2 commits intonetty:4.1from
njhill:fast-arr-cmp

Conversation

@njhill
Copy link
Copy Markdown
Member

@njhill njhill commented Mar 1, 2019

Motivation:

While looking at hpack header-processing hotspots I noticed some low level too-big-to-inline methods which can be shrunk.

Modifications:

Reduce bytecode size and/or runtime operations used for the following methods:

  • PlatformDependent0.equals(byte[], ...)
  • PlatformDependent0.equalsConstantTime(byte[], ...)
  • PlatformDependent0.hashCodeAscii(byte[],int,int)
  • PlatformDependent.hashCodeAscii(CharSequence)

Result:

Existing benchmarks show decent improvement

Before

Benchmark                     (size)   Mode  Cnt         Score         Error  Units
HpackUtilBenchmark.newEquals   SMALL  thrpt    5  17200229.374 ± 1701239.198  ops/s
HpackUtilBenchmark.newEquals  MEDIUM  thrpt    5   3386061.629 ±   72264.685  ops/s
HpackUtilBenchmark.newEquals   LARGE  thrpt    5    507579.209 ±   65883.951  ops/s

After

Benchmark                     (size)   Mode  Cnt         Score         Error  Units
HpackUtilBenchmark.newEquals   SMALL  thrpt    5  29221527.058 ± 4805825.836  ops/s
HpackUtilBenchmark.newEquals  MEDIUM  thrpt    5   6556251.645 ±  466115.199  ops/s
HpackUtilBenchmark.newEquals   LARGE  thrpt    5    879828.889 ±  148136.641  ops/s

Before

Benchmark                          (size)  Mode  Cnt     Score     Error  Units
PlatformDepBench.unsafeBytesEqual       4  avgt   10     4.263 ±   0.110  ns/op
PlatformDepBench.unsafeBytesEqual      10  avgt   10     5.206 ±   0.133  ns/op
PlatformDepBench.unsafeBytesEqual      50  avgt   10     8.160 ±   0.320  ns/op
PlatformDepBench.unsafeBytesEqual     100  avgt   10    13.810 ±   0.751  ns/op
PlatformDepBench.unsafeBytesEqual    1000  avgt   10    89.077 ±   7.275  ns/op
PlatformDepBench.unsafeBytesEqual   10000  avgt   10   773.940 ±  24.579  ns/op
PlatformDepBench.unsafeBytesEqual  100000  avgt   10  7546.807 ± 110.395  ns/op

After

Benchmark                          (size)  Mode  Cnt     Score     Error  Units
PlatformDepBench.unsafeBytesEqual       4  avgt   10     3.337 ±   0.087  ns/op
PlatformDepBench.unsafeBytesEqual      10  avgt   10     4.286 ±   0.194  ns/op
PlatformDepBench.unsafeBytesEqual      50  avgt   10     7.817 ±   0.123  ns/op
PlatformDepBench.unsafeBytesEqual     100  avgt   10    11.260 ±   0.412  ns/op
PlatformDepBench.unsafeBytesEqual    1000  avgt   10    84.255 ±   2.596  ns/op
PlatformDepBench.unsafeBytesEqual   10000  avgt   10   591.892 ±   5.136  ns/op
PlatformDepBench.unsafeBytesEqual  100000  avgt   10  6978.859 ± 285.043  ns/op

I left out the AsciiStringBenchmark results for brevity but they show ~5-10% improvement, depending on length.

Motivation:

While looking at hpack header-processing hotspots I noticed some low
level too-big-to-inline methods which can be shrunk.

Modifications:

Reduce bytecode size and/or runtime operations used for the following
methods:

PlatformDependent0.equals(byte[], ...)
PlatformDependent0.equalsConstantTime(byte[], ...)
PlatformDependent0.hashCodeAscii(byte[],int,int)
PlatformDependent.hashCodeAscii(CharSequence)

Result:

Existing benchmarks show decent improvement

Before

Benchmark                     (size)   Mode  Cnt         Score         Error  Units
HpackUtilBenchmark.newEquals   SMALL  thrpt    5  17200229.374 ± 1701239.198  ops/s
HpackUtilBenchmark.newEquals  MEDIUM  thrpt    5   3386061.629 ±   72264.685  ops/s
HpackUtilBenchmark.newEquals   LARGE  thrpt    5    507579.209 ±   65883.951  ops/s

After

Benchmark                     (size)   Mode  Cnt         Score         Error  Units
HpackUtilBenchmark.newEquals   SMALL  thrpt    5  29221527.058 ± 4805825.836  ops/s
HpackUtilBenchmark.newEquals  MEDIUM  thrpt    5   6556251.645 ±  466115.199  ops/s
HpackUtilBenchmark.newEquals   LARGE  thrpt    5    879828.889 ±  148136.641  ops/s

Before

Benchmark                          (size)  Mode  Cnt     Score     Error  Units
PlatformDepBench.unsafeBytesEqual       4  avgt   10     4.263 ±   0.110  ns/op
PlatformDepBench.unsafeBytesEqual      10  avgt   10     5.206 ±   0.133  ns/op
PlatformDepBench.unsafeBytesEqual      50  avgt   10     8.160 ±   0.320  ns/op
PlatformDepBench.unsafeBytesEqual     100  avgt   10    13.810 ±   0.751  ns/op
PlatformDepBench.unsafeBytesEqual    1000  avgt   10    89.077 ±   7.275  ns/op
PlatformDepBench.unsafeBytesEqual   10000  avgt   10   773.940 ±  24.579  ns/op
PlatformDepBench.unsafeBytesEqual  100000  avgt   10  7546.807 ± 110.395  ns/op

After

Benchmark                          (size)  Mode  Cnt     Score     Error  Units
PlatformDepBench.unsafeBytesEqual       4  avgt   10     3.337 ±   0.087  ns/op
PlatformDepBench.unsafeBytesEqual      10  avgt   10     4.286 ±   0.194  ns/op
PlatformDepBench.unsafeBytesEqual      50  avgt   10     7.817 ±   0.123  ns/op
PlatformDepBench.unsafeBytesEqual     100  avgt   10    11.260 ±   0.412  ns/op
PlatformDepBench.unsafeBytesEqual    1000  avgt   10    84.255 ±   2.596  ns/op
PlatformDepBench.unsafeBytesEqual   10000  avgt   10   591.892 ±   5.136  ns/op
PlatformDepBench.unsafeBytesEqual  100000  avgt   10  6978.859 ± 285.043  ns/op
@netty-bot
Copy link
Copy Markdown

Can one of the admins verify this patch?

@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

final int length = bytes.length(), remainingBytes = length & 7;
// Benchmarking shows that by just naively looping for inputs 8~31 bytes long we incur a relatively large
// performance penalty (only achieve about 60% performance of loop which iterates over each char). So because
// of this we take special provisions to unroll the looping for these conditions.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill is this comment still true ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I can test it...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanmaurer interesting, it does appear to still be the case.

This PR as-is:

AsciiStringBenchmark.hashCodeBenchCharSequenceNew       8  thrpt    5   84938768.167 ±  3807502.233  ops/s
AsciiStringBenchmark.hashCodeBenchCharSequenceNew      10  thrpt    5   75864099.463 ±  3345939.549  ops/s

Without 8-31 bytes special case:

AsciiStringBenchmark.hashCodeBenchCharSequenceNew       8  thrpt    5   55068716.482 ±  8293068.816  ops/s
AsciiStringBenchmark.hashCodeBenchCharSequenceNew      10  thrpt    5   52594830.642 ±  4131819.596  ops/s

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill thanks for testing :)

public static int hashCodeAscii(CharSequence bytes) {
int hash = HASH_CODE_ASCII_SEED;
final int remainingBytes = bytes.length() & 7;
final int length = bytes.length(), remainingBytes = length & 7;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to @normanmaurer on this, but comma locals aren't good for readability.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @carl-mastrangelo

final long diff = startPos2 - startPos1;
if (length > 7) {
final long end = baseOffset1 + remainingBytes;
for (long i = baseOffset1 - 8 + length; i >= end; i -= 8) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanmaurer IIRC Java aligns objects on word boundaries, which would imply arrays also start on word boundaries. Doesn't this imply that iterating from a likely odd number offset backwards would result in torn reads? Im surprised this isn't starting at the beginning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. @njhill mind checking ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carl-mastrangelo I just left that aspect as it was before, and assumed the reason in this particular method at least was that statistically strings are more likely to differ near the end than the start (common prefixes). That consideration wouldn't apply to the other methods though which always cover all the bytes.

Also note that this is comparing arbitrary regions so the start might not align with a word boundary either, and even if it did it might not for the destination array. And presumably BYTE_ARRAY_BASE_OFFSET would need to be taken into account too.

So I guess it would need some experiments and if there was a big win for aligning the long reads in this way we could consider special-casing very large inputs for the other methods. But I'm not sure how common large inputs are in general since I think this is mainly used for headers?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mainly used for headers?

Well, at least for gRPC, I think we use the constant time one usually, because we can't tell if a header is sensitive (like an auth header).

Also note that this is comparing arbitrary regions so the start might not align with a word boundary either.

Yeah, I considered the same thing, but my gut says pretty much everyone starts at 0. Since this function is public and not specific to Hpack, we don't know the scope of usage. You are probably right that it doesn't matter, I was jut expressing surprise for the original structure.

@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

Copy link
Copy Markdown
Contributor

@vkostyukov vkostyukov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@normanmaurer
Copy link
Copy Markdown
Member

@carl-mastrangelo any more comments or you are happy now ?

@normanmaurer normanmaurer reopened this Mar 6, 2019
@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

@normanmaurer
Copy link
Copy Markdown
Member

Also @ejona86

@normanmaurer normanmaurer added this to the 4.1.34.Final milestone Mar 7, 2019
Copy link
Copy Markdown
Member

@carl-mastrangelo carl-mastrangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@normanmaurer normanmaurer merged commit b2eaab0 into netty:4.1 Mar 8, 2019
@normanmaurer
Copy link
Copy Markdown
Member

@njhill thanks a lot!

normanmaurer pushed a commit that referenced this pull request Mar 8, 2019
Motivation:

While looking at hpack header-processing hotspots I noticed some low
level too-big-to-inline methods which can be shrunk.

Modifications:

Reduce bytecode size and/or runtime operations used for the following
methods:

PlatformDependent0.equals(byte[], ...)
PlatformDependent0.equalsConstantTime(byte[], ...)
PlatformDependent0.hashCodeAscii(byte[],int,int)
PlatformDependent.hashCodeAscii(CharSequence)

Result:

Existing benchmarks show decent improvement

Before

Benchmark                     (size)   Mode  Cnt         Score         Error  Units
HpackUtilBenchmark.newEquals   SMALL  thrpt    5  17200229.374 ± 1701239.198  ops/s
HpackUtilBenchmark.newEquals  MEDIUM  thrpt    5   3386061.629 ±   72264.685  ops/s
HpackUtilBenchmark.newEquals   LARGE  thrpt    5    507579.209 ±   65883.951  ops/s

After

Benchmark                     (size)   Mode  Cnt         Score         Error  Units
HpackUtilBenchmark.newEquals   SMALL  thrpt    5  29221527.058 ± 4805825.836  ops/s
HpackUtilBenchmark.newEquals  MEDIUM  thrpt    5   6556251.645 ±  466115.199  ops/s
HpackUtilBenchmark.newEquals   LARGE  thrpt    5    879828.889 ±  148136.641  ops/s

Before

Benchmark                          (size)  Mode  Cnt     Score     Error  Units
PlatformDepBench.unsafeBytesEqual       4  avgt   10     4.263 ±   0.110  ns/op
PlatformDepBench.unsafeBytesEqual      10  avgt   10     5.206 ±   0.133  ns/op
PlatformDepBench.unsafeBytesEqual      50  avgt   10     8.160 ±   0.320  ns/op
PlatformDepBench.unsafeBytesEqual     100  avgt   10    13.810 ±   0.751  ns/op
PlatformDepBench.unsafeBytesEqual    1000  avgt   10    89.077 ±   7.275  ns/op
PlatformDepBench.unsafeBytesEqual   10000  avgt   10   773.940 ±  24.579  ns/op
PlatformDepBench.unsafeBytesEqual  100000  avgt   10  7546.807 ± 110.395  ns/op

After

Benchmark                          (size)  Mode  Cnt     Score     Error  Units
PlatformDepBench.unsafeBytesEqual       4  avgt   10     3.337 ±   0.087  ns/op
PlatformDepBench.unsafeBytesEqual      10  avgt   10     4.286 ±   0.194  ns/op
PlatformDepBench.unsafeBytesEqual      50  avgt   10     7.817 ±   0.123  ns/op
PlatformDepBench.unsafeBytesEqual     100  avgt   10    11.260 ±   0.412  ns/op
PlatformDepBench.unsafeBytesEqual    1000  avgt   10    84.255 ±   2.596  ns/op
PlatformDepBench.unsafeBytesEqual   10000  avgt   10   591.892 ±   5.136  ns/op
PlatformDepBench.unsafeBytesEqual  100000  avgt   10  6978.859 ± 285.043  ns/op
@njhill njhill deleted the fast-arr-cmp branch March 8, 2019 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants