-
-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Buffer alignment #6293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer alignment #6293
Conversation
| int alignCapacity(int reqCapacity) { | ||
| int delta = reqCapacity & cacheAlignmentMask; | ||
| if (delta == 0) { | ||
| return reqCapacity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix formatting.
| // testInternalNioBuffer(128); | ||
| // testInternalNioBuffer(1024); | ||
| // testInternalNioBuffer(4 * 1024); | ||
| // testInternalNioBuffer(64 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you do this ?
| PooledByteBufAllocator pool = new PooledByteBufAllocator(true, 2, 2, 8192, 11, 1000, 1000, 1000, true, 64); | ||
| ByteBuf buff = pool.directBuffer(4096); | ||
| for(int i = 0; i < 4096; i++) { | ||
| buff.writeByte(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix formatting.
| public void testArenaMetricsCacheAlign() { | ||
| testArenaMetrics0(new PooledByteBufAllocator(true, 2, 2, 8192, 11, 1000, 1000, 1000, true, 64), 100, 1, 1, 0); | ||
| } | ||
| @Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add empty line above
| } | ||
| buff.release(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
| public PooledByteBufAllocator(boolean preferDirect, int nHeapArena, int nDirectArena, int pageSize, int maxOrder, | ||
| int tinyCacheSize, int smallCacheSize, int normalCacheSize, | ||
| boolean useCacheForAllThreads) { | ||
| boolean useCacheForAllThreads, int cacheAlignment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiril-me we also need to keep the old constructor to not break the API.
| return reqCapacity; | ||
| } else { | ||
| return alignCapacity(reqCapacity); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this:
return cacheAlignment == 0 ? reqCapacity : alignCapacity(reqCapacity);| @Override | ||
| protected PoolChunk<byte[]> newUnpooledChunk(int capacity) { | ||
| return new PoolChunk<byte[]>(this, new byte[capacity], capacity); | ||
| return new PoolChunk<byte[]>(this, new byte[capacity], capacity, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so as we only do this for direct buffers why not rename it to directMemoryCacheAlignment or something like this. This also is true for the system property etc.
| memory = allocateDirect(capacity + cacheAlignment); | ||
| offset = offsetCacheLine(memory, cacheAlignmentMask); | ||
| } | ||
| return new PoolChunk<ByteBuffer>(this, memory, capacity, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider changing this to:
return cacheAlignment == 0 ? new PoolChunk<ByteBuffer>(this, allocateDirect(capacity), capacity, 0) :
new PoolChunk<ByteBuffer>(this, allocateDirect(capacity + cacheAlignment), capacity, offsetCacheLine(memory, cacheAlignmentMask)) ;| protected PoolChunk<ByteBuffer> newChunk(int pageSize, int maxOrder, int pageShifts, int chunkSize) { | ||
| final ByteBuffer memory; | ||
| final int offset; | ||
| if (cacheAlignment == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as below
| pooledDirectBuffers[i].writeBytes(bytes); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
| int block = size / 128; | ||
| for (int i = 0; i < pooledDirectBuffers.length; i++) { | ||
| byte[] bytes = new byte[block]; | ||
| rand.nextBytes(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiril-me the allocating and filling of bytes[] should not happen in the benchmark itself, but be part of the @Setup otherwise it will affect the benchmark. Same goes fro everything else. that is not pooledDirectBuffers[i].writeBytes(bytes). Even better would be to also remove the array access here.
| } | ||
|
|
||
| @Benchmark | ||
| public void writeRead() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as below.
|
|
||
| import java.util.HashMap; | ||
| import java.util.Map; | ||
| /* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move copyright to the top of the file and also change year to 2017
| public PooledByteBufAllocator(boolean preferDirect, int nHeapArena, int nDirectArena, int pageSize, int maxOrder, | ||
| int tinyCacheSize, int smallCacheSize, int normalCacheSize, | ||
| boolean useCacheForAllThreads) { | ||
| boolean useCacheForAllThreads, int cacheAlignment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify cacheAlignment is >= 0
|
@kiril-me also please re-run benchmarks and update here once you are done |
|
|
||
| private PooledByteBufAllocator pooledAllocator; | ||
|
|
||
| private ByteBuf pooledDirectBuffers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pooledDirectBuffers -> pooledDirectBuffer
| @Benchmark | ||
| public void write() { | ||
| pooledDirectBuffers.writeBytes(bytes); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiril-me also add a benchmark which just reads ? For this you will need to write in the doSetup() method tho.
|
@kiril-me also ensure you show the new numbers after the changes are in. |
|
@kiril-me please rebase on top of current 4.1 so it only includes your commit. |
|
@normanmaurer I made changes. Reworked benchmarks. Still need to make research how to make benchmarks stable. |
|
@kiril-me let me know once I should check again |
|
@normanmaurer I changed benchmarks. I have two direct buffers. First is default direct buffer for whom I calculate offset in case it was aligned. Because we want to have the miss-align buffer. The second is 64-byte aligned. The performance is visible for large buffer sizes. |
|
@kiril-me please share the new results. |
|
Result: |
|
@kiril-me didnt you say that there is a performance win with cache alignment ? All the numbers in your results suggest otherwise. |
|
I used average time benchmark. The lower value the better. Here is results for 1048576 buffer. |
|
@kiril-me ah doh! I did not notice you used |
|
Throughput measurement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't you share almost all the code while the only difference would be the alignOffset in some cases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? I use alignOffset for the miss-align line. Yes, it will be used in some cases. I didn't find the way to change offset inside buffer once it was created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment that explains the 1137 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be static final and also please add a comment why you used 4
|
@kiril-me also thanks for all the effort! Looks very good :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scottmitch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes look good! few small comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have to be a power of 2 for the mask below to work? if so should we enforce that somewhere for example: warn and go to the next positive power of 2, or set to 0?
MathUtil.safeFindNextPositivePowerOfTwo maybe useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check inside PooledByteBufAllocator. Should I add it in PoolArena too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PooledByteBufAllocator is good enough IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could be tertiary for slightly less code:
return delta == 0 ? reqCapacity : reqCapacity + directMemoryCacheAlignment - delta;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: else is not necessary because you return in the if statement above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: else is not necessary because you return in the if statement above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you make a temporary for size and sizeMask but not for alignOffset ... do we need any temporaries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I made it temporary as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious why the temporaries are necessary ... is this just preference or habit from dealing with volatiles/mutable state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's habit to be sure that data is mutable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
@kiril-me please squash |
64-byte alignment is recommended by the Intel performance guide (https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) for data-structures over 64 bytes. Requiring padding to a multiple of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code. Modification: At the moment cache alignment must be setup manually. But probably it might be taken from the system. The original code was introduced by @normanmaurer https://github.com/netty/netty/pull/4726/files Result: Buffer alignment works better than miss-align cache.
|
@kiril-me once you signed the ICLA I can merge this one... Thanks! |
|
Done. When are you planning to release 4.0.45.Final? |
|
@kiril-me thanks... within the next two weeks. |


Motivation:
64-byte alignment is recommended by the Intel performance guide (https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) for data-structures over 64 bytes.
Requiring padding to a multiple of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code.
Modification:
At the moment cache alignment must be setup manually. But probably it might be taken from the system. The original code was introduced by @normanmaurer https://github.com/netty/netty/pull/4726/files
buffer/src/main/java/io/netty/buffer/PoolArena.java
buffer/src/main/java/io/netty/buffer/PoolChunk.java
buffer/src/main/java/io/netty/buffer/PooledByteBuf.java
buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java
buffer/src/test/java/io/netty/buffer/AbstractByteBufTest.java
buffer/src/test/java/io/netty/buffer/PoolArenaTest.java
buffer/src/test/java/io/netty/buffer/PooledByteBufAllocatorTest.java
microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorBenchmark.java
microbench/src/main/java/io/netty/microbench/buffer/PooledByteBufAllocatorAlignBenchmark.java
Benchmarks show better write and read performance on large buffer size.