Skip to content

Conversation

@kiril-me
Copy link
Contributor

@kiril-me kiril-me commented Jan 29, 2017

Motivation:

64-byte alignment is recommended by the Intel performance guide (https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) for data-structures over 64 bytes.
Requiring padding to a multiple of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code.

Modification:
At the moment cache alignment must be setup manually. But probably it might be taken from the system. The original code was introduced by @normanmaurer https://github.com/netty/netty/pull/4726/files

buffer/src/main/java/io/netty/buffer/PoolArena.java
buffer/src/main/java/io/netty/buffer/PoolChunk.java
buffer/src/main/java/io/netty/buffer/PooledByteBuf.java
buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java
buffer/src/test/java/io/netty/buffer/AbstractByteBufTest.java
buffer/src/test/java/io/netty/buffer/PoolArenaTest.java
buffer/src/test/java/io/netty/buffer/PooledByteBufAllocatorTest.java
microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorBenchmark.java

microbench/src/main/java/io/netty/microbench/buffer/PooledByteBufAllocatorAlignBenchmark.java

Result:
Benchmark                                       (cacheAlign)   (size)  Mode  Cnt   Score   Error  Units
PooledByteBufAllocatorAlignBenchmark.read                  0    01024  avgt   25   0.013 ± 0.001  ms/op
PooledByteBufAllocatorAlignBenchmark.read                  0    04096  avgt   25   0.050 ± 0.004  ms/op
PooledByteBufAllocatorAlignBenchmark.read                  0    16384  avgt   25   0.202 ± 0.018  ms/op
PooledByteBufAllocatorAlignBenchmark.read                  0    65536  avgt   25   0.840 ± 0.065  ms/op
PooledByteBufAllocatorAlignBenchmark.read                  0  1048576  avgt   25  23.778 ± 4.068  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64    01024  avgt   25   0.012 ± 0.001  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64    04096  avgt   25   0.047 ± 0.003  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64    16384  avgt   25   0.200 ± 0.022  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64    65536  avgt   25   0.749 ± 0.079  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64  1048576  avgt   25  13.331 ± 1.396  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0    01024  avgt   25   0.013 ± 0.001  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0    04096  avgt   25   0.050 ± 0.004  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0    16384  avgt   25   0.220 ± 0.027  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0    65536  avgt   25   0.830 ± 0.067  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0  1048576  avgt   25  16.060 ± 0.484  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64    01024  avgt   25   0.012 ± 0.001  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64    04096  avgt   25   0.045 ± 0.003  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64    16384  avgt   25   0.177 ± 0.011  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64    65536  avgt   25   0.746 ± 0.076  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64  1048576  avgt   25  14.150 ± 0.619  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0    01024  avgt   25   0.023 ± 0.002  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0    04096  avgt   25   0.094 ± 0.007  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0    16384  avgt   25   0.380 ± 0.028  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0    65536  avgt   25   1.477 ± 0.127  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0  1048576  avgt   25  27.154 ± 2.389  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64    01024  avgt   25   0.021 ± 0.002  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64    04096  avgt   25   0.087 ± 0.009  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64    16384  avgt   25   0.353 ± 0.037  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64    65536  avgt   25   1.367 ± 0.112  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64  1048576  avgt   25  22.501 ± 1.552  ms/op

Benchmarks show better write and read performance on large buffer size.

int alignCapacity(int reqCapacity) {
int delta = reqCapacity & cacheAlignmentMask;
if (delta == 0) {
return reqCapacity;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix formatting.

// testInternalNioBuffer(128);
// testInternalNioBuffer(1024);
// testInternalNioBuffer(4 * 1024);
// testInternalNioBuffer(64 * 1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you do this ?

PooledByteBufAllocator pool = new PooledByteBufAllocator(true, 2, 2, 8192, 11, 1000, 1000, 1000, true, 64);
ByteBuf buff = pool.directBuffer(4096);
for(int i = 0; i < 4096; i++) {
buff.writeByte(100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix formatting.

public void testArenaMetricsCacheAlign() {
testArenaMetrics0(new PooledByteBufAllocator(true, 2, 2, 8192, 11, 1000, 1000, 1000, true, 64), 100, 1, 1, 0);
}
@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add empty line above

}
buff.release();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

public PooledByteBufAllocator(boolean preferDirect, int nHeapArena, int nDirectArena, int pageSize, int maxOrder,
int tinyCacheSize, int smallCacheSize, int normalCacheSize,
boolean useCacheForAllThreads) {
boolean useCacheForAllThreads, int cacheAlignment) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiril-me we also need to keep the old constructor to not break the API.

return reqCapacity;
} else {
return alignCapacity(reqCapacity);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this:

return cacheAlignment == 0 ? reqCapacity : alignCapacity(reqCapacity);

@Override
protected PoolChunk<byte[]> newUnpooledChunk(int capacity) {
return new PoolChunk<byte[]>(this, new byte[capacity], capacity);
return new PoolChunk<byte[]>(this, new byte[capacity], capacity, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so as we only do this for direct buffers why not rename it to directMemoryCacheAlignment or something like this. This also is true for the system property etc.

memory = allocateDirect(capacity + cacheAlignment);
offset = offsetCacheLine(memory, cacheAlignmentMask);
}
return new PoolChunk<ByteBuffer>(this, memory, capacity, offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider changing this to:

return cacheAlignment == 0 ? new PoolChunk<ByteBuffer>(this, allocateDirect(capacity), capacity, 0) :
        new PoolChunk<ByteBuffer>(this, allocateDirect(capacity + cacheAlignment), capacity, offsetCacheLine(memory, cacheAlignmentMask)) ;

protected PoolChunk<ByteBuffer> newChunk(int pageSize, int maxOrder, int pageShifts, int chunkSize) {
final ByteBuffer memory;
final int offset;
if (cacheAlignment == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as below

pooledDirectBuffers[i].writeBytes(bytes);
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

int block = size / 128;
for (int i = 0; i < pooledDirectBuffers.length; i++) {
byte[] bytes = new byte[block];
rand.nextBytes(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiril-me the allocating and filling of bytes[] should not happen in the benchmark itself, but be part of the @Setup otherwise it will affect the benchmark. Same goes fro everything else. that is not pooledDirectBuffers[i].writeBytes(bytes). Even better would be to also remove the array access here.

}

@Benchmark
public void writeRead() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as below.


import java.util.HashMap;
import java.util.Map;
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move copyright to the top of the file and also change year to 2017

public PooledByteBufAllocator(boolean preferDirect, int nHeapArena, int nDirectArena, int pageSize, int maxOrder,
int tinyCacheSize, int smallCacheSize, int normalCacheSize,
boolean useCacheForAllThreads) {
boolean useCacheForAllThreads, int cacheAlignment) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify cacheAlignment is >= 0

@normanmaurer
Copy link
Member

@kiril-me also please re-run benchmarks and update here once you are done


private PooledByteBufAllocator pooledAllocator;

private ByteBuf pooledDirectBuffers;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pooledDirectBuffers -> pooledDirectBuffer

@Benchmark
public void write() {
pooledDirectBuffers.writeBytes(bytes);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiril-me also add a benchmark which just reads ? For this you will need to write in the doSetup() method tho.

@normanmaurer
Copy link
Member

@kiril-me also ensure you show the new numbers after the changes are in.

@normanmaurer
Copy link
Member

@kiril-me please rebase on top of current 4.1 so it only includes your commit.

@kiril-me
Copy link
Contributor Author

@normanmaurer I made changes. Reworked benchmarks. Still need to make research how to make benchmarks stable.

@normanmaurer
Copy link
Member

@kiril-me let me know once I should check again

@kiril-me
Copy link
Contributor Author

kiril-me commented Feb 1, 2017

@normanmaurer I changed benchmarks. I have two direct buffers. First is default direct buffer for whom I calculate offset in case it was aligned. Because we want to have the miss-align buffer. The second is 64-byte aligned. The performance is visible for large buffer sizes.

@normanmaurer
Copy link
Member

@kiril-me please share the new results.

@kiril-me
Copy link
Contributor Author

kiril-me commented Feb 1, 2017

Result:

Benchmark                                       (cacheAlign)   (size)  Mode  Cnt   Score   Error  Units
PooledByteBufAllocatorAlignBenchmark.read                  0    01024  avgt   25   0.013 ± 0.001  ms/op
PooledByteBufAllocatorAlignBenchmark.read                  0    04096  avgt   25   0.050 ± 0.004  ms/op
PooledByteBufAllocatorAlignBenchmark.read                  0    16384  avgt   25   0.202 ± 0.018  ms/op
PooledByteBufAllocatorAlignBenchmark.read                  0    65536  avgt   25   0.840 ± 0.065  ms/op
PooledByteBufAllocatorAlignBenchmark.read                  0  1048576  avgt   25  23.778 ± 4.068  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64    01024  avgt   25   0.012 ± 0.001  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64    04096  avgt   25   0.047 ± 0.003  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64    16384  avgt   25   0.200 ± 0.022  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64    65536  avgt   25   0.749 ± 0.079  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64  1048576  avgt   25  13.331 ± 1.396  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0    01024  avgt   25   0.013 ± 0.001  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0    04096  avgt   25   0.050 ± 0.004  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0    16384  avgt   25   0.220 ± 0.027  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0    65536  avgt   25   0.830 ± 0.067  ms/op
PooledByteBufAllocatorAlignBenchmark.write                 0  1048576  avgt   25  16.060 ± 0.484  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64    01024  avgt   25   0.012 ± 0.001  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64    04096  avgt   25   0.045 ± 0.003  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64    16384  avgt   25   0.177 ± 0.011  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64    65536  avgt   25   0.746 ± 0.076  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64  1048576  avgt   25  14.150 ± 0.619  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0    01024  avgt   25   0.023 ± 0.002  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0    04096  avgt   25   0.094 ± 0.007  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0    16384  avgt   25   0.380 ± 0.028  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0    65536  avgt   25   1.477 ± 0.127  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead             0  1048576  avgt   25  27.154 ± 2.389  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64    01024  avgt   25   0.021 ± 0.002  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64    04096  avgt   25   0.087 ± 0.009  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64    16384  avgt   25   0.353 ± 0.037  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64    65536  avgt   25   1.367 ± 0.112  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64  1048576  avgt   25  22.501 ± 1.552  ms/op

@normanmaurer
Copy link
Member

@kiril-me didnt you say that there is a performance win with cache alignment ? All the numbers in your results suggest otherwise.

@kiril-me
Copy link
Contributor Author

kiril-me commented Feb 1, 2017

I used average time benchmark. The lower value the better. Here is results for 1048576 buffer.

PooledByteBufAllocatorAlignBenchmark.read                  0  1048576  avgt   25  23.778 ± 4.068  ms/op
PooledByteBufAllocatorAlignBenchmark.read                 64  1048576  avgt   25  13.331 ± 1.396  ms/op

PooledByteBufAllocatorAlignBenchmark.write                 0  1048576  avgt   25  16.060 ± 0.484  ms/op
PooledByteBufAllocatorAlignBenchmark.write                64  1048576  avgt   25  14.150 ± 0.619  ms/op

PooledByteBufAllocatorAlignBenchmark.writeRead             0  1048576  avgt   25  27.154 ± 2.389  ms/op
PooledByteBufAllocatorAlignBenchmark.writeRead            64  1048576  avgt   25  22.501 ± 1.552  ms/op

@normanmaurer
Copy link
Member

@kiril-me ah doh! I did not notice you used avgt !

@kiril-me
Copy link
Contributor Author

kiril-me commented Feb 1, 2017

Throughput measurement.

Benchmark                                       (cacheAlign)   (size)   Mode  Cnt   Score   Error   Units
PooledByteBufAllocatorAlignBenchmark.read                  0    01024  thrpt   25  77.005 ± 7.958  ops/ms
PooledByteBufAllocatorAlignBenchmark.read                  0    04096  thrpt   25  19.920 ± 1.757  ops/ms
PooledByteBufAllocatorAlignBenchmark.read                  0    16384  thrpt   25   5.038 ± 0.544  ops/ms
PooledByteBufAllocatorAlignBenchmark.read                  0    65536  thrpt   25   1.170 ± 0.156  ops/ms
PooledByteBufAllocatorAlignBenchmark.read                  0  1048576  thrpt   25   0.052 ± 0.003  ops/ms
PooledByteBufAllocatorAlignBenchmark.read                 64    01024  thrpt   25  81.833 ± 6.673  ops/ms
PooledByteBufAllocatorAlignBenchmark.read                 64    04096  thrpt   25  21.035 ± 1.704  ops/ms
PooledByteBufAllocatorAlignBenchmark.read                 64    16384  thrpt   25   5.465 ± 0.443  ops/ms
PooledByteBufAllocatorAlignBenchmark.read                 64    65536  thrpt   25   1.296 ± 0.112  ops/ms
PooledByteBufAllocatorAlignBenchmark.read                 64  1048576  thrpt   25   0.076 ± 0.007  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                 0    01024  thrpt   25  77.216 ± 7.052  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                 0    04096  thrpt   25  19.165 ± 1.373  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                 0    16384  thrpt   25   4.969 ± 0.332  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                 0    65536  thrpt   25   1.241 ± 0.087  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                 0  1048576  thrpt   25   0.062 ± 0.003  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                64    01024  thrpt   25  85.550 ± 6.341  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                64    04096  thrpt   25  21.650 ± 1.796  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                64    16384  thrpt   25   5.365 ± 0.455  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                64    65536  thrpt   25   1.323 ± 0.096  ops/ms
PooledByteBufAllocatorAlignBenchmark.write                64  1048576  thrpt   25   0.074 ± 0.004  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead             0    01024  thrpt   25  42.563 ± 4.060  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead             0    04096  thrpt   25  10.743 ± 0.958  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead             0    16384  thrpt   25   2.688 ± 0.190  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead             0    65536  thrpt   25   0.670 ± 0.042  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead             0  1048576  thrpt   25   0.040 ± 0.003  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead            64    01024  thrpt   25  44.415 ± 4.095  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead            64    04096  thrpt   25  11.130 ± 0.854  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead            64    16384  thrpt   25   2.896 ± 0.182  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead            64    65536  thrpt   25   0.717 ± 0.047  ops/ms
PooledByteBufAllocatorAlignBenchmark.writeRead            64  1048576  thrpt   25   0.043 ± 0.004  ops/ms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't you share almost all the code while the only difference would be the alignOffset in some cases ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? I use alignOffset for the miss-align line. Yes, it will be used in some cases. I didn't find the way to change offset inside buffer once it was created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment that explains the 1137 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be static final and also please add a comment why you used 4

@normanmaurer
Copy link
Member

@kiril-me also thanks for all the effort! Looks very good :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR Constructor has 9 parameters, which is greater than 7 authorized. rule

Copy link
Member

@Scottmitch Scottmitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good! few small comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be a power of 2 for the mask below to work? if so should we enforce that somewhere for example: warn and go to the next positive power of 2, or set to 0?

MathUtil.safeFindNextPositivePowerOfTwo maybe useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check inside PooledByteBufAllocator. Should I add it in PoolArena too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PooledByteBufAllocator is good enough IMHO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be tertiary for slightly less code:

return delta == 0 ? reqCapacity : reqCapacity + directMemoryCacheAlignment - delta;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: else is not necessary because you return in the if statement above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: else is not necessary because you return in the if statement above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you make a temporary for size and sizeMask but not for alignOffset ... do we need any temporaries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I made it temporary as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious why the temporaries are necessary ... is this just preference or habit from dealing with volatiles/mutable state?

Copy link
Contributor Author

@kiril-me kiril-me Feb 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's habit to be sure that data is mutable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@normanmaurer normanmaurer self-assigned this Feb 3, 2017
@normanmaurer normanmaurer added this to the 4.0.45.Final milestone Feb 3, 2017
@normanmaurer
Copy link
Member

@kiril-me please squash

    64-byte alignment is recommended by the Intel performance guide (https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) for data-structures over 64 bytes.
    Requiring padding to a multiple of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code.

    Modification:

    At the moment cache alignment must be setup manually. But probably it might be taken from the system. The original code was introduced by @normanmaurer https://github.com/netty/netty/pull/4726/files

    Result:

    Buffer alignment works better than miss-align cache.
@normanmaurer
Copy link
Member

normanmaurer commented Feb 6, 2017

@kiril-me once you signed the ICLA I can merge this one... Thanks!

http://netty.io/s/icla

@kiril-me
Copy link
Contributor Author

kiril-me commented Feb 6, 2017

Done. When are you planning to release 4.0.45.Final?

@normanmaurer
Copy link
Member

@kiril-me thanks... within the next two weeks.

@normanmaurer
Copy link
Member

Cherry-picked into 4.1 (66b9be3) and 4.0 (2f0b079)

@kiril-me thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants