core,netty: retain buffer in MessageFramer to reuse for subsequent messages by carl-mastrangelo · Pull Request #2092 · grpc/grpc-java

carl-mastrangelo · 2016-07-26T17:55:52Z

When a buffer is allocated in the MessageFramer, the capacity is at least 4K
in size, which leaves a lot of space wasted on small messages. This change
alters the WritableBuffers to attempt to split themselves into a "remaining"
buffer that reuses space.

This should only have a serious impact on streaming, and only when messages
are small.

Benchmarks for FlowControlled messages per second are promising showing at
least 10%

Results:

Benchmark                                                                        (channelCount) (clientExecutor) (maxConcurrentStreams) (responseSize) Mode Cnt Score Error Units Score Error Units Speedup
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  1   DEFAULT 1   SMALL   thrpt   10  755282.751  ±29529.715 ops/s   905364.204  ±29097.292 ops/s   1.198708964
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  1   DEFAULT 2   SMALL   thrpt   10  791301.174  ±20232.22  ops/s   994319.597  ±40626.872 ops/s   1.256562772
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  1   DEFAULT 10  SMALL   thrpt   10  658210.954  ±21180.062 ops/s   857288.67   ±34114.984 ops/s   1.302452754
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  1   DEFAULT 100 SMALL   thrpt   10  657083.367  ±19745.56  ops/s   944762.512  ±19204.798 ops/s   1.437812246
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  1   DIRECT  1   SMALL   thrpt   10  887901.239  ±19285.937 ops/s   1343311.784 ±31011.844 ops/s   1.512906757
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  1   DIRECT  2   SMALL   thrpt   10  970308.851  ±27653.139 ops/s   1307188.816 ±35358.558 ops/s   1.347188387
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  1   DIRECT  10  SMALL   thrpt   10  735228.889  ±24770.534 ops/s   1199052.762 ±33300.275 ops/s   1.630856431
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  1   DIRECT  100 SMALL   thrpt   10  716183.74   ±12900.773 ops/s   1127230.832 ±33133.574 ops/s   1.573940833
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  2   DEFAULT 1   SMALL   thrpt   10  1291716.222 ±31902.31  ops/s   1573493.799 ±39530.656 ops/s   1.218142013
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  2   DEFAULT 2   SMALL   thrpt   10  1268950.933 ±42048.729 ops/s   1536156.083 ±34455.726 ops/s   1.210571696
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  2   DEFAULT 10  SMALL   thrpt   10  1128048.116 ±26606.668 ops/s   1356561.752 ±23734.552 ops/s   1.20257437
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  2   DEFAULT 100 SMALL   thrpt   10  988027.512  ±23084.526 ops/s   1347382.882 ±30614.784 ops/s   1.363709882
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  2   DIRECT  1   SMALL   thrpt   10  1478108.147 ±65561.972 ops/s   2163858.787 ±79516.791 ops/s   1.463938069
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  2   DIRECT  2   SMALL   thrpt   10  1524911.028 ±61438.415 ops/s   2176216.821 ±73718.403 ops/s   1.427110685
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  2   DIRECT  10  SMALL   thrpt   10  1313905.422 ±36649.59  ops/s   2055314.846 ±73080.768 ops/s   1.564279142
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  2   DIRECT  100 SMALL   thrpt   10  1117786.572 ±35215.016 ops/s   1865375.865 ±49877.609 ops/s   1.668812197
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  4   DEFAULT 1   SMALL   thrpt   10  1878378.751 ±26112.523 ops/s   1928202.574 ±43622.205 ops/s   1.026524908
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  4   DEFAULT 2   SMALL   thrpt   10  1982666.038 ±24621.442 ops/s   1941969.538 ±40390.353 ops/s   0.9794738503
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  4   DEFAULT 10  SMALL   thrpt   10  1707623.426 ±20835.524 ops/s   1725687.175 ±18794.404 ops/s   1.010578298
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  4   DEFAULT 100 SMALL   thrpt   10  ≈0    ops/s       1490847.182 ±  43668.394   ops/s
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  4   DIRECT  1   SMALL   thrpt   10  2375070.989 ±53326.398 ops/s   3600611.491 ±68078.043 ops/s   1.51600163
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  4   DIRECT  2   SMALL   thrpt   10  2462611.986 ±49230.177 ops/s   3370385.331 ±73550.54  ops/s   1.368622158
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  4   DIRECT  10  SMALL   thrpt   10  2108888.769 ±37617.889 ops/s   3179585.616 ±75017.316 ops/s   1.507706648
i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond  4   DIRECT  100 SMALL   thrpt   10  ≈0    ops/s       2292144.755 ±  69949.443   ops/s

cc: @louiscryan

…ssages When a buffer is allocated in the MessageFramer, the capacity is at least 4K in size, which leaves a lot of space wasted on small messages. This change alters the WritableBuffers to attempt to split themselves into a "remaining" buffer that reuses space. This should only have a serious impact on streaming, and only when messages are small. Benchmarks for FlowControlled messages per second are promising showing at least 10% Results: Benchmark (channelCount) (clientExecutor) (maxConcurrentStreams) (responseSize) Mode Cnt Score Error Units Score Error Units Speedup i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 1 DEFAULT 1 SMALL thrpt 10 755282.751 ±29529.715 ops/s 905364.204 ±29097.292 ops/s 1.198708964 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 1 DEFAULT 2 SMALL thrpt 10 791301.174 ±20232.22 ops/s 994319.597 ±40626.872 ops/s 1.256562772 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 1 DEFAULT 10 SMALL thrpt 10 658210.954 ±21180.062 ops/s 857288.67 ±34114.984 ops/s 1.302452754 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 1 DEFAULT 100 SMALL thrpt 10 657083.367 ±19745.56 ops/s 944762.512 ±19204.798 ops/s 1.437812246 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 1 DIRECT 1 SMALL thrpt 10 887901.239 ±19285.937 ops/s 1343311.784 ±31011.844 ops/s 1.512906757 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 1 DIRECT 2 SMALL thrpt 10 970308.851 ±27653.139 ops/s 1307188.816 ±35358.558 ops/s 1.347188387 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 1 DIRECT 10 SMALL thrpt 10 735228.889 ±24770.534 ops/s 1199052.762 ±33300.275 ops/s 1.630856431 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 1 DIRECT 100 SMALL thrpt 10 716183.74 ±12900.773 ops/s 1127230.832 ±33133.574 ops/s 1.573940833 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 2 DEFAULT 1 SMALL thrpt 10 1291716.222 ±31902.31 ops/s 1573493.799 ±39530.656 ops/s 1.218142013 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 2 DEFAULT 2 SMALL thrpt 10 1268950.933 ±42048.729 ops/s 1536156.083 ±34455.726 ops/s 1.210571696 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 2 DEFAULT 10 SMALL thrpt 10 1128048.116 ±26606.668 ops/s 1356561.752 ±23734.552 ops/s 1.20257437 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 2 DEFAULT 100 SMALL thrpt 10 988027.512 ±23084.526 ops/s 1347382.882 ±30614.784 ops/s 1.363709882 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 2 DIRECT 1 SMALL thrpt 10 1478108.147 ±65561.972 ops/s 2163858.787 ±79516.791 ops/s 1.463938069 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 2 DIRECT 2 SMALL thrpt 10 1524911.028 ±61438.415 ops/s 2176216.821 ±73718.403 ops/s 1.427110685 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 2 DIRECT 10 SMALL thrpt 10 1313905.422 ±36649.59 ops/s 2055314.846 ±73080.768 ops/s 1.564279142 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 2 DIRECT 100 SMALL thrpt 10 1117786.572 ±35215.016 ops/s 1865375.865 ±49877.609 ops/s 1.668812197 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 4 DEFAULT 1 SMALL thrpt 10 1878378.751 ±26112.523 ops/s 1928202.574 ±43622.205 ops/s 1.026524908 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 4 DEFAULT 2 SMALL thrpt 10 1982666.038 ±24621.442 ops/s 1941969.538 ±40390.353 ops/s 0.9794738503 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 4 DEFAULT 10 SMALL thrpt 10 1707623.426 ±20835.524 ops/s 1725687.175 ±18794.404 ops/s 1.010578298 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 4 DEFAULT 100 SMALL thrpt 10 ≈0 ops/s 1490847.182 ± 43668.394 ops/s i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 4 DIRECT 1 SMALL thrpt 10 2375070.989 ±53326.398 ops/s 3600611.491 ±68078.043 ops/s 1.51600163 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 4 DIRECT 2 SMALL thrpt 10 2462611.986 ±49230.177 ops/s 3370385.331 ±73550.54 ops/s 1.368622158 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 4 DIRECT 10 SMALL thrpt 10 2108888.769 ±37617.889 ops/s 3179585.616 ±75017.316 ops/s 1.507706648 i.g.benchmarks.netty.FlowControlledMessagesPerSecondBenchmark.stream:messagesPerSecond 4 DIRECT 100 SMALL thrpt 10 ≈0 ops/s 2292144.755 ± 69949.443 ops/s

ejona86 · 2016-07-26T18:12:31Z

@carl-mastrangelo, we can't merge this right now, right? Because otherwise we could leak buffers?

carl-mastrangelo · 2016-07-26T18:15:05Z

I'm pretty sure we already leak buffers, if close / flush is never called.

carl-mastrangelo · 2016-07-26T18:38:58Z

It is worth noting that reducing the minimum size of the buffer in NettyWritableBufferAllocator to 32 produces a similar effect. It looks like it may be slightly faster for the Flow Control benchmark.

ejona86 · 2016-07-26T22:50:54Z

I'm pretty sure we already leak buffers, if close / flush is never called.

Which is not currently possible, with the exception of unary requests that lack a halfClose. It's hard to consider a unary request that lacks a halfClose that isn't a blatant bug in the application and wouldn't be noticed when the unary request is never sent. That leakage is impossible when using our stubs.

With this PR the only way we don't leak is if halfClose/close is called (note that cancel() isn't enough). That impacts every outbound streaming RPC. That concerns me, and doesn't seem like something that can be ignored.

It is worth noting that reducing the minimum size of the buffer in NettyWritableBufferAllocator to 32 produces a similar effect.

That sounds like a much easier solution. If it gets us performance in the same ballpark, I'd be much more eager to take it.

If we're going to retain direct buffers in MessageFramer, we really have to have it thread-safe enough that we can release those buffers when the RPC completes. That would be a pain, but I would understand if we needed to do so.

buchgr · 2016-07-28T17:36:22Z

From skimming over the code it seems like here we are also allocating a 4K buffer for when we know we only need 5 bytes ... https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/MessageFramer.java#L204

It is worth noting that reducing the minimum size of the buffer in NettyWritableBufferAllocator to 32 produces a similar effect. It looks like it may be slightly faster for the Flow Control benchmark.

I agree with Eric that this sounds much simpler. Why 32? Is that specific to the benchmark payload size? Do we know how often piggy-backing writes in the message framer actually happens?

carl-mastrangelo · 2016-07-28T17:54:38Z

32 = 5 bytes of frame which is large enough to hold small messages. Next size down is 16, which my gut says is too small.

Not changing the smallest size was based on a conversion with Louis. I was under the impression that changing the size was a not feasible.

Reusing the buffer also has some benefits, assuming that chunks are returned in powers of two. A stream of large messages that straddle the power of two boundary would still cause a lot of waste (possibly <1/2 the buffer).

If the allocator returns chunks of power size 2^-3 (as in there will never be more than 12.5% waste) then it doesn't make much sense to reuse the buffer. That is a decision of the WritableBufferAllocator and the Writable buffer. They can decide if splitting the buffer makes sense based on decisions the MessageFramer can't which is why it can return null.

As for your question about piggy backed writes: they happen in streaming RPCs, which is why the benchmarks no longer OOM with this change. Another common case is streaming large chunks of data from a database, as Cloud Spanner does.

Having the stream close out the buffer safely is possible.

buchgr · 2016-07-28T18:22:31Z

@carl-mastrangelo

As for your question about piggy backed writes: they happen in streaming RPCs, which is why the benchmarks no longer OOM with this change. Another common case is streaming large chunks of data from a database, as Cloud Spanner does.

I thought on server streaming we call framer.flush() after every message https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/ServerCallImpl.java#L147 ? Am I wrong?

carl-mastrangelo · 2016-07-28T18:39:48Z

@buchgr You are correct, but I don't think that is relevant. The next write after the flush is going to want a buffer.

louiscryan · 2016-07-29T01:00:16Z

core/src/main/java/io/grpc/internal/MessageFramer.java

-    buffer = null;
+    // If we have a buffer still and we are not trying to do the last flush on a close() call,
+    // try and reuse the remainder of the buffer.
+    if (buf != null && !isClosed()) {


@buchgr

One alternative way of doing this is to hold a reference to the buffer but not to early increment it's reference count. When we attempt to use the buffer the next time we can try and increment it's refcount and if we can't its been recycled already and we will need to get a new buffer. This should be thread safe and it doesn't introduce a memory leak risk of any significance

@louiscryan

Don't think that works. Note that not only the backing memory is pooled / reused, but also the ByteBuf object itself. How do you distinquish if a ByteBuf object with refCnt() > 0 is still that same object with the same backing memory or if it has meanwhile been claimed by some other call side? Even if additionally to the ref count we compared ByteBuf.memoryAddress, it wouldn't be safe as 1) we can't compare memoryAddress + refCnt and increment refCnt atomically or 2) by chance another call side might have gotten the same ByteBuf object with the same backing memory and we can't distinquish that case or 3) there is some time between getting the buffer from the recycler and assigning it backing memory and so you could end up with a ByteBuf that is not backed by any (valid) memory.

buchgr · 2016-07-29T13:51:29Z

Just so I don't confuse things. Here is my understanding.

Right now (current master), we don't piggy-back writes for server streaming. We only piggy-back writes for client streaming, as we don't flush until 4K have been written or half close (whatever comes first). So in a sense that 4K minimum size right now is mostly useless, in terms of doing fewer allocations and wastes lots of space for small writes. Correct?

Now in this PR you suggest to slice the buffer after flush and re-use. Which sounds like a good idea, but with the downside that we might leak buffers if halfClose / close never gets called. Did you measure the allocation latency? It might be so fast that slicing vs. allocating doesn't really matter. Note, that small allocations are served from thread caches without synchronization. So doing several small allocations vs. one 4K allocation might be faster. I am not exactly sure how Netty's allocator thead caches work. Real jemalloc does cache allocations up to 32K by default, but with fewer slots for larger buffers. So it might cache e.g. up to 1000 64 byte buffers, but only 10 4K buffers or so. Allocation / Deallocation in a thread cache in jemalloc is just pointer pumping.

Your other suggestion is to do one allocation per write, with that allocation having a minimum size of 32 bytes. Now my question, why have the minimum size if we effectively never piggy-back? On the fast-path (known payload length), we know how much space we need anyway.

ejona86 · 2016-07-29T16:20:36Z

Right now (current master), we don't piggy-back writes for server streaming. We only piggy-back writes for client streaming, as we don't flush until 4K have been written or half close (whatever comes first).

Current master does no piggy-backing to my knowledge. I also think any solution we would apply to both client streaming and server streaming.

Your other suggestion is to do one allocation per write, with that allocation having a minimum size of 32 bytes. Now my question, why have the minimum size if we effectively never piggy-back?

I think the minimum was originally put in to avoid fragmentation in the allocator, although it was also written with flush delaying in mind (although we don't do that today).

Yeah, we many not really need the minimum size. We just tell Netty what we need and let Netty's allocator do what it thinks is best.

The only advantage we have vs the allocator is that the allocator API has stricter semantics than we need. When we allocate a size, the size is only a hint. We can handle smaller sizes and sometimes make use of larger. Netty's allocator may allocate something larger, but I haven't seen an API to know how much extra space there is past the capacity we requested.

Right now the min and max in NettyWritableBufferAllocator is semi-arbitrary. I don't believe we've done any real investigation on whether the numbers are appropriate. It was introduced in f7010f2 via #313 and hasn't changed since.

buchgr · 2016-07-30T19:53:11Z

Current master does no piggy-backing to my knowledge. I also think any solution we would apply to both client streaming and server streaming.

You are right. So if we never piggy-back, then there is no point in having a minimum size really?

I think the minimum was originally put in to avoid fragmentation in the allocator, although it was also written with flush delaying in mind (although we don't do that today).

I think we should have confidence in the allocator and let the allocator deal with fragmentation.

Netty's allocator may allocate something larger, but I haven't seen an API to know how much extra space there is past the capacity we requested.

It will fit an allocation to a specific size class. Those size classes were chosen to keep overhead to a minimum, so I think that's fine. I would argue that the allocator would certainly do a better job with keeping that extra capacity to a minimum, than us choosing some fixed size class.

buchgr · 2016-08-07T13:57:42Z

So we get rid of the minimum size in NettyWritableBufferAllocator @carl-mastrangelo @ejona86 ?

buchgr · 2016-08-08T09:18:21Z

While working on #2139, I noticed that a large minimum size does make sense for another reason, that's not immediately obvious.

In the MessageFramer we try to size the buffer to at least the message size, however, that might not always be possible (okhttp, or large messages in netty) and so we might get a smaller buffer than we need. In that case, MessageFramer.writeRaw has logic to allocate additional buffers, if the current buffer is full [1]. It allocates the new buffer according to the length of the byte[]. However, since in our fast path we are zero copy those byte arrays will often be tiny (i.e. less than 10 bytes, as we might call OutputStream.write immediately for each produced "element" without buffering). I think for protobuf, writeRaw is called for each message entry.

That's more a problem of the current code structure though and could be fixed, as we call writeRaw only when we know the length of the input stream. Just wanted to mention it here.

[1] https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/MessageFramer.java#L245

ejona86 · 2016-08-08T16:19:38Z

But just before writeRaw we will generally allocate an appropriately-sized buffer. (Maybe that allocation in writeRaw should go away? It is probably left-over from an earlier version of the code)

buchgr · 2016-08-08T17:11:54Z

@ejona86

But just before writeRaw we will generally allocate an appropriately-sized buffer.

yep, but the buffer allocator is allowed to return a buffer less than appropriately-sized bytes. Which seems not-ideal. Should rather push that complexity of allocating a big enough buffer further down into the allocator and not have the framer worry about it.

ejona86 · 2016-08-08T17:14:50Z

Which seems not-ideal. Should rather push that complexity of allocating a big enough buffer further down into the allocator and not have the framer worry about it.

But if the message is 1 GB, we don't want a contiguous buffer. 1024 1 MB buffers would be fine. We could make writeRaw a bit smarter and inform it how much we expect is remaining? So then if we get a smaller buffer initially we will continue asking for large buffers.

buchgr · 2016-08-08T17:39:48Z

But if the message is 1 GB, we don't want a contiguous buffer. 1024 1 MB buffers would be fine.

In case you are thinking about peak memory usage - we wouldn't be flushing until that 1GB has been framed. So for peak memory usage it should make no difference.

Also I am not saying that a contiguous 1GB buffer should be returned. In fact, the allocator might not be able to do that. It's specific to the transport. However, it could always return a CompositeWritableBuffer that is backed by 1024 1MB buffers?

carl-mastrangelo · 2016-08-08T17:44:07Z

@buchgr Having to allocate any memory seems like a problem in a near future. I expect that proto will be able to self describe as ByteBuffers (or equiv) and thus not need to be copied by message framer. In that world, does it make sense to move serialization into the transport?

ejona86 · 2016-08-08T18:20:29Z

In case you are thinking about peak memory usage - we wouldn't be flushing until that 1GB has been framed. So for peak memory usage it should make no difference.

I was not thinking of peak memory usage. I was thinking more about fragmentation and the effects of large contiguous allocations. For example, it would be virtually impossible to reliably allocate a 1 GB buffer on a 32 bit VM, assuming you have memory available.

The flushing could be changed fairly trivially, but I don't expect much benefit from going down that road. Pretty soon we will want protobuf to stop implementing KnownLength, because that will allow writing backward with only a single pass over the protobuff, instead of two right now: one to get lengths, one to write. Right now protobuf itself requires two passes, so it is "free" for us to get the length. But Nathan's work to write backward will be available eventually.

Also you can't release any of the 1GB buffer until the entire thing is written. Splitting the buffer into smaller pieces is a feature, not a downfall. At some point combining buffers stops being helpful/performance gain. And that makes sense, because otherwise writev/readv wouldn't exist. Only small buffers cause issue.

Also I am not saying that a contiguous 1GB buffer should be returned. In fact, the allocator might not be able to do that. It's specific to the transport. However, it could always return a CompositeWritableBuffer that is backed by 1024 1MB buffers?

I don't see what advantage that provides. Whether it is 1 CompositeWritableBuffer with 1024 ByteBufs or a list of 1024 ByteBufs it seems mostly the same to me. Yes, it might be slightly more efficient than ChannelOutboundBuffer, but I'd have trouble believing it wouldn't be in the noise since it is amortized over 1 MB of data. Basically, if the message is large, then a lot of the costs stop mattering because they are fixed-cost or heavily amortized. As long as you manage the data is reasonably large chunks and don't do needless copies, then it's fine.

I expect that proto will be able to self describe as ByteBuffers (or equiv) and thus not need to be copied by message framer. In that world, does it make sense to move serialization into the transport?

What self describing feature are you talking about? I've not heard anything about that. (Or rather, I've not heard of anything along those lines that isn't a completely different wire format.) Even if we had that though, it seems like it wouldn't matter where we serialize since serializing doesn't actually need to serialize anything, so it is close to free.

I do know there is work to serialize directly to ByteBuffers as an alternative to byte[]. But that optimization isn't fundamentally different.

buchgr · 2016-08-08T18:47:22Z

I don't see what advantage that provides. Whether it is 1 CompositeWritableBuffer with 1024 ByteBufs or a list of 1024 ByteBufs it seems mostly the same to me.

My thinking was that the code in MessageFramer would get much simpler, if the framer can rely on the allocator returning a buffer that's at least the size it needs.

ejona86 · 2016-08-08T21:03:02Z

My thinking was that the code in MessageFramer would get much simpler, if the framer can rely on the allocator returning a buffer that's at least the size it needs.

Oh, okay. That's fair. That may be the case. Although I think MessageFramer is probably the most stable class we've had on the critical path of an RPC. And the complicated parts are in the paths that we don't know the message size. We also have 3 implementations of WritableBuffer. If we went down that road, Cronet (which uses plain ByteBuffers) would need a composite, but I guess that isn't too much code.

My instinct tells me that we don't save much though. In that, by trying to simplify we make things more complicated, more opaque, and harder to optimize. Especially if we ever stop flushing after every message.

pgrosu · 2016-08-08T22:30:32Z

The following code of allocating 1024 blocks of 1MB each would take about 15 seconds - which could be a bit long - taking into account an insert for a LinkedList is O(1):

import java.util.LinkedList;

public class GBConsecutive
{
  public static void main(String[] args)
  {
    long tick = System.currentTimeMillis();
    LinkedList<byte[]> ll = new LinkedList<byte[]>();
    int ONE_MB = 1048576;
    for ( int i = 0 ; i < 1024 ; i++ ) {
      byte [] ONE_MB_BYTE = new byte[ONE_MB];
      ll.add(ONE_MB_BYTE);
    }
    long elapsed = System.currentTimeMillis() - tick;
    System.out.println("Elapsed milliseconds: " + elapsed );
  }
}

pgrosu · 2016-08-08T22:43:18Z

Now the above code runs in 46 seconds, and when I perform it all at once like this, it runs in 28 seconds:

import java.util.LinkedList;

public class GBConsecutiveAll
{
  public static void main(String[] args)
  {
    long tick = System.currentTimeMillis();
    LinkedList<byte[]> ll = new LinkedList<byte[]>();
    int ONE_GB = 1073741824;
    for ( int i = 0 ; i < 1 ; i++ ) {
      byte [] ONE_GB_BYTE = new byte[ONE_GB];
      ll.add(ONE_GB_BYTE);
    }
    long elapsed = System.currentTimeMillis() - tick;
    System.out.println("Elapsed milliseconds:" + elapsed );
  }
}

carl-mastrangelo assigned ejona86 Jul 26, 2016

carl-mastrangelo added 2 commits July 26, 2016 10:56

fix reverted line

bf913b2

add comment

17be6cf

louiscryan reviewed Jul 29, 2016
View reviewed changes

carl-mastrangelo closed this Nov 15, 2017

lock bot locked as resolved and limited conversation to collaborators Jan 20, 2019

Conversation

carl-mastrangelo commented Jul 26, 2016

Uh oh!

ejona86 commented Jul 26, 2016

Uh oh!

carl-mastrangelo commented Jul 26, 2016

Uh oh!

carl-mastrangelo commented Jul 26, 2016

Uh oh!

ejona86 commented Jul 26, 2016

Uh oh!

buchgr commented Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carl-mastrangelo commented Jul 28, 2016

Uh oh!

buchgr commented Jul 28, 2016

Uh oh!

carl-mastrangelo commented Jul 28, 2016

Uh oh!

louiscryan Jul 29, 2016

Choose a reason for hiding this comment

Uh oh!

buchgr Jul 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buchgr commented Jul 29, 2016

Uh oh!

ejona86 commented Jul 29, 2016

Uh oh!

buchgr commented Jul 30, 2016

Uh oh!

buchgr commented Aug 7, 2016

Uh oh!

buchgr commented Aug 8, 2016

Uh oh!

ejona86 commented Aug 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

buchgr commented Aug 8, 2016

Uh oh!

ejona86 commented Aug 8, 2016

Uh oh!

buchgr commented Aug 8, 2016

Uh oh!

carl-mastrangelo commented Aug 8, 2016

Uh oh!

ejona86 commented Aug 8, 2016

Uh oh!

buchgr commented Aug 8, 2016

Uh oh!

ejona86 commented Aug 8, 2016

Uh oh!

pgrosu commented Aug 8, 2016

Uh oh!

pgrosu commented Aug 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

buchgr commented Jul 28, 2016 •

edited

Loading

buchgr Jul 29, 2016 •

edited

Loading

ejona86 commented Aug 8, 2016 •

edited

Loading

pgrosu commented Aug 8, 2016 •

edited

Loading