Streamline CompositeByteBuf internals#8437
Conversation
|
Can one of the admins verify this patch? |
|
Sequential access benchmark before: After: Random access benchmark before: After: Create and write benchmark before: After:
|
|
@netty-bot test this please |
30e97d0 to
44ebe5b
Compare
|
@normanmaurer I have re-based and fixed the failing unit tests (sorry, I just ran the buffer ones and forgot to run the whole suite). The failures were due to ref-count of components no longer being checked at addition time - it was done implicitly before when they were sliced. |
|
@netty-bot test this please |
a9b524e to
efdf2fa
Compare
Motivation: CompositeByteBuf is a powerful and versatile abstraction, allowing for manipulation of large data without copying bytes. There is still a non-negligible cost to reading/writing however relative to "singular" ByteBufs, and this can be mostly eliminated with some rework of the internals. My use case is message modification/transformation while zero-copy proxying. For example replacing a string within a large message with one of a different length Modifications: - No longer slice added buffers and unwrap added slices - Components store target buf offset relative to position in composite buf - Less allocations, object footprint, pointer indirection, offset arithmetic - Use Component[] rather than ArrayList<Component> - Avoid pointer indirection and duplicate bounds check, more efficient backing array growth - Facilitates optimization when doing bulk-inserts - inserting n ByteBufs behind m is now O(m + n) instead of O(mn) - Avoid unnecessary casting and method call indirection via superclass - Eliminate some duplicate range/ref checks via non-checking versions of toComponentIndex and findComponent - Add simple fast-path for toComponentIndex(0); add racy cache of last-accessed Component to findComponent(int) - Override forEachByte0(...) and forEachByteDesc0(...) methods - Make use of RecyclableArrayList in nioBuffers(int, int) (in line with FasterCompositeByteBuf impl) - Modify addComponents0(boolean,int,Iterable) to use the Iterable directly rather than copy to an array first (and possibly to an ArrayList before that) - Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform repeated array insertions and avoid second loop for offset updates - Simplify other logic in various places, in particular the general pattern used where a sub-range is iterated over - Add benchmarks to demonstrate some improvements While refactoring I also came across a couple of clear bugs. They are fixed in these changes but I will open another PR with unit tests and fixes to the current version. Result: Much faster creation, manipulation, and access; many fewer allocations and smaller footprint. Benchmark results to follow.
efdf2fa to
e981383
Compare
|
@normanmaurer I've now squashed and rebased this. Take your time! :) |
mosesn
left a comment
There was a problem hiding this comment.
this generally seems good, there are a lot of changes, where do most of the benefits come from? are they from the iterator and byte processing?
|
Thanks for the review @mosesn
The biggest all-round benefits appear to be from the removal of the slices and component ArrayList. The ByteProcessor methods obviously benefit significantly from being overridden. Sequential reading/writing benefits significantly from the "lastAccessed" cache. Most of the other changes benefit the speed of construction and manipulation of the components. |
|
@netty-bot test this please |
normanmaurer
left a comment
There was a problem hiding this comment.
just nits: after this LGTM... Great work!
- also null out Component.slice in the transferTo case - remove redundant updateComponentOffsets(0) in consolidate() - simplify clearComps()
|
@normanmaurer @mosesn I've added njhill@b8ed3ba to address the comments; it also includes a couple of minor tweaks to things I noticed while looking over it again. |
|
@netty-bot test this please |
|
@njhill thanks a lot! |
| shiftComps(cIndex, count); // will increase componentCount | ||
| ci = cIndex; // only set this after we've shifted so that finally block logic is always correct | ||
| int nextOffset = cIndex > 0 ? components[cIndex - 1].endOffset : 0; | ||
| for (ByteBuf b; arrOffset < len && (b = buffers[arrOffset]) != null; arrOffset++, ci++) { |
There was a problem hiding this comment.
This is kind of difficult to read. You should consider making you code slightly more verbose, but faster to consume.
There was a problem hiding this comment.
Thanks @carl-mastrangelo, I do have a habit of over-compactness sometimes. I have some smaller follow-on updates to the class in mind anyhow so will try and modify this to be a bit more verbose/readable.
Motivation In netty#8758, @doom369 reported an infinite loop bug in CompositeByteBuf which was introduced in netty#8437. This is the same small fix for that, along with fixes for two other bugs found while re-inspecting the changes and adding unit tests. Modification - Replace recursive call to toComponentIndex with toComponentIndex0 as intended - Add missed "lastAccessed" racy cache invalidation in capacity(int) method - Fix incorrect determination of initial offset in non-zero cIndex case of updateComponentOffsets method - New unit tests for previously uncovered methods Results Fewer bugs.
Motivation In #8758, @doom369 reported an infinite loop bug in CompositeByteBuf which was introduced in #8437. This is the same small fix for that, along with fixes for two other bugs found while re-inspecting the changes and adding unit tests. Modification - Replace recursive call to toComponentIndex with toComponentIndex0 as intended - Add missed "lastAccessed" racy cache invalidation in capacity(int) method - Fix incorrect determination of initial offset in non-zero cIndex case of updateComponentOffsets method - New unit tests for previously uncovered methods Results Fewer bugs.
Motivation In #8758, @doom369 reported an infinite loop bug in CompositeByteBuf which was introduced in #8437. This is the same small fix for that, along with fixes for two other bugs found while re-inspecting the changes and adding unit tests. Modification - Replace recursive call to toComponentIndex with toComponentIndex0 as intended - Add missed "lastAccessed" racy cache invalidation in capacity(int) method - Fix incorrect determination of initial offset in non-zero cIndex case of updateComponentOffsets method - New unit tests for previously uncovered methods Results Fewer bugs.
Motivation There's some miscellaneous cleanup/simplification of CompositeByteBuf which would help make the code a bit clearer. Modifications - Simplify web of constructors and addComponents methods, reducing duplication of logic - Rename `Component.freeIfNecessary()` method to just `free()`, which is less confusing (see netty#8641) - Make loop in addComponents0(...) method more verbose/readable (see netty#8437 (comment)) - Simplify addition/subtraction in setBytes(...) methods Result Smaller/clearer code
Motivation There's some miscellaneous cleanup/simplification of CompositeByteBuf which would help make the code a bit clearer. Modifications - Simplify web of constructors and addComponents methods, reducing duplication of logic - Rename `Component.freeIfNecessary()` method to just `free()`, which is less confusing (see #8641) - Make loop in addComponents0(...) method more verbose/readable (see #8437 (comment)) - Simplify addition/subtraction in setBytes(...) methods Result Smaller/clearer code
Motivation There's some miscellaneous cleanup/simplification of CompositeByteBuf which would help make the code a bit clearer. Modifications - Simplify web of constructors and addComponents methods, reducing duplication of logic - Rename `Component.freeIfNecessary()` method to just `free()`, which is less confusing (see #8641) - Make loop in addComponents0(...) method more verbose/readable (see #8437 (comment)) - Simplify addition/subtraction in setBytes(...) methods Result Smaller/clearer code
Motivation A small thread-safety bug was introduced during the internal optimizations of ComponentByteBuf made a while back in netty#8437. When there is a single component which was added as a slice, internalNioBuffer(int,int) will currently return the unwrapped slice's un-duplicated internal NIO buffer. This is not safe since it could be modified concurrently with other usage of that parent buffer. Modifications Delegate internalNioBuffer to nioBuffer in this case, which returns a duplicate. This matches what's done in derived buffers in general (for the same reason). Add unit test. Result Fixed possible thread-safety bug
#9169) Motivation A small thread-safety bug was introduced during the internal optimizations of ComponentByteBuf made a while back in #8437. When there is a single component which was added as a slice, internalNioBuffer(int,int) will currently return the unwrapped slice's un-duplicated internal NIO buffer. This is not safe since it could be modified concurrently with other usage of that parent buffer. Modifications Delegate internalNioBuffer to nioBuffer in this case, which returns a duplicate. This matches what's done in derived buffers in general (for the same reason). Add unit test. Result Fixed possible thread-safety bug
#9169) Motivation A small thread-safety bug was introduced during the internal optimizations of ComponentByteBuf made a while back in #8437. When there is a single component which was added as a slice, internalNioBuffer(int,int) will currently return the unwrapped slice's un-duplicated internal NIO buffer. This is not safe since it could be modified concurrently with other usage of that parent buffer. Modifications Delegate internalNioBuffer to nioBuffer in this case, which returns a duplicate. This matches what's done in derived buffers in general (for the same reason). Add unit test. Result Fixed possible thread-safety bug
Motivation The way that Components are added/stored in CompositeByteBuf was changed in netty#8437 to minimize slicing and runtime access indirection, with the obvious intention to preserve the effective behaviour of public methods. @jingene discovered a discrepancy, reported in netty#9398, where the component(int) and componentAtOffset(int) methods can return an incorrect ByteBuf, i.e. not equivalent to a slice of the added buffer at the time it was added (the previous behaviour). This can happen in particular if the added ByteBuf is a wrapped slice, e.g. a leak aware or unreleasable buffer. Upon further scrutiny another subtle deviation was also noticed: When the added `ByteBuf is a slice whose readable region does not cover the whole buffer internalComponent(int) returns the original slice (with capacity != readableBytes) rather than a sliced slice. Modifications Unfortunately to fix this robustly required slightly more invasive change than first expected. - A final ByteBuf srcBuf field has been added to the Component class, which always holds the originally-added buffer. This simplifies some of the existing fragile logic where we need access to this (e.g. when releasing) - Component has been made abstract with two final impls. Which of these is used depends on whether the srcBuf is already a slice of the component's range (or equivalent to one) - Correctly handle the various places where access to the pre-unwrapped buffer is important, including the problem component(int) methods - Extend the pre-unwrapping optimization to cover WrappedByteBufs, SwappedByteBufs, and duplicates in addition to slices - Unit test for the original bug provided by @jingene Result - Correct behaviour of the (internal)component(AtOffset) methods in all cases - Reduced indirection when accessing components that were added from more kinds of ByteBufs (duplicates in particular) - Less fragile logic related to lazy-slicing components I don't expect there to be any noticeable performance impact of splitting Component into two subclasses since most of the methods are still final in the superclass and the ones that aren't will benefit from bimorphic inlining. But will aim to benchmark nonetheless. There could be slight increase in mem use due to two additional fields in _one_ of the two Component subtypes, but I think this is needed for correctness and still much better than slicing components unconditionally when adding them. Co-authored-by: jingene <jingene0206@gmail.com>
Motivation:
CompositeByteBufis a powerful and versatile abstraction, allowing for manipulation of large data without copying bytes. There is still a non-negligible cost to reading/writing however relative to "singular"ByteBufs, and this can be mostly eliminated with some rework of the internals.My use case is message modification/transformation while zero-copy proxying. For example replacing a string within a large message with one of a different length
Modifications:
Component[]rather thanArrayList<Component>ByteBufs behind m is now O(m + n) instead of O(mn)toComponentIndexandfindComponenttoComponentIndex(0); add racy cache of last-accessed Component tofindComponent(int)forEachByte0(...)andforEachByteDesc0(...)methodsRecyclableArrayListinnioBuffers(int, int)(in line withFasterCompositeByteBufimpl)addComponents0(boolean,int,Iterable)to use theIterabledirectly rather than copy to an array first (and possibly to anArrayListbefore that)addComponents0(boolean,int,ByteBuf[],int)to not perform repeated array insertions and avoid second loop for offset updatesWhile refactoring I also came across a couple of clear bugs. They are fixed in these changes but I will open another PR with unit tests and fixes to the current version.
Result:
Much faster creation, manipulation, and access; many fewer allocations and smaller footprint. Benchmark results to follow.