Skip to content

Streamline CompositeByteBuf internals#8437

Merged
normanmaurer merged 2 commits intonetty:4.1from
njhill:composite-streamline
Nov 3, 2018
Merged

Streamline CompositeByteBuf internals#8437
normanmaurer merged 2 commits intonetty:4.1from
njhill:composite-streamline

Conversation

@njhill
Copy link
Copy Markdown
Member

@njhill njhill commented Oct 27, 2018

Motivation:

CompositeByteBuf is a powerful and versatile abstraction, allowing for manipulation of large data without copying bytes. There is still a non-negligible cost to reading/writing however relative to "singular" ByteBufs, and this can be mostly eliminated with some rework of the internals.

My use case is message modification/transformation while zero-copy proxying. For example replacing a string within a large message with one of a different length

Modifications:

  • No longer slice added buffers and unwrap added slices
    • Components store target buf offset relative to position in composite buf
    • Less allocations, object footprint, pointer indirection, offset arithmetic
  • Use Component[] rather than ArrayList<Component>
    • Avoid pointer indirection and duplicate bounds check, more efficient backing array growth
    • Facilitates optimization when doing bulk-inserts - inserting n ByteBufs behind m is now O(m + n) instead of O(mn)
  • Avoid unnecessary casting and method call indirection via superclass
  • Eliminate some duplicate range/ref checks via non-checking versions of toComponentIndex and findComponent
  • Add simple fast-path for toComponentIndex(0); add racy cache of last-accessed Component to findComponent(int)
  • Override forEachByte0(...) and forEachByteDesc0(...) methods
  • Make use of RecyclableArrayList in nioBuffers(int, int) (in line with FasterCompositeByteBuf impl)
  • Modify addComponents0(boolean,int,Iterable) to use the Iterable directly rather than copy to an array first (and possibly to an ArrayList before that)
  • Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform repeated array insertions and avoid second loop for offset updates
  • Simplify other logic in various places, in particular the general pattern used where a sub-range is iterated over
  • Add benchmarks to demonstrate some improvements

While refactoring I also came across a couple of clear bugs. They are fixed in these changes but I will open another PR with unit tests and fixes to the current version.

Result:

Much faster creation, manipulation, and access; many fewer allocations and smaller footprint. Benchmark results to follow.

@netty-bot
Copy link
Copy Markdown

Can one of the admins verify this patch?

@njhill
Copy link
Copy Markdown
Member Author

njhill commented Oct 27, 2018

Sequential access benchmark before:

Benchmark               (bufferType)   (size)   Mode  Cnt         Score        Error  Units
forEachByte             SMALL_CHUNKS        8  thrpt   20  26993071.439 ± 219325.277  ops/s
forEachByte             SMALL_CHUNKS       64  thrpt   20  23873737.218 ± 171486.366  ops/s
forEachByte             SMALL_CHUNKS     1024  thrpt   20  19524219.387 ± 171803.902  ops/s
forEachByte             SMALL_CHUNKS    10240  thrpt   20  16371334.078 ± 202567.089  ops/s
forEachByte             SMALL_CHUNKS   102400  thrpt   20  14407374.550 ± 137438.057  ops/s
forEachByte             SMALL_CHUNKS  1024000  thrpt   20  11661776.666 ± 119680.974  ops/s
forEachByte             LARGE_CHUNKS        8  thrpt   20  33611609.026 ± 287654.748  ops/s
forEachByte             LARGE_CHUNKS       64  thrpt   20  33652029.577 ± 280328.451  ops/s
forEachByte             LARGE_CHUNKS     1024  thrpt   20  29717325.621 ± 157056.609  ops/s
forEachByte             LARGE_CHUNKS    10240  thrpt   20  24705063.350 ± 234310.406  ops/s
forEachByte             LARGE_CHUNKS   102400  thrpt   20  20561512.069 ± 156557.907  ops/s
forEachByte             LARGE_CHUNKS  1024000  thrpt   20  17808390.595 ±  87640.287  ops/s
sequentialWriteAndRead  SMALL_CHUNKS        8  thrpt   20   4362862.448 ±  31972.845  ops/s
sequentialWriteAndRead  SMALL_CHUNKS       64  thrpt   20    458934.160 ±   2397.932  ops/s
sequentialWriteAndRead  SMALL_CHUNKS     1024  thrpt   20     15448.106 ±   1278.067  ops/s
sequentialWriteAndRead  SMALL_CHUNKS    10240  thrpt   20      1130.944 ±      5.192  ops/s
sequentialWriteAndRead  SMALL_CHUNKS   102400  thrpt   20       108.989 ±      0.478  ops/s
sequentialWriteAndRead  SMALL_CHUNKS  1024000  thrpt   20         9.183 ±      0.213  ops/s
sequentialWriteAndRead  LARGE_CHUNKS        8  thrpt   20   6237428.716 ±  41521.522  ops/s
sequentialWriteAndRead  LARGE_CHUNKS       64  thrpt   20    845245.939 ±  10866.322  ops/s
sequentialWriteAndRead  LARGE_CHUNKS     1024  thrpt   20     38610.312 ±    329.698  ops/s
sequentialWriteAndRead  LARGE_CHUNKS    10240  thrpt   20      2996.666 ±     38.982  ops/s
sequentialWriteAndRead  LARGE_CHUNKS   102400  thrpt   20       243.859 ±      2.593  ops/s
sequentialWriteAndRead  LARGE_CHUNKS  1024000  thrpt   20        19.296 ±      0.240  ops/s

After:

Benchmark               (bufferType)   (size)   Mode  Cnt         Score         Error  Units
forEachByte             SMALL_CHUNKS        8  thrpt   20  68062399.601 ± 6766022.135  ops/s
forEachByte             SMALL_CHUNKS       64  thrpt   20  69543939.577 ± 4830276.158  ops/s
forEachByte             SMALL_CHUNKS     1024  thrpt   20  55004050.513 ± 3936975.234  ops/s
forEachByte             SMALL_CHUNKS    10240  thrpt   20  48133958.927 ±  969792.765  ops/s
forEachByte             SMALL_CHUNKS   102400  thrpt   20  37868245.224 ± 1384262.077  ops/s
forEachByte             SMALL_CHUNKS  1024000  thrpt   20  37194864.817 ± 2248474.343  ops/s
forEachByte             LARGE_CHUNKS        8  thrpt   20  94935640.886 ± 1928949.654  ops/s
forEachByte             LARGE_CHUNKS       64  thrpt   20  95666251.202 ± 1826611.657  ops/s
forEachByte             LARGE_CHUNKS     1024  thrpt   20  87083674.874 ± 1305642.972  ops/s
forEachByte             LARGE_CHUNKS    10240  thrpt   20  74770743.128 ±  488339.091  ops/s
forEachByte             LARGE_CHUNKS   102400  thrpt   20  61742575.861 ±  286639.001  ops/s
forEachByte             LARGE_CHUNKS  1024000  thrpt   20  49095591.069 ± 2823970.971  ops/s
sequentialWriteAndRead  SMALL_CHUNKS        8  thrpt   20  11910592.825 ±  121065.538  ops/s
sequentialWriteAndRead  SMALL_CHUNKS       64  thrpt   20   1771930.034 ±   44302.166  ops/s
sequentialWriteAndRead  SMALL_CHUNKS     1024  thrpt   20     52373.348 ±    6128.125  ops/s
sequentialWriteAndRead  SMALL_CHUNKS    10240  thrpt   20      2918.622 ±      66.881  ops/s
sequentialWriteAndRead  SMALL_CHUNKS   102400  thrpt   20       269.001 ±      10.828  ops/s
sequentialWriteAndRead  SMALL_CHUNKS  1024000  thrpt   20        24.236 ±       1.213  ops/s
sequentialWriteAndRead  LARGE_CHUNKS        8  thrpt   20  28439985.180 ±  342735.897  ops/s
sequentialWriteAndRead  LARGE_CHUNKS       64  thrpt   20   5432692.507 ±   21276.229  ops/s
sequentialWriteAndRead  LARGE_CHUNKS     1024  thrpt   20    154414.637 ±   23798.671  ops/s
sequentialWriteAndRead  LARGE_CHUNKS    10240  thrpt   20     15639.020 ±    2702.698  ops/s
sequentialWriteAndRead  LARGE_CHUNKS   102400  thrpt   20      2005.657 ±      13.568  ops/s
sequentialWriteAndRead  LARGE_CHUNKS  1024000  thrpt   20       194.357 ±       1.032  ops/s

Random access benchmark before:

Benchmark   (bufferType)   (size)   Mode  Cnt         Score         Error  Units
setGetLong  SMALL_CHUNKS       64  thrpt   20   4290155.605 ±   45371.841  ops/s
setGetLong  SMALL_CHUNKS    10240  thrpt   20    818391.458 ±   11681.233  ops/s
setGetLong  SMALL_CHUNKS  1024000  thrpt   20    435844.142 ±    4360.760  ops/s
setGetLong  LARGE_CHUNKS       64  thrpt   20  26527380.867 ±  822279.326  ops/s
setGetLong  LARGE_CHUNKS    10240  thrpt   20  16899663.785 ±  349608.066  ops/s
setGetLong  LARGE_CHUNKS  1024000  thrpt   20   5854026.521 ±   20542.515  ops/s
setLong     SMALL_CHUNKS       64  thrpt   20   7978590.875 ±  254796.749  ops/s
setLong     SMALL_CHUNKS    10240  thrpt   20   1607988.364 ±   27542.171  ops/s
setLong     SMALL_CHUNKS  1024000  thrpt   20    705170.809 ±   15939.167  ops/s
setLong     LARGE_CHUNKS       64  thrpt   20  34921276.445 ± 2481802.392  ops/s
setLong     LARGE_CHUNKS    10240  thrpt   20  23420150.831 ±   83684.512  ops/s
setLong     LARGE_CHUNKS  1024000  thrpt   20   9589351.189 ±   79475.182  ops/s

After:

Benchmark   (bufferType)   (size)   Mode  Cnt         Score         Error  Units
setGetLong  SMALL_CHUNKS       64  thrpt   20   7816830.465 ±  350508.182  ops/s
setGetLong  SMALL_CHUNKS    10240  thrpt   20   1995144.457 ±   28483.814  ops/s
setGetLong  SMALL_CHUNKS  1024000  thrpt   20    933882.582 ±   76850.014  ops/s
setGetLong  LARGE_CHUNKS       64  thrpt   20  45196249.259 ± 3371949.390  ops/s
setGetLong  LARGE_CHUNKS    10240  thrpt   20  19241447.949 ±  120664.468  ops/s
setGetLong  LARGE_CHUNKS  1024000  thrpt   20   8777183.501 ±  216361.178  ops/s
setLong     SMALL_CHUNKS       64  thrpt   20  12787399.243 ±  494984.215  ops/s
setLong     SMALL_CHUNKS    10240  thrpt   20   3873818.287 ±   54400.633  ops/s
setLong     SMALL_CHUNKS  1024000  thrpt   20   1380204.611 ±   94828.223  ops/s
setLong     LARGE_CHUNKS       64  thrpt   20  56391729.890 ± 3434506.497  ops/s
setLong     LARGE_CHUNKS    10240  thrpt   20  21926368.046 ± 1154927.398  ops/s
setLong     LARGE_CHUNKS  1024000  thrpt   20  10125780.537 ±  307389.290  ops/s

Create and write benchmark before:

Benchmark  (bufferType)   (size)   Mode  Cnt        Score        Error  Units
writeCBB   SMALL_CHUNKS       64  thrpt   24  1250520.113 ±  35029.851  ops/s
writeCBB   SMALL_CHUNKS     1024  thrpt   24   168285.985 ±   4150.415  ops/s
writeCBB   SMALL_CHUNKS    10240  thrpt   24    17817.229 ±    341.587  ops/s
writeCBB   SMALL_CHUNKS   102400  thrpt   24     1247.043 ±     71.196  ops/s
writeCBB   SMALL_CHUNKS  1024000  thrpt   24      112.241 ±      5.766  ops/s
writeCBB   LARGE_CHUNKS       64  thrpt   24  4847690.410 ±  19710.479  ops/s
writeCBB   LARGE_CHUNKS     1024  thrpt   24  4679833.835 ± 194635.132  ops/s
writeCBB   LARGE_CHUNKS    10240  thrpt   24  1436468.046 ±  36013.585  ops/s
writeCBB   LARGE_CHUNKS   102400  thrpt   24   150188.013 ±   2729.452  ops/s
writeCBB   LARGE_CHUNKS  1024000  thrpt   24    12876.866 ±    188.330  ops/s

After:

Benchmark  (bufferType)   (size)   Mode  Cnt        Score        Error  Units
writeCBB   SMALL_CHUNKS       64  thrpt   24  1523615.079 ±  40609.703  ops/s
writeCBB   SMALL_CHUNKS     1024  thrpt   24   309682.828 ±  16358.499  ops/s
writeCBB   SMALL_CHUNKS    10240  thrpt   24    27729.313 ±   1178.068  ops/s
writeCBB   SMALL_CHUNKS   102400  thrpt   24     1703.018 ±     26.623  ops/s
writeCBB   SMALL_CHUNKS  1024000  thrpt   24      141.040 ±     13.677  ops/s
writeCBB   LARGE_CHUNKS       64  thrpt   24  6130592.282 ±  92724.933  ops/s
writeCBB   LARGE_CHUNKS     1024  thrpt   24  6468883.047 ± 376330.828  ops/s
writeCBB   LARGE_CHUNKS    10240  thrpt   24  1836046.592 ±  39547.323  ops/s
writeCBB   LARGE_CHUNKS   102400  thrpt   24   195067.557 ±   7758.292  ops/s
writeCBB   LARGE_CHUNKS  1024000  thrpt   24    14938.708 ±    221.054  ops/s

FixedCompositeByteBuf is still faster for this; the benefit all comes from speed of construction - actually writing the resulting buffer is slower:

writeFCBB  SMALL_CHUNKS       64  thrpt   24  2477965.194 ± 230073.269  ops/s
writeFCBB  SMALL_CHUNKS     1024  thrpt   24   404597.067 ±  19591.733  ops/s
writeFCBB  SMALL_CHUNKS    10240  thrpt   24    40672.568 ±   1989.881  ops/s
writeFCBB  SMALL_CHUNKS   102400  thrpt   24     3748.048 ±     25.265  ops/s
writeFCBB  SMALL_CHUNKS  1024000  thrpt   24      324.002 ±     12.610  ops/s
writeFCBB  LARGE_CHUNKS       64  thrpt   24  8175906.738 ± 216508.810  ops/s
writeFCBB  LARGE_CHUNKS     1024  thrpt   24  8604834.785 ± 465917.018  ops/s
writeFCBB  LARGE_CHUNKS    10240  thrpt   24  2061924.498 ±  62903.237  ops/s
writeFCBB  LARGE_CHUNKS   102400  thrpt   24   220805.358 ±   7421.653  ops/s
writeFCBB  LARGE_CHUNKS  1024000  thrpt   24    17907.005 ±    309.598  ops/s

@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

@njhill njhill force-pushed the composite-streamline branch from 30e97d0 to 44ebe5b Compare October 28, 2018 16:52
@njhill
Copy link
Copy Markdown
Member Author

njhill commented Oct 28, 2018

@normanmaurer I have re-based and fixed the failing unit tests (sorry, I just ran the buffer ones and forgot to run the whole suite). The failures were due to ref-count of components no longer being checked at addition time - it was done implicitly before when they were sliced.

@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

Motivation:

CompositeByteBuf is a powerful and versatile abstraction, allowing for
manipulation of large data without copying bytes. There is still a
non-negligible cost to reading/writing however relative to "singular"
ByteBufs, and this can be mostly eliminated with some rework of the
internals.

My use case is message modification/transformation while zero-copy
proxying. For example replacing a string within a large message with one
of a different length

Modifications:

- No longer slice added buffers and unwrap added slices
   - Components store target buf offset relative to position in
composite buf
   - Less allocations, object footprint, pointer indirection, offset
arithmetic
- Use Component[] rather than ArrayList<Component>
   - Avoid pointer indirection and duplicate bounds check, more
efficient backing array growth
   - Facilitates optimization when doing bulk-inserts - inserting n
ByteBufs behind m is now O(m + n) instead of O(mn)
- Avoid unnecessary casting and method call indirection via superclass
- Eliminate some duplicate range/ref checks via non-checking versions of
toComponentIndex and findComponent
- Add simple fast-path for toComponentIndex(0); add racy cache of
last-accessed Component to findComponent(int)
- Override forEachByte0(...) and forEachByteDesc0(...) methods
- Make use of RecyclableArrayList in nioBuffers(int, int) (in line with
FasterCompositeByteBuf impl)
- Modify addComponents0(boolean,int,Iterable) to use the Iterable
directly rather than copy to an array first (and possibly to an
ArrayList before that)
- Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform
repeated array insertions and avoid second loop for offset updates
- Simplify other logic in various places, in particular the general
pattern used where a sub-range is iterated over
- Add benchmarks to demonstrate some improvements

While refactoring I also came across a couple of clear bugs. They are
fixed in these changes but I will open another PR with unit tests and
fixes to the current version.

Result:

Much faster creation, manipulation, and access; many fewer allocations
and smaller footprint. Benchmark results to follow.
@njhill njhill force-pushed the composite-streamline branch from efdf2fa to e981383 Compare October 30, 2018 20:08
@njhill
Copy link
Copy Markdown
Member Author

njhill commented Oct 30, 2018

@normanmaurer I've now squashed and rebased this. Take your time! :)

Copy link
Copy Markdown
Contributor

@mosesn mosesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this generally seems good, there are a lot of changes, where do most of the benefits come from? are they from the iterator and byte processing?

@njhill
Copy link
Copy Markdown
Member Author

njhill commented Nov 2, 2018

Thanks for the review @mosesn

where do most of the benefits come from? are they from the iterator and byte processing?

The biggest all-round benefits appear to be from the removal of the slices and component ArrayList. The ByteProcessor methods obviously benefit significantly from being overridden. Sequential reading/writing benefits significantly from the "lastAccessed" cache. Most of the other changes benefit the speed of construction and manipulation of the components.

@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

Copy link
Copy Markdown
Member

@normanmaurer normanmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just nits: after this LGTM... Great work!

- also null out Component.slice in the transferTo case
- remove redundant updateComponentOffsets(0) in consolidate()
- simplify clearComps()
@njhill
Copy link
Copy Markdown
Member Author

njhill commented Nov 2, 2018

@normanmaurer @mosesn I've added njhill@b8ed3ba to address the comments; it also includes a couple of minor tweaks to things I noticed while looking over it again.

@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

@normanmaurer normanmaurer merged commit 10539f4 into netty:4.1 Nov 3, 2018
@normanmaurer
Copy link
Copy Markdown
Member

@njhill thanks a lot!

@normanmaurer normanmaurer added this to the 4.1.32.Final milestone Nov 3, 2018
@normanmaurer normanmaurer self-assigned this Nov 3, 2018
@njhill njhill deleted the composite-streamline branch November 4, 2018 04:46
shiftComps(cIndex, count); // will increase componentCount
ci = cIndex; // only set this after we've shifted so that finally block logic is always correct
int nextOffset = cIndex > 0 ? components[cIndex - 1].endOffset : 0;
for (ByteBuf b; arrOffset < len && (b = buffers[arrOffset]) != null; arrOffset++, ci++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of difficult to read. You should consider making you code slightly more verbose, but faster to consume.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carl-mastrangelo, I do have a habit of over-compactness sometimes. I have some smaller follow-on updates to the class in mind anyhow so will try and modify this to be a bit more verbose/readable.

njhill added a commit to njhill/netty that referenced this pull request Jan 23, 2019
Motivation

In netty#8758, @doom369 reported an infinite loop bug in CompositeByteBuf
which was introduced in netty#8437.

This is the same small fix for that, along with fixes for two other bugs
found while re-inspecting the changes and adding unit tests.

Modification

- Replace recursive call to toComponentIndex with toComponentIndex0 as
intended
- Add missed "lastAccessed" racy cache invalidation in capacity(int)
method
- Fix incorrect determination of initial offset in non-zero cIndex case
of updateComponentOffsets method
- New unit tests for previously uncovered methods

Results

Fewer bugs.
normanmaurer pushed a commit that referenced this pull request Jan 24, 2019
Motivation

In #8758, @doom369 reported an infinite loop bug in CompositeByteBuf
which was introduced in #8437.

This is the same small fix for that, along with fixes for two other bugs
found while re-inspecting the changes and adding unit tests.

Modification

- Replace recursive call to toComponentIndex with toComponentIndex0 as
intended
- Add missed "lastAccessed" racy cache invalidation in capacity(int)
method
- Fix incorrect determination of initial offset in non-zero cIndex case
of updateComponentOffsets method
- New unit tests for previously uncovered methods

Results

Fewer bugs.
normanmaurer pushed a commit that referenced this pull request Jan 24, 2019
Motivation

In #8758, @doom369 reported an infinite loop bug in CompositeByteBuf
which was introduced in #8437.

This is the same small fix for that, along with fixes for two other bugs
found while re-inspecting the changes and adding unit tests.

Modification

- Replace recursive call to toComponentIndex with toComponentIndex0 as
intended
- Add missed "lastAccessed" racy cache invalidation in capacity(int)
method
- Fix incorrect determination of initial offset in non-zero cIndex case
of updateComponentOffsets method
- New unit tests for previously uncovered methods

Results

Fewer bugs.
@njhill njhill mentioned this pull request Jan 25, 2019
njhill added a commit to njhill/netty that referenced this pull request Jan 29, 2019
Motivation

There's some miscellaneous cleanup/simplification of CompositeByteBuf
which would help make the code a bit clearer.

Modifications

- Simplify web of constructors and addComponents methods, reducing
duplication of logic
- Rename `Component.freeIfNecessary()` method to just `free()`, which is
less confusing (see netty#8641)
- Make loop in addComponents0(...) method more verbose/readable (see
netty#8437 (comment))
- Simplify addition/subtraction in setBytes(...) methods

Result

Smaller/clearer code
normanmaurer pushed a commit that referenced this pull request Feb 1, 2019
Motivation

There's some miscellaneous cleanup/simplification of CompositeByteBuf
which would help make the code a bit clearer.

Modifications

- Simplify web of constructors and addComponents methods, reducing
duplication of logic
- Rename `Component.freeIfNecessary()` method to just `free()`, which is
less confusing (see #8641)
- Make loop in addComponents0(...) method more verbose/readable (see
#8437 (comment))
- Simplify addition/subtraction in setBytes(...) methods

Result

Smaller/clearer code
normanmaurer pushed a commit that referenced this pull request Feb 1, 2019
Motivation

There's some miscellaneous cleanup/simplification of CompositeByteBuf
which would help make the code a bit clearer.

Modifications

- Simplify web of constructors and addComponents methods, reducing
duplication of logic
- Rename `Component.freeIfNecessary()` method to just `free()`, which is
less confusing (see #8641)
- Make loop in addComponents0(...) method more verbose/readable (see
#8437 (comment))
- Simplify addition/subtraction in setBytes(...) methods

Result

Smaller/clearer code
njhill added a commit to njhill/netty that referenced this pull request May 21, 2019
Motivation

A small thread-safety bug was introduced during the internal
optimizations of ComponentByteBuf made a while back in netty#8437. When there
is a single component which was added as a slice,
internalNioBuffer(int,int) will currently return the unwrapped slice's
un-duplicated internal NIO buffer. This is not safe since it could be
modified concurrently with other usage of that parent buffer.

Modifications

Delegate internalNioBuffer to nioBuffer in this case, which returns a
duplicate. This matches what's done in derived buffers in general
(for the same reason). Add unit test.

Result

Fixed possible thread-safety bug
normanmaurer pushed a commit that referenced this pull request May 22, 2019
#9169)

Motivation

A small thread-safety bug was introduced during the internal
optimizations of ComponentByteBuf made a while back in #8437. When there
is a single component which was added as a slice,
internalNioBuffer(int,int) will currently return the unwrapped slice's
un-duplicated internal NIO buffer. This is not safe since it could be
modified concurrently with other usage of that parent buffer.

Modifications

Delegate internalNioBuffer to nioBuffer in this case, which returns a
duplicate. This matches what's done in derived buffers in general
(for the same reason). Add unit test.

Result

Fixed possible thread-safety bug
normanmaurer pushed a commit that referenced this pull request May 22, 2019
#9169)

Motivation

A small thread-safety bug was introduced during the internal
optimizations of ComponentByteBuf made a while back in #8437. When there
is a single component which was added as a slice,
internalNioBuffer(int,int) will currently return the unwrapped slice's
un-duplicated internal NIO buffer. This is not safe since it could be
modified concurrently with other usage of that parent buffer.

Modifications

Delegate internalNioBuffer to nioBuffer in this case, which returns a
duplicate. This matches what's done in derived buffers in general
(for the same reason). Add unit test.

Result

Fixed possible thread-safety bug
njhill added a commit to njhill/netty that referenced this pull request Aug 28, 2019
Motivation

The way that Components are added/stored in CompositeByteBuf was changed
in netty#8437 to minimize slicing and runtime access indirection, with the
obvious intention to preserve the effective behaviour of public methods.

@jingene discovered a discrepancy, reported in netty#9398, where the
component(int) and componentAtOffset(int) methods can return an
incorrect ByteBuf, i.e. not equivalent to a slice of the added buffer at
the time it was added (the previous behaviour). This can happen in
particular if the added ByteBuf is a wrapped slice, e.g. a leak aware or
unreleasable buffer.

Upon further scrutiny another subtle deviation was also noticed: When
the added `ByteBuf is a slice whose readable region does not cover the
whole buffer internalComponent(int) returns the original slice (with
capacity != readableBytes) rather than a sliced slice.

Modifications

Unfortunately to fix this robustly required slightly more invasive
change than first expected.
- A final ByteBuf srcBuf field has been added to the Component class,
which always holds the originally-added buffer. This simplifies some of
the existing fragile logic where we need access to this (e.g. when
releasing)
- Component has been made abstract with two final impls. Which of these
is used depends on whether the srcBuf is already a slice of the
component's range (or equivalent to one)
- Correctly handle the various places where access to the pre-unwrapped
buffer is important, including the problem component(int) methods
- Extend the pre-unwrapping optimization to cover WrappedByteBufs,
SwappedByteBufs, and duplicates in addition to slices
- Unit test for the original bug provided by @jingene

Result

- Correct behaviour of the (internal)component(AtOffset) methods in all
cases
- Reduced indirection when accessing components that were added from
more kinds of ByteBufs (duplicates in particular)
- Less fragile logic related to lazy-slicing components

I don't expect there to be any noticeable performance impact of
splitting Component into two subclasses since most of the methods are
still final in the superclass and the ones that aren't will benefit from
bimorphic inlining. But will aim to benchmark nonetheless.

There could be slight increase in mem use due to two additional fields
in _one_ of the two Component subtypes, but I think this is needed for
correctness and still much better than slicing components
unconditionally when adding them.


Co-authored-by: jingene <jingene0206@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants