GH-40039: [Java][FlightRPC] Improve performance by removing unnecessary memory copies#40042
Merged
lidavidm merged 1 commit intoapache:mainfrom Feb 12, 2024
Merged
GH-40039: [Java][FlightRPC] Improve performance by removing unnecessary memory copies#40042lidavidm merged 1 commit intoapache:mainfrom
lidavidm merged 1 commit intoapache:mainfrom
Conversation
|
|
lidavidm
approved these changes
Feb 12, 2024
Member
lidavidm
left a comment
There was a problem hiding this comment.
Great find, thank you for the fix & the explanation!
|
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 66351e3. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this pull request
Feb 19, 2024
…ecessary memory copies (apache#40042) ### Rationale for this change Described in details in the issue: apache#40039 Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server. ### What changes are included in this PR? Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one. User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. ### Are these changes tested? **TestPerf: 30% throughput boost** ``` BEFORE Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s. AFTER Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s. ``` Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed. Two tests with zero-copy set to true: **50 batches, 30 columns (Int32), 199999 rows in each batch** - before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`) - after change: throughput ~32Gbit/s (20% boost) **50 batches, 30 columns (Int32), 200k rows in each batch** - before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch. - after change: throughput ~32Gbit/s (**115% boost**) * Closes: apache#40039 Authored-by: Lev Tolmachev <lev.tolmachev@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
pribor
pushed a commit
to GlobalWebIndex/arrow
that referenced
this pull request
Oct 24, 2025
…ecessary memory copies (apache#40042) ### Rationale for this change Described in details in the issue: apache#40039 Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server. ### What changes are included in this PR? Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one. User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. ### Are these changes tested? **TestPerf: 30% throughput boost** ``` BEFORE Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s. AFTER Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s. ``` Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed. Two tests with zero-copy set to true: **50 batches, 30 columns (Int32), 199999 rows in each batch** - before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`) - after change: throughput ~32Gbit/s (20% boost) **50 batches, 30 columns (Int32), 200k rows in each batch** - before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch. - after change: throughput ~32Gbit/s (**115% boost**) * Closes: apache#40039 Authored-by: Lev Tolmachev <lev.tolmachev@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Described in details in the issue: #40039
Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but
maxNumComponentsfor it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server.What changes are included in this PR?
Changing maxNumComponents to
Integer.MAX_VALUEbecause we never want to silently merge large buffers into one.User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write.
Are these changes tested?
TestPerf: 30% throughput boost
Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed.
Two tests with zero-copy set to true:
50 batches, 30 columns (Int32), 199999 rows in each batch
grpc-nio-worker-ELG-*)50 batches, 30 columns (Int32), 200k rows in each batch
flight-server-default-executor-*thread and blocks server from writing next batch.