Skip to content

GH-40039: [Java][FlightRPC] Improve performance by removing unnecessary memory copies#40042

Merged
lidavidm merged 1 commit intoapache:mainfrom
tolmalev:main
Feb 12, 2024
Merged

GH-40039: [Java][FlightRPC] Improve performance by removing unnecessary memory copies#40042
lidavidm merged 1 commit intoapache:mainfrom
tolmalev:main

Conversation

@tolmalev
Copy link
Copy Markdown
Contributor

@tolmalev tolmalev commented Feb 12, 2024

Rationale for this change

Described in details in the issue: #40039

Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but maxNumComponents for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server.

What changes are included in this PR?

Changing maxNumComponents to Integer.MAX_VALUE because we never want to silently merge large buffers into one.

User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write.

Are these changes tested?

TestPerf: 30% throughput boost

BEFORE
Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s.

AFTER
Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s.

Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed.

Two tests with zero-copy set to true:
50 batches, 30 columns (Int32), 199999 rows in each batch

  • before change: throughput ~25Gbit/s (memory copy happens in grpc-nio-worker-ELG-*)
  • after change: throughput ~32Gbit/s (20% boost)

50 batches, 30 columns (Int32), 200k rows in each batch

  • before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in flight-server-default-executor-* thread and blocks server from writing next batch.
  • after change: throughput ~32Gbit/s (115% boost)

@tolmalev tolmalev requested a review from lidavidm as a code owner February 12, 2024 07:47
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #40039 has been automatically assigned in GitHub to PR creator.

Copy link
Copy Markdown
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find, thank you for the fix & the explanation!

@lidavidm lidavidm changed the title GH-40039: [Java][FlightRpc] Improve performance by removing unnecessary memory copies GH-40039: [Java][FlightRPC] Improve performance by removing unnecessary memory copies Feb 12, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Feb 12, 2024
@lidavidm lidavidm merged commit 66351e3 into apache:main Feb 12, 2024
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Feb 12, 2024
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 66351e3.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ecessary memory copies (apache#40042)

### Rationale for this change
Described in details in the issue: apache#40039

Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server.

### What changes are included in this PR?
Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one.

User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. 

### Are these changes tested?

**TestPerf: 30% throughput boost**
```
BEFORE
Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s.

AFTER
Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s.
```

Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed.

Two tests with zero-copy set to true:
**50 batches, 30 columns (Int32), 199999 rows in each batch**
- before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`)
- after change: throughput ~32Gbit/s (20% boost)

**50 batches, 30 columns (Int32), 200k rows in each batch**
- before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch.
- after change: throughput ~32Gbit/s (**115% boost**)
* Closes: apache#40039

Authored-by: Lev Tolmachev <lev.tolmachev@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…ecessary memory copies (apache#40042)

### Rationale for this change
Described in details in the issue: apache#40039

Summary: class ArrowMessage uses CompositeByteBuf to avoid memory copies but `maxNumComponents` for it is calculated incorrectly and as a result memory copies are still performed which significantly affects the performance of the server.

### What changes are included in this PR?
Changing maxNumComponents to `Integer.MAX_VALUE` because we never want to silently merge large buffers into one.

User can set useZeroCopy=false (default) and then the library will copy data into a new buffer before sending it to Netty for write. 

### Are these changes tested?

**TestPerf: 30% throughput boost**
```
BEFORE
Transferred 100000000 records totaling 3200000000 bytes at 877.812629 MiB/s. 28764164.218015 record/s. 7024.784185 batch/s.

AFTER
Transferred 100000000 records totaling 3200000000 bytes at 1145.333893 MiB/s. 37530301.022096 record/s. 9165.650116 batch/s.
```

Also tested with a simple client-server application and I saw even more significant performance boost if padding isn't needed.

Two tests with zero-copy set to true:
**50 batches, 30 columns (Int32), 199999 rows in each batch**
- before change: throughput ~25Gbit/s (memory copy happens in `grpc-nio-worker-ELG-*`)
- after change: throughput ~32Gbit/s (20% boost)

**50 batches, 30 columns (Int32), 200k rows in each batch**
- before change: throughput ~15Gbit/s (much slower than with 199999 because memory copy happens in `flight-server-default-executor-*` thread and blocks server from writing next batch.
- after change: throughput ~32Gbit/s (**115% boost**)
* Closes: apache#40039

Authored-by: Lev Tolmachev <lev.tolmachev@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Java][FlightRpc] server zero-copy doesn't work if padding buffers are needed to serialise response

2 participants