GH-39377: [C++] IO: Reuse same buffer in CompressedInputStream#39807
GH-39377: [C++] IO: Reuse same buffer in CompressedInputStream#39807pitrou merged 23 commits intoapache:mainfrom
Conversation
|
|
pitrou
left a comment
There was a problem hiding this comment.
This may be a good idea, but have you run any benchmarks?
cpp/src/arrow/io/compressed.cc
Outdated
| std::shared_ptr<ResizableBuffer> decompressed_; | ||
| std::shared_ptr<ResizableBuffer> decompressed_impl_; |
There was a problem hiding this comment.
Why are there two buffers suddently? What is the difference between the two?
There was a problem hiding this comment.
It's not neccessary to maintain two buffer, I use it here for simplify the changes(previous impl set decompressed_ = nullptr to means decompressed is consumed). Seems this is not a good idea.
Currently not, let me find and run them |
|
I really need to work on these stream classes before I can fee confident reviewing these optimizations. |
This comment was marked as outdated.
This comment was marked as outdated.
|
You should benchmark using a faster codec such as LZ4, if you really want to measure the overhead of CompressedInputStream. |
|
(After changing |
|
Sorry for delaying, I'm suffering from to much work this two weeks. I'll enhance this on weekend |
|
It's ok @mapleFU ! |
# Conflicts: # cpp/src/arrow/dataset/dataset_writer.cc
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
@mapleFU I don't think there any existing C++ benchmarks that exercise CompressedInputStream (except the ones you have added). There may be Python benchmarks reading compressed CSV files, I'm not sure. |
|
Aha, you're right |
pitrou
left a comment
There was a problem hiding this comment.
+1. I unfortunately can't reproduce any speedup when reading a compressed CSV file from Python, but this looks potentially useful anyway.
|
@github-actions crossbow submit -g cpp |
|
Yeah, I'm use a CompressedInputStream to read csv/tsv/json... from remote object store. The previous code works well but this patch is some minor optimizations about it. |
|
Revision: 8ca0840 Submitted crossbow builds: ursacomputing/crossbow @ actions-371b6d0039 |
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit aae2557. There were 8 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
Don't understand the reason, would native read by R uses CompressedInputStream...? |
|
It's certainly unrelated. |
|
@github-actions crossbow submit test-r-rstudio-r-base-4.1-opensuse153 |
This comment was marked as outdated.
This comment was marked as outdated.
|
Revision: 8ca0840 Submitted crossbow builds: ursacomputing/crossbow @ actions-7cf4daee98
|
### Rationale for this change Previous pr ( #39807 ) remove std::move when returning value, however, it's not allowed in some old compilers ### What changes are included in this PR? add std::move for return, and add reason for that ### Are these changes tested? Should test by other ci ### Are there any user-facing changes? no * GitHub Issue: #41024 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#39807) ### Rationale for this change This patch reuses the same buffer in `CompressedInputStream`. It includes the `decompress_` and `compress_` buffer ### What changes are included in this PR? 1. For `compress_`, allocate and reuse same buffer with `kChunkSize` (64KB), and reusing it 2. For `decompress_`, reusing a same buffer (mostly 1MB) without continues `Reallocate` In the worst case, `decompress_` might hold a large buffer. ### Are these changes tested? Already ### Are there any user-facing changes? `CompressedInputStream` might has larger buffer * Closes: apache#39377 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <anmmscs_maple@qq.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers ### What changes are included in this PR? add std::move for return, and add reason for that ### Are these changes tested? Should test by other ci ### Are there any user-facing changes? no * GitHub Issue: apache#41024 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
This patch reuses the same buffer in
CompressedInputStream. It includes thedecompress_andcompress_bufferWhat changes are included in this PR?
compress_, allocate and reuse same buffer withkChunkSize(64KB), and reusing itdecompress_, reusing a same buffer (mostly 1MB) without continuesReallocateIn the worst case,
decompress_might hold a large buffer.Are these changes tested?
Already
Are there any user-facing changes?
CompressedInputStreammight has larger buffer